Batch PDF Highlighting

Batch highlighting tool can process all PDF documents in a directory tree and create new PDFs with built in highlight annotations. Documents can be highlighted for a single query or multiple queries specified in a CSV file.

Run batch highlighter from the command line. Available options will be listed when started without arguments:

$ batch-pdf-highlight
 
usage: batch-pdf-highlight
--check-file-time highlight only if input PDF or query file is newer
than output file
--delete-input-pdf delete input pdf after highlighting
-h,--help print this message
-i,--input-dir <dir> input folder
-l,--language <lang> language for text analysis
-nav add navigation links to document
-o,--output-dir <dir> output folder
-q,--query <str> search query with keywords to highlight
-qenc,--charset <file> charset of the query file, default is UTF8
-qf,--query-file <file> file in csv format with keywords to highlight
-t,--threads <num> thread count for concurrent highlighting

Batch highlighter works in offline mode and does not need Highlighter Server running – in fact, the server and the batch tool cannot run at the same time.

Simple Query Highlighting

To highlight documents for a single query, use -q option as in:

batch-pdf-highlight -i D:\input\pdfs -o D:\output-dir -l en -q "home style price"

The above command will highlight terms home, style, and price using English language rules.

Using Query Files

Using a query file you can highlight all your PDF documents for multiple, possibly hundreds, keywords and phrases at once.

The query file is a plain CSV file with the following columns:

  • query

  • color

  • tag

  • bookmark

The query is the only required column. Multi term queries will be handled as phrases without need to put them in quotes. Color is desired RGB color of highlight for the query keywords.

Tags can be used to categorize and group keywords. If the color was not specified but the tag is, all queries with the same tag will get assigned the same color. In addition, created PDF bookmarks to highlights will be grouped by tags. (More details about customizing bookmark creation and PDF output can be found here.)

The bookmark column can be used to control PDF bookmark creation on a query level, specifying bookmark path template string. If not defined, the default path template will be used – see PDF bookmark options.

To highlight PDF documents for a query file, use -qf option as in:

batch-pdf-highlight -i D:\input\pdfs -o D:\output-dir -l en -qf queries.csv

The queries CSV file can list queries only, as in:

query
New Zealand
Australia
United Kingdom
Alice

or, can include colors and tags as well:

query,color,tag
New Zealand,00FF00,Place
Australia,00FF00,Place
United Kingdom,00FF00,Place
Alice,FFFF00,Character

To quickly test how PDF highlighting works for your documents and keywords, use our online demo.

Configuration File

The batch highlighter configuration file is named application.conf and is loaded from the <highlighter>/conf/batch/ directory. This config file does not exist by default, but the directory contains the file application.conf.sample that contains the most commonly used options.

Log File

Batch highlighter creates a log of all handled files, listing any issue or error that occurred. The default path to the log file is <highlighter>/logs/batch-highlighter.log.

The file is overriden every time the batch tool runs.


comments powered by Disqus