Highlighter Data Files

By default, PDF Highlighter uses file system for all data storage. If you don't explicitly set data directory in your application.conf, Highlighter will create and use "highlighter-cache" folder in system's tmp directory.

To change where Highlighter keeps data files, set highlighter.dataDir property in the application.conf:

highlighter {
dataDir = "D:/highlighter-data"
}

Details below are for in-depth understanding, not necessary for an occasional server setup.

Basically, there are three types of persisted data:

  1. Full text search index.

  2. Text positions cache.

  3. Results cache.

Full Text Search Index

One might ask why Highlighter needs own search engine when it's dealing with one document at a time and, in the most common use case, it's integrated with an external search solution? The thing is, external search engine usually doesn't provide the type of data Highlighter needs. Highlighter indexes each PDF page individually in order to quickly locate which document pages need to be highlighted.

Full text search index files are located in the index/solr data directory and are managed by Apache Solr instance embedded with Highlighter.

The total index size depends on PDF documents set and it's roughly about 5%-10% of documents size.

If you're going to use only "highlight-for-xml" highlighting method, the full text search indexing module can be disabled.

Text Positions Cache

This cache keeps data about position in page of each document word. When the cache exists, Highlighter can handle highlighting requests without reading the PDF each time. For performance reasons, multiple files are created per indexed document.

Text position cache files are located below the index/text data directory. The total size depends on PDF documents set and it's roughly 15-20% of documents size.

Results Cache

A file is created per handled highlight request, usually containing less than 1KB of data persisted.

Results cache is automatically cleaned up in accordance with settings.


comments powered by Disqus