Highlight for Highlight File

Web service method /highlight-for-xml highlights the PDF file for text positions specified in an XML file. This method requires the following parameters:

  • uri - PDF document location.

  • xml - Highlights file location. This URL is typically served by a search application with highlighting data provided by the search engine.

For all available options, see API documentation.

Integration

Earlier versions of Adobe Reader (up to version 8, and 9 with option change) supported PDF highlighting, using as input an XML-like structure that contains term offsets and lengths. Document URL relying on this feature of Acrobat Reader specifies highlight file location as xml parameter after the hash:

http://host/path/document.pdf#xml=http://host/path/highlight.xml

Assuming that your search application already generates PDF document links in the above format, the simplest way to integrate Highlighter would be using our jQuery plugin:

<script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script src="//api.highlight4.me/js/jquery.pdf-highlighter.js"></script>
<script>
jQuery(document).ready(function() {
var hlConfig = {
highlighterUrl: "https://api.highlight4.me", // update for self-hosted Highlighter
resolveDocumentBase: true,
updateHref: true
};
$('#ResultsDiv a[href*=".pdf#xml="], #ResultsDiv a[href*=".pdf?xml="]').pdfHighlighter(hlConfig); // TODO: update ResultsDiv selector
});
</script>

The above JavaScript snippet loads jQuery library, the plugin script "jquery.pdf-highlighter.js" and using jQuery selector attaches highlighter to all PDF links below the results element.

Highlights file format

Adobe highlight file format

The standard Adobe highlight file uses loc elements to specify highlight ranges. Each highlight range line has attributes:

  • pg: Specifies the page on which the highlight is located. Pages are numbered sequentially, with the first page in a file having a page number of zero.

  • pos: Specifies the offset of the highlight on the page. The offset of the first character on a page is zero. (While Adobe XML allows offsets to be specified either in words or characters, PDF Highlighter supports only character offsets.)

  • len: Specifies the number of words or characters to highlight.

Example:

<XML>
<Body units=characters mode=active version=2>
<Highlight>
<loc pg=0 pos=29 len=4>
<loc pg=0 pos=40 len=4>
</Highlight>
</Body>
</XML>

Extended highlight file format

For simpler integration with other tools, Highlighter support some non-standard extensions to the Adobe highlight file.

Highlighter supports some an extended highlight file format that simplifies document highlighting for NLP output data. File format differences are described below.

Additional body element attributes:

  • positions: By default, pos offsets used in highlight ranges are Adobe compatible. If this attribute is set to "internal", pos offsets used in highlight ranges match positions in text extracted by Highlighter (i.e. content returned by the /extract service).

Highlight range line (the loc element) attributes:

  • pg: Page index attribute is not required. If pg is missing, the pos offset specifies text position in document instead of in page.

  • color: Specifies RGB color code to use for the highlight range.

Example:

<xml>
<body units="characters" positions="internal">
<highlight>
<loc pos="0" len="5" color="FF00FF" />
<loc pos="22" len="10" color="0000FF" />
</highlight>
</body>
</xml>

Phrase highlighting support

See dtSearch Engine integration for extended format that allows phrase highlighting with dtSearch.


comments powered by Disqus