dtSearch Engine

As dtSearch search engine is able to generate PDF highlight file, recommended approach for integration with PDF Highlighter is using "highlight-for-xml" service method.

dtSearch Engine

Follow dtSearch guide for web search application development and hit highlighting. Client requirements mentioned in the Highlighting hits in PDF files article do not apply when PDF Highlighter server is used! You should implement callback URL to serve "xml" as described in the dtSearch guide but, when combined with PDF Highlighter, any standard web browser can be used (no Acrobat Reader required).

Phrase Highlighting

As PDF highlight file format does not contain any hints about search queries and relation between found words, PDF Highlighter cannot mark terms of a matching phrase as a single hit. However, PDF Highlighter supports phrase recognition when additional "HitsByWord" data from dtSearch results is provided.

HitsByWord data can be sent to Highlighter either as a part of highlight file or as request parameter.

Option 1: Including HitsByWord with highlight file

To include HitsByWord with highlight file, add dtsearch_hitsByWord element to the generated xml with HitsByWord data added as a text node, as in the example:

<XML>
<Body units=characters color=#ff00ff mode=active version=2>
<Highlight>
<loc pg=0 pos=42 len=9>
<loc pg=0 pos=52 len=4>
<loc pg=4 pos=34 len=9>
...
<loc pg=225 pos=151 len=4>
</Highlight>
<dtsearch_hitsByWord>
[0]park, 5 (4776 6378 7352 69193 69614 )
[1]community plan, 9 (6 7751 8180 8302 8698 9106 40587 40695 40701 )
</dtsearch_hitsByWord>
</Body>
</XML>

With more recent dtSearch APIs that have SearchResults.UrlEncodeItemWithIndexId method, implementation is straight forward.

When collecting search results, get URL encoded item from dtSearch using SearchResults.UrlEncodeItemWithIndexId method, not SearchResults.UrlEncodeItem, and add it to your search results page:

string urlEncodedItem = searchResults.UrlEncodeItemWithIndexId(
iItem: iItem,
indexId: indexId,
additionalSearchFlags: dtsSearchWantHitsByWord | dtsSearchWantHitsArray | dtsSearchWantHitsByWordOrdinals,
asSearch: true);

In your controller action that returns highlight file from dtSearch, extend the generated file with HitsByWord data:

[HttpGet]
[Route("pdf-highlight-xml")]
public IActionResult GetPdfHighlightXml([FromQuery] string urlEncodedItem)
{
using (SearchResults res = new SearchResults())
{
res.UrlDecodeItemWithIndex(urlEncodedItem, indexPath);
res.GetNthDoc(0);
 
var hlXml = res.MakePdfWebHighlightFile(0);
// get extra HitByWord data used by PDF Highlighter for phrase matching
var hitsByWord = res.CurrentItem.HitsByWord;
if (hitsByWord != null && hitsByWord.Count > 0)
{
StringBuilder buf = new StringBuilder();
buf.Append("<dtsearch_hitsByWord>\n");
hitsByWord.ForEach(s => buf.Append(s).Append('\n'));
buf.Append("</dtsearch_hitsByWord>\n");
hlXml = hlXml.Replace("</Body>", buf.ToString() + "</Body>");
}
return Content(hlXml, "text/xml");
}
}

Option 2: Add dtsearch_hitsByWord parameter to highlighting request

In your dtSearch application that executes SearchJob, enable dtsSearchWantHitsByWord and dtsSearchWantHitsArray flags. Something along the line of:

SearchJob searchJob = new SearchJob();
// ... search job init
int searchFlags = form.getSearchFlags();
 
// update search flags for HitsByWord data
searchFlags |= SearchFlags.dtsSearchWantHitsByWord | SearchFlags.dtsSearchWantHitsArray;
 
// set updated flags and execute search
searchJob.setSearchFlags(searchFlags);
searchJob.execute();

Collect HitsByWord when reading search results:

SearchResults results = searchJob.getResults();
// ...
for (int i = firstHit; i <= lastHit && i < results.getCount(); ++i) {
results.getNthDoc(i);
// ... get document location and any other fields you need
 
String hlInfo = results.urlEncodeItem(i); // get data about document hits
String hitsByWord = results.getDocDetailItem("_hitsByWord"); // get HitsByWord details
// ... save these two with your results
}

Add HitsByWord data as dtsearch_hitsByWord parameter to highlight request along with uri and xml:

String hlParams = "uri=" + urlEncode(document.getFilename()) + "&xml=" + urlEncode(callbackUrlToHighlightXml);
if (StringUtils.isNoneBlank(hitsByWord))
url += "&dtsearch_hitsByWord=" + urlEncode(hitsByWord);

Depending on programming language and dtSearch API you're using to retrieve HitsByWord, you may get List<String> (list of strings) instead of a single String. In that case join strings with a new line character as a separator ('\n') and send this resulting string as dtsearch_hitsByWord parameter to Highlighter.

For common words, "HitsByWord" may contain more data than web servers allow in a HTTP GET request. When sending HitsByWord, it's recommended to either submit highlighting request using POST, or to set "dtsearch_hitsByWord" parameter to callback URL where PDF Highlighter can get it.


comments powered by Disqus