FoxTrot Search Forum
FoxTrot Search for macOS Forum

Home » Public Forums » FoxTrot Search User Forum » Indexing PDFs with lots of text
Re: Indexing PDFs with lots of text [message #1311 is a reply to message #1308] Sat, 27 November 2021 14:42 Go to previous messageGo to previous message
FoxTrot Engineering
Messages: 417
Registered: April 2020
Senior Member
To see what has been indexed exactly for a given document, option-click it in the result list (or use the "display type" popup menu in the window's toolbar, in FoxTrot 7.1). If the document has been incorrectly parsed, or partially parsed, you will see that easily.

PDF parsing is not an exact science, as in many cases the human readable text must be reconstructed from individual characters spawn over a page, with many font and encoding issues. Our FAQ gives some useful tips.

By default, FoxTrot uses Spotlight's metadata importer to extract text from PDFs at indexing time, and I think that as most (all?) Apple-provided Spotlight importers, it truncates its result to 10 MB of plain text using UTF-16 encoding, i.e. about 5,000,000 characters. The only workaround is to use Xpdf instead.

Using Xpdf is quite slower, and produces different results: sometimes better, sometimes worse, depending on the document. There is currently no way to index some documents using Xpdf and some others using Spotlight's importer, but that is something we would like to improve in the future.

Also, note that highlighting found words in the PDF preview uses a third PDF engine (PDFKit), which gives different result than both Spotlight's PDF importer, and Xpdf.

The problem you mention of a cover page which prevents indexing the following pages seems to be a bug in Spotlight's PDF importer. Is this document correctly indexed using Xpdf? Does word highlight work correctly in this document? Maybe you can send me a sample file so I can report it to Apple.


Jérôme - FoxTrot Engineering
 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: BUG: Searching with OR operator and quotation mark
Next Topic: pdf documents with search highlights
Goto Forum:
  


Current Time: Wed May 14 00:20:08 GMT+2 2025