FoxTrot Search Forum
FoxTrot Search for macOS Forum

Home » Public Forums » FoxTrot Search User Forum » Indexing PDFs with lots of text
Re: Indexing PDFs with lots of text [message #1310 is a reply to message #1308] Fri, 26 November 2021 22:47 Go to previous messageGo to previous message
Grant Barrett
Messages: 24
Registered: October 2019
Junior Member
Another thing I've done is test whether "garbage" characters were causing the problem.

I discovered I had two versions of the same document created by two different sources at two different times using two different PDF output methods. One of them was fully indexed and searchable by FoxTrot, the other was not. Their word counts were nearly the same.

I was able to determine that the cover page on the "bad" PDF was the problem, and deleting it made it fully indexable by FoxTrot. That "bad" document, exported as a new linearized PDF/A, then having its text exported, shows that all of the text on that problematic cover page displays as squares with question marks in them. Usually this is the sign of a missing font, but a PDF should fall back to system fonts. It can also mean many other things beyond the scope of what we're doing here. (My fonts are in good shape but I am cleaning the font cache and verifying everything just to be sure.)

In any case, I've used that and some other garbage text as a starting point to search the other "bad" files — and some "good" ones — but it looks like that one success may have been something of a red herring.

For example,

􏰃 = °
􏰁 = ±

If I search in FoxTrot for those two garbage characters (and not for what they're actually supposed to be), I find them in a number of PDFs that are fully searchable. And they are not found in all of the "bad" documents.

There are, of course, many other potential "garbage" characters. My suspicion is that there are one or two garbage characters that are acting as an "end document" or BEL or some other weird character that FoxTrot does not like.

I am attaching the garbage text that, when I deleted the page it was on from a "bad" PDF, then allowed the PDF to be fully searchable. I suspect there may be one or more culprit in there but I am not able to do a full trial-and-error right now.

[Updated on: Fri, 26 November 2021 22:47]

Report message to a moderator

 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: BUG: Searching with OR operator and quotation mark
Next Topic: pdf documents with search highlights
Goto Forum:
  


Current Time: Mon Apr 29 18:30:48 GMT+2 2024