FoxTrot Search Forum
FoxTrot Search for macOS Forum

Home » Public Forums » FoxTrot Search User Forum » Indexing PDFs with lots of text
Indexing PDFs with lots of text [message #1308] Fri, 26 November 2021 21:17 Go to previous message
Grant Barrett
Messages: 24
Registered: October 2019
Junior Member
I'm trying to understand FoxTrot's limits when it comes to indexing the text of large PDF files.

I work with lots of dictionaries, so the files tend to have many pages of large dimensions, and a lot of text in a multicolumn format, often with many abbreviations and other languages in etymological notes, or even two or more languages equally throughout in the case of bilingual or multilingual dictionaries. The text in the PDFs is either created during the publishing process by the book publisher or later via OCR, often by me from out-of-copyright titles.

But I've recently noticed that for some PDFs, Foxtrot will simply not index all their text. I've tried to pin down the characteristics of these files, but I can see no correspondence. For example, these are all files in which the PDFs' text is not fully indexed:

DOA1.pdf - 1976pp; 228.9 MB; 3,170,727 words; 18,120,048 characters
TCD1.pdf - 1864pp; 1.29 GB; 2,773,628 words; 16,932,731 characters
DOA2.pdf - 987pp; 114.7 MB; 1,572,111 words; 8,856,352; characters
HTO1.pdf - 1832pp; 1.65 GB; 3,259,509 words; 22,297,853 characters
DAA1.pdf - 675pp; 92.1 MB; 842,993 words; 6,580,574 characters.

And these are files in which the PDFs' text *is* fully indexed:

HJG1.pdf - 1081pp; 1.13 GB; 857,170 words; 4,985,458 characters
LAL1.pdf - 731pp; 847 MB; 546,168 words; 3,090,798 characters
CAE1.pdf - 617pp; 720 MB; 647,164 words; 3,877,006 characters

Based on that, the best guess I have is that it's a character limit between 4,985,458 and 6,580,574.

I thought that perhaps the presence of files in the "resource hog" section of the "Indexed Data" tab of the "Manage Indices" window would also be an indicator to their incomplete searchability, but it isn't. There are plenty of files in there that are fully searchable, and plenty of files in there that are not fully searchable.

I am updated to the latest FoxTrot, 7.1.2. In order to try to resolve this problem, I rebuilt the Spotlight index of drive in question, then I did a complete rebuild of all my FoxTrot indexes. None of that solved the problem.

One solution is to split the PDFs into smaller files. That works: all of the pieces of the larger file are then completely searchable. This confirms for me that in some way, the file or text size is a problem. This is not an ideal solution as it prevents proximity searches where the file is split (though I am trying to figure out a script to split files like: 1-300, 299-601, 599-901, 899-1201, etc.).

I created a test index of four of the problem files and indexed them using xpdf. That did seem to help, but it was incredibly slow to index, and I don't see a way to have some indexes use SpotLight and some use xpdf. I am about to take as many of the problem files as I can find and do another xpdf test of all of them and see if they are all now fully searchable this way.

However, as I continue this time-consuming trial-and-error, I wanted to reach out and ask if this problem is familiar to anyone and if you have recommended solutions besides the ones I've come up with.

Thanks.
 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: BUG: Searching with OR operator and quotation mark
Next Topic: pdf documents with search highlights
Goto Forum:
  


Current Time: Thu Mar 28 19:38:58 GMT+1 2024