FoxTrot Search Forum: FoxTrot Search User Forum » Indexing PDFs with lots of text

Home » Public Forums » FoxTrot Search User Forum » Indexing PDFs with lots of text

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

Indexing PDFs with lots of text [message #1308]

Fri, 26 November 2021 21:17

Grant Barrett
Messages: 24
Registered: October 2019

Junior Member

I'm trying to understand FoxTrot's limits when it comes to indexing the text of large PDF files.

I work with lots of dictionaries, so the files tend to have many pages of large dimensions, and a lot of text in a multicolumn format, often with many abbreviations and other languages in etymological notes, or even two or more languages equally throughout in the case of bilingual or multilingual dictionaries. The text in the PDFs is either created during the publishing process by the book publisher or later via OCR, often by me from out-of-copyright titles.

But I've recently noticed that for some PDFs, Foxtrot will simply not index all their text. I've tried to pin down the characteristics of these files, but I can see no correspondence. For example, these are all files in which the PDFs' text is not fully indexed:

DOA1.pdf - 1976pp; 228.9 MB; 3,170,727 words; 18,120,048 characters
TCD1.pdf - 1864pp; 1.29 GB; 2,773,628 words; 16,932,731 characters
DOA2.pdf - 987pp; 114.7 MB; 1,572,111 words; 8,856,352; characters
HTO1.pdf - 1832pp; 1.65 GB; 3,259,509 words; 22,297,853 characters
DAA1.pdf - 675pp; 92.1 MB; 842,993 words; 6,580,574 characters.

And these are files in which the PDFs' text *is* fully indexed:

HJG1.pdf - 1081pp; 1.13 GB; 857,170 words; 4,985,458 characters
LAL1.pdf - 731pp; 847 MB; 546,168 words; 3,090,798 characters
CAE1.pdf - 617pp; 720 MB; 647,164 words; 3,877,006 characters

Based on that, the best guess I have is that it's a character limit between 4,985,458 and 6,580,574.

I thought that perhaps the presence of files in the "resource hog" section of the "Indexed Data" tab of the "Manage Indices" window would also be an indicator to their incomplete searchability, but it isn't. There are plenty of files in there that are fully searchable, and plenty of files in there that are not fully searchable.

I am updated to the latest FoxTrot, 7.1.2. In order to try to resolve this problem, I rebuilt the Spotlight index of drive in question, then I did a complete rebuild of all my FoxTrot indexes. None of that solved the problem.

One solution is to split the PDFs into smaller files. That works: all of the pieces of the larger file are then completely searchable. This confirms for me that in some way, the file or text size is a problem. This is not an ideal solution as it prevents proximity searches where the file is split (though I am trying to figure out a script to split files like: 1-300, 299-601, 599-901, 899-1201, etc.).

I created a test index of four of the problem files and indexed them using xpdf. That did seem to help, but it was incredibly slow to index, and I don't see a way to have some indexes use SpotLight and some use xpdf. I am about to take as many of the problem files as I can find and do another xpdf test of all of them and see if they are all now fully searchable this way.

However, as I continue this time-consuming trial-and-error, I wanted to reach out and ask if this problem is familiar to anyone and if you have recommended solutions besides the ones I've come up with.

Thanks.

Report message to a moderator

Re: Indexing PDFs with lots of text [message #1309 is a reply to message #1308]

Fri, 26 November 2021 21:51

Grant Barrett
Messages: 24
Registered: October 2019

Junior Member

I forgot to mention that I am searching for phrases in FoxTrot that I have copied directly from the PDFs themselves when I am discovering which files to not seem to be fully indexed. I can also search the PDFs in Preview and find the phrases.

Report message to a moderator

Re: Indexing PDFs with lots of text [message #1310 is a reply to message #1308]

Fri, 26 November 2021 22:47

Grant Barrett
Messages: 24
Registered: October 2019

Junior Member

Another thing I've done is test whether "garbage" characters were causing the problem.

I discovered I had two versions of the same document created by two different sources at two different times using two different PDF output methods. One of them was fully indexed and searchable by FoxTrot, the other was not. Their word counts were nearly the same.

I was able to determine that the cover page on the "bad" PDF was the problem, and deleting it made it fully indexable by FoxTrot. That "bad" document, exported as a new linearized PDF/A, then having its text exported, shows that all of the text on that problematic cover page displays as squares with question marks in them. Usually this is the sign of a missing font, but a PDF should fall back to system fonts. It can also mean many other things beyond the scope of what we're doing here. (My fonts are in good shape but I am cleaning the font cache and verifying everything just to be sure.)

In any case, I've used that and some other garbage text as a starting point to search the other "bad" files — and some "good" ones — but it looks like that one success may have been something of a red herring.

For example,

􏰃 = °
􏰁 = ±

If I search in FoxTrot for those two garbage characters (and not for what they're actually supposed to be), I find them in a number of PDFs that are fully searchable. And they are not found in all of the "bad" documents.

There are, of course, many other potential "garbage" characters. My suspicion is that there are one or two garbage characters that are acting as an "end document" or BEL or some other weird character that FoxTrot does not like.

I am attaching the garbage text that, when I deleted the page it was on from a "bad" PDF, then allowed the PDF to be fully searchable. I suspect there may be one or more culprit in there but I am not able to do a full trial-and-error right now.

Attachment: garbage-text.txt
(Size: 2.48KB, Downloaded 135 times)

[Updated on: Fri, 26 November 2021 22:47]

Report message to a moderator

Re: Indexing PDFs with lots of text [message #1311 is a reply to message #1308]

Sat, 27 November 2021 14:42

FoxTrot Engineering
Messages: 384
Registered: April 2020

Senior Member

To see what has been indexed exactly for a given document, option-click it in the result list (or use the "display type" popup menu in the window's toolbar, in FoxTrot 7.1). If the document has been incorrectly parsed, or partially parsed, you will see that easily.

PDF parsing is not an exact science, as in many cases the human readable text must be reconstructed from individual characters spawn over a page, with many font and encoding issues. Our FAQ gives some useful tips.

By default, FoxTrot uses Spotlight's metadata importer to extract text from PDFs at indexing time, and I think that as most (all?) Apple-provided Spotlight importers, it truncates its result to 10 MB of plain text using UTF-16 encoding, i.e. about 5,000,000 characters. The only workaround is to use Xpdf instead.

Using Xpdf is quite slower, and produces different results: sometimes better, sometimes worse, depending on the document. There is currently no way to index some documents using Xpdf and some others using Spotlight's importer, but that is something we would like to improve in the future.

Also, note that highlighting found words in the PDF preview uses a third PDF engine (PDFKit), which gives different result than both Spotlight's PDF importer, and Xpdf.

The problem you mention of a cover page which prevents indexing the following pages seems to be a bug in Spotlight's PDF importer. Is this document correctly indexed using Xpdf? Does word highlight work correctly in this document? Maybe you can send me a sample file so I can report it to Apple.

Jérôme - FoxTrot Engineering

Report message to a moderator

Re: Indexing PDFs with lots of text [message #1313 is a reply to message #1311]

Sat, 27 November 2021 16:44

Grant Barrett
Messages: 24
Registered: October 2019

Junior Member

Thank you for your helpful reply. I think my best course of action will be to a) determine which files pass the character limit; b) split them into multiple PDFs with overlapping pages; c) continue to use the Spotlight importer.

Report message to a moderator

Re: Indexing PDFs with lots of text [message #1314 is a reply to message #1311]

Sat, 27 November 2021 17:42

Grant Barrett
Messages: 24
Registered: October 2019

Junior Member

I put together a script using pdftotext that counts characters in files. I ran it on 10 problem files. Six of the 10 are well under 5,000,000 characters, which means that something else is amiss. Here are the counts:

1554967
2167651
2293207
2458849
2903970
3388007
5458417
7954470
11318310
11481914

I have been, as you mention, option-clicking to look at the parsed text in FoxTrot. I have read the FAQ and taken all steps proposed there. I am now in the process of trying to determine what makes these other <5,000,000-character files problem files. I can make them, and the odd coverpage one, available to you in a private link.

Report message to a moderator

Re: Indexing PDFs with lots of text [message #1315 is a reply to message #1314]

Sat, 27 November 2021 18:09

Grant Barrett
Messages: 24
Registered: October 2019

Junior Member

Thanks for bearing with me as I dribble out information. I have confirmed that all of the files with character counts under about 5,000,000 characters are, in fact, just fine. As you suggested, it looks like my confusion was the discrepancy between what is shown in macOS Preview, what is shown in the FoxTrot search snippets, and what is shown in the option-click indexed text.

Report message to a moderator

Re: Indexing PDFs with lots of text [message #1316 is a reply to message #1315]

Sat, 27 November 2021 19:53

Grant Barrett
Messages: 24
Registered: October 2019

Junior Member

For anyone who may find them useful, I am attaching the scripts I am using to mitigate this problem.

PDF-character-count.sh — A modified version of a script from Timothy J. Luoma https://gist.github.com/tjluoma/205e3d85e46eb6025b87c6db5b77 375b that adds a PDF file's character count to the PDF's file name. It uses pdftotext, which you can install in terminal with

pip install pdftotext

To use in the terminal:

PDF-character-count.sh FILENAME.pdf

You can also run it with a wildcard in a directory to do character counts of every PDF in the directory:

PDF-character-count.sh *.pdf

splitPDF-overlap.py — A slightly modified version of a script from Benjamin Han at http://www.cs.cmu.edu/~benhdj/Mac/unix.html#splitPDF. It divides a PDF into chunks at the page count you specify with a one-page overlap (to make sure that "words near" searches will be more likely to return results if what you want happens to be across the new document split). The only thing I changed in the script from Ben's version is changing

startPageNum = splitPageNum + 1

startPageNum = splitPageNum - 1

To use, figure out how you want to split your PDF. For example, running the character-count script on a big PDF shows it has 19,422,027 characters. Since we want our PDFs to be under 5,000,000 characters (which will ensure they are indexable by Spotlight), we will divide this file into four pieces, meaning that each resulting file will have about 4,855,507 characters. But we have to convert that character count into a page count for this script. Our PDF file has 1153 pages, which divided into four chunks would be about 288 pages each. So our instructions to the script will indicate the starting page number of each chunk at 288-page intervals.

splitPDF-overlap.py FILENAME.pdf 288 576 864.

Our resulting files will be:

FILENAME.part1.1_288.pdf
FILENAME.part2.287_576.pdf
FILENAME.part3.575_864.pdf
FILENAME.part4.863_1153.pdf

As you can see, there is a one-page overlap at each of the files. File one ends at 288, but file two starts at 287, and so on.

One side effect of this split is that the resulting split files can sometimes *each* be as large or larger than the original file. To remedy this without noticeably quality loss, I run the following in terminal:

find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf --tesseract-timeout=0 --skip-text --jbig2-lossy --optimize 3 --output-type pdf '{}' '{}'

This recursively finds PDF files in the current directory and its subdirectories, runs ocrmypdf to invoke tesseract but *doesn't* rerun OCR (that's what the "—skip-text" is telling it), and optimizes the PDFs it finds, if possible. If the optimized PDF it makes is smaller than the one it started with, it keeps the new one. If the new one is larger than the one it started with, it scraps the new one, and keeps the original as-is. To use this, you need only to install ocrmypdf (which will then install tesseract and other dependencies), which can be done with homebrew. (I highly recommend it, anyway, as it is an excellent tool on top of the tesseract, allowing it to handle PDFs more effectively and giving you much more control.)

If I were cleverer, I would combine this all into one grand script, but, alas, that is beyond my ken at the moment.

Attachment: PDF-character-count.sh
(Size: 1.38KB, Downloaded 133 times)
Attachment: splitPDF-overlap.py
(Size: 2.32KB, Downloaded 446 times)

[Updated on: Sat, 27 November 2021 20:05]

Report message to a moderator

Re: Indexing PDFs with lots of text [message #1317 is a reply to message #1316]

Sat, 27 November 2021 23:19

Grant Barrett
Messages: 24
Registered: October 2019

Junior Member

I might as well include the page-counting script, too. It works just like the character-counting one, at least as far as using it in the terminal. Using it and the character-counting script means you don't have to open up each file by hand to figure out how many split chunks to make.

Attachment: PDF-page-count.sh
(Size: 1.29KB, Downloaded 117 times)

Report message to a moderator

Re: Indexing PDFs with lots of text [message #1318 is a reply to message #1308]

Sat, 11 December 2021 23:57

Itkind
Messages: 3
Registered: December 2021

Junior Member

Hi, in the help file is written:

With FoxTrot 7.2 and later, you can specify whether to use Xpdf or Spotlight’s importer, for each individual file. In the search results list, select the files that need to be re-parsed, and select “choose PDF parser” in the contextual menu. This requires having write permission to these files, to set an extended attribute.
You can also set the PDF parser to use for specific files using one the following Terminal.app commands: xattr -w com.ctmdev.foxtrot.extractor xpdf [file ...] xattr -w com.ctmdev.foxtrot.extractor spotlight [file ...] xattr -d com.ctmdev.foxtrot.extractor [file ...]

Where can I find version 7.2?

Report message to a moderator

Re: Indexing PDFs with lots of text [message #1319 is a reply to message #1318]

Mon, 13 December 2021 18:25

FoxTrot Engineering
Messages: 384
Registered: April 2020

Senior Member

The FAQ has been updated online by mistake; version 7.2 is not available yet.

Jérôme - FoxTrot Engineering

Report message to a moderator

Previous Topic:	BUG: Searching with OR operator and quotation mark
Next Topic:	pdf documents with search highlights

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Fri Apr 19 10:43:48 GMT+2 2024