FoxTrot Search Forum
FoxTrot Search for macOS Forum

Home » Public Forums » FoxTrot Search User Forum » Indexing PDFs with lots of text
Re: Indexing PDFs with lots of text [message #1316 is a reply to message #1315] Sat, 27 November 2021 19:53 Go to previous messageGo to previous message
Grant Barrett
Messages: 24
Registered: October 2019
Junior Member
For anyone who may find them useful, I am attaching the scripts I am using to mitigate this problem.

PDF-character-count.sh — A modified version of a script from Timothy J. Luoma https://gist.github.com/tjluoma/205e3d85e46eb6025b87c6db5b77 375b that adds a PDF file's character count to the PDF's file name. It uses pdftotext, which you can install in terminal with
pip install pdftotext
To use in the terminal:

PDF-character-count.sh FILENAME.pdf
You can also run it with a wildcard in a directory to do character counts of every PDF in the directory:

PDF-character-count.sh *.pdf
splitPDF-overlap.py — A slightly modified version of a script from Benjamin Han at http://www.cs.cmu.edu/~benhdj/Mac/unix.html#splitPDF. It divides a PDF into chunks at the page count you specify with a one-page overlap (to make sure that "words near" searches will be more likely to return results if what you want happens to be across the new document split). The only thing I changed in the script from Ben's version is changing
startPageNum = splitPageNum + 1
to
startPageNum = splitPageNum - 1
To use, figure out how you want to split your PDF. For example, running the character-count script on a big PDF shows it has 19,422,027 characters. Since we want our PDFs to be under 5,000,000 characters (which will ensure they are indexable by Spotlight), we will divide this file into four pieces, meaning that each resulting file will have about 4,855,507 characters. But we have to convert that character count into a page count for this script. Our PDF file has 1153 pages, which divided into four chunks would be about 288 pages each. So our instructions to the script will indicate the starting page number of each chunk at 288-page intervals.

splitPDF-overlap.py FILENAME.pdf 288 576 864. 
Our resulting files will be:

FILENAME.part1.1_288.pdf
FILENAME.part2.287_576.pdf
FILENAME.part3.575_864.pdf
FILENAME.part4.863_1153.pdf

As you can see, there is a one-page overlap at each of the files. File one ends at 288, but file two starts at 287, and so on.

One side effect of this split is that the resulting split files can sometimes *each* be as large or larger than the original file. To remedy this without noticeably quality loss, I run the following in terminal:

find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf --tesseract-timeout=0 --skip-text --jbig2-lossy --optimize 3 --output-type pdf '{}' '{}'
This recursively finds PDF files in the current directory and its subdirectories, runs ocrmypdf to invoke tesseract but *doesn't* rerun OCR (that's what the "—skip-text" is telling it), and optimizes the PDFs it finds, if possible. If the optimized PDF it makes is smaller than the one it started with, it keeps the new one. If the new one is larger than the one it started with, it scraps the new one, and keeps the original as-is. To use this, you need only to install ocrmypdf (which will then install tesseract and other dependencies), which can be done with homebrew. (I highly recommend it, anyway, as it is an excellent tool on top of the tesseract, allowing it to handle PDFs more effectively and giving you much more control.)

If I were cleverer, I would combine this all into one grand script, but, alas, that is beyond my ken at the moment.

[Updated on: Sat, 27 November 2021 20:05]

Report message to a moderator

 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: BUG: Searching with OR operator and quotation mark
Next Topic: pdf documents with search highlights
Goto Forum:
  


Current Time: Mon Apr 29 12:31:42 GMT+2 2024