FoxTrot Search Forum: FoxTrot Search User Forum » Indexing PDFs with lots of text

Home » Public Forums » FoxTrot Search User Forum » Indexing PDFs with lots of text

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

Re: Indexing PDFs with lots of text [message #1316 is a reply to message #1315]

Sat, 27 November 2021 19:53

Grant Barrett
Messages: 24
Registered: October 2019

Junior Member

For anyone who may find them useful, I am attaching the scripts I am using to mitigate this problem.

PDF-character-count.sh — A modified version of a script from Timothy J. Luoma https://gist.github.com/tjluoma/205e3d85e46eb6025b87c6db5b77 375b that adds a PDF file's character count to the PDF's file name. It uses pdftotext, which you can install in terminal with

pip install pdftotext

To use in the terminal:

PDF-character-count.sh FILENAME.pdf

You can also run it with a wildcard in a directory to do character counts of every PDF in the directory:

PDF-character-count.sh *.pdf

splitPDF-overlap.py — A slightly modified version of a script from Benjamin Han at http://www.cs.cmu.edu/~benhdj/Mac/unix.html#splitPDF. It divides a PDF into chunks at the page count you specify with a one-page overlap (to make sure that "words near" searches will be more likely to return results if what you want happens to be across the new document split). The only thing I changed in the script from Ben's version is changing

startPageNum = splitPageNum + 1

startPageNum = splitPageNum - 1

To use, figure out how you want to split your PDF. For example, running the character-count script on a big PDF shows it has 19,422,027 characters. Since we want our PDFs to be under 5,000,000 characters (which will ensure they are indexable by Spotlight), we will divide this file into four pieces, meaning that each resulting file will have about 4,855,507 characters. But we have to convert that character count into a page count for this script. Our PDF file has 1153 pages, which divided into four chunks would be about 288 pages each. So our instructions to the script will indicate the starting page number of each chunk at 288-page intervals.

splitPDF-overlap.py FILENAME.pdf 288 576 864.

Our resulting files will be:

FILENAME.part1.1_288.pdf
FILENAME.part2.287_576.pdf
FILENAME.part3.575_864.pdf
FILENAME.part4.863_1153.pdf

As you can see, there is a one-page overlap at each of the files. File one ends at 288, but file two starts at 287, and so on.

One side effect of this split is that the resulting split files can sometimes *each* be as large or larger than the original file. To remedy this without noticeably quality loss, I run the following in terminal:

find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf --tesseract-timeout=0 --skip-text --jbig2-lossy --optimize 3 --output-type pdf '{}' '{}'

This recursively finds PDF files in the current directory and its subdirectories, runs ocrmypdf to invoke tesseract but *doesn't* rerun OCR (that's what the "—skip-text" is telling it), and optimizes the PDFs it finds, if possible. If the optimized PDF it makes is smaller than the one it started with, it keeps the new one. If the new one is larger than the one it started with, it scraps the new one, and keeps the original as-is. To use this, you need only to install ocrmypdf (which will then install tesseract and other dependencies), which can be done with homebrew. (I highly recommend it, anyway, as it is an excellent tool on top of the tesseract, allowing it to handle PDFs more effectively and giving you much more control.)

If I were cleverer, I would combine this all into one grand script, but, alas, that is beyond my ken at the moment.

Attachment: PDF-character-count.sh
(Size: 1.38KB, Downloaded 134 times)
Attachment: splitPDF-overlap.py
(Size: 2.32KB, Downloaded 447 times)

[Updated on: Sat, 27 November 2021 20:05]

Report message to a moderator

[Message index]

		Indexing PDFs with lots of text By: Grant Barrett on Fri, 26 November 2021 21:17
		Re: Indexing PDFs with lots of text By: Grant Barrett on Fri, 26 November 2021 21:51
		Re: Indexing PDFs with lots of text By: Grant Barrett on Fri, 26 November 2021 22:47
		Re: Indexing PDFs with lots of text By: FoxTrot Engineering on Sat, 27 November 2021 14:42
		Re: Indexing PDFs with lots of text By: Grant Barrett on Sat, 27 November 2021 16:44
		Re: Indexing PDFs with lots of text By: Grant Barrett on Sat, 27 November 2021 17:42
		Re: Indexing PDFs with lots of text By: Grant Barrett on Sat, 27 November 2021 18:09
		Re: Indexing PDFs with lots of text By: Grant Barrett on Sat, 27 November 2021 19:53
		Re: Indexing PDFs with lots of text By: Grant Barrett on Sat, 27 November 2021 23:19
		Re: Indexing PDFs with lots of text By: Itkind on Sat, 11 December 2021 23:57
		Re: Indexing PDFs with lots of text By: FoxTrot Engineering on Mon, 13 December 2021 18:25

Previous Topic:	BUG: Searching with OR operator and quotation mark
Next Topic:	pdf documents with search highlights

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Mon Apr 29 12:31:42 GMT+2 2024