| 
		
			| Re: Indexing PDFs with lots of text [message #1316 is a reply to message #1315] | Sat, 27 November 2021 19:53   |  
			| 
				
				
					| Grant Barrett Messages: 35
 Registered: October 2019
 | Member |  |  |  
	| For anyone who may find them useful, I am attaching the scripts I am using to mitigate this problem. 
 PDF-character-count.sh — A modified version of a script from Timothy J. Luoma    https://gist.github.com/tjluoma/205e3d85e46eb6025b87c6db5b77 375b that adds a PDF file's character count to the PDF's file name. It uses pdftotext, which you can install in terminal with  To use in the terminal:
 
 
 You can also run it with a wildcard in a directory to do character counts of every PDF in the directory:PDF-character-count.sh FILENAME.pdf
 
 splitPDF-overlap.py — A slightly modified version of a script from Benjamin Han at http://www.cs.cmu.edu/~benhdj/Mac/unix.html#splitPDF. It divides a PDF into chunks at the page count you specify with a one-page overlap (to make sure that "words near" searches will be more likely to return results if what you want happens to be across the new document split). The only thing I changed in the script from Ben's version is changingPDF-character-count.sh *.pdftostartPageNum = splitPageNum + 1To use, figure out how you want to split your PDF. For example, running the character-count script on a big PDF shows it has 19,422,027 characters. Since we want our PDFs to be under 5,000,000 characters (which will ensure they are indexable by Spotlight), we will divide this file into four pieces, meaning that each resulting file will have about 4,855,507 characters. But we have to convert that character count into a page count for this script. Our PDF file has 1153 pages, which divided into four chunks would be about 288 pages each. So our instructions to the script will indicate the starting page number of each chunk at 288-page intervals.startPageNum = splitPageNum - 1
 
 Our resulting files will be:splitPDF-overlap.py FILENAME.pdf 288 576 864. 
 FILENAME.part1.1_288.pdf
 FILENAME.part2.287_576.pdf
 FILENAME.part3.575_864.pdf
 FILENAME.part4.863_1153.pdf
 
 As you can see, there is a one-page overlap at each of the files. File one ends at 288, but file two starts at 287, and so on.
 
 One side effect of this split is that the resulting split files can sometimes *each* be as large or larger than the original file. To remedy this without noticeably quality loss, I run the following in terminal:
 
 
 This recursively finds PDF files in the current directory and its subdirectories, runs ocrmypdf to invoke tesseract but *doesn't* rerun OCR (that's what the "—skip-text" is telling it), and optimizes the PDFs it finds, if possible. If the optimized PDF it makes is smaller than the one it started with, it keeps the new one. If the new one is larger than the one it started with, it scraps the new one, and keeps the original as-is. To use this, you need only to install ocrmypdf (which will then install tesseract and other dependencies), which can be done with homebrew. (I highly recommend it, anyway, as it is an excellent tool on top of the tesseract, allowing it to handle PDFs more effectively and giving you much more control.)find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf --tesseract-timeout=0 --skip-text --jbig2-lossy --optimize 3 --output-type pdf '{}' '{}'
 If I were cleverer, I would combine this all into one grand script, but, alas, that is beyond my ken at the moment.
 [Updated on: Sat, 27 November 2021 20:05] Report message to a moderator |  
	|  |  |