how to find all non searchable pdf [message #1193] |
Sun, 02 May 2021 20:05 |
SonicxSonicx
Messages: 13 Registered: April 2021
|
Junior Member |
|
|
Hello,
could you help me?
I want to find all non searchable pdf (I mean PDF without OCR - like image pdf).
I have thousands and thousands pdf on my storage and I want to move all "non searchable" pdf to new directory for OCR.
I set:
- all item of type: PDF
- Then apply advanced filter: Contents: Does not contain the string: a I use "a" like the most common charackter
But there is result a lot of pdf and also searchable pdf with "a"
Could you help me please?
|
|
|
|
Re: how to find all non searchable pdf [message #1200 is a reply to message #1193] |
Tue, 04 May 2021 10:36 |
FoxTrot Engineering
Messages: 406 Registered: April 2020
|
Senior Member |
|
|
I am not sure why this does not seem to work for you. It should. However, to find PDF with no text at all, instead of [does not contain the string] [a], you would better use:
[all items of type] [PDF]
[then apply advanced filter] [contents] [is exactly the string] [] []
If you want to find PDF files whose textual content length is less than 1000 characters (instead of absolutely empty), the following should theoretically work:
[all items of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [". also applies to newlines"] [^.{0,1000}$]
However, due to a bug, the later currently does not work. This will be fixed in release 7.1, but in the mean time this one works:
[all items of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [". also applies to newlines"] [^.{0,1000}\x00*$]
Also note that this regular expression currently finds document whose content length is between 1 and 1000 characters, and misses lengths of 0 character. This will also be fixed in 7.1. Note that the maximum length you can search with this kind of regular expression is 65535.
Jérôme - FoxTrot Engineering
|
|
|
Re: how to find all non searchable pdf [message #1201 is a reply to message #1200] |
Tue, 04 May 2021 17:53 |
SonicxSonicx
Messages: 13 Registered: April 2021
|
Junior Member |
|
|
Thank you,
but it doesnt work
1)
[all items of type] [PDF]
[then apply advanced filter] [contents] [is exactly the string] [] []
No results - zero pdf according this filter but there are many
2)
[all items of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [". also applies to newlines"] [^.{0,1000}$]
No results - zero pdf according this filter but there are many
One other app (DEVONthink) has filter for it „Word count“ „is“ „0“.
But this app (DEVONthink) cant do what I need.
|
|
|
|
Re: how to find all non searchable pdf [message #1203 is a reply to message #1201] |
Tue, 04 May 2021 18:24 |
FoxTrot Engineering
Messages: 406 Registered: April 2020
|
Senior Member |
|
|
Did you enable "Prefer Xpdf for PDF documents" in the First Aid / manage third party importers window (that can be opened by pressing the command and option keys when launching FoxTrot)?
If so, Xpdf seems to always return a single space as content instead of an empty string.
However, searching for:
[all items of type] [PDF]
[then apply advanced filter] [contents] [is exactly the string] [] [ ] <- type a space here!
does not work either (it will be fixed in version 7.1)
But the following should actually work in 7.0.4:
[all items of type] [PDF]
[then apply advanced filter] [contents] [is exactly the string] [ignore blanks] []
Also, the modified regular expression to work around the bug in 7.0.4 should also work:
[all items of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [". also applies to newlines"] [^ \x00*$] <- don't miss the space between ^ and \
Jérôme - FoxTrot Engineering
|
|
|
|
Re: how to find all non searchable pdf [message #1206 is a reply to message #1205] |
Wed, 05 May 2021 09:51 |
FoxTrot Engineering
Messages: 406 Registered: April 2020
|
Senior Member |
|
|
Option-click on a found file (or use the "display type: plain text" popup menu in the toolbar) to show the text that has been extracted and indexed. This should help to understand why some files are found when you think they should not, and vice versa. Also, this will show some metadata that you can target with an additional condition, for example:
[then apply advanced filter] [other metadata] [contains one of the strings / does not contain any of the strings] [√ ignore case, √ ignore accents, √ multiple strings] [SomeScannerBrand—SomeAppSignature—etc]
(use the "em-dash" character, and not the classical -, to separate strings, i.e. shift-option-dash on a qwerty keyboard, or option-dash on some other keyboards)
Jérôme - FoxTrot Engineering
|
|
|
|
Re: how to find all non searchable pdf [message #1270 is a reply to message #1207] |
Wed, 29 September 2021 12:50 |
Des Bw
Messages: 26 Registered: June 2017
|
Junior Member |
|
|
I find good results with pdffonts.
Here is a simple script I run using Hazel:
if [ `pdffonts "$1" | grep Type | sed -n '$='` ]
# FAIL when the file is OCRed
then
exit 1
else
exit 0
fi
|
|
|