FoxTrot Search Forum
FoxTrot Search for macOS Forum

Home » Public Forums » FoxTrot Search User Forum » how to find all non searchable pdf
how to find all non searchable pdf [message #1193] Sun, 02 May 2021 20:05 Go to next message
SonicxSonicx
Messages: 8
Registered: April 2021
Junior Member
Hello,

could you help me?

I want to find all non searchable pdf (I mean PDF without OCR - like image pdf).

I have thousands and thousands pdf on my storage and I want to move all "non searchable" pdf to new directory for OCR.

I set:

- all item of type: PDF
- Then apply advanced filter: Contents: Does not contain the string: a I use "a" like the most common charackter

But there is result a lot of pdf and also searchable pdf with "a"

Could you help me please?

Re: how to find all non searchable pdf [message #1197 is a reply to message #1193] Mon, 03 May 2021 08:55 Go to previous messageGo to next message
Darren Ingram
Messages: 12
Registered: May 2018
Junior Member
I did not find a solution in the past at least. In the end I hired a company to write a small application to do the job for me. It was worth it at the time for a specific project.
Re: how to find all non searchable pdf [message #1200 is a reply to message #1193] Tue, 04 May 2021 10:36 Go to previous messageGo to next message
FoxTrot Engineering
Messages: 383
Registered: April 2020
Senior Member
I am not sure why this does not seem to work for you. It should. However, to find PDF with no text at all, instead of [does not contain the string] [a], you would better use:
[all items of type] [PDF]
[then apply advanced filter] [contents] [is exactly the string] [] []

If you want to find PDF files whose textual content length is less than 1000 characters (instead of absolutely empty), the following should theoretically work:
[all items of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [". also applies to newlines"] [^.{0,1000}$]

However, due to a bug, the later currently does not work. This will be fixed in release 7.1, but in the mean time this one works:
[all items of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [". also applies to newlines"] [^.{0,1000}\x00*$]

Also note that this regular expression currently finds document whose content length is between 1 and 1000 characters, and misses lengths of 0 character. This will also be fixed in 7.1. Note that the maximum length you can search with this kind of regular expression is 65535.


Jérôme - FoxTrot Engineering
Re: how to find all non searchable pdf [message #1201 is a reply to message #1200] Tue, 04 May 2021 17:53 Go to previous messageGo to next message
SonicxSonicx
Messages: 8
Registered: April 2021
Junior Member
Thank you,

but it doesnt work

1)
[all items of type] [PDF]
[then apply advanced filter] [contents] [is exactly the string] [] []

No results - zero pdf according this filter but there are many

2)
[all items of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [". also applies to newlines"] [^.{0,1000}$]


No results - zero pdf according this filter but there are many

One other app (DEVONthink) has filter for it „Word count“ „is“ „0“.

But this app (DEVONthink) cant do what I need.

Re: how to find all non searchable pdf [message #1202 is a reply to message #1197] Tue, 04 May 2021 17:54 Go to previous messageGo to next message
SonicxSonicx
Messages: 8
Registered: April 2021
Junior Member
Dear Darren,


could you provide me this app? Indeed, we can share the costs.

Thank you.
Re: how to find all non searchable pdf [message #1203 is a reply to message #1201] Tue, 04 May 2021 18:24 Go to previous messageGo to next message
FoxTrot Engineering
Messages: 383
Registered: April 2020
Senior Member
Did you enable "Prefer Xpdf for PDF documents" in the First Aid / manage third party importers window (that can be opened by pressing the command and option keys when launching FoxTrot)?
If so, Xpdf seems to always return a single space as content instead of an empty string.

However, searching for:
[all items of type] [PDF]
[then apply advanced filter] [contents] [is exactly the string] [] [ ] <- type a space here!
does not work either (it will be fixed in version 7.1)

But the following should actually work in 7.0.4:
[all items of type] [PDF]
[then apply advanced filter] [contents] [is exactly the string] [ignore blanks] []

Also, the modified regular expression to work around the bug in 7.0.4 should also work:
[all items of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [". also applies to newlines"] [^ \x00*$] <- don't miss the space between ^ and \


Jérôme - FoxTrot Engineering
Re: how to find all non searchable pdf [message #1205 is a reply to message #1203] Tue, 04 May 2021 21:57 Go to previous messageGo to next message
SonicxSonicx
Messages: 8
Registered: April 2021
Junior Member
Thank you for your time, but all of these steps have different results but no ideal.

a lot of non-OCR pdf missing OR a lot of OCR pdf included
Re: how to find all non searchable pdf [message #1206 is a reply to message #1205] Wed, 05 May 2021 09:51 Go to previous messageGo to next message
FoxTrot Engineering
Messages: 383
Registered: April 2020
Senior Member
Option-click on a found file (or use the "display type: plain text" popup menu in the toolbar) to show the text that has been extracted and indexed. This should help to understand why some files are found when you think they should not, and vice versa. Also, this will show some metadata that you can target with an additional condition, for example:
[then apply advanced filter] [other metadata] [contains one of the strings / does not contain any of the strings] [√ ignore case, √ ignore accents, √ multiple strings] [SomeScannerBrand—SomeAppSignature—etc]

(use the "em-dash" character, and not the classical -, to separate strings, i.e. shift-option-dash on a qwerty keyboard, or option-dash on some other keyboards)


Jérôme - FoxTrot Engineering
Re: how to find all non searchable pdf [message #1207 is a reply to message #1202] Wed, 05 May 2021 09:55 Go to previous messageGo to next message
Darren Ingram
Messages: 12
Registered: May 2018
Junior Member
As it was it was custom-written I suspect there would be problems with its intellectual property. However, if you wish to contact the company and ask about "PDF Reporter" that they wrote for me, maybe they can do a version for you?


Shane Stanley sstanley«~at~»myriad-com«|dot|»com«|dot|»au is the person to contact. Do mention our exchange.
Re: how to find all non searchable pdf [message #1270 is a reply to message #1207] Wed, 29 September 2021 12:50 Go to previous message
Des Bw
Messages: 26
Registered: June 2017
Junior Member
I find good results with pdffonts.
Here is a simple script I run using Hazel:

if  [ `pdffonts "$1" | grep Type | sed -n '$='` ]

# FAIL when the file is OCRed
then
   exit 1
else
   exit 0
fi
Previous Topic: ftp often get crashed after updating to 7.1. How to download an older version?
Next Topic: Broken words at the end of line
Goto Forum:
  


Current Time: Thu Mar 28 20:45:32 GMT+1 2024