FoxTrot Search Forum: FoxTrot Search User Forum » how to find all non searchable pdf

Home » Public Forums » FoxTrot Search User Forum » how to find all non searchable pdf

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

how to find all non searchable pdf [message #1193]

Sun, 02 May 2021 20:05

SonicxSonicx
Messages: 13
Registered: April 2021

Junior Member

Hello,

could you help me?

I want to find all non searchable pdf (I mean PDF without OCR - like image pdf).

I have thousands and thousands pdf on my storage and I want to move all "non searchable" pdf to new directory for OCR.

I set:

- all item of type: PDF
- Then apply advanced filter: Contents: Does not contain the string: a I use "a" like the most common charackter

But there is result a lot of pdf and also searchable pdf with "a"

Could you help me please?

Report message to a moderator

Re: how to find all non searchable pdf [message #1197 is a reply to message #1193]

Mon, 03 May 2021 08:55

Darren Ingram
Messages: 12
Registered: May 2018

Junior Member

I did not find a solution in the past at least. In the end I hired a company to write a small application to do the job for me. It was worth it at the time for a specific project.

Report message to a moderator

Re: how to find all non searchable pdf [message #1200 is a reply to message #1193]

Tue, 04 May 2021 10:36

FoxTrot Engineering
Messages: 427
Registered: April 2020

Senior Member

I am not sure why this does not seem to work for you. It should. However, to find PDF with no text at all, instead of [does not contain the string] [a], you would better use:
[all items of type] [PDF]
[then apply advanced filter] [contents] [is exactly the string] [] []

If you want to find PDF files whose textual content length is less than 1000 characters (instead of absolutely empty), the following should theoretically work:
[all items of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [". also applies to newlines"] [^.{0,1000}$]

However, due to a bug, the later currently does not work. This will be fixed in release 7.1, but in the mean time this one works:
[all items of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [". also applies to newlines"] [^.{0,1000}\x00*$]

Also note that this regular expression currently finds document whose content length is between 1 and 1000 characters, and misses lengths of 0 character. This will also be fixed in 7.1. Note that the maximum length you can search with this kind of regular expression is 65535.

Jérôme - FoxTrot Engineering

Report message to a moderator

Re: how to find all non searchable pdf [message #1201 is a reply to message #1200]

Tue, 04 May 2021 17:53

SonicxSonicx
Messages: 13
Registered: April 2021

Junior Member

Thank you,

but it doesnt work

1)
[all items of type] [PDF]
[then apply advanced filter] [contents] [is exactly the string] [] []

No results - zero pdf according this filter but there are many

2)
[all items of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [". also applies to newlines"] [^.{0,1000}$]

No results - zero pdf according this filter but there are many

One other app (DEVONthink) has filter for it „Word count“ „is“ „0“.

But this app (DEVONthink) cant do what I need.

Attachment: smart rule.png
(Size: 42.72KB, Downloaded 349 times)
Attachment: Snímek obrazovky 2021-05-04 v 17.35.17.png
(Size: 89.10KB, Downloaded 384 times)

Report message to a moderator

Re: how to find all non searchable pdf [message #1202 is a reply to message #1197]

Tue, 04 May 2021 17:54

SonicxSonicx
Messages: 13
Registered: April 2021

Junior Member

Dear Darren,

could you provide me this app? Indeed, we can share the costs.

Thank you.

Report message to a moderator

Re: how to find all non searchable pdf [message #1203 is a reply to message #1201]

Tue, 04 May 2021 18:24

FoxTrot Engineering
Messages: 427
Registered: April 2020

Senior Member

Did you enable "Prefer Xpdf for PDF documents" in the First Aid / manage third party importers window (that can be opened by pressing the command and option keys when launching FoxTrot)?
If so, Xpdf seems to always return a single space as content instead of an empty string.

However, searching for:
[all items of type] [PDF]
[then apply advanced filter] [contents] [is exactly the string] [] [ ] <- type a space here!
does not work either (it will be fixed in version 7.1)

But the following should actually work in 7.0.4:
[all items of type] [PDF]
[then apply advanced filter] [contents] [is exactly the string] [ignore blanks] []

Also, the modified regular expression to work around the bug in 7.0.4 should also work:
[all items of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [". also applies to newlines"] [^ \x00*$] <- don't miss the space between ^ and \

Jérôme - FoxTrot Engineering

Report message to a moderator

Re: how to find all non searchable pdf [message #1205 is a reply to message #1203]

Tue, 04 May 2021 21:57

SonicxSonicx
Messages: 13
Registered: April 2021

Junior Member

Thank you for your time, but all of these steps have different results but no ideal.

a lot of non-OCR pdf missing OR a lot of OCR pdf included

Report message to a moderator

Re: how to find all non searchable pdf [message #1206 is a reply to message #1205]

Wed, 05 May 2021 09:51

FoxTrot Engineering
Messages: 427
Registered: April 2020

Senior Member

Option-click on a found file (or use the "display type: plain text" popup menu in the toolbar) to show the text that has been extracted and indexed. This should help to understand why some files are found when you think they should not, and vice versa. Also, this will show some metadata that you can target with an additional condition, for example:
[then apply advanced filter] [other metadata] [contains one of the strings / does not contain any of the strings] [√ ignore case, √ ignore accents, √ multiple strings] [SomeScannerBrand—SomeAppSignature—etc]

(use the "em-dash" character, and not the classical -, to separate strings, i.e. shift-option-dash on a qwerty keyboard, or option-dash on some other keyboards)

Jérôme - FoxTrot Engineering

Report message to a moderator

Re: how to find all non searchable pdf [message #1207 is a reply to message #1202]

Wed, 05 May 2021 09:55

Darren Ingram
Messages: 12
Registered: May 2018

Junior Member

As it was it was custom-written I suspect there would be problems with its intellectual property. However, if you wish to contact the company and ask about "PDF Reporter" that they wrote for me, maybe they can do a version for you?

Shane Stanley sstanley«~at~»myriad-com«|dot|»com«|dot|»au is the person to contact. Do mention our exchange.

Report message to a moderator

Re: how to find all non searchable pdf [message #1270 is a reply to message #1207]

Wed, 29 September 2021 12:50

Des Bw
Messages: 26
Registered: June 2017

Junior Member

I find good results with pdffonts.
Here is a simple script I run using Hazel:

if  [ `pdffonts "$1" | grep Type | sed -n '$='` ]

# FAIL when the file is OCRed
then
   exit 1
else
   exit 0
fi

Report message to a moderator

Previous Topic:	ftp often get crashed after updating to 7.1. How to download an older version?
Next Topic:	Broken words at the end of line

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Mon Dec 01 23:09:25 GMT+1 2025