Broken words at the end of line [message #1269] |
Wed, 29 September 2021 12:42 |
Des Bw
Messages: 26 Registered: June 2017
|
Junior Member |
|
|
If a word appears at the end of a line, it will be broken down into two parts. In one of the pdf files I am reading right now, the word behaviorism appears as be- haviorism because it appeared at the end of the line.
I knew the file has the term behaviorism in it. But, Foxtrot was not able to find it because of this breakage.
- is there any way to tell Foxtrot to ignore - part so that broken words would read as one (unbroken)?
[Updated on: Wed, 29 September 2021 12:43] Report message to a moderator
|
|
|
Re: Broken words at the end of line [message #1271 is a reply to message #1269] |
Wed, 29 September 2021 15:53 |
FoxTrot Engineering
Messages: 406 Registered: April 2020
|
Senior Member |
|
|
Depending on how the PDF file has been created, it may be possible (but I am actually not sure of this) that hyphenated words are handled as a single word split on two lines; but usually, they are instead handled by the software generating the PDF as two distinct words separated by an hyphen and a line feed, and in this case, FoxTrot won't find the full word.
If you want to specifically search for a given hyphenated form of a word, you may search for a quoted string: ["be haviorism"].
If you want to search for different hyphenated forms of a word, you may search for a multiple quoted string using the | operator: ["behaviorism"|"be haviorism"|"behavio rism"].
If you are a regular expression expert, you may also try something like:
[any document of type] [PDF]
[then apply advanced filter] [contents] [contains the regular expression] [be(\p{Pd}\s+)?ha(\p{Pd}\s+)?vio(\p{Pd}\s+)?ri(\p{Pd}\s+)?sm ]
(in this regex, \p{Pd} matches any kind of dash character (minus, hyphen, dash etc) and \s+ any spacing character, including line feed or return).
Hope this helps
Jérôme - FoxTrot Engineering
|
|
|
Re: Broken words at the end of line [message #1272 is a reply to message #1271] |
Thu, 30 September 2021 10:18 |
Des Bw
Messages: 26 Registered: June 2017
|
Junior Member |
|
|
Thank you for the reply. Yes, I know how to do these searches using Regex or Foxtrot's own system. The problem is I often don't know of a word is hyphenated or not. For the word ranking everything else, broken words are irrelevant/ignored in the current system.
- As to the first point, I think the way pdf generating softwares create hyphenated words is pretty consistent.
It is always a hyphen followed the return key. When I look at these breakages in the text version of the index, they always appear as hyphen followed by a space, as I have shown in the example.
(I am assuming all the ranking, and proximity magic works on the text version of the index. I assume the pdf formatting such as the margins, etc is irrelevant to the search algorism).
- So, I was hopping, if Foxtrot can be programmed at its core to always remove (ignore) HYPHENspace sequence, such that broken words can be read as one (unified).
|
|
|
Re: Broken words at the end of line [message #1273 is a reply to message #1272] |
Thu, 30 September 2021 10:46 |
Des Bw
Messages: 26 Registered: June 2017
|
Junior Member |
|
|
PDF-generating softwares put hyphens at the end of a line if word needs to be broken into two. The search algorithm in Foxtrot considers these broken words as separate.
To be more specific, I have attached a sample page.
Download the file, index it in Foxtrot, open it in a new window, and search for the word "possessives".
- How many occurrences does Foxtrot give you?
- None
But, if I search the same word in Acrobat Adobe, it gives me 1 occurrence, because Acrobat is intelligent enough to remove those carriage returns (recognized the word as one).
I think Foxtrot needs to do the same. The results will be more accurate is these carriage returns are removed and the word is considered as one.
[Updated on: Thu, 30 September 2021 10:48] Report message to a moderator
|
|
|
Re: Broken words at the end of line [message #1274 is a reply to message #1273] |
Thu, 30 September 2021 11:11 |
FoxTrot Engineering
Messages: 406 Registered: April 2020
|
Senior Member |
|
|
Thanks for providing this sample document. This one can actually be processed correctly if you use Xpdf, instead of Spotlight's importer, to process PDF documents. To do so, press the command and option keys while launching FoxTrot, and enable "manage third-party metadata importers". Then enable "prefer Xpdf for PDF documents" (and do NOT enable the stream order mode). You will then need to rebuild your indices.
This will work for documents created with a software that handles hyphenation correctly; PDFs created from TextEdit, for example, will still generate two separate words. PDFs created from Pages won't.
Jérôme - FoxTrot Engineering
|
|
|
Re: Broken words at the end of line [message #1275 is a reply to message #1274] |
Thu, 30 September 2021 15:27 |
FoxTrot Engineering
Messages: 406 Registered: April 2020
|
Senior Member |
|
|
In fact, as far as I know, the PDF standard (in recent version) allows to create documents that handles hyphenated words correctly; for example, the German word "Drucker" can be displayed as "Druk-" (with a k) followed by "ker" (with a second k) on the next line, while still returning "Drucker" (with ck) to applications that process the content. However, I don't know how to create such a PDF file, and I can't find a sample file to download.
Your example file, as well as PDF files I tried to generate using TextEdit, Pages, Word or LibreOffice, do not use this feature, and generate two distinct words separated by a hyphen-minus character (U+002d, the plain old -), or an hyphen character (U+2010). Then Xpdf (and probably Acrobat also) uses a hack to delete the last character of a line when it is an hyphen-minus, which will probably be fine in your case; however I am quite surprised and disappointed that such a hack is still needed in 2021, for a file format that is more than 30-year old.
Jérôme - FoxTrot Engineering
|
|
|
Re: Broken words at the end of line [message #1276 is a reply to message #1275] |
Thu, 30 September 2021 18:21 |
Des Bw
Messages: 26 Registered: June 2017
|
Junior Member |
|
|
Quote: To do so, press the command and option keys while launching FoxTrot, and enable "manage third-party metadata importers". Then enable "prefer Xpdf for PDF documents" (and do NOT enable the stream order mode). You will then need to rebuild your indices.
This will work for documents created with a software that handles hyphenation correctly
Thank you so much. That is what I have been looking for.
AS to the pdf creating-softwares, most of the documents I use are created by Tex. So, they should be fine.
Your German case is pretty advanced. I don't think Tex can do that either.
Quote:I am quite surprised and disappointed that such a hack is still needed in 2021, for a file format that is more than 30-year old.
Well, we are inserting these hyphens to help for human eye readability. There is a way to avoid work-breakage in Latex (TEX). But, often, the white space between words looks inconsistent (wider in one line, narrower on the other). For that, breaking the word with a hyphen is assumed to be the best solution. The hyphen is to remind the reader that the word has been broken. Otherwise, reader will attempt to read each of the pieces separately--causing unnecessary headache.
The best we can do is make the hyphen consistent to assist the hacking software.
|
|
|
|
Re: Broken words at the end of line [message #1278 is a reply to message #1277] |
Fri, 01 October 2021 09:26 |
FoxTrot Engineering
Messages: 406 Registered: April 2020
|
Senior Member |
|
|
Both Xpdf and Spotlight's importer have issues with some PDF documents, especially OCR'ed ones; for some files, Xpdf produces the best result, and for other ones, its is the Spotlight importer. See the FAQ for more info.
Xpdf is quite slower to index than Spotlight's importer.
Jérôme - FoxTrot Engineering
|
|
|