FoxTrot Search Forum
FoxTrot Search for macOS Forum

Home » Public Forums » FoxTrot Search User Forum » Do html files rely on Spotlight index?
Do html files rely on Spotlight index? [message #1352] Sat, 19 February 2022 07:21 Go to next message
Atlas
Messages: 140
Registered: August 2009
Senior Member
I installed Foxtrot on two separate machines, and the first one is able to index html files just fine. However, the spotlight index for the second machine is not able to index html files (and I've tried everything but I can't fix the spotlight index on this second machine). At first, I thought this wouldn't affect Foxtrot index and that I would still be able to search for my html files using Foxtrot. However, it seems that Foxtrot cannot search for html files on the second machine either. Basically, any file that the spotlight on the second machine cannot search, Foxtrot cannot search for it either, so it looks like Foxtrot indexes html files using the index from spotlight. Is this correct? I just want to confirm the behavior, and report this as a bug if it's not expected behavior.

Example of file that Foxtrot here. In my diagnostic, Devonthink is able to index this html file independently of spotlight on the second machine, and it can search the file. Therefore, I don't think it's the file that's not indexable.
Re: Do html files rely on Spotlight index? [message #1353 is a reply to message #1352] Sat, 19 February 2022 10:51 Go to previous messageGo to next message
FoxTrot Engineering
Messages: 406
Registered: April 2020
Senior Member
FoxTrot does not rely on Spotlight's index; it does however use Spotlight's metadata importers to extract indexable text from files. So yes, if for some reason a Spotlight importer does not work correctly, files handled by this importer won't be indexed correctly both with Spotlight and with FoxTrot.
For HTML files, FoxTrot has a fallback extractor (Gumbo) that is used when Spotlight's importer does not return any data at all for a given HTML file (this sometimes happens). However, if Spotlight's importer returns some partial or garbled data, then FoxTrot indexes that.
You can option-click a file from FoxTrot's search result list (when searching by filename, for example) to see the plain text that was extracted from a given file.
What do you get for the file you attached, on both machines? Do these machines have the same version of macOS? A quick test here with your files shows very different results on macOS 10.14 (most visible text is actually indexed, as well as a bunch of base64-encoded images) and macOS 12 (no base64 data, but many text is missing). Interestingly, saving the source file from stackoverflow using the current version of SingleFile creates a file that can be extracted correctly on macOS 12.
FoxTrot 7.5 will allow to use Gumbo instead of Spotlight's importer for all HTML files, or for specific HTML files. This will however require using Terminal.app as Spotlight's importer is supposed to be fast and to give good results in most cases.


Jérôme - FoxTrot Engineering
Re: Do html files rely on Spotlight index? [message #1354 is a reply to message #1353] Sat, 19 February 2022 12:05 Go to previous messageGo to next message
Atlas
Messages: 140
Registered: August 2009
Senior Member
Thank you for inspecting the file I showed. The two machines are indeed on two different MacOS versions, and for good reasons (because I anticipate that certain versions might break my tools). However, I didn't expect MacOS 12 to have an impact on the way files are indexed.

(1) On the machine with older MacOS, I can search for the file and view the html file just fine in Foxtrot.
(2) For the machine with MacOS 12, I cannot search the file's contents. For example, I know the unique string "alfabravo" is in the file, but Foxtrot will not show the file if I search for that string or any string in the file. HOWEVER, the strange thing is that the html file is perfectly viewable in Foxtrot when I search for the file by name, and I CAN search for the string "alfabravo" when I open the file up in Foxtrot viewer (use "View in Foxtrot"). So somehow ... Foxtrot DOES have access to the indexable text of the html file when it opens up the file, but it can't search for it???
(3) The problem is not consistent for all html files on MacOS 12. Roughly half of my html files are indexable and searchable, even though they are created by older versions of SingleFile, and I don't see any particular reason why some are indexable and not others. I actually don't mind Spotlight's behavior because I gave up on using Spotlight (keeps breaking with updates). What I'm concerned with is that the contents of html files are also suddenly becoming unsearchable for Foxtrot as well. Foxtrot has been the reliable rock that I use to search.

In the mean time, can you please show me how to use Gumbo importer to index all html files in a folder? This would really help me, because a lot of my research database has become invisible to Foxtrot. Thank you for looking into this.

[Updated on: Sat, 19 February 2022 12:18]

Report message to a moderator

Re: Do html files rely on Spotlight index? [message #1360 is a reply to message #1354] Sun, 20 February 2022 10:36 Go to previous messageGo to next message
FoxTrot Engineering
Messages: 406
Registered: April 2020
Senior Member
(2): finding a file relies on the index, and thus on the text that was extracted from the file at indexing time (usually using a Spotlight metadata importer, or in some specific cases an alternate text extractor like Xpdf or Gumbo). Highlighting found occurrences in the preview is completely different, and uses PDFKit, WebKit, or macOS's text engine depending of the file type. So yes, the result can be different. However, as I said in the previous message, you can option-click a found message to display the plain text that has actually been indexed, and then you should see why a file can be or can't be found.

The current FoxTrot version uses Gumbo only when the Spotlight importer completely fails to process an HTML file (usually because of a charset problem). With version 7.5, you will be able to use Gumbo for all HTML files by typing this command in Terminal.app:
defaults write com.ctmdev.FoxTrotShared PreferGumbo -bool YES
or to use Gumbo for some files only:
xattr -w com.ctmdev.foxtrot.extractor gumbo [file [file…]]


Jérôme - FoxTrot Engineering
Re: Do html files rely on Spotlight index? [message #1361 is a reply to message #1360] Sun, 20 February 2022 10:52 Go to previous messageGo to next message
Atlas
Messages: 140
Registered: August 2009
Senior Member
Thanks for clarifying on the difference between what the indexer sees and what the preview sees.

(1) When I option-click the html file on the machine where Foxtrot is not able to search for the file content, I can see that the index doesn't contain any meaningful text other than the page title. So something has gone wrong on that machine with MacOS12, where the Spotlight importer is not able to index the html file correctly. Is that a good interpretation of the situation?
(2) I just realized that we're only on version 7.1.3, which seems far from 7.5. So does that mean the commands you've kindly shown is not available yet?

Thanks for clarifying this issue, because it's not always clear how Foxtrot interacts with Spotlight index. In the current situation with html files, it really looks to me like Foxtrot search relies on Spotlight importer to extract indexable text, which means Foxtrot search will fail if Spotlight importer fails (for whatever reason). Seems like Gumbo might be a solution and I'm glad we're working on it.

[Updated on: Sun, 20 February 2022 10:53]

Report message to a moderator

Re: Do html files rely on Spotlight index? [message #1364 is a reply to message #1361] Sun, 20 February 2022 18:42 Go to previous messageGo to next message
FoxTrot Engineering
Messages: 406
Registered: April 2020
Senior Member
(1) Yes; either these HTML file are not valid (thought they may display correctly with most browsers), or the HTML Spotlight importer on macOS 12 does a less good work than previous versions. The fact that files produced by the current version of SingleFile do not have this problem may suggest the former.
(2) Yes; the release date for version 7.5 is not scheduled yet.


Jérôme - FoxTrot Engineering
Re: Do html files rely on Spotlight index? [message #1367 is a reply to message #1364] Wed, 23 February 2022 15:39 Go to previous message
Atlas
Messages: 140
Registered: August 2009
Senior Member
I've tried producing html files using different versions of SingleFile and also other methods, and Spotlight is just not indexing them consistently. In fact, my Monterey doesn't even index pdf files consistently. Have you had other reports of Monterey's Spotlight having issue with indexing? Makes me wonder if I should downgrade my system to Big Sur or just to a fresh reinstall of Monterey again.
Previous Topic: Safari History
Next Topic: Please help me get my indexing started
Goto Forum:
  


Current Time: Tue Dec 03 19:13:34 GMT+1 2024