FoxTrot Search Forum
FoxTrot Search for macOS Forum

Home » Public Forums » FoxTrot Search User Forum » BUG: Foxtrot looks like it is not indexing large files completely
BUG: Foxtrot looks like it is not indexing large files completely [message #1593] Sat, 10 December 2022 12:19 Go to next message
Atlas
Messages: 130
Registered: August 2009
Senior Member
Background

- I've posted previously [url=https://forum.foxtrot-search.com/index.php?t=msg&th =518&#msg_1553]here[/url] that it looks like Foxtrot doesn't completely index large html files. I've described in that thread that Foxtrot doesn't completely index html files even if it's as small as 35MB.
- In the past month, I thought that maybe the problem is only restricted to html files. However, it looks like Foxtrot is also not indexing large text files.

Details of current bug

- I have a large tsv file that's 650MB (which I believe Foxtrot treats as a text file). This is an archive file.
- I put this large file in its own folder, and create a separate Foxtrot index which targets only the parent folder of this file.
- I removed plain text file limits using "defaults write com.ctmdev.FoxTrotShared PlainTextFileLimitMB -int 0". This is according to website instruction.
- When I search for text string inside this tsv files, I noticed that I only get search results if the text string appears in the very top of the file. Text strings at the bottom of the file are not searchable.

Current concern

- I have a feeling that Foxtrot is doing in INCOMPLETE index of files if the size is over a certain limit, no matter what's the file format. If this is the case, users need to at least know what's that size limit so they can be prepared to deal with it. But ideally, if this bug exists then it should be fixed.
Re: BUG: Foxtrot looks like it is not indexing large files completely [message #1598 is a reply to message #1593] Thu, 22 December 2022 17:55 Go to previous messageGo to next message
AJKS
Messages: 48
Registered: June 2020
Member
If I remember correctly, it isn't FoxTrot but Spotlight that has a limit of something like 1 million characters.
Spotlight will only index the first 1 million (or whatever the limit is) characters in your documents.
Foxtrot isn't to blame.
Re: BUG: Foxtrot looks like it is not indexing large files completely [message #1600 is a reply to message #1598] Fri, 23 December 2022 05:32 Go to previous messageGo to next message
Atlas
Messages: 130
Registered: August 2009
Senior Member
Can Foxtrot Engineering please confirm the problem is that Foxtrot is relying on Spotlight to create index? My understanding with Foxtrot is that it uses its own indexer, which is why its search results are often better than Spotlight (though it does use Spotlight to index certain metadata). But in the case of my html files, I've even specifically requested that Foxtrot use the Gumbo indexer, so that's not dependent on Spotlight.

Can Foxtrot Engineering please chime in and let me know what's their assessment of this issue as well? Thanks.
Re: BUG: Foxtrot looks like it is not indexing large files completely [message #1601 is a reply to message #1600] Fri, 23 December 2022 19:18 Go to previous messageGo to next message
FoxTrot Engineering
Messages: 383
Registered: April 2020
Senior Member
- it seems that Gumbo has indeed some size limits when parsing large HTML files. I did not find any obvious setting to control this, and I can't tell what these limits are.

- .tsv and .csv files are currently not handled by FoxTrot's built-in text extractor, which is used for .txt files when the hidden preferences PlainTextFileLimitMB or PlainTextPreferredEncodings are set, or when "prefer alternatives: plain text files: FoxTrot built-in" is enabled in the "manage third-party metadata importers" window). You can however use the Aliases hidden preference to handle these files as .txt files.

- there is a bug that can make these settings have no effect (PlainTextFileLimitMB, PlainTextPreferredEncodings, and "prefer alternatives: plain text files: FoxTrot built-in"). Not sure when this bug was introduced. So even if you set both PlainTextFileLimitMB and Aliases for .tsv files, you may still have the 10 MB limit of Spotlight's importer.

- No, FoxTrot does not rely on Spotlight itself, just on Spotlight's importers to extract text from various document formats.


Jérôme - FoxTrot Engineering
Re: BUG: Foxtrot looks like it is not indexing large files completely [message #1602 is a reply to message #1601] Sat, 24 December 2022 04:15 Go to previous messageGo to next message
Atlas
Messages: 130
Registered: August 2009
Senior Member
FoxTrot Engineering wrote on Fri, 23 December 2022 19:18
- there is a bug that can make these settings have no effect (PlainTextFileLimitMB, PlainTextPreferredEncodings, and "prefer alternatives: plain text files: FoxTrot built-in"). Not sure when this bug was introduced. So even if you set both PlainTextFileLimitMB and Aliases for .tsv files, you may still have the 10 MB limit of Spotlight's importer.
Thanks for looking into this. I think what you said above is exactly the case. I'm seeing a bug where I've set both PlainTextFileLimitMB and Aliases for .tsv files, and Foxtrot is still not indexing large .tsv or .csv files completely. This seems to happen on both version 7.5.3 and 7.5.2. To be honest, I'm not sure how widespread is this problem, because the index size limit might affect other files beyond just html, tsv, and csv. If Foxtrot is indeed producing incomplete indexes for large file sizes, I strongly recommend that we prioritize fixing this bug, because I think it's a pretty serious issue. Incomplete indexes produce queries that fail in silence, because everything looks like they work as expected until the user is searching for a keyword that appears later in the document. Alternatively, if we're unable to fix it, we should at least understand the conditions when this bug occur (what kind of files have file size limit, and how large are the limits), so that users can proactive try to avoid it.

[Updated on: Sat, 24 December 2022 04:16]

Report message to a moderator

Re: BUG: Foxtrot looks like it is not indexing large files completely [message #1603 is a reply to message #1602] Sat, 24 December 2022 09:50 Go to previous messageGo to next message
FoxTrot Engineering
Messages: 383
Registered: April 2020
Senior Member
FoxTrot Engineering wrote on Fri, 23 December 2022 19:18
- there is a bug that can make these settings have no effect (PlainTextFileLimitMB, PlainTextPreferredEncodings, and "prefer alternatives: plain text files: FoxTrot built-in"). Not sure when this bug was introduced.
This bug was introduced in version 7.5.2 (September 2022).


Jérôme - FoxTrot Engineering
Re: BUG: Foxtrot looks like it is not indexing large files completely [message #1604 is a reply to message #1603] Sun, 25 December 2022 02:42 Go to previous message
Atlas
Messages: 130
Registered: August 2009
Senior Member
Thanks for updating us in the origin of the this bug. Do you anticipate that we could fix this bug in upcoming updates? Thanks again for looking into the issue.
Previous Topic: Epub indexing?
Next Topic: New User
Goto Forum:
  


Current Time: Fri Mar 29 06:36:13 GMT+1 2024