Refining PDF Index Output

sina_iravanian · May 4, 2011

Hi all,
First off thanks for the great tool.
I observe in some PDF files (esp. those generated by LaTeX), that some character combinations are converted to to non-standard characters, e.g., "ff" is replaced with a single character (0xfb00), and "fi" with a single character (0xfb01). They can be turned back to their original strings after indexing, hence preventing search problems.

ajlyon · May 4, 2011

That would be quite nice-- and I don't think it would be that hard to patch into the existing code. I'll take a look at doing this.

ajlyon · May 4, 2011

Although this would optimally be handled by Sqlite through proper collations--- I know that matching and collation were supposed to become more flexible with, iirc, Firefox 4. Still, I'll look into it.

sina_iravanian · May 4, 2011

Actually the contents of the ".zotero-ft-cache" files suffer from this problem. I use the "Retrieve metadata for PDF" functionality a lot, and I wonder if this functionality uses contents in the mentioned cache file.

P.S.:
For reference this kind of character combinations are called "ligature" and I could find a short list of them here:
http://en.wikipedia.org/wiki/Typographic_ligature#Ligatures_in_Unicode_.28Latin-derived_alphabets.29

ajlyon · May 4, 2011

The correct solution is to treat these are their normalized equivalents (fl, &c.) when we do the database queries that underly searches. The database software allows this, but it looks like Mozilla doesn't include that module (see mention at https://wiki.mozilla.org/Firefox/Projects/FTS_and_Awesomebar#Week_of_2010.2F03.2F08).

This is actually part of a fairly large problem that affects some other languages much more.

We could, as I said, intercept those characters when making the full-text index, but it'd only be right to intercept the whole set of them, which is multilingual and quite large.