Refining PDF Index Output

Hi all,
First off thanks for the great tool.
I observe in some PDF files (esp. those generated by LaTeX), that some character combinations are converted to to non-standard characters, e.g., "ff" is replaced with a single character (0xfb00), and "fi" with a single character (0xfb01). They can be turned back to their original strings after indexing, hence preventing search problems.
  • That would be quite nice-- and I don't think it would be that hard to patch into the existing code. I'll take a look at doing this.
  • Although this would optimally be handled by Sqlite through proper collations--- I know that matching and collation were supposed to become more flexible with, iirc, Firefox 4. Still, I'll look into it.
  • Actually the contents of the ".zotero-ft-cache" files suffer from this problem. I use the "Retrieve metadata for PDF" functionality a lot, and I wonder if this functionality uses contents in the mentioned cache file.

    P.S.:
    For reference this kind of character combinations are called "ligature" and I could find a short list of them here:
    http://en.wikipedia.org/wiki/Typographic_ligature#Ligatures_in_Unicode_.28Latin-derived_alphabets.29
  • The correct solution is to treat these are their normalized equivalents (fl, &c.) when we do the database queries that underly searches. The database software allows this, but it looks like Mozilla doesn't include that module (see mention at https://wiki.mozilla.org/Firefox/Projects/FTS_and_Awesomebar#Week_of_2010.2F03.2F08).

    This is actually part of a fairly large problem that affects some other languages much more.

    We could, as I said, intercept those characters when making the full-text index, but it'd only be right to intercept the whole set of them, which is multilingual and quite large.

This is an old discussion that has not been active in a long time. Instead of commenting here, you should start a new discussion. If you think the content of this discussion is still relevant, you can link to it from your new discussion.