Refining PDF Index Output
Hi all,
First off thanks for the great tool.
I observe in some PDF files (esp. those generated by LaTeX), that some character combinations are converted to to non-standard characters, e.g., "ff" is replaced with a single character (0xfb00), and "fi" with a single character (0xfb01). They can be turned back to their original strings after indexing, hence preventing search problems.
First off thanks for the great tool.
I observe in some PDF files (esp. those generated by LaTeX), that some character combinations are converted to to non-standard characters, e.g., "ff" is replaced with a single character (0xfb00), and "fi" with a single character (0xfb01). They can be turned back to their original strings after indexing, hence preventing search problems.
This is an old discussion that has not been active in a long time. Instead of commenting here, you should start a new discussion. If you think the content of this discussion is still relevant, you can link to it from your new discussion.
Upgrade Storage
P.S.:
For reference this kind of character combinations are called "ligature" and I could find a short list of them here:
http://en.wikipedia.org/wiki/Typographic_ligature#Ligatures_in_Unicode_.28Latin-derived_alphabets.29
This is actually part of a fairly large problem that affects some other languages much more.
We could, as I said, intercept those characters when making the full-text index, but it'd only be right to intercept the whole set of them, which is multilingual and quite large.