Ligatures are not copied from PDF in Zoteros PDF viewer, Report ID 865672033

edited September 26, 2022
Report ID 865672033

Thank you for all the lovely work put in Zotero!

Zotero displays ligatures, like like "fi" or "ft", just fine in PDF. Surprisingly,
a) ligatures are not copied from PDF in Zoteros PDF reader, i.e. words with ligatures are munged when copied from the PDF
EDIT: Also, Zotero annotations (highlighted text) extracted from PDFs will have mangled text.
b) words with ligatures do not turn up in searches.

Example from PDF archived in Zotero:
"easily drift as speculation" - copied text using Preview (Mac OS X)
"easily dri as speculation" - copied text using Zotero’s PDF viewer

This is serious, because Zotero searches in PDF will not find words with ligatures. Search for "drift" in the example above will turn up 0 hits when
- searching the PDF in Zotero
- searching PDFs in a collection in Everything mode

The ligature-issue was raised april 2021:
https://forums.zotero.org/discussion/89000/search-returns-results-are-missing-compared-to-equivalent-library-in-endnote
The upshot then was: "Ultimately, Zotero should find results regardless of whether the search text or the extracted text includes a ligature."

The issue of ligature seems to persist. I will be happy to provide help if possible.

Best regards,
Joe Siri Ekgren

*ligature: two or more letters, like "fi" or "ft", combined in a single glyph
  • Can you link to an example PDF, or email it to support@zotero.org with a link to this thread?
  • Thank you for your reply. I have now sent a mail to support@zotero.org.

    I sent a copy of a PDF, "Versification...", which has ligatures that Zotero can display, but not copy or search. The PDF was added January 24, 2022. Indexed "partial". Reindexing had no effect. The string "fi" is found 468 times.

    1) Duplicating and renaming using an external PDF reader and importing the file into Zotero does not solve the issue.

    2) But I stumbled over a workaround:
    Deleting the first page using an external PDF reader seems to be a workaround for this specific document:

    - open the PDF in Mac OS X Preview,
    - duplicate the PDF
    - delete the first page (of 309)
    - save
    - import to Zotero

    The ligatures can now be copied and are found when searching. Searching for the string "fi" has "more than 1000 matches".

    3) Deleting the first page in the Zotero PDF viewer does not solve the issue.

    Thank you for any thoughts on this issue.

    Best regards,

    Joe
  • edited September 27, 2022
    Yes, Preview.app recognizes ligatures correctly in this file, but other two PDF readers, that I tested, also show empty characters.
  • We'll report the issue to PDF.js (our underlying library).
  • @martynas_b and @

    Thank you for your time and consideration. Best regards, Joe
  • @martynas_b The issue of ligatures vanishing from Zotero PDF annotations and searches could be due to embedded fonts.

    Using Acrobat Reader-> File->Properties->Fonts to list the embedded fonts,it seems the original PDF has about 100 embedded fonts*. These fonts have different
    a) types such as
    True Type,
    True Type (CID),
    Type 1,
    Type 1 (CID)
    b) encoding such as
    ANSI
    Identity-H
    Roman
    Custom

    Deleting the first page in Mac PDF reader (Preview) makes ligatures appear in Zotero annotations and document searches. It seems that deleting the first page removes all TrueType fonts and reduces the number of embedded fonts to six. These six embedded fonts are
    a) Type 1
    b) and use encoding
    Roman
    Custom

    I hope this helps.

    Best regards,
    Joe

    * variants such as italic and bold are registered as separate fonts
Sign In or Register to comment.