Ligatures are not copied from PDF in Zoteros PDF viewer, Report ID 865672033
Report ID 865672033
Thank you for all the lovely work put in Zotero!
Zotero displays ligatures, like like "fi" or "ft", just fine in PDF. Surprisingly,
a) ligatures are not copied from PDF in Zoteros PDF reader, i.e. words with ligatures are munged when copied from the PDF
EDIT: Also, Zotero annotations (highlighted text) extracted from PDFs will have mangled text.
b) words with ligatures do not turn up in searches.
Example from PDF archived in Zotero:
"easily drift as speculation" - copied text using Preview (Mac OS X)
"easily dri as speculation" - copied text using Zotero’s PDF viewer
This is serious, because Zotero searches in PDF will not find words with ligatures. Search for "drift" in the example above will turn up 0 hits when
- searching the PDF in Zotero
- searching PDFs in a collection in Everything mode
The ligature-issue was raised april 2021:
https://forums.zotero.org/discussion/89000/search-returns-results-are-missing-compared-to-equivalent-library-in-endnote
The upshot then was: "Ultimately, Zotero should find results regardless of whether the search text or the extracted text includes a ligature."
The issue of ligature seems to persist. I will be happy to provide help if possible.
Best regards,
Joe Siri Ekgren
*ligature: two or more letters, like "fi" or "ft", combined in a single glyph
Thank you for all the lovely work put in Zotero!
Zotero displays ligatures, like like "fi" or "ft", just fine in PDF. Surprisingly,
a) ligatures are not copied from PDF in Zoteros PDF reader, i.e. words with ligatures are munged when copied from the PDF
EDIT: Also, Zotero annotations (highlighted text) extracted from PDFs will have mangled text.
b) words with ligatures do not turn up in searches.
Example from PDF archived in Zotero:
"easily drift as speculation" - copied text using Preview (Mac OS X)
"easily dri as speculation" - copied text using Zotero’s PDF viewer
This is serious, because Zotero searches in PDF will not find words with ligatures. Search for "drift" in the example above will turn up 0 hits when
- searching the PDF in Zotero
- searching PDFs in a collection in Everything mode
The ligature-issue was raised april 2021:
https://forums.zotero.org/discussion/89000/search-returns-results-are-missing-compared-to-equivalent-library-in-endnote
The upshot then was: "Ultimately, Zotero should find results regardless of whether the search text or the extracted text includes a ligature."
The issue of ligature seems to persist. I will be happy to provide help if possible.
Best regards,
Joe Siri Ekgren
*ligature: two or more letters, like "fi" or "ft", combined in a single glyph
I sent a copy of a PDF, "Versification...", which has ligatures that Zotero can display, but not copy or search. The PDF was added January 24, 2022. Indexed "partial". Reindexing had no effect. The string "fi" is found 468 times.
1) Duplicating and renaming using an external PDF reader and importing the file into Zotero does not solve the issue.
2) But I stumbled over a workaround:
Deleting the first page using an external PDF reader seems to be a workaround for this specific document:
- open the PDF in Mac OS X Preview,
- duplicate the PDF
- delete the first page (of 309)
- save
- import to Zotero
The ligatures can now be copied and are found when searching. Searching for the string "fi" has "more than 1000 matches".
3) Deleting the first page in the Zotero PDF viewer does not solve the issue.
Thank you for any thoughts on this issue.
Best regards,
Joe
Thank you for your time and consideration. Best regards, Joe
Using Acrobat Reader-> File->Properties->Fonts to list the embedded fonts,it seems the original PDF has about 100 embedded fonts*. These fonts have different
a) types such as
True Type,
True Type (CID),
Type 1,
Type 1 (CID)
b) encoding such as
ANSI
Identity-H
Roman
Custom
Deleting the first page in Mac PDF reader (Preview) makes ligatures appear in Zotero annotations and document searches. It seems that deleting the first page removes all TrueType fonts and reduces the number of embedded fonts to six. These six embedded fonts are
a) Type 1
b) and use encoding
Roman
Custom
I hope this helps.
Best regards,
Joe
* variants such as italic and bold are registered as separate fonts