Extracted PDF highlights include soft hyphen (U+00AD)

aaaaaaaaaaaaaaa · January 30, 2024

I appreciate that when I make highlight annotations in PDFs in Zotero, the extracted text that appears in the left sidebar automatically expunges line breaks and line-ending hyphens. However I've noticed that when making highlights in some PDF files, an invisible soft hyphen character (U+00AD) is sometimes copied over into the extracted annotation. Usually this character appears in the middle of a word, which can prevent the annotation from being returned when running a search for the term in the left-sidebar annotation window (for example, searching the term "economic"—without the soft hyphen—would fail to return annotations containing "economic"—with the soft hyphen). This can also lead to abnormal behavior when copying the extracted quotes into other programs like LibreOffice. Would the devs consider automatically removing this invisible character from extracted highlights, in the same way that line breaks and line-ending hyphens are removed?

I'm using 7.0.0-beta.54+6b996d4f9 on macOS and the testflight beta on my iPad/iPhone. It seems that highlight annotations created on iOS include the soft-hyphen character, and annotations made on my computer remove the soft-hyphen and substitute a regular space character in its place.

martynas_b · February 1, 2024

So, are you suggesting that the behavior of both the iOS and macOS apps is incorrect because the word can't be found later, correct?

We would like to investigate in which cases PDFs use those characters. Could you send an example PDF (or just single page) to support@zotero.org with a link to this thread?

martynas_b · February 12, 2024

I checked the characters in the PDF files, and it seems those hyphens don't exist. It appears they haven't been OCRed at all.

Do you have examples of other PDF viewers that work better with these PDF files?

aaaaaaaaaaaaaaa · February 12, 2024

@martynas_b Assuming I'm understanding what you mean by OCR, I'm observing something very different. For instance, when I use the PDF viewer app Podofyllin to analyze the embedded text of the PDF files I sent you, then copy the problematic lines to my clipboard, then paste the text into an online unicode character viewer tool, it shows that the difficult character is \u00AD, which is a soft-hyphen.

The app PDF Expert behaves almost flawlessly, but not 100%. PDF Expert properly removes the soft-hyphen from the selected text when I copy/paste the excerpt in question onto my clipboard. For example, if I select and copy/paste the word "preserved" (printed on the page with "pre-" at the end of the line followed by "served" at the start of the next line), the text on my clipboard automatically merges the two fragments together, forming the full word preserved (as opposed to the broken words pre served or pre served, which are the results from iOS Zotero and macOS Zotero, respectively—note that they might look the same onscreen depending on your browser, but in fact each of the three versions include different characters). However, Zotero and PDF Expert both behave in a similarly-flawed way when I make a highlight/annotation that includes a word split by a soft-hyphen at the end of a line: the text contents of the highlight annotation—which are automatically extracted from the page and displayed in the left-sidebar alongside other highlights/annotations—includes a space in the middle of the word. To reiterate, PDF Expert properly mends together the split fragments of the word when copy/pasting from the page, but it behaves sub-par (like Zotero) when making highlights on the page, to the extent that it fails to merge the split fragments of the word separated by a soft hyphen at the end of line.

This is unexpected behavior because Zotero automatically merges together the split fragments of a word if the hyphen at the end of the line is a typical "Hyphen-Minus" (0x2d) character. I would expect that same behavhior from a soft-hyphen, especially in light of the fact there might not be any visual difference between these distinct unicode hyphen characters. To the eye, they appear identical, yet they behave differently. Oddly (pleasantly), Zotero is doing something behind the scenes that makes the "Hyphen-Minus" character behave like a soft-hyphen (as it should!), yet Zotero fails to perform the same behind-the-scenes treatment on actual soft hyphens, thus rendering the soft hyphen improperly as a "space" character.

martynas_b · February 12, 2024

Ok, maybe that character actually exists. We have to investigate why PDF.js (Zotero reader is based on it) doesn't see this character. Thanks for the analysis!