PDF Reader - highlighted text contains no spaces with some PDFs

richard.masters · June 30, 2021

I have just noticed that in the beta PDF Reader *for some PDFs* highlighted text is shown in the annotations panel without spaces - i.e. all text is run together.

When I open those same PDFs in other PDF readers, highlighted text is shown correctly, with spaces.

I cannot see any differences between PDFs that the PDF Reader handles correctly and those that it doesn't - has anyone else noticed this, or can help with diagnosis?

Zotero 5.0.97-beta.33+fdcd4e51c on Windows 10

martynas_b · June 30, 2021

The highlighted text that is visible in annotations sidebar is extracted text (the same that you get when copying selected text). Some PDFs, especially the scanned and OCRed ones, have a very poor quality text layer with various errors. If you want to know if other PDF readers are extracting it correctly, you have to copy and paste that text somewhere.

If those PDFs are scanned, a newer and more advanced OCRing software could help to replace text layers.

richard.masters · June 30, 2021

Thanks for responding @martynas_b. What I don't understand is with the exact same PDF file I have Zotero PDF Reader giving me:
"plagueofbedbugsandfamilyillness.Attheotherendofthecountry,acountessispaintingbotanical"
and PDF-XChange Editor giving me:
"plague of bed bugs and family illness. At the other end of the country, a count
ess is painting botanical"

As far as I can see, they two programs must be interpreting what they pull out of the text layer differently.

martynas_b · June 30, 2021

@richard.masters If you want, you can send us the PDF to support@zotero.org with a link to this thread, and we will try to improve our PDF reader in future versions.

richard.masters · June 30, 2021

Thanks @martynas_b - done.

m.lana · August 2, 2022

hi,
i would like to renew the interest in this issue, which remains with the 6.0.11 version of Zotero (i am on mac with OS Monterey 12.2.1).

i found a PDF whose text when imported into an editor shows only 'stringed' words (a group of words make 1 single string), while when the same text is imported into a word processor shows both spaced and stringed words.
when i paste the text into Sublime text, it appears that the stringed words are separated by 1 <0x2029> character, while the spaced words are separated by 2 characters: <0x2029> and a usual space.
apparently in UTF-16 <0x2029> represents Unicode Character 'PARAGRAPH SEPARATOR'
could you the developers manage the conversion of <0x2029> into "space" for the needs of the users?
best
Maurizio

martynas_b · August 2, 2022

@m.lana We really need an example PDF to investigate cases like this. You could send it to support@zotero.org with a link to this thread.

m.lana · August 2, 2022

ok. i send it now

m.lana · August 22, 2022

any update?

martynas_b · August 22, 2022

@m.lana Could you send it again?

m.lana · August 22, 2022

re-sent just now!
:-)

martynas_b · August 22, 2022

@m.lana The PDF file has issues on other PDF viewers as well (i.e. on Preview.app each word is in a separate line). We plan to do some improvements, but not any time soon.

m.lana · August 22, 2022

hum... when i open it in preview app i see it perfectly paged.
and the trouble of words not separated when importing from annotations into notes happens also with other PDFs

it happens because at the end of every word there is a UTF-16 <0x2029> which represents Unicode Character 'PARAGRAPH SEPARATOR'
and preview properly treats every word as a 'finished' paragraph.
the minimum is to change <0x2029> into spaces when one selects an annotation for extraction as note. this way a true 'PARAGRAPH SEPARATOR' is lost, but it is not terrible given that one generates excerpts of text

martynas_b · August 22, 2022

@m.lana Is there a reason why that PDF is doing this?

Yes, we can probably treat paragraph separator as a space, but usually we try to avoid fixing other PDF exporters random bugs.

m.lana · August 23, 2022

i understand your argument.
in fact i saw that when this happens, i can paste the flawed text extraction into a programmer's editor, search for <0x2029> and replace it with space.
not terrible.
i understood that this problem already appears when you draw the pointer to select an area of text: if the text is left and right justified but the highlight of the selection is narrower than the visual margins of the text, then the text contains <0x2029> characters