PDF Reader - highlighted text contains no spaces with some PDFs
I have just noticed that in the beta PDF Reader *for some PDFs* highlighted text is shown in the annotations panel without spaces - i.e. all text is run together.
When I open those same PDFs in other PDF readers, highlighted text is shown correctly, with spaces.
I cannot see any differences between PDFs that the PDF Reader handles correctly and those that it doesn't - has anyone else noticed this, or can help with diagnosis?
Zotero 5.0.97-beta.33+fdcd4e51c on Windows 10
When I open those same PDFs in other PDF readers, highlighted text is shown correctly, with spaces.
I cannot see any differences between PDFs that the PDF Reader handles correctly and those that it doesn't - has anyone else noticed this, or can help with diagnosis?
Zotero 5.0.97-beta.33+fdcd4e51c on Windows 10
If those PDFs are scanned, a newer and more advanced OCRing software could help to replace text layers.
"plagueofbedbugsandfamilyillness.Attheotherendofthecountry,acountessispaintingbotanical"
and PDF-XChange Editor giving me:
"plague of bed bugs and family illness. At the other end of the country, a count
ess is painting botanical"
As far as I can see, they two programs must be interpreting what they pull out of the text layer differently.
i would like to renew the interest in this issue, which remains with the 6.0.11 version of Zotero (i am on mac with OS Monterey 12.2.1).
i found a PDF whose text when imported into an editor shows only 'stringed' words (a group of words make 1 single string), while when the same text is imported into a word processor shows both spaced and stringed words.
when i paste the text into Sublime text, it appears that the stringed words are separated by 1 <0x2029> character, while the spaced words are separated by 2 characters: <0x2029> and a usual space.
apparently in UTF-16 <0x2029> represents Unicode Character 'PARAGRAPH SEPARATOR'
could you the developers manage the conversion of <0x2029> into "space" for the needs of the users?
best
Maurizio
:-)
and the trouble of words not separated when importing from annotations into notes happens also with other PDFs
it happens because at the end of every word there is a UTF-16 <0x2029> which represents Unicode Character 'PARAGRAPH SEPARATOR'
and preview properly treats every word as a 'finished' paragraph.
the minimum is to change <0x2029> into spaces when one selects an annotation for extraction as note. this way a true 'PARAGRAPH SEPARATOR' is lost, but it is not terrible given that one generates excerpts of text
Yes, we can probably treat paragraph separator as a space, but usually we try to avoid fixing other PDF exporters random bugs.
in fact i saw that when this happens, i can paste the flawed text extraction into a programmer's editor, search for <0x2029> and replace it with space.
not terrible.
i understood that this problem already appears when you draw the pointer to select an area of text: if the text is left and right justified but the highlight of the selection is narrower than the visual margins of the text, then the text contains <0x2029> characters