PDF Reader - highlighted text contains no spaces with some PDFs

I have just noticed that in the beta PDF Reader *for some PDFs* highlighted text is shown in the annotations panel without spaces - i.e. all text is run together.

When I open those same PDFs in other PDF readers, highlighted text is shown correctly, with spaces.

I cannot see any differences between PDFs that the PDF Reader handles correctly and those that it doesn't - has anyone else noticed this, or can help with diagnosis?

Zotero 5.0.97-beta.33+fdcd4e51c on Windows 10
  • The highlighted text that is visible in annotations sidebar is extracted text (the same that you get when copying selected text). Some PDFs, especially the scanned and OCRed ones, have a very poor quality text layer with various errors. If you want to know if other PDF readers are extracting it correctly, you have to copy and paste that text somewhere.

    If those PDFs are scanned, a newer and more advanced OCRing software could help to replace text layers.
  • Thanks for responding @martynas_b. What I don't understand is with the exact same PDF file I have Zotero PDF Reader giving me:
    "plagueofbedbugsandfamilyillness.Attheotherendofthecountry,acountessispaintingbotanical"
    and PDF-XChange Editor giving me:
    "plague of bed bugs and family illness. At the other end of the country, a count
    ess is painting botanical"

    As far as I can see, they two programs must be interpreting what they pull out of the text layer differently.
  • @richard.masters If you want, you can send us the PDF to support@zotero.org with a link to this thread, and we will try to improve our PDF reader in future versions.
  • hi,
    i would like to renew the interest in this issue, which remains with the 6.0.11 version of Zotero (i am on mac with OS Monterey 12.2.1).

    i found a PDF whose text when imported into an editor shows only 'stringed' words (a group of words make 1 single string), while when the same text is imported into a word processor shows both spaced and stringed words.
    when i paste the text into Sublime text, it appears that the stringed words are separated by 1 <0x2029> character, while the spaced words are separated by 2 characters: <0x2029> and a usual space.
    apparently in UTF-16 <0x2029> represents Unicode Character 'PARAGRAPH SEPARATOR'
    could you the developers manage the conversion of <0x2029> into "space" for the needs of the users?
    best
    Maurizio
  • @m.lana We really need an example PDF to investigate cases like this. You could send it to support@zotero.org with a link to this thread.
  • ok. i send it now
  • any update?
  • @m.lana Could you send it again?
  • re-sent just now!
    :-)
  • @m.lana The PDF file has issues on other PDF viewers as well (i.e. on Preview.app each word is in a separate line). We plan to do some improvements, but not any time soon.
  • hum... when i open it in preview app i see it perfectly paged.
    and the trouble of words not separated when importing from annotations into notes happens also with other PDFs

    it happens because at the end of every word there is a UTF-16 <0x2029> which represents Unicode Character 'PARAGRAPH SEPARATOR'
    and preview properly treats every word as a 'finished' paragraph.
    the minimum is to change <0x2029> into spaces when one selects an annotation for extraction as note. this way a true 'PARAGRAPH SEPARATOR' is lost, but it is not terrible given that one generates excerpts of text

  • @m.lana Is there a reason why that PDF is doing this?

    Yes, we can probably treat paragraph separator as a space, but usually we try to avoid fixing other PDF exporters random bugs.
  • i understand your argument.
    in fact i saw that when this happens, i can paste the flawed text extraction into a programmer's editor, search for <0x2029> and replace it with space.
    not terrible.
    i understood that this problem already appears when you draw the pointer to select an area of text: if the text is left and right justified but the highlight of the selection is narrower than the visual margins of the text, then the text contains <0x2029> characters
Sign In or Register to comment.