PDF Reader - highlighted text contains no spaces with some PDFs

I have just noticed that in the beta PDF Reader *for some PDFs* highlighted text is shown in the annotations panel without spaces - i.e. all text is run together.

When I open those same PDFs in other PDF readers, highlighted text is shown correctly, with spaces.

I cannot see any differences between PDFs that the PDF Reader handles correctly and those that it doesn't - has anyone else noticed this, or can help with diagnosis?

Zotero 5.0.97-beta.33+fdcd4e51c on Windows 10
  • The highlighted text that is visible in annotations sidebar is extracted text (the same that you get when copying selected text). Some PDFs, especially the scanned and OCRed ones, have a very poor quality text layer with various errors. If you want to know if other PDF readers are extracting it correctly, you have to copy and paste that text somewhere.

    If those PDFs are scanned, a newer and more advanced OCRing software could help to replace text layers.
  • Thanks for responding @martynas_b. What I don't understand is with the exact same PDF file I have Zotero PDF Reader giving me:
    and PDF-XChange Editor giving me:
    "plague of bed bugs and family illness. At the other end of the country, a count
    ess is painting botanical"

    As far as I can see, they two programs must be interpreting what they pull out of the text layer differently.
  • @richard.masters If you want, you can send us the PDF to support@zotero.org with a link to this thread, and we will try to improve our PDF reader in future versions.
  • hi,
    i would like to renew the interest in this issue, which remains with the 6.0.11 version of Zotero (i am on mac with OS Monterey 12.2.1).

    i found a PDF whose text when imported into an editor shows only 'stringed' words (a group of words make 1 single string), while when the same text is imported into a word processor shows both spaced and stringed words.
    when i paste the text into Sublime text, it appears that the stringed words are separated by 1 <0x2029> character, while the spaced words are separated by 2 characters: <0x2029> and a usual space.
    apparently in UTF-16 <0x2029> represents Unicode Character 'PARAGRAPH SEPARATOR'
    could you the developers manage the conversion of <0x2029> into "space" for the needs of the users?
  • @m.lana We really need an example PDF to investigate cases like this. You could send it to support@zotero.org with a link to this thread.
  • ok. i send it now
Sign In or Register to comment.