Zotfile extracting garbled text from Adobe Acrobat Pro DC

smorris11 · April 4, 2020

I recently began using Adobe Acrobat Pro DC to work with my pdfs. For the first time since installing it, I tried to extract highlighted text using Zotfile...the result is a bunch of garbled text roughly the same length as the amount of text I highlighted. Here is an example:
">O'I\UKSLKZLY]PJLZ&)LJH\ZLL]PKLUJLZOV^Z[OL'^VYR( Z[\K'MV\UK[OH[ ̧[OVZLBWHY[PJPWHU[ZDYLJLP]PUNI\UKSLKZLY]PJLZ^LYL [OYLL[VMV\Y[PTLZTVYLSPRLS'[VHJOPL]LHTHQVYLJVUVTPJV\[JVTL [OHU[OVZL^OVZLZLY]PJLZ^LYLUV[I\UKSLK¹(I[(ZZVJPH[LZ 7H[O^H'Z[V:\JJLZZ!(U0U[LYPT(UHS'ZPZVM:LY]PJLZHUK6\[JVTLZ PU;OYLL7YVNYHTZ1\UL "

This only happens with some PDFs, not all. For the ones it hasn't worked with, I have made sure the text is editable using OCR. What am I missing?

I have used Zotfile for years (albeit with Acrobat Pro 9) and have never had issues with any PDF.

qqbb · April 4, 2020

A first test could be to select some text in your pdf viewer and copy-paste it to a text editor. If this doesn't work, there's something wrong with your pdf. See the discussion here for some possible issues, e.g., a poor OCR layer.

Problems could also occur if the file is saved in a more recent PDF format. Try to use PDF version 1.3 to 1.5, if possible. I don't know whether version 1.6 is problematic, but issues have been reported for version 1.7. You can check the PDF version in the file properties, accessible with Ctrl+D in many viewers. Some related issues on github are:
https://github.com/jlegewie/zotfile/issues/418
https://github.com/zotero/zotero/issues/1018

smorris11 · April 6, 2020

qqbb - Thanks for the suggestions. I can copy and paste text just fine from the pdf viewer to the text editor. I tried saving the file in an older format with no luck.

I also re-downloaded Acrobat 9 and tried it - also no luck. Here is a link to a pdf with the problem: https://furman.box.com/s/p9f798z27t3i3q2u5ajdat08b9ywwe0l.

Any other suggestions?

qqbb · April 6, 2020

It seems to me that your PDF file has a font encoding issue. Selecting the whole text in my PDF viewer (Ctrl/Cmd+A), copying (Ctrl/Cmd+C), and pasting (Ctrl/Cmd+V) to a text editor, I mostly get garbled text. Some parts are fine, e.g., the quotes on the first page. I get the same issue with this online tool.

If you don't have this problem on your machine, it's possible that your Adobe program can fix the issue. E.g., try to remove the "Smallest File Size" setting when using the Distiller. If you didn't create the PDF yourself and have the same problem on your machine, try to run OCR software on it.

You can find more background on this issue here. Specifically, in your file the HelveticaNeue fonts that are used for the main text are missing a ToUnicode table. To see this, you could run xpdf's pdffonts function on your file as explained here, which shows 'no' for the 'uni' column.

Some general tips (not specific to the issue above):
1) If you create PDF files yourself, try to save them in PDF/A format, preferably PDF/A-1 (based on PDF version 1.4).
2) If your file has an issue related to the PDF version, try to convert the PDF to an older version.
3) If you have MS Word installed, right-click your PDF file and try to open it with Word. It will try to convert the PDF to DOCX. This could show you issues with the file (e.g., garbled text). If this works without issues, you could try to save the resulting DOCX as a PDF file, selecting PDF/A in the options.

smorris11 · April 7, 2020

qqbb - Completely right. I got an error when messing around with it that indicated a font encoding issue. Before I got your message, however, I found the original source of the file and re-imported it into Zotero (the pdf-and others with the same issue-came from a group library, and I did not import the file originally). Tried highlighting the re-imported file and extracting annotations...no problem. Not sure how the files got messed up.

Really appreciate the help. I am going to save your suggestions for the future. Thanks.

qqbb · April 7, 2020

Not sure how the files got messed up.

Maybe someone tried to "optimize" the pdf files by running Distiller with the "Smallest File Size" setting on them. This might have removed the ToUnicode table, as explained here: https://stackoverflow.com/a/12190234.

For group libraries, it might be useful to communicate the goal of keeping the files in a long-term archiving format such as PDF/A.