Zotfile extracting garbled text from Adobe Acrobat Pro DC
I recently began using Adobe Acrobat Pro DC to work with my pdfs. For the first time since installing it, I tried to extract highlighted text using Zotfile...the result is a bunch of garbled text roughly the same length as the amount of text I highlighted. Here is an example:
">O'I\UKSLKZLY]PJLZ&)LJH\ZLL]PKLUJLZOV^Z[OL'^VYR( Z[\K'MV\UK[OH[ ̧[OVZLBWHY[PJPWHU[ZDYLJLP]PUNI\UKSLKZLY]PJLZ^LYL [OYLL[VMV\Y[PTLZTVYLSPRLS'[VHJOPL]LHTHQVYLJVUVTPJV\[JVTL [OHU[OVZL^OVZLZLY]PJLZ^LYLUV[I\UKSLK¹(I[(ZZVJPH[LZ 7H[O^H'Z[V:\JJLZZ!(U0U[LYPT(UHS'ZPZVM:LY]PJLZHUK6\[JVTLZ PU;OYLL7YVNYHTZ1\UL "
This only happens with some PDFs, not all. For the ones it hasn't worked with, I have made sure the text is editable using OCR. What am I missing?
I have used Zotfile for years (albeit with Acrobat Pro 9) and have never had issues with any PDF.
">O'I\UKSLKZLY]PJLZ&)LJH\ZLL]PKLUJLZOV^Z[OL'^VYR( Z[\K'MV\UK[OH[ ̧[OVZLBWHY[PJPWHU[ZDYLJLP]PUNI\UKSLKZLY]PJLZ^LYL [OYLL[VMV\Y[PTLZTVYLSPRLS'[VHJOPL]LHTHQVYLJVUVTPJV\[JVTL [OHU[OVZL^OVZLZLY]PJLZ^LYLUV[I\UKSLK¹(I[(ZZVJPH[LZ 7H[O^H'Z[V:\JJLZZ!(U0U[LYPT(UHS'ZPZVM:LY]PJLZHUK6\[JVTLZ PU;OYLL7YVNYHTZ1\UL "
This only happens with some PDFs, not all. For the ones it hasn't worked with, I have made sure the text is editable using OCR. What am I missing?
I have used Zotfile for years (albeit with Acrobat Pro 9) and have never had issues with any PDF.
Problems could also occur if the file is saved in a more recent PDF format. Try to use PDF version 1.3 to 1.5, if possible. I don't know whether version 1.6 is problematic, but issues have been reported for version 1.7. You can check the PDF version in the file properties, accessible with Ctrl+D in many viewers. Some related issues on github are:
https://github.com/jlegewie/zotfile/issues/418
https://github.com/zotero/zotero/issues/1018
I also re-downloaded Acrobat 9 and tried it - also no luck. Here is a link to a pdf with the problem: https://furman.box.com/s/p9f798z27t3i3q2u5ajdat08b9ywwe0l.
Any other suggestions?
If you don't have this problem on your machine, it's possible that your Adobe program can fix the issue. E.g., try to remove the "Smallest File Size" setting when using the Distiller. If you didn't create the PDF yourself and have the same problem on your machine, try to run OCR software on it.
You can find more background on this issue here. Specifically, in your file the HelveticaNeue fonts that are used for the main text are missing a ToUnicode table. To see this, you could run xpdf's pdffonts function on your file as explained here, which shows 'no' for the 'uni' column.
Some general tips (not specific to the issue above):
1) If you create PDF files yourself, try to save them in PDF/A format, preferably PDF/A-1 (based on PDF version 1.4).
2) If your file has an issue related to the PDF version, try to convert the PDF to an older version.
3) If you have MS Word installed, right-click your PDF file and try to open it with Word. It will try to convert the PDF to DOCX. This could show you issues with the file (e.g., garbled text). If this works without issues, you could try to save the resulting DOCX as a PDF file, selecting PDF/A in the options.
Really appreciate the help. I am going to save your suggestions for the future. Thanks.
For group libraries, it might be useful to communicate the goal of keeping the files in a long-term archiving format such as PDF/A.