'Extract PDF annotations' message hangs on Linux/Ubuntu
Hello,
Thanks for Zotero!
I'm trying to extract the annotations from PDFs on a GNU/linux 64 machine (Ubuntu 12.10) with Zotero 4.0.12 plugin on firefox. The extraction previously worked as expected (after following instructions to install pdftotext and pdfinfo and creating symbolic links in the Zotero data directory (called pdftotext-Linux-x86_64, and pdfinfo-Linux-x86_64) to their executables (in /usr/bin)).
Recently, however, when I try to extract annotations zotero just hangs with the pop-up message 'extracting annotations'+progress bar. I can cancel the pop-up with ESC, and firefox returns to normal, but the annotations are not being extracted, no matter how long I wait.
Pdfinfo and pdftotext seem to be working: I can call them outside of Zotero, and they certainly do convert pdfs to text (as their name implies-- but I can't find my annotations in their output). However, pdftotext ALWAYS seems to generate the following errors, irrespective of the PDF files that are being worked on "Syntax Error (392865): Bad block header in flate stream". These errors are generated even if an output txt is successfully generated.
I have the latest pdfinfo and pdftotext (0.20.4-0ubuntu1.2) available in the Ubuntu Quantal repository.
Any help would be appreciated.
Rob
EDIT: This bug seems to have been resolved on Zotfile 3.1 updated 04/feb/2014 on Zotero 4.0.16. Hurray!
Thanks for Zotero!
I'm trying to extract the annotations from PDFs on a GNU/linux 64 machine (Ubuntu 12.10) with Zotero 4.0.12 plugin on firefox. The extraction previously worked as expected (after following instructions to install pdftotext and pdfinfo and creating symbolic links in the Zotero data directory (called pdftotext-Linux-x86_64, and pdfinfo-Linux-x86_64) to their executables (in /usr/bin)).
Recently, however, when I try to extract annotations zotero just hangs with the pop-up message 'extracting annotations'+progress bar. I can cancel the pop-up with ESC, and firefox returns to normal, but the annotations are not being extracted, no matter how long I wait.
Pdfinfo and pdftotext seem to be working: I can call them outside of Zotero, and they certainly do convert pdfs to text (as their name implies-- but I can't find my annotations in their output). However, pdftotext ALWAYS seems to generate the following errors, irrespective of the PDF files that are being worked on "Syntax Error (392865): Bad block header in flate stream". These errors are generated even if an output txt is successfully generated.
I have the latest pdfinfo and pdftotext (0.20.4-0ubuntu1.2) available in the Ubuntu Quantal repository.
Any help would be appreciated.
Rob
EDIT: This bug seems to have been resolved on Zotfile 3.1 updated 04/feb/2014 on Zotero 4.0.16. Hurray!
I assume Zotero 4.0.13 is a typo, that's not yet been released.
I'll try on the Zotfile forum and will report back if there is a solution.
I think the problem is with pdf.js, and perhaps due to my firefox java plugin. For example, I can't view annotated PDFs in firefox's pdf.js, at all (non-annotated are fine). Furthermore, I have a warning message my Firefox Add-ons/Plugins page regarding (one of) my Java Plug-ins: I think its deactivated because its out-of-date (despite being the newest available for my platform).
So, this isn't a Zotfile or Zotero problem. Just thought I'd share my findings. Given Java's headache, I may look into alternative ways that I can use pdftotext and other poppler-tools (as used in Mac's Zotfile) to extract annotations.
It may be true that some annotations might not be "extractable", but in order to figure that out, we would need to know how those annotations were produced. What software did you use? Do you have a sample PDF that you could share?
Thank you for the clarification regarding pdf.js and javascript.
The annotations were made in an Andriod Acrobat Reader, synced through Dropbox back to a linux machine. It also doesn't extract on a windows 7 machines, so maybe it is indeed the note-taking program. Any recommended alternatives to Acrobat for highlighting/annotating on Android? I'll try to find one and try again.
Yes, I can provide a sample annotated PDF. See the link below. This was annotated in same way as described above, and the annotations were previously extracted with no major problems, but now I'm unable to do so with it or any annotated pdf.
Thanks a millon for the assistance. Link below
https://www.dropbox.com/s/61v0t7hm84vefs9/Zou_2006_The adaptive lasso and its oracle properties.pdf?m
It's interesting that you say these used to extract fine, because Zotfile ships its own version of pdf.js which hasn't changed for half a year.
This behaviour started ~mid-August 2013. Previously, this pdf had its annotations extracted just fine. Maybe that is approximately the time that pdf.js was last updated... anyway, looks like I'll have to dig around with new/older versions of pdf.js to see what the problem is. Not sure how to start: I can't go to the Mozilla pdf.js bug page because Zotfile has its own custom pdf.js. The Zotfile developer seems swamped and not to interested in Linux or Windows.
The error (with a sample PDF) should be reported on the Zotfile thread so that the Zotfile developer can figure out how to proceed.
Thanks again, that is interesting to know that at least the pdf pages, but not annotations, are displaying in the develpment version of PDF.js. I'll try posting on the Zotfile thread...
So submitting a tickets for pdf.js worked great in this case. Thanks aurimas!
I have had problems with extracting annotations from clearscan pdfs
clearscan OCR gives me:
"I P h [:?Yh M&P'Yh +U.&Mh [:.h 7R+-?M&7.h ?Yh U.SU.Y.P[.+h ' e h &?U.h B ^ [ h [;. SU?MRU+@&Hh P&[^U.h R2h [;.h 7R+-?M&7.h M&eh ' . h U.SU.Y.P[.+h APh R[:.U" ( :1)
searchable image OCR gives me:
"As it brought new life to his intellectual interests it also awakened a new feeling for life itself through the mediation of the woman." ( :1)
just discovered this recently after I had converted dozens of scanned books from searchable image to clearscan (much clearer and 20-60% smaller) and began highlighting in pdfexpert 5.0 on ipad mini
as much as I understand it, the clearscan removes the searchable background image and replaces it with vector fonts so maybe they're not extractable, but www.sumnotes.net does it fine
any suggestions?
ps: The other fix, which started this discussion, was very easy and I just implemented it in zotfile by looking at the corresponding commit for pdf.js.