'Extract PDF annotations' message hangs on Linux/Ubuntu

yourwelcome · September 12, 2013

Hello,
Thanks for Zotero!
I'm trying to extract the annotations from PDFs on a GNU/linux 64 machine (Ubuntu 12.10) with Zotero 4.0.12 plugin on firefox. The extraction previously worked as expected (after following instructions to install pdftotext and pdfinfo and creating symbolic links in the Zotero data directory (called pdftotext-Linux-x86_64, and pdfinfo-Linux-x86_64) to their executables (in /usr/bin)).

Recently, however, when I try to extract annotations zotero just hangs with the pop-up message 'extracting annotations'+progress bar. I can cancel the pop-up with ESC, and firefox returns to normal, but the annotations are not being extracted, no matter how long I wait.

Pdfinfo and pdftotext seem to be working: I can call them outside of Zotero, and they certainly do convert pdfs to text (as their name implies-- but I can't find my annotations in their output). However, pdftotext ALWAYS seems to generate the following errors, irrespective of the PDF files that are being worked on "Syntax Error (392865): Bad block header in flate stream". These errors are generated even if an output txt is successfully generated.

I have the latest pdfinfo and pdftotext (0.20.4-0ubuntu1.2) available in the Ubuntu Quantal repository.

Any help would be appreciated.
Rob

EDIT: This bug seems to have been resolved on Zotfile 3.1 updated 04/feb/2014 on Zotero 4.0.16. Hurray!

adamsmith · September 12, 2013

extract annotations is not a Zotero but a Zotfile feature. You should post in the Zotfile thread, though I don't believe the developer is terribly motivated to troubleshoot individual issues with annotations. You can try to see whether the program you use for annotating makes a difference. Annotation extraction doesn't rely on pdftotext, but on pdf.js

I assume Zotero 4.0.13 is a typo, that's not yet been released.

yourwelcome · September 12, 2013

Sorry, all-in-all that was a pretty ill-conceived question: pdftotext is not used for annotation extraction and this is not the appropriate forum.
I'll try on the Zotfile forum and will report back if there is a solution.

adamsmith · September 12, 2013

(the Zotfile thread is here on the forum https://forums.zotero.org/discussion/5301/15/zotfile-zotero-plugin-to-rename-move-and-attach-pdfs-send-them-to-ipad-extract-pdf-annotations/

yourwelcome · November 2, 2013

Hello again,
I think the problem is with pdf.js, and perhaps due to my firefox java plugin. For example, I can't view annotated PDFs in firefox's pdf.js, at all (non-annotated are fine). Furthermore, I have a warning message my Firefox Add-ons/Plugins page regarding (one of) my Java Plug-ins: I think its deactivated because its out-of-date (despite being the newest available for my platform).

So, this isn't a Zotfile or Zotero problem. Just thought I'd share my findings. Given Java's headache, I may look into alternative ways that I can use pdftotext and other poppler-tools (as used in Mac's Zotfile) to extract annotations.

aurimas · November 3, 2013

FYI Javascript (the programming language used to create Zotero, Zotfile, pdf.js, etc.) is not at all related to Java. Javascript does not require a plugin/add-on for your browser - it runs natively. Java does need a plug-in, but as I said, it's unrelated.

It may be true that some annotations might not be "extractable", but in order to figure that out, we would need to know how those annotations were produced. What software did you use? Do you have a sample PDF that you could share?

yourwelcome · November 3, 2013

Hi Aurimas,
Thank you for the clarification regarding pdf.js and javascript.
The annotations were made in an Andriod Acrobat Reader, synced through Dropbox back to a linux machine. It also doesn't extract on a windows 7 machines, so maybe it is indeed the note-taking program. Any recommended alternatives to Acrobat for highlighting/annotating on Android? I'll try to find one and try again.

Yes, I can provide a sample annotated PDF. See the link below. This was annotated in same way as described above, and the annotations were previously extracted with no major problems, but now I'm unable to do so with it or any annotated pdf.

Thanks a millon for the assistance. Link below

https://www.dropbox.com/s/61v0t7hm84vefs9/Zou_2006_The adaptive lasso and its oracle properties.pdf?m

adamsmith · November 3, 2013

That doesn't open properly in pdf.js for me, either (i.e. the second page - the one with the annotations - doesn't load at all.), so that would seem to be a general problem.
It's interesting that you say these used to extract fine, because Zotfile ships its own version of pdf.js which hasn't changed for half a year.

yourwelcome · November 3, 2013

Yes, I also cannot view the annotated page in pdf.js. I think this is the key.
This behaviour started ~mid-August 2013. Previously, this pdf had its annotations extracted just fine. Maybe that is approximately the time that pdf.js was last updated... anyway, looks like I'll have to dig around with new/older versions of pdf.js to see what the problem is. Not sure how to start: I can't go to the Mozilla pdf.js bug page because Zotfile has its own custom pdf.js. The Zotfile developer seems swamped and not to interested in Linux or Windows.

aurimas · November 3, 2013

Looks like the latest development version of PDF.js is able to display the second page, though it does not display the highlighting. I'm not sure if that means that the extraction will not work.

The error (with a sample PDF) should be reported on the Zotfile thread so that the Zotfile developer can figure out how to proceed.

yourwelcome · November 3, 2013

@aurimas
Thanks again, that is interesting to know that at least the pdf pages, but not annotations, are displaying in the develpment version of PDF.js. I'll try posting on the Zotfile thread...

aurimas · November 3, 2013

Submitted a ticket for pdf.js https://github.com/mozilla/pdf.js/issues/3885 with a minimal test case.

Joscha · January 8, 2014

The problem was recently fixed in pdf.js and I have adopted the same fix to the development version of zotfile. With this fix, the annotation are extracted correctly!
So submitting a tickets for pdf.js worked great in this case. Thanks aurimas!

mjtowen · January 9, 2014

I am reposting on this thread from here https://forums.zotero.org/discussion/33061/pdf-text-screwed-after-sync-to-webdav-server/#Item_5, apologies if this is the wrong place

I have had problems with extracting annotations from clearscan pdfs

clearscan OCR gives me:
"I P h [:?Yh M&P'Yh +U.&Mh [:.h 7R+-?M&7.h ?Yh U.SU.Y.P[.+h ' e h &?U.h B ^ [ h [;. SU?MRU+@&Hh P&[^U.h R2h [;.h 7R+-?M&7.h M&eh ' . h U.SU.Y.P[.+h APh R[:.U" ( :1)

searchable image OCR gives me:
"As it brought new life to his intellectual interests it also awakened a new feeling for life itself through the mediation of the woman." ( :1)

just discovered this recently after I had converted dozens of scanned books from searchable image to clearscan (much clearer and 20-60% smaller) and began highlighting in pdfexpert 5.0 on ipad mini

as much as I understand it, the clearscan removes the searchable background image and replaces it with vector fonts so maybe they're not extractable, but www.sumnotes.net does it fine

any suggestions?

adamsmith · January 9, 2014

if I understand you correctly, you're in the wrong thread. This thread is about ZotFile's "Extract Annotation" feature, not about Zotero's full text indexing of pdf files. If I'm right about this and you're talking about indexing in general, please continue on the original thread.

mjtowen · January 9, 2014

I think I am in the right thread, I'm sending a pdf to my tablet with zotfile, annotating, then returning pdf to zotero standalone and extracting annotations with zotfile, I'm not posting about indexing in zotero

adamsmith · January 9, 2014

ok, right thread then. If you open the annotated PDF in Firefox, i.e. with pdf.js: do you see the annotations? Can you select and copy text?

mjtowen · January 9, 2014

I opened an annotated clearscan pdf file in firefox with pdf.js, I can see the highlighted text and I can select and copy text, but when I extract the annotations I still get the same garbled extracts

aurimas · January 9, 2014

Can you post a sample PDF somewhere?

mjtowen · January 9, 2014

https://www.dropbox.com/s/favrx6gasblbl0a/Pages%20from%2028347.pdf

aurimas · January 10, 2014

@Joscha, is the annotation extraction code for pdf.js written specifically for Zotfile (e.g. updateMarkup in canvas.js)? I can't find any reference to it anywhere else. I think the issue here is that the text in the PDF is represented with wide characters (16 bit) and it's not handled correctly. I've got it sort of working, but I'd rather report this upstream so it can be fixed properly.

mjtowen · January 10, 2014

Apologies if I have misled but have just seen the thread title, I am using win 8.1 on laptop, to iPad mini tablet, not Linux if that makes a diff?

Joscha · January 10, 2014

aurimas, pdf.js doesn't support the extraction of annotation itself so the version included in zotfile is modified. updateMarkup is one of the function I added. Others are charInAnnot and makeCharDims (I think they are the important once). The other important thing is that the version in zotfile is based on a pdf.js version from roughly a year ago. pdf.js development is pretty rapid and automatically merging in the new changes often does not work. It might be the case that pdf.js started supporting this example pdf sometime between a year ago and now. I ported the zotfile modification to a more recent pdf.js version maybe 2 month ago. The result is here but still has problems with spaces and some other issues. Looking at the commits in my fork of pdf.js at https://github.com/jlegewie/pdf.js might give you a good idea of the changes. Anyways, if you can fix it in zotfile directly, go ahead. In general, the annotation supports should be updated to the most recent pdf.js version and really needs a proper testing mechanism so that new versions can easily be compared to old version for a number of test pdfs. Otherwise, it can happen that modifications break other things. I won't have time to do this myself and it would really be something for which I could need help!

ps: The other fix, which started this discussion, was very easy and I just implemented it in zotfile by looking at the corresponding commit for pdf.js.

adamsmith · January 10, 2014

(@mjowen - good to mention that, but in this case there's no difference. The problem is the same on linux)

yourwelcome · February 6, 2014

Problem solved! I can now get both hightlights and comments, using Zotfile 3.1 updated 04/feb/2014 on Zotero 4.0.16 (both standalone and firefox addon).