Zotfile: Selectively extract types of notations? pdf.js?

jhanks · April 23, 2015

I would like to have Zotfile extract only notes I have taken, but not highlights. I don't see any preference for what kinds of annotations get extracted, but might it be possible to modify the pdf.js to make it blind to the types of stuff I don't want extracted?
For my purposes, making this as a permenent change would not be a problem. Except that I don't even know where the pdf.js is located, so I can't even begin to look through it and see if there is anything obvious that I (a non-programmer) can monkey with.
Any suggestions appreciated.

adamsmith · April 23, 2015

FWIW, you definitely don't want to change this in pdf.js -- you just want to tell zotfile to not add highlights to notes.
That would be in extract.js, which is significantly less complex:

https://github.com/jlegewie/zotfile/blob/master/chrome/content/zotfile/pdfextract/extract.js

But you'd still need to modify it and then rebuild ZotFile (there are instructions on how to build ZotFile from source on that repository). This isn't super challenging as coding goes, but if you've never done anything like it, I'd imagine it's rather daunting.

jhanks · April 24, 2015

OK well, I would like to build some skills, so maybe this is a good project for me. If I get anywhere, I'll post in this thread about it. Thanks for the advice/leads!

adamsmith · April 24, 2015

It's not a bad project for that. I'd recommend running zotfile straight from git source for that. The process is described for Zotero here: https://www.zotero.org/support/dev/source_code it's exactly the same for ZotFile with the exception of the zotero@chnm.gmu.edu which you need to adjust accordingly

jhanks · April 27, 2015

OK, I think I've figured it out. I haven't used my modified .xpi file for actual work yet, but an initial test showed it to have the desired effect.
Here is what I did:
1. Download the zip for Zotfile from GitHub
https://github.com/jlegewie/zotfile
2. Decompress and navigate to ...../zotfile-master/chrome/content/zotfile/pdfextract/pdfjs/src/
3. Open getPDFAnnotations.js
4. Go to line 23, which says: var SUPPORTED_ANNOTS = ['Text','Highlight','Underline'],
5. I deleted the 'Highlight' entry, because that is what will work best for me. (This way my text notes will be extracted, and any actual passages from the PDF that I want in the annotations, I can have pulled out by using underline instead of highlight.)
6. Build the xpi file using the instructions on the GitHub page. Since I am on Linux, I used the make file at the top of the Zotfile directory.

I guess the next step would be to try and figure out how to make an option in the Zotfile preferences to control what is included in the variable SUPPORTED_ANNOTS.

Thanks for the help!

chrisoutwright · April 8, 2020

Great instruction. I wonder if one could fix the way liquidtext annotations are handled. First question, is the extraction process separate from the the file handling itself, that is is Zotfile changing the pdf in any way when I call the "get from tablet" function to update my local library?

So far it seems that commented highlights (annotations/notes adhering to the highlighted text) is separated when syncing the file back to the Zotero library.

I wonder if one could put those together again. Is the extraction process involved in any of this? If so, is there a documentation on how to use the manual (PDF Reference Manual 1.7) so that getPDFAnnotations can be modified for those ends, and then build the xpi file, or is it just for extracting the annotations?

So far the manual gives indications, that the original Markup annotations (by liquidtext) must have gotten confused by the extractor (markup transformed into text annotations), or these are not supported. Any ideas how to keep these markup annotations from liquidtext to pc pdf readers?