parsing external file with Zotero Import Script

Hi,

I'm still finetuning my pdf solution. Would there actually be a way to parse an FDF-file (which is basically an txt-file) with an Import Script in Zotero?
I've had a look at some of the Import scripts with Scaffold, but I couldn't find one that suited my needs. Is it theoretically possible to create a structure with a Fileselector-Window ("Select the FDF-file you want to parse"), and then opening and parsing the selected file with some sort of a filesystemobject.
Just get me started,...

SeeYa Matthias
  • Hi,
    I've found the solution myself. Adapted RDF-import. Will be back for other questions...

    *returns to hacking*

    Matthias
  • edited May 9, 2007
    Ok, here's an evening's work.

    I now read the comments and the Highlights directly from the PDF! However, for the Highlights to be extracted, one still has to use Acrobat's option to automatically copy highlighted content into a hidden note (and this is probably not going to change, since everything else would require me to basically rewrite pdf2txt in javascript).

    Still, it imports every Note and Highlight from a PDF into a new Zotero note. While this is unsatisfactory, it is at least a proof of concept. Ideally I would wish to create a summary from the data, which is then imported as HTML alongside the pdf. However, I could not find out how to create a new HTML-file from within Zotero (I'm sure it would not be a problem) and then import it autmatically. The notes are just a very limited way of outputting that information.

    But I need some help (and sleep).

    In Scaffold create:
    Label: "PDF: Comments and Highlights - Import"
    Creator: "Matthias Heim"
    Target: "PDF"

    Detect Code:
    (I have no idea how this works - but it does. It should test whether the file opens up with %pdf, but I have an inkling that it doesn't perform that test)
    Zotero.configure("dataMode", "pdf");

    function detectImport() {
    var text=Zotero.read(4);
    if (text.search(/pdf/i)!=-1) {
    return true;
    }
    }


    Code:

    function doImport() {
    var text;
    var holdOver = ""; // part of the text held over from the last loop
    var cleanpdf = ""; // pdf, cleaned of all binary-streams (can be handled even if file big)
    var streamopen = false;
    Zotero.debug("PDF-Handler starts");

    Zotero.setCharacterSet("utf-8");

    // read cleanpdf (discard all binary streams), the resulting pdf (in string) should be small even for big files.
    // streams are enclosed inbetween "stream" and "endstream"
    while(text = Zotero.read(4096)) {
    holdOver+=text;
    if ((streamopen==true) && (holdOver.length>4105)) { //discard leftover trash from the previous round, keep enough for overlap
    holdOver=holdOver.slice(holdOver.length-4097);
    }
    var searchposition=holdOver.search(((streamopen==true)?"endstream":"stream"));
    while (searchposition!=-1) {
    if (streamopen==true) {
    holdOver=holdOver.slice(searchposition+8);
    streamopen=false;
    } else {
    cleanpdf+=holdOver.slice(0,searchposition);
    streamopen=true;
    holdOver=holdOver.slice(searchposition+5);
    }
    searchposition=holdOver.search((streamopen?"endstream":"stream"));
    }
    }
    cleanpdf+=holdOver;

    numberofcontents=0; // for debugging purposes, count the number of "Contents()"
    dummy=cleanpdf.search(/Contents\s*\(/i);
    while (dummy!=-1) {
    numberofcontents++;
    dummy=cleanpdf.indexOf("(",dummy)+1;
    var newtext=cleanpdf.substr(dummy,cleanpdf.slice(dummy).search(/\)\//));
    if (newtext!="") { // create a new note for every comment and highlight.
    var newItem = new Zotero.Item();
    newItem.itemType = 'note';
    newItem.note = newtext.replace(/\\r/g,"").replace(/\\\(/g,"").replace(/\\\)/g,""); //clean up
    newItem.complete();
    }
    cleanpdf=cleanpdf.slice(dummy+6);
    dummy=cleanpdf.search(/Contents\s*\(/i);
    }
    Zotero.debug("I have found "+numberofcontents.toString()+" Contents in PDF.");

    Zotero.debug("PDF-Handler ends");
    }
  • Although I seem to have a conversation with myself I vastly improved the code from above (which was one big mess).

    The code now really parses the pdf-file. It recreates the pdf-objects and the pagetree and looks for annotations (instead of just crudely throwing everyting inbetween "Contents()" in a note). It does not create a new Note for every annotation, but instead creates a new document with notes. Every note is accompanied by its pagenumber (btw. there really should be a way of storing those with a note) and a tag, telling whether it is a highlight or an annotation.

    I would appreciate if somebody actually tried this - I can make a screencast if necessary.
    Also, I would appreciate it if somebody could take a look at the "Store()"-function. While the parsing of the file is as efficient as I could make it, the actual storing of the annotations seems to create a huge SQL-overload - some bigger files (with several hundreds of annotations) take up to 10 min. to store. (compared to just 20 sec. of parsing the file and extracting the annotations).

    Anyway, here's the code, make sure you get everything, I had to distribute it over two posts:

    var objlist=new Object(); // stores the PDF-objects, stripped of all streams
    var pagelist=new Array(); // stores the pageobject of each page

    var pagenumberdifference=0; // indicates that the real pagenumber is different, has to be globel because function is recursive
    var pagenumbertype=0; // 1=(normal) 2=[roman] 0=consistent with pdf

    // returns failsafe either "" or the content of an object in the objectlist;
    function getObj(objcode) {
    var objcontent=objlist[objcode.replace(/^\s|\s$/g,"")];
    if (objcontent!=undefined) {
    return objcontent;
    } else {
    return "";
    }
    }

    // creates a new Annotation Object
    function Annotation(objcode) {
    this.objreference=objcode;
    this.contents="";
    this.type="";

    var objcontent=getObj(objcode);
    if (objcontent=="") return; // Object does not exist

    // extract contents
    var pos=objcontent.search(/\/Type\s*\/Annot\s*/i);
    if (pos!=-1) { // Annotation found
    // extract type
    pos=objcontent.search(/\/Subtype\s*\/Text\s*/i);
    if (pos!=-1) this.type="Note";
    pos=objcontent.search(/\/Subtype\s*\/Highlight\s*/i);
    if (pos!=-1) this.type="Highlight";
    pos=objcontent.search(/\/Subtype\s*\/Freetext\s*/i);
    if (pos!=-1) this.type="Text";

    pos=objcontent.search(/\/Contents\s*/i);
    if (pos!=-1) { // simply hidden in Contents()-Tag
    objcontent=objcontent.slice(objcontent.indexOf("(",pos)+1);
    //objcontent=objcontent.slice(0,objcontent.search(/\)\s*\//i)); // weaker version, might be incompatible
    objcontent=objcontent.slice(0,objcontent.search(/[^\\]\)/i)+1);

    objcontent=objcontent.replace(/\\r/ig," "); // cleaning up linebreaks and other escaped sequences
    objcontent=objcontent.replace(/\\\)/ig,")");
    objcontent=objcontent.replace(/\\\(/ig,"(");
    objcontent=objcontent.replace(/\\\s/ig,"");

    // change type to pagerenumber and trim to number if Pagerenumber-Popup
    pos=objcontent.search(/PageRenumber/i);
    if (pos!=-1) {
    objcontent=objcontent.replace(/[^\d\-\[\]\(\)]/g,"");
    this.type="PageRenumber";
    }

    this.contents=objcontent;
    } else {
    // complexer version would require parsing of streams - not implemented
    }

    }
    }

    // creates a new Page Object
    function Page(objcode) {
    this.objreference=objcode;
    this.realpagename=""; // caller usually handles the pagenumber, however, the constructor may modify it if he stumbles upon a PageRenumber-note
    this.annots=new Array();

    // now check whether page has annotations
    var objcontent=getObj(objcode);
    var pos=objcontent.search(/\/Annots/i);
    if (pos!=-1) { // Annotations found
    if ((objcontent.indexOf("[",pos)==-1) || (objcontent.indexOf("R",pos)<objcontent.indexOf("[",pos))) { // probably child object used, but this is just guessing
    pos=objcontent.indexOf("R",pos)-4;
    pos=objcontent.lastIndexOf(" ",pos);
    objcontent=objcontent.slice(pos,objcontent.indexOf("R",pos)-1); // extract obj where childobject can be found
    objcontent=getObj(objcontent);
    pos=0;
    }
    objcontent=objcontent.slice(objcontent.indexOf("[",pos)+1,objcontent.indexOf("]",pos)-2); // discard anything except childtree
    objcontent=objcontent.replace(/\s*$/,"").replace(/R*$/,""); // discard final R
    var pagetree=objcontent.split(" R ");
    for (var dummy=0; dummy<pagetree.length; dummy++) {
    this.annots.push(new Annotation(pagetree[dummy]));
    if (this.annots[this.annots.length-1].contents=="") {
    this.annots.pop();
    } else {
    if (this.annots[this.annots.length-1].type=="PageRenumber") { // read real pagenumber from a popup note
    this.realpagename=this.annots[this.annots.length-1].contents;
    this.annots.pop();
    } //else /*DEBUG*/ Zotero.debug(this.annots[this.annots.length-1].contents);
    }
    }
    }
    }
  • code continued:

    // initializes the objlist object, reading all PDF-objects from the file, stripping them of streams
    function createObjlist () {
    var text;
    var cleanpdf="";
    var holdOver = ""; // part of the text held over from the last loop
    var streamopen = false;
    var newcontent=0;

    Zotero.setCharacterSet("utf-8");

    /*DEBUG*/ var chunkcounter=0;
    // read cleanpdf (discard all binary streams), the resulting pdf (in string) should be small even for big files.
    // streams are enclosed inbetween "stream" and "endstream"
    while (text = Zotero.read(4096)) {
    /*DEBUG*/ chunkcounter++;
    /*DEBUG*/ if (chunkcounter%100==0) Zotero.debug ("Reading chunk: "+chunkcounter.toString());
    holdOver+=text;
    if ((streamopen==true) && (holdOver.length>4105)) { //discard leftover trash from the previous round, keep enough for overlap
    holdOver=holdOver.slice(holdOver.length-4097);
    }

    // run through read text, discard everything that is in a stream
    var searchposition=holdOver.search(((streamopen==true)?"endstream":"stream"));
    while (searchposition!=-1) {
    if (streamopen==true) {
    holdOver=holdOver.slice(searchposition+8);
    streamopen=false;
    } else {
    cleanpdf+=holdOver.slice(0,searchposition);
    newcontent=searchposition;
    streamopen=true;
    holdOver=holdOver.slice(searchposition+5);
    }
    searchposition=holdOver.search((streamopen?"endstream":"stream"));
    }

    // read objects from PDF
    if (newcontent>0) {
    var searchposition=cleanpdf.indexOf("endobj",newcontent);
    while (searchposition!=-1) {
    var cleanpdf1=cleanpdf.slice(0,searchposition);
    cleanpdf=cleanpdf.slice(searchposition+6);

    searchposition=cleanpdf1.lastIndexOf(" obj");
    var obj=cleanpdf1.slice(searchposition+4);
    cleanpdf1=cleanpdf1.slice(0,searchposition);
    searchposition=cleanpdf1.search(/\d+\s*\d+\s*$/ig);

    // store in objlist under name of PDF-object
    objlist[cleanpdf1.slice(searchposition)]=obj;

    cleanpdf=cleanpdf1+cleanpdf;
    searchposition=cleanpdf.indexOf("endobj", cleanpdf1.length);
    }
    newcontent=0;
    }
    }
    cleanpdf+=holdOver;

    // read PDF trailer from cleanpdf, store in objlist.root
    dummy=cleanpdf.search(/^trailer/im);
    if (dummy==-1) {
    Zotero.debug("PDF-Trailer could not be found, parsing will be impossible");
    } else {
    // Identify PDF Root
    var dummy=cleanpdf.indexOf("/Root",dummy);
    objlist.root=cleanpdf.slice(dummy+6,cleanpdf.indexOf(" R",dummy+6));
    if (objlist[objlist.root]==undefined) {
    Zotero.debug("Possible PDF-Structure Problem. Rootobject identified but not found.");
    return;
    }
    // Identify Pagetree-Root
    var dummy=objlist[objlist.root].indexOf("/Pages");
    objlist.pagetreeroot=objlist[objlist.root].slice(dummy+7,objlist[objlist.root].indexOf(" R",dummy+7));
    }
    }

    // reiteratively parses the pagetree (from objcode onwards) adds to variable pageslist[]
    function getPages(objcode) {
    var objcontent=getObj(objcode);
    if (objcontent=="") return; // Object does not exist
    var pos=objcontent.search(/\/Type\s*\/Pages\s*\/Kids/i);
    if (pos!=-1) { // Pagetree found
    if ((objcontent.indexOf("[",pos)==-1) || (objcontent.indexOf("R",pos)<objcontent.indexOf("[",pos))) { // probably child object used, but this is just guessing
    pos=objcontent.indexOf("R",pos)-4;
    pos=objcontent.lastIndexOf(" ",pos);
    objcontent=objcontent.slice(pos,objcontent.indexOf("R",pos)-1); // extract obj where childobject can be found
    objcontent=getObj(objcontent);
    pos=0;
    }
    objcontent=objcontent.slice(objcontent.indexOf("[",pos)+1,objcontent.indexOf("]",pos)-2); // discard anything except childtree
    objcontent=objcontent.replace(/\s*$/,"").replace(/R*$/,""); // discard final R
    var pagetree=objcontent.split(" R ");
    for (var dummy=0; dummy<pagetree.length; dummy++) {
    getPages(pagetree[dummy]); // Recursively move up pagetree
    }
    } else { // probably page
    var pos=objcontent.search(/\/Type\s*\/Page/i);
    if (pos!=-1) { // Page found, add to Array pages
    pagelist.push(new Page(objcode));
    if (pagelist[pagelist.length-1].realpagename=="") {
    pagelist[pagelist.length-1].realpagename=(pagelist.length+pagenumberdifference).toString();
    switch (pagenumbertype) {
    case 1:
    pagelist[pagelist.length-1].realpagename="("+pagelist[pagelist.length-1].realpagename+")";
    break;
    case 2:
    pagelist[pagelist.length-1].realpagename="["+pagelist[pagelist.length-1].realpagename+"]";
    break;
    }
    } else {
    pagenumberdifference=parseInt(pagelist[pagelist.length-1].realpagename.replace(/[^\d\-]/g,""))-pagelist.length;
    if (pagelist[pagelist.length-1].realpagename.search(/\(/)!=-1) {
    pagenumbertype=1;
    } else {
    if (pagelist[pagelist.length-1].realpagename.search(/\[/)!=-1) {
    pagenumbertype=2;
    } else {
    pagenumbertype=0;
    }
    }
    }
    /*DEBUG*/ if (pagelist.length%100==0) Zotero.debug("Parsing page: "+pagelist[pagelist.length-1].realpagename);
    }
    }
    }

    // reads Annotations and Highlighs from pagelist and stores them into Zotero
    function Store() {
    var newItem = new Zotero.Item();
    newItem.itemType = 'document';
    newItem.title="PDF-Import"; // Change as soon as you know how
    for (var pagecounter=0; pagecounter<pagelist.length; pagecounter++) {
    var annotscounter=0;
    while (annotscounter<pagelist[pagecounter].annots.length) {
    var newNote= new Array();
    if (pagelist[pagecounter].annots[annotscounter].type=="Highlight") {
    newNote.note='"'+pagelist[pagecounter].annots[annotscounter].contents+'" '+pagelist[pagecounter].realpagename;
    } else {
    newNote.note="p. "+pagelist[pagecounter].realpagename+" : "+pagelist[pagecounter].annots[annotscounter].contents;
    }
    newNote.tags = new Array();
    newNote.tags.push(pagelist[pagecounter].annots[annotscounter].type)
    newItem.notes.push(newNote);
    annotscounter++;
    }
    }
    newItem.complete();
    }

    function doImport() {
    Zotero.debug("PDF-Handler starts");

    createObjlist();
    Zotero.debug("PDF-Root-Object is: "+objlist.root+" R\nPageTree-Root-Object is: "+objlist.pagetreeroot+" R");
    getPages(objlist.pagetreeroot);

    Store();

    Zotero.debug("PDF-Handler finished");
    }
  • I just came across this and it sounds very interesting. I am, however, not sure how to get this code running? Can you help me?

    Thanks!
  • I also would love to try this code. How can I do it?
  • See http://www.zotero.org/support/dev/creating_translators_for_sites , but I'm not sure that the code works with the current release of Zotero.
  • Hi Matthias!

    I'm wondering whether there has been any progress with importing PDF highlights/comments easily into Zotero notes? Or does anyone know if there has been some other developments of how to easily import PDF highlighting into Zotero?
  • I haven't tried this, but Joscha has been doing some work with highlighting/annotating in Zotfile:
    http://www.columbia.edu/~jpl2136/zotfile.html
  • Just as an update, since this thread popus up time and again:
    This will no longer work! Don't try it.

    While I haven't tried Zotfile, I think it does not exactly what this script did, namely extract PDF annotations and highlights into Zotero notes - Correct me if I'm wrong. My script was an ugly hack, it worked for an old version of Zotero, but cluttered up the database.

    Though I would still wish to use automatic annotation and highlight extraction with Zotero, I don't think anything really practical will come along anytime soon.

    Nevertheless, allow me to leave this ugly hack here, if only as proof that Zotero/JavaScript is easily able to handle binary PDFs on its own to extract metadata.

    But don't try to get it running on any Zotero database you intend to use for actual work.

    Cheers,

    Matthias
  • mheim - while not Zotfile's main function, Joscha has indeed implemented exactly this functionality in the most recent version (I believe it's in beta). As I understand it's only working for Mac so far, though. See Joscha's post from August 18 here:
    http://forums.zotero.org/discussion/18737/
  • The Zotfile implementation is not pure JavaScript-- it uses the Poppler PDF library to do the same basic thing, but I understand it's safe and reliable.
Sign In or Register to comment.