parsing external file with Zotero Import Script

mheim · May 8, 2007

Hi,

I'm still finetuning my pdf solution. Would there actually be a way to parse an FDF-file (which is basically an txt-file) with an Import Script in Zotero?
I've had a look at some of the Import scripts with Scaffold, but I couldn't find one that suited my needs. Is it theoretically possible to create a structure with a Fileselector-Window ("Select the FDF-file you want to parse"), and then opening and parsing the selected file with some sort of a filesystemobject.
Just get me started,...

SeeYa Matthias

mheim · May 8, 2007

Hi,
I've found the solution myself. Adapted RDF-import. Will be back for other questions...

*returns to hacking*

Matthias

mheim · May 8, 2007

Ok, here's an evening's work.

I now read the comments and the Highlights directly from the PDF! However, for the Highlights to be extracted, one still has to use Acrobat's option to automatically copy highlighted content into a hidden note (and this is probably not going to change, since everything else would require me to basically rewrite pdf2txt in javascript).

Still, it imports every Note and Highlight from a PDF into a new Zotero note. While this is unsatisfactory, it is at least a proof of concept. Ideally I would wish to create a summary from the data, which is then imported as HTML alongside the pdf. However, I could not find out how to create a new HTML-file from within Zotero (I'm sure it would not be a problem) and then import it autmatically. The notes are just a very limited way of outputting that information.

But I need some help (and sleep).

In Scaffold create:
Label: "PDF: Comments and Highlights - Import"
Creator: "Matthias Heim"
Target: "PDF"

Detect Code:
(I have no idea how this works - but it does. It should test whether the file opens up with %pdf, but I have an inkling that it doesn't perform that test)

Zotero.configure("dataMode", "pdf");

function detectImport() {
	var text=Zotero.read(4);
	if (text.search(/pdf/i)!=-1) {
		return true;
	}
}

Code:


function doImport() {
	var text;
	var holdOver = "";	// part of the text held over from the last loop
	var cleanpdf = ""; // pdf, cleaned of all binary-streams (can be handled even if file big)
	var streamopen = false;
	Zotero.debug("PDF-Handler starts");
	
	Zotero.setCharacterSet("utf-8");
	
	// read cleanpdf (discard all binary streams), the resulting pdf (in string) should be small even for big files.
	// streams are enclosed inbetween "stream" and "endstream"
	while(text = Zotero.read(4096)) {
		holdOver+=text;
		if ((streamopen==true) && (holdOver.length>4105)) { //discard leftover trash from the previous round, keep enough for overlap
			holdOver=holdOver.slice(holdOver.length-4097);
		}
		var searchposition=holdOver.search(((streamopen==true)?"endstream":"stream"));
		while (searchposition!=-1) {
			if (streamopen==true) {
				holdOver=holdOver.slice(searchposition+8);
				streamopen=false;
			} else {
				cleanpdf+=holdOver.slice(0,searchposition);
				streamopen=true;
				holdOver=holdOver.slice(searchposition+5);
			}
			searchposition=holdOver.search((streamopen?"endstream":"stream"));
		}
	}
	cleanpdf+=holdOver;
	
	numberofcontents=0; // for debugging purposes, count the number of "Contents()"
	dummy=cleanpdf.search(/Contents\s*\(/i);
	while (dummy!=-1) {
		numberofcontents++;
		dummy=cleanpdf.indexOf("(",dummy)+1;
		var newtext=cleanpdf.substr(dummy,cleanpdf.slice(dummy).search(/\)\//));
		if (newtext!="") { // create a new note for every comment and highlight.
			var newItem = new Zotero.Item();
			newItem.itemType = 'note';
			newItem.note = newtext.replace(/\\r/g,"").replace(/\\\(/g,"").replace(/\\\)/g,""); //clean up
			newItem.complete();
		}
		cleanpdf=cleanpdf.slice(dummy+6);
		dummy=cleanpdf.search(/Contents\s*\(/i);
	}
	Zotero.debug("I have found "+numberofcontents.toString()+" Contents in PDF.");
	
	Zotero.debug("PDF-Handler ends");
}

mheim · May 15, 2007

Although I seem to have a conversation with myself I vastly improved the code from above (which was one big mess).

The code now really parses the pdf-file. It recreates the pdf-objects and the pagetree and looks for annotations (instead of just crudely throwing everyting inbetween "Contents()" in a note). It does not create a new Note for every annotation, but instead creates a new document with notes. Every note is accompanied by its pagenumber (btw. there really should be a way of storing those with a note) and a tag, telling whether it is a highlight or an annotation.

I would appreciate if somebody actually tried this - I can make a screencast if necessary.
Also, I would appreciate it if somebody could take a look at the "Store()"-function. While the parsing of the file is as efficient as I could make it, the actual storing of the annotations seems to create a huge SQL-overload - some bigger files (with several hundreds of annotations) take up to 10 min. to store. (compared to just 20 sec. of parsing the file and extracting the annotations).

Anyway, here's the code, make sure you get everything, I had to distribute it over two posts:


var objlist=new Object(); // stores the PDF-objects, stripped of all streams
var pagelist=new Array();  // stores the pageobject of each page

var pagenumberdifference=0; // indicates that the real pagenumber is different, has to be globel because function is recursive
var pagenumbertype=0; // 1=(normal) 2=[roman] 0=consistent with pdf

// returns failsafe either "" or the content of an object in the objectlist;
function getObj(objcode) {
	var objcontent=objlist[objcode.replace(/^\s|\s$/g,"")];
	if (objcontent!=undefined) {
		return objcontent;
	} else {
		return "";
	}
}

// creates a new Annotation Object
function Annotation(objcode) {
	this.objreference=objcode;
	this.contents="";
	this.type="";
	
	var objcontent=getObj(objcode);
	if (objcontent=="") return; // Object does not exist
	
	// extract contents
	var pos=objcontent.search(/\/Type\s*\/Annot\s*/i);
	if (pos!=-1) { // Annotation found
		// extract type
		pos=objcontent.search(/\/Subtype\s*\/Text\s*/i);
		if (pos!=-1) this.type="Note";
		pos=objcontent.search(/\/Subtype\s*\/Highlight\s*/i);
		if (pos!=-1) this.type="Highlight";
		pos=objcontent.search(/\/Subtype\s*\/Freetext\s*/i);
		if (pos!=-1) this.type="Text";
		
		pos=objcontent.search(/\/Contents\s*/i);
		if (pos!=-1) { // simply hidden in Contents()-Tag
			objcontent=objcontent.slice(objcontent.indexOf("(",pos)+1);
			//objcontent=objcontent.slice(0,objcontent.search(/\)\s*\//i)); // weaker version, might be incompatible
			objcontent=objcontent.slice(0,objcontent.search(/[^\\]\)/i)+1);
			
			objcontent=objcontent.replace(/\\r/ig," ");   // cleaning up linebreaks and other escaped sequences
			objcontent=objcontent.replace(/\\\)/ig,")");
			objcontent=objcontent.replace(/\\\(/ig,"(");
			objcontent=objcontent.replace(/\\\s/ig,"");
			
			// change type to pagerenumber and trim to number if Pagerenumber-Popup
			pos=objcontent.search(/PageRenumber/i);
			if (pos!=-1) {
				objcontent=objcontent.replace(/[^\d\-\[\]\(\)]/g,"");
				this.type="PageRenumber";
			}
			
			this.contents=objcontent;
		} else {
			// complexer version would require parsing of streams - not implemented
		}
		
	}
}

// creates a new Page Object
function Page(objcode) {
	this.objreference=objcode;
	this.realpagename=""; // caller usually handles the pagenumber, however, the constructor may modify it if he stumbles upon a PageRenumber-note
	this.annots=new Array();
	
	// now check whether page has annotations
	var objcontent=getObj(objcode);
	var pos=objcontent.search(/\/Annots/i);
	if (pos!=-1) { // Annotations found
		if ((objcontent.indexOf("[",pos)==-1) || (objcontent.indexOf("R",pos)<objcontent.indexOf("[",pos))) { // probably child object used, but this is just guessing
			pos=objcontent.indexOf("R",pos)-4;
			pos=objcontent.lastIndexOf(" ",pos);
			objcontent=objcontent.slice(pos,objcontent.indexOf("R",pos)-1); // extract obj where childobject can be found
			objcontent=getObj(objcontent);
			pos=0;
		}
		objcontent=objcontent.slice(objcontent.indexOf("[",pos)+1,objcontent.indexOf("]",pos)-2); // discard anything except childtree
		objcontent=objcontent.replace(/\s*$/,"").replace(/R*$/,""); // discard final R
		var pagetree=objcontent.split(" R ");
		for (var dummy=0; dummy<pagetree.length; dummy++) {
			this.annots.push(new Annotation(pagetree[dummy]));
			if (this.annots[this.annots.length-1].contents=="") {
				this.annots.pop();
			} else	{
				if (this.annots[this.annots.length-1].type=="PageRenumber") { // read real pagenumber from a popup note
					this.realpagename=this.annots[this.annots.length-1].contents;
					this.annots.pop();
				} //else /*DEBUG*/ Zotero.debug(this.annots[this.annots.length-1].contents);
			}
		}
	}
}

mheim · May 15, 2007

code continued:


// initializes the objlist object, reading all PDF-objects from the file, stripping them of streams
function createObjlist () {
	var text;
	var cleanpdf="";
	var holdOver = "";	// part of the text held over from the last loop
	var streamopen = false;
	var newcontent=0;
	
	Zotero.setCharacterSet("utf-8");
	
	/*DEBUG*/	var chunkcounter=0;
	// read cleanpdf (discard all binary streams), the resulting pdf (in string) should be small even for big files.
	// streams are enclosed inbetween "stream" and "endstream"
	while (text = Zotero.read(4096)) {
		/*DEBUG*/	chunkcounter++;
		/*DEBUG*/	if (chunkcounter%100==0) Zotero.debug ("Reading chunk: "+chunkcounter.toString());
		holdOver+=text;
		if ((streamopen==true) && (holdOver.length>4105)) { //discard leftover trash from the previous round, keep enough  for overlap
			holdOver=holdOver.slice(holdOver.length-4097);
		}
		
		// run through read text, discard everything that is in a stream
		var searchposition=holdOver.search(((streamopen==true)?"endstream":"stream"));
		while (searchposition!=-1) {
			if (streamopen==true) {
				holdOver=holdOver.slice(searchposition+8);
				streamopen=false;
			} else {
				cleanpdf+=holdOver.slice(0,searchposition);
				newcontent=searchposition;
				streamopen=true;
				holdOver=holdOver.slice(searchposition+5);
			}
			searchposition=holdOver.search((streamopen?"endstream":"stream"));
		}
		
		// read objects from PDF
		if (newcontent>0) {
			var searchposition=cleanpdf.indexOf("endobj",newcontent);
			while (searchposition!=-1) {
				var cleanpdf1=cleanpdf.slice(0,searchposition);
				cleanpdf=cleanpdf.slice(searchposition+6);
				
				searchposition=cleanpdf1.lastIndexOf(" obj");
				var obj=cleanpdf1.slice(searchposition+4);
				cleanpdf1=cleanpdf1.slice(0,searchposition);
				searchposition=cleanpdf1.search(/\d+\s*\d+\s*$/ig);
			
				// store in objlist under name of PDF-object
				objlist[cleanpdf1.slice(searchposition)]=obj;
						
				cleanpdf=cleanpdf1+cleanpdf;
				searchposition=cleanpdf.indexOf("endobj", cleanpdf1.length);
			}
			newcontent=0;
		}
	}
	cleanpdf+=holdOver;
	
	// read PDF trailer from cleanpdf, store in objlist.root
	dummy=cleanpdf.search(/^trailer/im);
	if (dummy==-1) {
		Zotero.debug("PDF-Trailer could not be found, parsing will be impossible");
	} else {
		// Identify PDF Root
		var dummy=cleanpdf.indexOf("/Root",dummy);
		objlist.root=cleanpdf.slice(dummy+6,cleanpdf.indexOf(" R",dummy+6));
		if (objlist[objlist.root]==undefined) {
			Zotero.debug("Possible PDF-Structure Problem. Rootobject identified but not found.");
			return;
		}
		// Identify Pagetree-Root
		var dummy=objlist[objlist.root].indexOf("/Pages");
		objlist.pagetreeroot=objlist[objlist.root].slice(dummy+7,objlist[objlist.root].indexOf(" R",dummy+7));
	}
}

// reiteratively parses the pagetree (from objcode onwards) adds to variable pageslist[]
function getPages(objcode) {
	var objcontent=getObj(objcode);
	if (objcontent=="") return; // Object does not exist
	var pos=objcontent.search(/\/Type\s*\/Pages\s*\/Kids/i);
	if (pos!=-1) { // Pagetree found
		if ((objcontent.indexOf("[",pos)==-1) || (objcontent.indexOf("R",pos)<objcontent.indexOf("[",pos))) { // probably child object used, but this is just guessing
			pos=objcontent.indexOf("R",pos)-4;
			pos=objcontent.lastIndexOf(" ",pos);
			objcontent=objcontent.slice(pos,objcontent.indexOf("R",pos)-1); // extract obj where childobject can be found
			objcontent=getObj(objcontent);
			pos=0;
		}
		objcontent=objcontent.slice(objcontent.indexOf("[",pos)+1,objcontent.indexOf("]",pos)-2); // discard anything except childtree
		objcontent=objcontent.replace(/\s*$/,"").replace(/R*$/,""); // discard final R
		var pagetree=objcontent.split(" R ");
		for (var dummy=0; dummy<pagetree.length; dummy++) {
			getPages(pagetree[dummy]); // Recursively move up pagetree
		}
	} else { // probably page
		var pos=objcontent.search(/\/Type\s*\/Page/i);
		if (pos!=-1) { // Page found, add to Array pages
			pagelist.push(new Page(objcode));
			if (pagelist[pagelist.length-1].realpagename=="") {
				pagelist[pagelist.length-1].realpagename=(pagelist.length+pagenumberdifference).toString();
				switch (pagenumbertype) {
					case 1:
						pagelist[pagelist.length-1].realpagename="("+pagelist[pagelist.length-1].realpagename+")";
						break;
					case 2:
						pagelist[pagelist.length-1].realpagename="["+pagelist[pagelist.length-1].realpagename+"]";
						break;
				}
			} else {
				pagenumberdifference=parseInt(pagelist[pagelist.length-1].realpagename.replace(/[^\d\-]/g,""))-pagelist.length;
				if (pagelist[pagelist.length-1].realpagename.search(/\(/)!=-1) {
					pagenumbertype=1;
				} else {
					if (pagelist[pagelist.length-1].realpagename.search(/\[/)!=-1) {
						pagenumbertype=2;
					} else {
						pagenumbertype=0;
					}
				}
			}
			/*DEBUG*/	if (pagelist.length%100==0) Zotero.debug("Parsing page: "+pagelist[pagelist.length-1].realpagename);
		}
	}
}

// reads Annotations and Highlighs from pagelist and stores them into Zotero
function Store() {
	var newItem = new Zotero.Item();
	newItem.itemType = 'document';
	newItem.title="PDF-Import"; // Change as soon as you know how
	for (var pagecounter=0; pagecounter<pagelist.length; pagecounter++) {
		var annotscounter=0;
		while (annotscounter<pagelist[pagecounter].annots.length) {
			var newNote= new Array();
			if (pagelist[pagecounter].annots[annotscounter].type=="Highlight") {
				newNote.note='"'+pagelist[pagecounter].annots[annotscounter].contents+'" '+pagelist[pagecounter].realpagename;
			} else {
				newNote.note="p. "+pagelist[pagecounter].realpagename+" : "+pagelist[pagecounter].annots[annotscounter].contents;
			}
			newNote.tags = new Array();
			newNote.tags.push(pagelist[pagecounter].annots[annotscounter].type)
			newItem.notes.push(newNote);
			annotscounter++;
		}
	}
	newItem.complete();
}

function doImport() {
	Zotero.debug("PDF-Handler starts");
	
	createObjlist();
	Zotero.debug("PDF-Root-Object is: "+objlist.root+" R\nPageTree-Root-Object is: "+objlist.pagetreeroot+" R");
	getPages(objlist.pagetreeroot);
	
	Store();
	
	Zotero.debug("PDF-Handler finished");
}

jodler · August 5, 2008

I just came across this and it sounds very interesting. I am, however, not sure how to get this code running? Can you help me?

Thanks!

ohthere · October 6, 2010

I also would love to try this code. How can I do it?

ajlyon · October 7, 2010

See http://www.zotero.org/support/dev/creating_translators_for_sites , but I'm not sure that the code works with the current release of Zotero.

studiosus · September 4, 2011

Hi Matthias!

I'm wondering whether there has been any progress with importing PDF highlights/comments easily into Zotero notes? Or does anyone know if there has been some other developments of how to easily import PDF highlighting into Zotero?

adamsmith · September 4, 2011

I haven't tried this, but Joscha has been doing some work with highlighting/annotating in Zotfile:
http://www.columbia.edu/~jpl2136/zotfile.html

mheim · November 15, 2011

Just as an update, since this thread popus up time and again:
This will no longer work! Don't try it.

While I haven't tried Zotfile, I think it does not exactly what this script did, namely extract PDF annotations and highlights into Zotero notes - Correct me if I'm wrong. My script was an ugly hack, it worked for an old version of Zotero, but cluttered up the database.

Though I would still wish to use automatic annotation and highlight extraction with Zotero, I don't think anything really practical will come along anytime soon.

Nevertheless, allow me to leave this ugly hack here, if only as proof that Zotero/JavaScript is easily able to handle binary PDFs on its own to extract metadata.

But don't try to get it running on any Zotero database you intend to use for actual work.

Cheers,

Matthias

adamsmith · November 15, 2011

mheim - while not Zotfile's main function, Joscha has indeed implemented exactly this functionality in the most recent version (I believe it's in beta). As I understand it's only working for Mac so far, though. See Joscha's post from August 18 here:
http://forums.zotero.org/discussion/18737/

ajlyon · November 15, 2011

The Zotfile implementation is not pure JavaScript-- it uses the Poppler PDF library to do the same basic thing, but I understand it's safe and reliable.