parsing external file with Zotero Import Script
Hi,
I'm still finetuning my pdf solution. Would there actually be a way to parse an FDF-file (which is basically an txt-file) with an Import Script in Zotero?
I've had a look at some of the Import scripts with Scaffold, but I couldn't find one that suited my needs. Is it theoretically possible to create a structure with a Fileselector-Window ("Select the FDF-file you want to parse"), and then opening and parsing the selected file with some sort of a filesystemobject.
Just get me started,...
SeeYa Matthias
I'm still finetuning my pdf solution. Would there actually be a way to parse an FDF-file (which is basically an txt-file) with an Import Script in Zotero?
I've had a look at some of the Import scripts with Scaffold, but I couldn't find one that suited my needs. Is it theoretically possible to create a structure with a Fileselector-Window ("Select the FDF-file you want to parse"), and then opening and parsing the selected file with some sort of a filesystemobject.
Just get me started,...
SeeYa Matthias
I've found the solution myself. Adapted RDF-import. Will be back for other questions...
*returns to hacking*
Matthias
I now read the comments and the Highlights directly from the PDF! However, for the Highlights to be extracted, one still has to use Acrobat's option to automatically copy highlighted content into a hidden note (and this is probably not going to change, since everything else would require me to basically rewrite pdf2txt in javascript).
Still, it imports every Note and Highlight from a PDF into a new Zotero note. While this is unsatisfactory, it is at least a proof of concept. Ideally I would wish to create a summary from the data, which is then imported as HTML alongside the pdf. However, I could not find out how to create a new HTML-file from within Zotero (I'm sure it would not be a problem) and then import it autmatically. The notes are just a very limited way of outputting that information.
But I need some help (and sleep).
In Scaffold create:
Label: "PDF: Comments and Highlights - Import"
Creator: "Matthias Heim"
Target: "PDF"
Detect Code:
(I have no idea how this works - but it does. It should test whether the file opens up with %pdf, but I have an inkling that it doesn't perform that test)
Zotero.configure("dataMode", "pdf");
function detectImport() {
var text=Zotero.read(4);
if (text.search(/pdf/i)!=-1) {
return true;
}
}
Code:
function doImport() {
var text;
var holdOver = ""; // part of the text held over from the last loop
var cleanpdf = ""; // pdf, cleaned of all binary-streams (can be handled even if file big)
var streamopen = false;
Zotero.debug("PDF-Handler starts");
Zotero.setCharacterSet("utf-8");
// read cleanpdf (discard all binary streams), the resulting pdf (in string) should be small even for big files.
// streams are enclosed inbetween "stream" and "endstream"
while(text = Zotero.read(4096)) {
holdOver+=text;
if ((streamopen==true) && (holdOver.length>4105)) { //discard leftover trash from the previous round, keep enough for overlap
holdOver=holdOver.slice(holdOver.length-4097);
}
var searchposition=holdOver.search(((streamopen==true)?"endstream":"stream"));
while (searchposition!=-1) {
if (streamopen==true) {
holdOver=holdOver.slice(searchposition+8);
streamopen=false;
} else {
cleanpdf+=holdOver.slice(0,searchposition);
streamopen=true;
holdOver=holdOver.slice(searchposition+5);
}
searchposition=holdOver.search((streamopen?"endstream":"stream"));
}
}
cleanpdf+=holdOver;
numberofcontents=0; // for debugging purposes, count the number of "Contents()"
dummy=cleanpdf.search(/Contents\s*\(/i);
while (dummy!=-1) {
numberofcontents++;
dummy=cleanpdf.indexOf("(",dummy)+1;
var newtext=cleanpdf.substr(dummy,cleanpdf.slice(dummy).search(/\)\//));
if (newtext!="") { // create a new note for every comment and highlight.
var newItem = new Zotero.Item();
newItem.itemType = 'note';
newItem.note = newtext.replace(/\\r/g,"").replace(/\\\(/g,"").replace(/\\\)/g,""); //clean up
newItem.complete();
}
cleanpdf=cleanpdf.slice(dummy+6);
dummy=cleanpdf.search(/Contents\s*\(/i);
}
Zotero.debug("I have found "+numberofcontents.toString()+" Contents in PDF.");
Zotero.debug("PDF-Handler ends");
}
The code now really parses the pdf-file. It recreates the pdf-objects and the pagetree and looks for annotations (instead of just crudely throwing everyting inbetween "Contents()" in a note). It does not create a new Note for every annotation, but instead creates a new document with notes. Every note is accompanied by its pagenumber (btw. there really should be a way of storing those with a note) and a tag, telling whether it is a highlight or an annotation.
I would appreciate if somebody actually tried this - I can make a screencast if necessary.
Also, I would appreciate it if somebody could take a look at the "Store()"-function. While the parsing of the file is as efficient as I could make it, the actual storing of the annotations seems to create a huge SQL-overload - some bigger files (with several hundreds of annotations) take up to 10 min. to store. (compared to just 20 sec. of parsing the file and extracting the annotations).
Anyway, here's the code, make sure you get everything, I had to distribute it over two posts:
var objlist=new Object(); // stores the PDF-objects, stripped of all streams
var pagelist=new Array(); // stores the pageobject of each page
var pagenumberdifference=0; // indicates that the real pagenumber is different, has to be globel because function is recursive
var pagenumbertype=0; // 1=(normal) 2=[roman] 0=consistent with pdf
// returns failsafe either "" or the content of an object in the objectlist;
function getObj(objcode) {
var objcontent=objlist[objcode.replace(/^\s|\s$/g,"")];
if (objcontent!=undefined) {
return objcontent;
} else {
return "";
}
}
// creates a new Annotation Object
function Annotation(objcode) {
this.objreference=objcode;
this.contents="";
this.type="";
var objcontent=getObj(objcode);
if (objcontent=="") return; // Object does not exist
// extract contents
var pos=objcontent.search(/\/Type\s*\/Annot\s*/i);
if (pos!=-1) { // Annotation found
// extract type
pos=objcontent.search(/\/Subtype\s*\/Text\s*/i);
if (pos!=-1) this.type="Note";
pos=objcontent.search(/\/Subtype\s*\/Highlight\s*/i);
if (pos!=-1) this.type="Highlight";
pos=objcontent.search(/\/Subtype\s*\/Freetext\s*/i);
if (pos!=-1) this.type="Text";
pos=objcontent.search(/\/Contents\s*/i);
if (pos!=-1) { // simply hidden in Contents()-Tag
objcontent=objcontent.slice(objcontent.indexOf("(",pos)+1);
//objcontent=objcontent.slice(0,objcontent.search(/\)\s*\//i)); // weaker version, might be incompatible
objcontent=objcontent.slice(0,objcontent.search(/[^\\]\)/i)+1);
objcontent=objcontent.replace(/\\r/ig," "); // cleaning up linebreaks and other escaped sequences
objcontent=objcontent.replace(/\\\)/ig,")");
objcontent=objcontent.replace(/\\\(/ig,"(");
objcontent=objcontent.replace(/\\\s/ig,"");
// change type to pagerenumber and trim to number if Pagerenumber-Popup
pos=objcontent.search(/PageRenumber/i);
if (pos!=-1) {
objcontent=objcontent.replace(/[^\d\-\[\]\(\)]/g,"");
this.type="PageRenumber";
}
this.contents=objcontent;
} else {
// complexer version would require parsing of streams - not implemented
}
}
}
// creates a new Page Object
function Page(objcode) {
this.objreference=objcode;
this.realpagename=""; // caller usually handles the pagenumber, however, the constructor may modify it if he stumbles upon a PageRenumber-note
this.annots=new Array();
// now check whether page has annotations
var objcontent=getObj(objcode);
var pos=objcontent.search(/\/Annots/i);
if (pos!=-1) { // Annotations found
if ((objcontent.indexOf("[",pos)==-1) || (objcontent.indexOf("R",pos)<objcontent.indexOf("[",pos))) { // probably child object used, but this is just guessing
pos=objcontent.indexOf("R",pos)-4;
pos=objcontent.lastIndexOf(" ",pos);
objcontent=objcontent.slice(pos,objcontent.indexOf("R",pos)-1); // extract obj where childobject can be found
objcontent=getObj(objcontent);
pos=0;
}
objcontent=objcontent.slice(objcontent.indexOf("[",pos)+1,objcontent.indexOf("]",pos)-2); // discard anything except childtree
objcontent=objcontent.replace(/\s*$/,"").replace(/R*$/,""); // discard final R
var pagetree=objcontent.split(" R ");
for (var dummy=0; dummy<pagetree.length; dummy++) {
this.annots.push(new Annotation(pagetree[dummy]));
if (this.annots[this.annots.length-1].contents=="") {
this.annots.pop();
} else {
if (this.annots[this.annots.length-1].type=="PageRenumber") { // read real pagenumber from a popup note
this.realpagename=this.annots[this.annots.length-1].contents;
this.annots.pop();
} //else /*DEBUG*/ Zotero.debug(this.annots[this.annots.length-1].contents);
}
}
}
}
// initializes the objlist object, reading all PDF-objects from the file, stripping them of streams
function createObjlist () {
var text;
var cleanpdf="";
var holdOver = ""; // part of the text held over from the last loop
var streamopen = false;
var newcontent=0;
Zotero.setCharacterSet("utf-8");
/*DEBUG*/ var chunkcounter=0;
// read cleanpdf (discard all binary streams), the resulting pdf (in string) should be small even for big files.
// streams are enclosed inbetween "stream" and "endstream"
while (text = Zotero.read(4096)) {
/*DEBUG*/ chunkcounter++;
/*DEBUG*/ if (chunkcounter%100==0) Zotero.debug ("Reading chunk: "+chunkcounter.toString());
holdOver+=text;
if ((streamopen==true) && (holdOver.length>4105)) { //discard leftover trash from the previous round, keep enough for overlap
holdOver=holdOver.slice(holdOver.length-4097);
}
// run through read text, discard everything that is in a stream
var searchposition=holdOver.search(((streamopen==true)?"endstream":"stream"));
while (searchposition!=-1) {
if (streamopen==true) {
holdOver=holdOver.slice(searchposition+8);
streamopen=false;
} else {
cleanpdf+=holdOver.slice(0,searchposition);
newcontent=searchposition;
streamopen=true;
holdOver=holdOver.slice(searchposition+5);
}
searchposition=holdOver.search((streamopen?"endstream":"stream"));
}
// read objects from PDF
if (newcontent>0) {
var searchposition=cleanpdf.indexOf("endobj",newcontent);
while (searchposition!=-1) {
var cleanpdf1=cleanpdf.slice(0,searchposition);
cleanpdf=cleanpdf.slice(searchposition+6);
searchposition=cleanpdf1.lastIndexOf(" obj");
var obj=cleanpdf1.slice(searchposition+4);
cleanpdf1=cleanpdf1.slice(0,searchposition);
searchposition=cleanpdf1.search(/\d+\s*\d+\s*$/ig);
// store in objlist under name of PDF-object
objlist[cleanpdf1.slice(searchposition)]=obj;
cleanpdf=cleanpdf1+cleanpdf;
searchposition=cleanpdf.indexOf("endobj", cleanpdf1.length);
}
newcontent=0;
}
}
cleanpdf+=holdOver;
// read PDF trailer from cleanpdf, store in objlist.root
dummy=cleanpdf.search(/^trailer/im);
if (dummy==-1) {
Zotero.debug("PDF-Trailer could not be found, parsing will be impossible");
} else {
// Identify PDF Root
var dummy=cleanpdf.indexOf("/Root",dummy);
objlist.root=cleanpdf.slice(dummy+6,cleanpdf.indexOf(" R",dummy+6));
if (objlist[objlist.root]==undefined) {
Zotero.debug("Possible PDF-Structure Problem. Rootobject identified but not found.");
return;
}
// Identify Pagetree-Root
var dummy=objlist[objlist.root].indexOf("/Pages");
objlist.pagetreeroot=objlist[objlist.root].slice(dummy+7,objlist[objlist.root].indexOf(" R",dummy+7));
}
}
// reiteratively parses the pagetree (from objcode onwards) adds to variable pageslist[]
function getPages(objcode) {
var objcontent=getObj(objcode);
if (objcontent=="") return; // Object does not exist
var pos=objcontent.search(/\/Type\s*\/Pages\s*\/Kids/i);
if (pos!=-1) { // Pagetree found
if ((objcontent.indexOf("[",pos)==-1) || (objcontent.indexOf("R",pos)<objcontent.indexOf("[",pos))) { // probably child object used, but this is just guessing
pos=objcontent.indexOf("R",pos)-4;
pos=objcontent.lastIndexOf(" ",pos);
objcontent=objcontent.slice(pos,objcontent.indexOf("R",pos)-1); // extract obj where childobject can be found
objcontent=getObj(objcontent);
pos=0;
}
objcontent=objcontent.slice(objcontent.indexOf("[",pos)+1,objcontent.indexOf("]",pos)-2); // discard anything except childtree
objcontent=objcontent.replace(/\s*$/,"").replace(/R*$/,""); // discard final R
var pagetree=objcontent.split(" R ");
for (var dummy=0; dummy<pagetree.length; dummy++) {
getPages(pagetree[dummy]); // Recursively move up pagetree
}
} else { // probably page
var pos=objcontent.search(/\/Type\s*\/Page/i);
if (pos!=-1) { // Page found, add to Array pages
pagelist.push(new Page(objcode));
if (pagelist[pagelist.length-1].realpagename=="") {
pagelist[pagelist.length-1].realpagename=(pagelist.length+pagenumberdifference).toString();
switch (pagenumbertype) {
case 1:
pagelist[pagelist.length-1].realpagename="("+pagelist[pagelist.length-1].realpagename+")";
break;
case 2:
pagelist[pagelist.length-1].realpagename="["+pagelist[pagelist.length-1].realpagename+"]";
break;
}
} else {
pagenumberdifference=parseInt(pagelist[pagelist.length-1].realpagename.replace(/[^\d\-]/g,""))-pagelist.length;
if (pagelist[pagelist.length-1].realpagename.search(/\(/)!=-1) {
pagenumbertype=1;
} else {
if (pagelist[pagelist.length-1].realpagename.search(/\[/)!=-1) {
pagenumbertype=2;
} else {
pagenumbertype=0;
}
}
}
/*DEBUG*/ if (pagelist.length%100==0) Zotero.debug("Parsing page: "+pagelist[pagelist.length-1].realpagename);
}
}
}
// reads Annotations and Highlighs from pagelist and stores them into Zotero
function Store() {
var newItem = new Zotero.Item();
newItem.itemType = 'document';
newItem.title="PDF-Import"; // Change as soon as you know how
for (var pagecounter=0; pagecounter<pagelist.length; pagecounter++) {
var annotscounter=0;
while (annotscounter<pagelist[pagecounter].annots.length) {
var newNote= new Array();
if (pagelist[pagecounter].annots[annotscounter].type=="Highlight") {
newNote.note='"'+pagelist[pagecounter].annots[annotscounter].contents+'" '+pagelist[pagecounter].realpagename;
} else {
newNote.note="p. "+pagelist[pagecounter].realpagename+" : "+pagelist[pagecounter].annots[annotscounter].contents;
}
newNote.tags = new Array();
newNote.tags.push(pagelist[pagecounter].annots[annotscounter].type)
newItem.notes.push(newNote);
annotscounter++;
}
}
newItem.complete();
}
function doImport() {
Zotero.debug("PDF-Handler starts");
createObjlist();
Zotero.debug("PDF-Root-Object is: "+objlist.root+" R\nPageTree-Root-Object is: "+objlist.pagetreeroot+" R");
getPages(objlist.pagetreeroot);
Store();
Zotero.debug("PDF-Handler finished");
}
Thanks!
I'm wondering whether there has been any progress with importing PDF highlights/comments easily into Zotero notes? Or does anyone know if there has been some other developments of how to easily import PDF highlighting into Zotero?
http://www.columbia.edu/~jpl2136/zotfile.html
This will no longer work! Don't try it.
While I haven't tried Zotfile, I think it does not exactly what this script did, namely extract PDF annotations and highlights into Zotero notes - Correct me if I'm wrong. My script was an ugly hack, it worked for an old version of Zotero, but cluttered up the database.
Though I would still wish to use automatic annotation and highlight extraction with Zotero, I don't think anything really practical will come along anytime soon.
Nevertheless, allow me to leave this ugly hack here, if only as proof that Zotero/JavaScript is easily able to handle binary PDFs on its own to extract metadata.
But don't try to get it running on any Zotero database you intend to use for actual work.
Cheers,
Matthias
http://forums.zotero.org/discussion/18737/