PDF Annotations Import

mheim · April 30, 2007

Hi,

Today everybody in the academics is working with PDF-files. While Zotero has already basic support for them, something which I would really appreciate, would be the possibility to import my PDF-annotations and highlights into Zotero. I don't think that pdf2txt supports annotations at the moment, but that would immediately bring PDF-support to the level of HTML-support - which would be great.

Thank for this great tool - it's what I've been waiting for...

Matthias

scot · April 30, 2007

Have you seen this thread?

Quoted from Dan Stillman:

Annotations do not work with PDF files, only HTML and image files. Since PDFs are loaded by plugins, we don't have any way of accessing the internal structure.

Acrobat offers some built-in annotation functions, though, and it may be possible that Zotero would be able to read such data in the future (for example, to index it for searching).

It sounds as if Zotero won't be able to handle PDF annotations in the same way that it handles HTML annotations, so "importing" existing annotations and treating them identically to Zotero's HTML additions will not be possible. However if what you want to do is use/search the text of your acrobat annotations, it sounds at least possible.

It would require a manual (or automatic) way for Zotero to 'update' its copy of the PDF annotations when they get changed, and ideally some mechanism for jumping to the appropriate part of the PDF when a PDF annotation is found in a search. (Though such capability may be exactly what Dan means cannot be achieved since, "PDFs are loaded by plugins."

dstillman · April 30, 2007

There's an Acrobat JavaScript API that could conceivably enable this sort of interaction, but I'm not sure it allows access from the browser to the PDF file—have to look into it.

mheim · April 30, 2007

At least in href-links it is quite easy to open a specific pdf-page
According to this page on Adobe Tech Note (http://www.adobe.com/cfusion/knowledgebase/index.cfm?id=317300) you just add a #page=PGNR to the url, where PGNR is the page number.
So, to open the a pdf on above page, which provides further detail from page 5 onwards, you can just click on http://partners.adobe.com/public/developer/en/acrobat/PDFOpenParameters.pdf#page=5
Apparently it is even possible to open the pdf on a specific comment or note. But you would have to know the comment-id.

My point is, that although Zotero can obviously not access pdf the way it accesses html, Acrobat basically already supports the annotation and notetaking-features Zotero enables on html. But it would add the icing to the cake if Zotero could access that data and organise it.
Personally that would be a favourite feature of mine. While students will obviously very often annotate Internet-sites, postgraduates (at least in the humanities) are bombarded with pdfs. In real life numbers this means that my current work (the first I manage in Zotero) deals currently with 42 magazine-articles (pdf), and I've so far grabbed four webpages into Zotero for annotation.
On the other hand, pdf2txt really seems not to support comments, so that seems to be it, unless somebody comes up with another pdf-frontend for Zotero.

mheim · May 5, 2007

Hello,

I am now using a workaround for my problem. In Adobe Acrobat 8 (maybe also in the Reader) you can change the settings, so that the program automatically copies the highlighted passages into the note that in pdfs comes with every highlight (but remains usually closed). One can then use "Summarize Comments" to get a Summary just of the Comments (read more about this here)

However, since this only creates another PDF, I have written a short JScript (which unfortunately only works on Windows) to parse the Form Document File (FDF) which you can export from Acrobat to create out of the Annotations and Highlights a nice HTML-file which can then be imported into Zotero and (again) be annotated. (Sounds much more complicated than it actually is).

Here's the Script, use at your own risk.

Best wishes,

Matthias

/* 
 * copy code into a file called "parse-fdf.js"
 *
 * This script parses an FDF-file (as can be exported by Adobe Acrobat)
 * and writes the highlights and annotations it finds into a readable
 * HTML-file.
 * Just drop the FDF-file onto the script.
 * In Adobe Acrobat 8 (maybe also in the Reader) you can change the
 * comment-settings so that the highlighted text is automatically copied
 * into the note which comes with every highlight. Only if then can
 * this small script extract it from the FDF file. It also works with
 * FDFs created by Foxit-Reader (mark however, that Foxit cannot yet
 * automatically copy highlighted text into a note).
 *
 * ! Requires Windows Script Host to be installed.
 * ! It will overwrite any previously parsed output without notice.
 *
 * This Software is licensed under the terms of the GPL.
 * Visit http://www.gnu.org/copyleft/gpl.html for further information.
 *
 * Copyright 2007. Matthias Heim
 * Version 0.1 (works for me, don't expect updates)
 *
 * Originally published in this thread in the Zotero Forum.
 * http://forums.zotero.org/discussion/772/pdf-annotations-import/#Item_5
 *
 */


// Test whether only one file passed on command line (or dropped onto js)
if (WScript.Arguments.length!=1) {
  WScript.Echo("Drop 1 (only) FDF file onto Program");
  WScript.Quit();
}

// Test whether file has FDF extension
var filename=WScript.Arguments.Item(0);
if (filename.search(/fdf$/i)==-1) {
  WScript.Echo("No FDF file, Quitting");
  WScript.Quit();
}


// Read FDF File into Variable s
var objFSO = new ActiveXObject("Scripting.FileSystemObject");
var objFile = objFSO.OpenTextFile(filename, 1);

// Cut ".fdf"-extension from filename
filename=filename.replace(/\.fdf$/i,"");


// Open HTML-Output File and initialize it
var objOutputFile = objFSO.CreateTextFile(filename+".pdf-Comments.html", true);
objOutputFile.WriteLine("<HTML><HEAD><TITLE>Annotations and Highlights in '"
  +filename+".pdf'</TITLE></HEAD><BODY><CENTER><P><H1>&laquo; "
  +filename.slice(filename.lastIndexOf("\\")+1)
  +" &raquo;</H1><hr noshade width=60%><H3>Highlights and Annotations</H3></P></CENTER><TABLE width=80%>");

// Read FDF-file into s
var s=""
while (!objFile.AtEndOfStream) s += objFile.ReadLine();
objFile.Close(); 


// Create Storage Pool
var StoreIndex=0;
var StorePgNr = new Array();
var StoreCType = new Array();
var StoreText = new Array();

// Loop through fdf (in s) and extract notes and highlights
// Check for the first Pagenumber (that comes with any comment)
var pos=s.search(/R\s*\/Page/i);
while (pos!=-1) {
  var firstpart=s.slice(0,pos);
  s=s.slice(pos+4);

  //Get Pagenumber from String akin to "ge 12>>"
  var pgnr=parseInt(s.slice(s.search(/\d/),s.search(">")))+1;
  
  //Get Comment-Type
  var ctype=0;
  var dummy=firstpart.search(/Subtype\s*\/Text/i);
  if (dummy!=-1) {
    ctype=1;
  } else {
    dummy=firstpart.search(/Subtype\s*\/Highlight/i);
    if (dummy!=-1) {
        ctype=2
    } // else append new comment types here
  }

  //Get Comment (if of known type) from String as in "Contents (This is the comment)/"
  if (ctype!=0) {
    firstpart=firstpart.slice(dummy);
    var dummy=firstpart.search(/Contents\s*\(/i);
    if (dummy!=-1) {
      firstpart=firstpart.slice(dummy);
      firstpart=firstpart.slice(firstpart.search(/\(/)+1,firstpart.search(/\)\//));
      firstpart=firstpart.replace(/\\r/ig," ");   // cleaning up linebreaks
      if (firstpart!="") {   // store comment in pool if not empty
        StorePgNr[StoreIndex]=pgnr;
        StoreCType[StoreIndex]=ctype;
        StoreText[StoreIndex]=firstpart;
        StoreIndex++;  
      }
    } //else no comment found
  }

  pos=s.search(/R\s*\/Page/i);
}

if (StoreIndex>0) {
  
  // Sort Entries
  for (var i=0; i<StoreIndex; i++) {
    for (var j=StoreIndex-1; j>i; j--) {
      if (StorePgNr[i]>StorePgNr[j]) {
        var pgnr=StorePgNr[j];
        var ctype=StoreCType[j];
        var cText=StoreText[j];

        StorePgNr[j]=StorePgNr[i];
        StoreCType[j]=StoreCType[i];
        StoreText[j]=StoreText[i];

        StorePgNr[i]=pgnr;
        StoreCType[i]=ctype;
        StoreText[i]=cText; 
      }
    }
  }

  // List Entries
  var page=0;
  var newpage="";
  for (var i=0; i<StoreIndex; i++) {
    if (StorePgNr[i]>page) { // new Page
      page=StorePgNr[i];
      objOutputFile.WriteLine("<TR><TD width=25%> </TD><TD><HR noshade width=30% size=1></TD></TR>");
      newpage='<A href="file:///'+filename.replace("/\\/","/")+".pdf#page="+page.toString()+'">\nPage '+page.toString()+"</A>";
    } else {
      newpage="";
    }

    switch (StoreCType[i]) {
      case 1:
        objOutputFile.WriteLine("<TR valign=middle><TD width=25% align=center><small>"+newpage+"</small></TD><TD><TABLE align=center frame=vsides cellpadding=10 cellspacing=10 width=95% bgcolor=cornsilk><TD>\n"+StoreText[i]+"\n</TD></TABLE></TD></TR>");
        break;
      case 2:
        objOutputFile.WriteLine("<TR valign=middle><TD width=25% align=center><small>"+newpage+"</small></TD><TD><TABLE frame=box cellpadding=10 cellspacing=10 width=100%><TD>\n"+StoreText[i]+"\n</TD></TABLE></TD></TR>");
        break;
    }   
  }
}

//WScript.Echo("Conversion finished");
objOutputFile.WriteLine("</TABLE></BODY></HTML>");
objOutputFile.Close();