PDFs highlighting/annotation - possible solution

Recent version of okular a KDE document viewer is capable of adding annotations, highlights, and some other useful staff to PDFs and probably all other supported formats, e.g., djvu. However, this is sort of a "fake" solution since okular does not change anything in the file, instead it creates another xml file that holds all relevant information. Despite obvious drawbacks, this approach has its own benefits: pdf files are not modified, hence, they may be printed without any problem, these xml files can be added into Zotero database providing a simple way to add annotations to PDFs.

As far as I know recent versions of okular are available on all major platforms: Linux, Mac, and Window, thus, this solution will be quite portable.
  • I think it would be just super if pdftohtml could be integrated into Zotero, to automate conversion from PDF to HTML, which could then be annotated.
    http://pdftohtml.sourceforge.net/

    Conversion with pdftohtml is not foolproof, though. One of the difficulties is that there are so many kinds of PDFs. Pdftohtml has an option for including a text layer (from OCR) in the produced HTML, with the -hidden parameter, and it can work with files that include text and image illustrations, with the -c parameter.

    PDFs that contain images only, with no text or text layer, don't fare too well with pdftohtml, though. This is because, in my experience, the images produced in the conversion are not of a high-enough resolution to be read, in particular when zoomed to a readable text size. But for those kind of PDFs, there's the pdfimages utility. The images extracted with pdfimages would need to be put in some kind of simple web page, but I bet it would be possible to program that.

    In any case, the pdftohtml and pdfimages tools are kin of pdftotext, which Zotero already uses for full-text indexing. In fact, Linux distros tend to package all three of the utilities together with some others, as poppler-utils. And like pdftotext, pdftohtml and pdfimages are cross-platform.

    So I'd think that eventually it would be possible to make a Zotero (or Firefox?) extension for converting PDFs to HTML for annotations and highlighting that would not rely on an external website, the way that the PDF Download extension does (and all the others I've seen). At present, the conversion can always be done on the command line.
  • I think sybillie is on to a great idea! it would be amazing to dump the pdf data to some kind of file that we can perform markup on, but still keep the original pdf file for posterity and have it all linked together!
  • another option could be feeding the pdf file to a service like zoho viewer http://www.zoho.com
Sign In or Register to comment.