Translators for PDFs

mronkko · March 12, 2009

For people that store a large number of PDFs on their computer the ability to automatically import these into Zotero would be valuable. This could be implemented by introducing translators for PDF files.

Several databases embedd the citation information on the PDFs and it can be quite easily extracted using pdftotext. For example in a PDF file that I have, the first two lines from the output of pdftotext contain

"Understanding dynamic capabilities
Sidney G Winter Strategic Management Journal; Oct 2003; 24, 10; ABI/INFORM Global pg. 991"

This could be parsed to generate an article item.

Running the same article through OCR produces the following that the start of the file

"Understanding dynamic capabilities
Sidney G Winter
Strategic Management Journal; ()ct 2003; 24, l(); ABUINFORM Global
pg. 991
Strategic Management Journal
Strat. Mgmt. J., 24: 99]-995 (2003)
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/smj.3l8
/
Lwwwmz
/rmnnnnrnn~mm UNDERSTANDING DYNAMIC CAPABILITIES
\ SIDNEY G. WINTER*
The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A.
Defining ordinary or `zero-level' capabilities as those thatpermit afirm to `make a living ' in the
short term, one can define dynamic capabilities as those that operate to extend, modny or create
ordinary capabilities. Logically, one can then proceed to elaborate a hierarchy of higher-order
capabilities. However, it is argued here that the strategic substance of capabilities involves
patterning of activity, and that costly investments are typically required to create and sustain
such patterning-for example, in product development. Firms can accomplish change without
reliance on dynamic capability, by means here termed `ad hoc problem solving.' Whether higher-
order capabilities are created or not depends on the costs and benefits of the investments relative
to ad hoc problem solving, and so does the `level of the game' at which strategic competition
effectively occurs. Copyright C 2003 John Wiley & Sons, Ltd."

This could be translated to a full bibliographic entry containing also the abstract.

OCR was done with GhostScript and Tesseract. The command line commands are

gs -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=temp.tiff" $OPTIONS -c save pop -f "understanding dynamic capabilities.pdf"

tesseract temp.tiff temp

Potential use cases could include dragging and dropping PDFs to library to generate bibliographic entries and to import the dropped items as sub-items or use a menu to mass import PDFs.

One option would be to implement this feature as a Zotero plugin. The advantage of this approach would be that the plugin could include Tesseract OCR engine and relevant parts of GhostScript and default using those for the conversion.

Mikko

Tjowens · March 12, 2009

A few thoughts.

One, the retrieve PDF metadata function basically solves this issue.

Two,

"Understanding dynamic capabilities
Sidney G Winter
Strategic Management Journal; ()ct 2003; 24, l(); ABUINFORM Global
pg. 991
Strategic Management Journal
Strat. Mgmt. J., 24: 99]-995 (2003)
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/smj.3l8

is pretty much useless for Zotero. While there are tools for parsing bibliographic information like this they are still to unreliable for these sort of cases. Now, if these PDFs had RIS or BibTeX files on the front of them it could be a slightly different story. That said, the retrieve metadata for PDF solution seems to be a much stronger one for the same problem.

mark · March 12, 2009

Note that the retrieve matadata fore PDF feature is only available in Zotero 1.5 beta.

mronkko · March 12, 2009

Retrieve metadata is an awesome feature, but it does not work for all PDFs
1) Google scholar does not always provide the correct information
2) Not all PDFs that do have information about the title, author, and journal seem to work. For example the PDF that I used as an example does not go through the metadata retrieval but results an error message that the PDF does not contain OCR:d text.
3) Not all

Moreover, Google scholar does not give you an abstract of the article. For these reasons I suggested that if this information exists in the PDF, a future version of Zotero would include translators for PDF files and only if translator does not exist use the current metadata retrieval process. The fact that the tools for parsing the information that the previous post quotes just indicates that a system for user contributed translators for PDFs processed by pdftotext would be useful.

Mikko

noksagt · March 12, 2009

"Understanding dynamic capabilities
Sidney G Winter
Strategic Management Journal; ()ct 2003; 24, l(); ABUINFORM Global
pg. 991
Strategic Management Journal
Strat. Mgmt. J., 24: 99]-995 (2003)
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/smj.3l8

is pretty much useless for Zotero.

In this particular example, I don't think that is true. It contains a DOI, which can be found using a regex. While there are some documents that have multiple DOIs throughout, most of those with any of:

only one DOI
a DOI that is repeated more than others
only one DOI on the first page & no dois until the last few pages

mean that the DOI corresponds to the article, itself & citation information can be retrieved from, e.g. crossref. Other desktop reference management software already does this. It obviously doesn't work for all PDFs (but then the current solution does not work for all PDFs either).

Now, if these PDFs had RIS or BibTeX files on the front of them it could be a slightly different story.

And, of course, some (not nearly enough) PDFs do have embedded metatdata. This is another occasionally-available source of info about a PDF that other programs parse & that Zotero currently ignores.

That said, the retrieve metadata for PDF solution seems to be a much stronger one for the same problem.

I agree that it is very useful, but I think it could still be improved & wouldn't just ignore other useful information that is in some PDFs.

dstillman · March 12, 2009

The use of Google Scholar is just the initial implementation. We'll be adding additional recognition techniques (including use of DOIs) in upcoming releases.

mronkko · March 12, 2009

I recently imported a large number of PDF files to Zotero as a huge drag and drop operation. While metadata lookup did a good job finding data for a number of articles, still approximately two thirds of my collection was not recognized. Still a large majority of these papers are published journal articles that include the title, author, etc on the front page and could be retrieved by running the article through pdf2txt or OCR and then using regular expressions to retrieve the details from the front page.

The only problem is that all publishers or databases, in case that they include a custom first page, store this in a different format. However, for one journal or database this format is always the same and would allow implementing a similar translator system that is currently in use for web pages. This is why I proposed a feature to include PDF translators in some future version.

How do I mark a part of a post as quote?

Mikko

ahoward · March 12, 2009

I'm sorry, Mikko, but this cannot be done. There is no consistent formatting for title pages of articles such that regular expressions would have anything to hinge upon. The fields have to be consistently in the same place or expressly identified for this to work. There's no way to automate the ingestion of the data contained in a highly variable set of title pages. Believe me, such things have been proposed before.

mronkko · March 12, 2009

Here is script for parsing the first page of a PDF from a files that have been retrieved from JSTOR. Similar translators could be written for other data bases. Also, journal level translators would be possible.

Here is a bash script that processes pdfs
http://pastebin.com/f7f343bc

And here is a perl script that does the parsing
http://pastebin.com/f4141f288

Mikko