Translators for PDFs
For people that store a large number of PDFs on their computer the ability to automatically import these into Zotero would be valuable. This could be implemented by introducing translators for PDF files.
Several databases embedd the citation information on the PDFs and it can be quite easily extracted using pdftotext. For example in a PDF file that I have, the first two lines from the output of pdftotext contain
"Understanding dynamic capabilities
Sidney G Winter Strategic Management Journal; Oct 2003; 24, 10; ABI/INFORM Global pg. 991"
This could be parsed to generate an article item.
Running the same article through OCR produces the following that the start of the file
"Understanding dynamic capabilities
Sidney G Winter
Strategic Management Journal; ()ct 2003; 24, l(); ABUINFORM Global
pg. 991
Strategic Management Journal
Strat. Mgmt. J., 24: 99]-995 (2003)
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/smj.3l8
/
Lwwwmz
/rmnnnnrnn~mm UNDERSTANDING DYNAMIC CAPABILITIES
\ SIDNEY G. WINTER*
The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A.
Defining ordinary or `zero-level' capabilities as those thatpermit afirm to `make a living ' in the
short term, one can define dynamic capabilities as those that operate to extend, modny or create
ordinary capabilities. Logically, one can then proceed to elaborate a hierarchy of higher-order
capabilities. However, it is argued here that the strategic substance of capabilities involves
patterning of activity, and that costly investments are typically required to create and sustain
such patterning-for example, in product development. Firms can accomplish change without
reliance on dynamic capability, by means here termed `ad hoc problem solving.' Whether higher-
order capabilities are created or not depends on the costs and benefits of the investments relative
to ad hoc problem solving, and so does the `level of the game' at which strategic competition
effectively occurs. Copyright C 2003 John Wiley & Sons, Ltd."
This could be translated to a full bibliographic entry containing also the abstract.
OCR was done with GhostScript and Tesseract. The command line commands are
gs -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=temp.tiff" $OPTIONS -c save pop -f "understanding dynamic capabilities.pdf"
tesseract temp.tiff temp
Potential use cases could include dragging and dropping PDFs to library to generate bibliographic entries and to import the dropped items as sub-items or use a menu to mass import PDFs.
One option would be to implement this feature as a Zotero plugin. The advantage of this approach would be that the plugin could include Tesseract OCR engine and relevant parts of GhostScript and default using those for the conversion.
Mikko
Several databases embedd the citation information on the PDFs and it can be quite easily extracted using pdftotext. For example in a PDF file that I have, the first two lines from the output of pdftotext contain
"Understanding dynamic capabilities
Sidney G Winter Strategic Management Journal; Oct 2003; 24, 10; ABI/INFORM Global pg. 991"
This could be parsed to generate an article item.
Running the same article through OCR produces the following that the start of the file
"Understanding dynamic capabilities
Sidney G Winter
Strategic Management Journal; ()ct 2003; 24, l(); ABUINFORM Global
pg. 991
Strategic Management Journal
Strat. Mgmt. J., 24: 99]-995 (2003)
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/smj.3l8
/
Lwwwmz
/rmnnnnrnn~mm UNDERSTANDING DYNAMIC CAPABILITIES
\ SIDNEY G. WINTER*
The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A.
Defining ordinary or `zero-level' capabilities as those thatpermit afirm to `make a living ' in the
short term, one can define dynamic capabilities as those that operate to extend, modny or create
ordinary capabilities. Logically, one can then proceed to elaborate a hierarchy of higher-order
capabilities. However, it is argued here that the strategic substance of capabilities involves
patterning of activity, and that costly investments are typically required to create and sustain
such patterning-for example, in product development. Firms can accomplish change without
reliance on dynamic capability, by means here termed `ad hoc problem solving.' Whether higher-
order capabilities are created or not depends on the costs and benefits of the investments relative
to ad hoc problem solving, and so does the `level of the game' at which strategic competition
effectively occurs. Copyright C 2003 John Wiley & Sons, Ltd."
This could be translated to a full bibliographic entry containing also the abstract.
OCR was done with GhostScript and Tesseract. The command line commands are
gs -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=temp.tiff" $OPTIONS -c save pop -f "understanding dynamic capabilities.pdf"
tesseract temp.tiff temp
Potential use cases could include dragging and dropping PDFs to library to generate bibliographic entries and to import the dropped items as sub-items or use a menu to mass import PDFs.
One option would be to implement this feature as a Zotero plugin. The advantage of this approach would be that the plugin could include Tesseract OCR engine and relevant parts of GhostScript and default using those for the conversion.
Mikko
One, the retrieve PDF metadata function basically solves this issue.
Two, is pretty much useless for Zotero. While there are tools for parsing bibliographic information like this they are still to unreliable for these sort of cases. Now, if these PDFs had RIS or BibTeX files on the front of them it could be a slightly different story. That said, the retrieve metadata for PDF solution seems to be a much stronger one for the same problem.
1) Google scholar does not always provide the correct information
2) Not all PDFs that do have information about the title, author, and journal seem to work. For example the PDF that I used as an example does not go through the metadata retrieval but results an error message that the PDF does not contain OCR:d text.
3) Not all
Moreover, Google scholar does not give you an abstract of the article. For these reasons I suggested that if this information exists in the PDF, a future version of Zotero would include translators for PDF files and only if translator does not exist use the current metadata retrieval process. The fact that the tools for parsing the information that the previous post quotes just indicates that a system for user contributed translators for PDFs processed by pdftotext would be useful.
Mikko
- only one DOI
- a DOI that is repeated more than others
- only one DOI on the first page & no dois until the last few pages
mean that the DOI corresponds to the article, itself & citation information can be retrieved from, e.g. crossref. Other desktop reference management software already does this. It obviously doesn't work for all PDFs (but then the current solution does not work for all PDFs either). And, of course, some (not nearly enough) PDFs do have embedded metatdata. This is another occasionally-available source of info about a PDF that other programs parse & that Zotero currently ignores. I agree that it is very useful, but I think it could still be improved & wouldn't just ignore other useful information that is in some PDFs.The only problem is that all publishers or databases, in case that they include a custom first page, store this in a different format. However, for one journal or database this format is always the same and would allow implementing a similar translator system that is currently in use for web pages. This is why I proposed a feature to include PDF translators in some future version.
How do I mark a part of a post as quote?
Mikko
Here is a bash script that processes pdfs
http://pastebin.com/f7f343bc
And here is a perl script that does the parsing
http://pastebin.com/f4141f288
Mikko