Help Needed: Workflow for existing PDFs into Zotero

rcr1991 · April 20, 2008

I have thousands of PDFs stored locally. Want to eventually have them all in Zotero as fully indexed and searchable files.

My inclination is that these PDFs have some value--some are from 'pay' sites, and re-downloading them seems inefficient.

Ideally there would be a way to to search the PDF, look up its DOI or PMID, get that info, make a zotero record, link to the file, and make it fully text searchable.

I have pdftotext running in batch mode, can create text files of non-image PDFs.
Preparing to OCR 'image' files so they can be indexed after PDFtoText. Stumped by how to link the newpdf.txt file to the orignal PDF file?

But the "lookup at PubMed and put into Zotero'' function eludes me. I've spent the weekend searching the web and have come across several links, but no actual code that I can get to work to make this happen. In my searching I realize there are a lot of people out there trying to solve this problem, but the tools seem to not be available.

So I am open for any suggestions for workflow and tools that would accomplish the complete integration of existing PDFs into Zotero.

Have looked at BibDesk, JabRef, EndNote, Papers. Nothing does it all, and many say 'manually find the citation' then we'll do the rest. That is the part I want to automate. I'll pay to have the tool created and make it freely available.

Any suggestions?

Rob

markb · April 21, 2008

Well, I am in the same situation.
I don't have a worked-out solution, but here is how I would approach it.
(1) first of all, you need to batch things (obviously). It's a lot easier to batch in linux vs windows. It may be too big a pain to transfer into a linux environment. Anyways, you can probably get away with batching in windows, too.
(2) use pdftotext for each file (batch this) - name so that it relates to original pdf
(3) howto uniquely ID the paper?
(a) using text version of pdf (from step 2) - get journal name and volume/page/year info using a perl script.
(b) grab the title (perl again)
(c) I'll comment a little on this below
(4) get full citation information via pubmed (can again use perl to query the website)
(5) put this citation format into a zotero-compatible file format (again, probably perl is best but maybe an easier way)
(6) import into zotero

I know this requires some significant programming skills (perl, prob linux) and will invariably screw up with a subset of your pdfs. But, assuming that you are using the biomed literature (you mentioned Pubmed), I think this could be robust.

If I end up writing something like this myself, I will give details and post all code.

Mark

Con_Sole · April 23, 2008

Same problem here... Also I have tried batch downloading again the PDFs from PMC, but all I got was an IP ban :( Also It would be a lot easier to use the already downloade PDF files...

rcr1991 · April 25, 2008

markb,

thanks for the thinking on this. you've outlined the steps. i've stumbled through the process to steps 5 and 6 using kludged together solutions to test feasibility. some new questions to refine my understanding of the next steps:

now i have a properly identified pdf, and its text file. i have read about 'layers' in PDFs, and I was kind of assuming that is what I would want to create to import to Zotero 'cause without the text the full text search wouldn't work on image-only PDFs. sorry if am being obtuse. but a number of journals are scanning their archives in 'image-only' PDF format. it seems that some of the workflows suggest applying Adobe or other OCR to these image-only PDFs, and if I understand it this OCR'd text does get stored as a text layer--cause PDF to text cannot extract 'text" from an 'image', right.

So do I need to create that type of PDF for Zotero? Any what tools. Not scared of Linux or command line, just want Zotero full of all my reprints :).

I have this idea, maybe stuid, that if I could make Zotero think my computer was the NLM it would find all my PDFs, rename them, index them, etc. Only half joking.

Again, I appreciate the dialog.

Rob

markb · April 27, 2008

Hi Rob,
well, I think this is a generally important topic. As I said, I too have a large number of pdfs.
So some points, numbered for clarity:
(1) my workflow was oriented toward just getting the full citation information into zotero (with abstract, also). Just let me expound on this for clarity. The first thing is that a paper can be uniquely identified by first author and starting page number (except in some rare cases). This information can be put into pubmed to get the citation information for that exact paper. So the second thing is writing a robust routine to gather the first author name and page number from a text converted pdf. This sounds slightly tricky to me; I'd guess that something I wrote in perl might work for 80% of the pdfs; the other 20% would require some special handling. If you have done this, bravo!
The next step is writing something (perl again) to automatically query pubmed with the author name and page number. This should be pretty easy - perl is great at this sort of thing.
So, from pubmed, you get the full citation info and abstract.
(2) Ok, so I was done at this stage. You want your pdfs to be fulltext searchable within zotero, right? This seems more difficult to me. I am thinking an attached note with the full converted text would be good. Yes, I have heard about the OCR conversion for imaged files (e.g. many at JSTOR). There should be good open-source versions of this, sorry I don't know them.
A bigger possibility: clearly, the google desktop search and yahoo desktop search are able to look within pdfs (this is how I allowed full text access). So there should be a way to have a routine within zotero to do this... however, this sounds more like recoding zotero to me.

I think enhanced pdf support would really be a big, big bonus for zotero. Critical for scientists and more and more people. PDF, for better or worse, is the standard now...

I hope this helps and if I misunderstood anything (probable) please comment.

Mark

nien · June 2, 2008

can u explain it in more details?
i don't understand the "perl" things
sorry...

lucky me, i just start writing my thesis when i found zotero
not yet many documents
its alright for me to redownload

but if there is other way, will give it a try

scot · June 2, 2008

This only deals with part of the question at hand, but Acrobat Pro can batch-process Image-based PDFs into Image-based-PDFs-with-a-plaintext-layer (the searchable PDFs you want for Zotero). There's a description of the process here:

http://www.acrobatusers.com/forums/aucbb/viewtopic.php?id=14400

The process of automating their entry into Zotero seems daunting indeed. Extracting page numbers or publication data from text with no semantic markup sounds pretty difficult. I only have hundreds, not thousands of PDFs, so I'm importing them manually.

It helps to have two screens, and you set up your windows so you can see the first page of the PDF and the filename (in a file manager) at the same time. Then you re-find the bibliographic info in a Zotero-friendly database. (By typing some significant search terms from the metadata. When you find it you, Import the data into Zotero, and drag your file onto the new item in Zotero, and that's it. Zotero imports it and indexes the text layer of the PDF. It's painful only if you have to comb more than one database to find the bibliographic info. (or if you have to do it 2K times). Good luck.

lakelander · July 24, 2008

I also think this is an important topic. My last big project was done using Endnote and Word for the bib managment and writing, and UltraRecall for gathering, organising and searching material, mostly PDFs. UR has no bibliographic capability, so I still needed endnote but it seems zotero combines everything you need for managing and gathering information for academic writing. UR can search for text in a text-based pdf file to identify files containing the search terms, but as with zotero, you then have to enter the search again in acrobat to find the item location in the document. Is there a method of performing only a zotero search of PDF text and identifying the passages containing the search terms?

dstillman · July 24, 2008

Is there a method of performing only a zotero search of PDF text and identifying the passages containing the search terms?

Context for search term matches is available from the underlying search system for fulltext searches but not yet presented in the UI, mainly because we don't have a great way to do so. It's also much harder to provide context for non-fulltext searches, which would be nice to do for consistency.

lakelander · July 24, 2008

Thanks Dan,

If I understand you correctly the answer is "Yes, well.. no, but maybe sometime."

I suppose it is only reasonable that users should expect to continue to expend some of their own effort in their research. It would be even easier, wouldn't it, if zotero could guess what I ought to be thinking about then find it for me without any of that pesky typing at all.

wouterstomp · August 13, 2008

This might work:

Referencer (http://icculus.org/referencer) is a program that runs on linux (I would suggest using Ubuntu) and automatically retrieves the pdf's metadata provided it includes an arXiv ID or DOI code.

It generates a bibtex file which can then be imported by zotero.

If it works it is probably the simplest solution currently available.

wouterstomp · August 13, 2008

Another option is c2bib (http://www.molspaces.com/d_cb2bib-overview.php), which works on windows too. If you import the list of pdf´s and do a network reference query for each of them you will get a nice bibtext file which you can import in zotero. It will take a few mouseclicks for each pdf but it certainly beats doing it all manually.

wouterstomp · August 13, 2008

Also see this thread: http://forums.zotero.org/discussion/255/importing-metadata-from-pdf/

sybille · August 14, 2008

I imported a group of almost 400 PDFs into Zotero manually last summer. I came up with a method to automate some parts of the process, enough so that the importing was not unbearably tedious. :) Here's the method I used:

1) Make a list of the titles of the PDFs.
In my case, the PDFs all had versions of their respective titles in the filenames that contained enough information that I was able to find the articles using Google Scholar, along with some other data.
So, on linux, I used the shell (cat) to make a list of the filenames in the directory where the PDFs were located and then I used a command line text editor (sed) to edit that list so that only the names of the article titles remained, one to a line. And I saved all of that as a plain text file.
Of course there are plenty of other ways to make a list of the titles of the PDFs, that's just what was easiest for me.

2) Install the Firefox extension "Context Search"
https://addons.mozilla.org/en-US/firefox/addon/240
Context Search allows you to access all of the installed Firefox search engines from the right-click menu.
I then installed the Google Scholar search plugin, since I had already verified that my PDF articles were listed there.
http://mycroft.mozdev.org/search-engines.html?name=scholar.google.com
(1st on the list)
I also used "Manage Search Engines..." in the Firefox search toolbar to move the Google Scholar plugin to the top of the list.

3) Use Google Scholar to enter the articles into Zotero.
So I opened the text file containing the list of the articles in Firefox, selected an article title, right-clicked, and chose Google Scholar from the "Search for" list. This opened a new tab with the Google Scholar results, from which the article's metadata could be added to Zotero in normal way.

4) Add the PDF file to the Zotero item.
This was made easier by adding an entry to the Places menu in the GNOME file manager for the folder in which all the PDFs were stored.

It did take a while to go through all of the articles in this way, but not that long once I settled on the method. I put the articles into a separate Zotero collection sorted so that the most recently added item was on top, to make it easier to keep track of things, and I did it in batches.

The advantage of the manual import is that it allowed me to check the metadata in Zotero as I added the items, to make sure that they were accurate for my preferences (full author names, for example). I find that a quick manual check is always a good idea whenever adding new items into Zotero, because there's just too much inconsistency in the available metadata.

Even if it was time-consuming, I'd rather spend the time once as described above than to have to have to check, by hand, each individual reference and bibliography I make by hand in each of my projects, which is how I used to do things pre-Zotero.