Importing and associating PDF files with references

jbenjam · June 3, 2008

Hi,

I'm just starting out with Zotero.

I have a collection of hundreds of PDF files from various journals, and would like to associate those files with their respective bibliographic information. Is there an automatic way to do this, or am I stuck manually adding the PDF file, and dragging it to the appropriate bibliographic item?

Thanks,
-Ben

gal.avraham · June 4, 2008

I'm joining Ben's Q, too. I got a huge pile of PDF documents, saved in a certain directory on my computer. Is there a way "telling" zotero to associate those files with their respective bibliographic information, in an automatically manner?
Cheers,

Gal

scot · June 4, 2008

ben and gal,
Unfortunately there is no way to do this automatically, for the simple reason that (unless I'm wrong) there is no consistent standard for storing metadata or document IDs within PDFs. Sure, some individual article databases may have patterns,putting citation info in a human-readable format on the first or last page, for example. But (1) this has generally not been done with machine-readablitiy in mind, so it's tricky to extract, and (2) every publisher seems to do it differently. It would not be impossible to write a 'scraper' to extract unique data from a PDF, and then try to associate it with a record in Zotero, but as far as I know, no one has done it, and it there would likely have to be a different scraper for each source (database or publisher) of PDFs. As I said, not impossible, but I wouldn't hold your breath.

If you do drag and drop, it helps a lot to have 2 monitors, and to set things up so that you can preview the PDF and still see its filename in a file manager (for dragging). You can keep Zotero's advanced search box open as well, (to search for authors' names, for example), and if I remember right, you can drag straight to it. It's mildly painful, but only for the time it takes. I can probably do 20 an hour, more if all the items are in my database already.

In addition, if you have any older scanned PDFs which don't have a searchable text layer, (and if full text searching is really of value to you), you might want to find a friend with Adobe Acrobat pro, and run your PDFs through its OCR function, which can apparently be automated. Find the procedure here:

http://www.acrobatusers.com/forums/aucbb/viewtopic.php?id=14400

Cheers.

noksagt · June 4, 2008

This topic has been discussed a few times (here and hre), but I'll rehash it a bit...

(unless I'm wrong) there is no consistent standard for storing metadata or document IDs within PDFs.

PDFs have had 'info dictionaries' for a long time & now support XMP. These are under-used by publishers and other content providers, though. pdfinfo can read them both for documents that do have them.

There are other programs that can interact with PDFs directly. A popular recipe is to search the PDF for a string that looks like an identifier (DOI, PMID, arXiv ID, etc.) & then look up information for that match. Obviously, this doesn't work with EVERY pdf, but it works with many of them.

generally not been done with machine-readablitiy in mind, so it's tricky to extract, and (2) every publisher seems to do it differently. It would not be impossible to write a 'scraper' to extract unique data from a PDF, and then try to associate it with a record in Zotero, but as far as I know, no one has done it

Other programs HAVE tackled the problem of using heuristics to identify non-standardized, human-readable text in PDFs too (a good example is the way CiteSeer parses bibliographies). This process benefits from having a database of other documents, so it may be better as a longer-term goal for the server.

In addition, if you have any older scanned PDFs which don't have a searchable text layer, (and if full text searching is really of value to you), you might want to find a friend with Adobe Acrobat pro

gscan2pdf is a free/open source alternative on linux that can use OCR. See sybille's notes.

scot · June 5, 2008

Thanks, noksagt for the corrections. It's nice to know the situation is not as bleak as I thought.

srk · June 16, 2008

Would this workflow, er, work? Use CB2BIB (http://www.molspaces.com/cb2bib) to extract bibliographic information (to BibTeX format) from the PDF files and rename them, followed by something like JabRef (http://jabref.sourceforge.net/) to export the bibliographic data in XMP format directly into the PDF files ?

Never tried it - YMMV- let everyone on the forum know if it works. This seems to be a common problem.

khazaei · March 3, 2010

It is possible to associate PDF files with their respective bibliographic information, in an automatically manner

just see here
http://www.zotero.org/support/retrieve_pdf_metadata

nmsalgueiro · April 2, 2010

I have the same problem. Zotero can't find the correct DOI for a lot of PDFs in my library, so it can't ask the resolver about its bibliographical data. What I would like to know is this:

If I manually find the DOI of a document, is there a way to say to Zotero "THIS is the DOI for THIS document, go find its metadata"? This is actually a very simple feature, but I can't manage to find it in this software. Thanks in advance for any help you may provide.

fcheslack · April 2, 2010

"Add Item by Identifier" (the magic wand icon) supports DOI

nmsalgueiro · April 2, 2010

fcheslack, thank you so much for that! I'm a newbie on this software, and the piece of info you gave me is invaluable. Again, thank you for such a quick answer.