Importing and associating PDF files with references
Hi,
I'm just starting out with Zotero.
I have a collection of hundreds of PDF files from various journals, and would like to associate those files with their respective bibliographic information. Is there an automatic way to do this, or am I stuck manually adding the PDF file, and dragging it to the appropriate bibliographic item?
Thanks,
-Ben
I'm just starting out with Zotero.
I have a collection of hundreds of PDF files from various journals, and would like to associate those files with their respective bibliographic information. Is there an automatic way to do this, or am I stuck manually adding the PDF file, and dragging it to the appropriate bibliographic item?
Thanks,
-Ben
Cheers,
Gal
Unfortunately there is no way to do this automatically, for the simple reason that (unless I'm wrong) there is no consistent standard for storing metadata or document IDs within PDFs. Sure, some individual article databases may have patterns,putting citation info in a human-readable format on the first or last page, for example. But (1) this has generally not been done with machine-readablitiy in mind, so it's tricky to extract, and (2) every publisher seems to do it differently. It would not be impossible to write a 'scraper' to extract unique data from a PDF, and then try to associate it with a record in Zotero, but as far as I know, no one has done it, and it there would likely have to be a different scraper for each source (database or publisher) of PDFs. As I said, not impossible, but I wouldn't hold your breath.
If you do drag and drop, it helps a lot to have 2 monitors, and to set things up so that you can preview the PDF and still see its filename in a file manager (for dragging). You can keep Zotero's advanced search box open as well, (to search for authors' names, for example), and if I remember right, you can drag straight to it. It's mildly painful, but only for the time it takes. I can probably do 20 an hour, more if all the items are in my database already.
In addition, if you have any older scanned PDFs which don't have a searchable text layer, (and if full text searching is really of value to you), you might want to find a friend with Adobe Acrobat pro, and run your PDFs through its OCR function, which can apparently be automated. Find the procedure here:
http://www.acrobatusers.com/forums/aucbb/viewtopic.php?id=14400
Cheers.
There are other programs that can interact with PDFs directly. A popular recipe is to search the PDF for a string that looks like an identifier (DOI, PMID, arXiv ID, etc.) & then look up information for that match. Obviously, this doesn't work with EVERY pdf, but it works with many of them. Other programs HAVE tackled the problem of using heuristics to identify non-standardized, human-readable text in PDFs too (a good example is the way CiteSeer parses bibliographies). This process benefits from having a database of other documents, so it may be better as a longer-term goal for the server. gscan2pdf is a free/open source alternative on linux that can use OCR. See sybille's notes.
Never tried it - YMMV- let everyone on the forum know if it works. This seems to be a common problem.
just see here
http://www.zotero.org/support/retrieve_pdf_metadata
If I manually find the DOI of a document, is there a way to say to Zotero "THIS is the DOI for THIS document, go find its metadata"? This is actually a very simple feature, but I can't manage to find it in this software. Thanks in advance for any help you may provide.