Importing metadata from PDF

Jim Offord · December 27, 2006

I have a large number of PDF files that I have downloaded over the last several years. Is there a way that I can get the bibliographic information out of these files, into Zotero, and then link or store the PDF with that information?

dancohen · December 28, 2006

Currently there's not an automatic way of doing this (and it would be hard to implement since PDFs don't have a common place that citation information is stored). You could grab the citation information automatically by finding these articles (alas, probably one at a time) through any of Zotero's supported sites, such as Google Scholar or JSTOR, and then attaching your PDFs to those citations. You could also manually create items (under the green "plus" button) for each of your PDFs and cut and paste the citation information into them from the PDFs, and then attach the PDFs to the items. Unfortunately there's no perfect solution here for legacy collections that aren't associated with a bibliographic tool like EndNote or Bibtex.

elwood151 · January 8, 2007

Hi Jim,

I recently found the free tool ** CB2BIB ** (available for Windows, MacOS and UNIX) which might help you extracting metadata from pdf files.
-> http://www.molspaces.com/cb2bib/

Kind regards

Martin

Jim Offord · January 10, 2007

Martin:

Thank you for the reference to CB2BIB. It is unclear to me from the documentation how to use this program. It uses data from the clipboard?

Do you know if there is a way to do large numbers of PDF files at once using this program?

elwood151 · January 28, 2007

Hi Jim,

I did not use this program "seriously" yet, just played arount a little bit with it.

Did you read this "usage information" here:
http://www.molspaces.com/d_cb2bib-overview.php#usage ?

Isn't it the following, what you're searching for:
(Quote)
"Multiple retrieving from PDF files
Multiple PDF or convertible to text files can be sequentially processed by dragging a set of files into cb2Bib's PDFImport window. By starting the processing button, files are sequentially converted to text and send to cb2Bib clipboard panel for reference extraction. If the automatic recognition fails, the process pauses and allows for cb2Bib manual extraction. Alternatively, if automatic recognition succeeds references are optionally saved and next file is processed. See Configuring PDFImport section for setting your to text converter."

Sorry for not beeing able to be more concrete at this time,

Kind regards

Martin

jefelino · September 16, 2007

This isn't just an issue for legacy collections; it is very common to obtain PDFs from sources other than their original database (e.g. colleagues' emails, course websites), and it would be a major inconvenience to track down the original source, use a third-party tool, or manually enter the metadata every time.

It's true that PDFs don't have a common format for storing bibliographic information, but PDFs that came from particular databases often do come with standard cover pages. A good solution would be a family of "translators", like those used for webpages, that can extract this information. Another good solution would be to fully integrate a tool like cb2Bib. It would be a very large boost to Zotero's usefulness if I could drag one or several PDFs into the Zotero panel and have their metadata automatically extracted.

scot · September 16, 2007

Perhaps, but the number of databases is dizzying, and 'no standard way of presenting the data' really does require the need for serious heuristics (and tentative results). Not to shoot down the idea, and there may in fact be a few databases with well-structured cover pages and lots of items out there that would justify the effort (JSTOR?), but it would be pretty hard.

And you know of course that you don't need to track a PDF down to its original source if it is a PDF of a paper-journal article. If you have a good discipline-specific database for your area(s), which Zotero supports, or if Google scholar supports your area sufficiently well, you can just get the metadata from there, import it quickly into Zotero, and simply drag the pdf onto your new Zotero entry. I did a bunch of this last week, and the easy cases were very easy. It's just a matter of typing a few unique words from the author or title fields into the database, importing the metadata for a batch of articles, and then dragging the PDFs from the filesystem onto their newly created entry. Zotero does the rest. It helps if you have a file manager (like xyplorer for windows) which has a PDF preview function, as well as plenty of screenspace so you can see your pdf coverpage, your database, your list of PDFs to import and your zotero list all at the same time.

The hard cases were hard, and time consuming, but you could hardly avoid that with even a good set of translators, it seems to me. Things that would have any chance of being easy to construct a translator for were (in my case) from easily traceable sources, and therefore pretty quick to look up. Odd conference papers, or one-off journal articles would make for manual work anyway. In the end I found it easiest just to let Zotero import all the easy ones, and type the rest in by hand.

You're right. If you really have to hunt it's not worth the time. If you have to look more than one place for the metadata, it's faster just to type it in. And perhaps you'd find it less arduous than it sounds. Zotero is reasonably quick for the fingers
(SHIFT-CTRL-N, first letter of item type (J), TAB, title, TAB, Author Surname, TAB, First name, TAB, etc.) And add to that that automatic data sources (at least for my field) almost always have some typographic oddity which needs manual correcting. I started to thing that manually typing everything wouldn't be all that bad. Of course that all depends on whether the collection has real value for you. You won't be too motivated to manually enter data for things you don't see yourself really needing metadata for.

arnegj · February 14, 2008

Just an idea on this feature. For newer pdf-files that have the doi on the frontpage the doi could be extracted (for instance using pdftotext and a regular expression) , this then points to the webpage and gives the metadata. Although this would fail for older pdf's it would be highly useful to be able to browse for pdf-folder from Zotero and import metadata for all pdf's with doi, it would also be a very useful feature for a Thunderbird plugin.

khazaei · March 3, 2010

There is a way for that I realized recently.
You put all of your PDF files in a folder or different classified folders and then use this free software:
http://www.mendeley.com/

in Mendeley, go to File menu and then add folder ( here you address your folder(s) which your PDFs are in). Mendeley import all bibliographic information in your PDF files.

The next step is to export you library as RIS and import it in Zotero.

adamsmith · March 3, 2010

actually Zotero has a very similar feature - it just wasn't available 2 years ago, when the thread was last active.
Select the pdfs, right-click and choose "Retrieve Metadata" - last I heard Mendeley's feature worked a little better and they have some clever ideas of using additional data (including user-provided data), but Zotero does use Google Scholar Results as well as DOIs on the first page to get metadata and that works in a large majority of cases.

khazaei · March 3, 2010

Thank you Adamsmith,
It is great, so we do not need Mendeley any more.
http://www.zotero.org/support/retrieve_pdf_metadata

thomasmid · April 9, 2011

i think this comment might fit here:

I have 500ish pdfs on my computer and a comparable amount of bib data (for each pdf) stored in zotero. They are not associated yet, but I would like to do so for a new organizational schema.

Is it possible to automate the process of right-clicking for "add attachment" and then "attach stored copy of file" for all of them? [for example: bulk upload the pdf's as items to zotero and then ask it to retrieve the metadata from my zotero bibs?]

Simon · April 9, 2011

On Windows or Mac OS X, you should be able to drag the PDFs onto the references in Zotero, which will do the same thing. This will still require going through each individually, but should be significantly faster than right-clicking and selecting in the open dialog.

joshrbaxter · July 7, 2012

I have all my pdfs in a single folder in my google drive (which is pretty nice, the web interface has really nice search tools). Is there a way to link my pdfs with my zotero references so it will work with on multiple machines that may have different directory paths?

adamsmith · July 7, 2012

Josh - that's unrelated to this thread, please start a new one.

Jejecks · November 3, 2013

82191784

I'm having a huge problem.
I don't have any idea why, but I can't import or collect References of PDF documents!!!

What I have to do?

adamsmith · November 3, 2013

Start a new thread - using the red "Start a new discussion" button at the top left of this page - and provide some detail about what you're trying to achieve, what you're currently doing, and how it's failing.

godblessfq · January 9, 2014

Is it a good idea to automatically retrieve the meta data when pdf files are dropped on zotero? It is tedious if I need to get the meta data of pdf files already in the library, I have to select them right click and select find matadata.
I want the meta data because I have some pdf files with names that unrelated to its content.
Thank you very much!

dstillman · January 9, 2014

Please start a new thread. I'm closing this one, which is ancient.