Take pdf's existing metadata

I've posted something related a long! while ago - there was little response so here's a second try.
Wouldn't it be nice if Zotero had a JabRef-like pdf metadata extraction feature (I think Mendeley does this as well; cb2bib does it). What's that all about? There are ways of associating metadata (author, keywords etc.) with a given pdf either by placing it into the "Document Information Dictionary" - that's old school - or by adding it as an xmp stream. Then the metadata is forever linked to the document. In JabRef if you drag a pdf with metadata onto the program, it automatically extracts author names and so on, creating a new bibliographic entry. That way the Zotero feature "Retrieve metadata for PDF" would get completely new value.
Cheers
Stefan
  • This would be a huge leap for Zotero. It would be a big item in the list "Why I use Zotero".
  • Actually, I may be off-track here, but wouldn´t you need just the DOI?
  • edited July 12, 2010
    rquiroga - my sense is that Zotero already does what you're asking for - it already looks for a DOI in a pdf and gets the info when you use "Retrieve Metadata from PDF".
    skreisel wants to use (and/or write) tags to pdfs themselves. That has been discussed before, I think one of the reasons this isn't done is that there is no uniform standard for bibliographical tags - the DocInfo or xmp stream would be a bit ad-hoc. But I'm not sure - as this has been discussed before, though,if you want to get into this more, search around some and find the old thread so we don't go in circles.
  • There is no reason to start new threads. Bump the old thread with information if needed.

    Here are threads on the topic:
    http://forums.zotero.org/discussion/3079/importing-and-associating-pdf-files-with-references/
    http://forums.zotero.org/discussion/8635/rename-file-update-document-metadata/

    AFAIK, it is still the case that Zotero would require executables outside of Firefox to do this (as pdf extraction, etc. currently use). Reading some types of metadata should be relatively easy to do without adding large dependencies. Writing metadata may require a bit more development of an external tool that could do this that is both small and cross platform.
  • edited July 12, 2010
    Thanks all for the comments.

    Just to get back to you:

    rquiroga - sure if the PDF extraction incl. the DOI works properly then Zotero retrieves metadata effectively.

    adamsmith - "I think one of the reasons this isn't done is that there is no uniform standard for bibliographical tags {...]"; while I'd agree, bibtex is good candidate - even Google Scholar outputs in bibtex. So if the PDF metadata complies to bibtex "standard" (there aren't a whole lot of fields, just the basics like author, keywords etc.), why not use it?

    noksagt - Most of the threads go into discussing if and how metadata could be written! to a PDF. Even though that's an important topic, I see the point of undermining Zoteros independence by referencing external software. I've "outsourced" the "Retrieve metadata for PDF" command and then writing metadata to the PDF by using cb2bib - it's simply much more flexible (e.g. also looks up metadata in Pubmed).

    So basically what it boils down to is: Could Zotero read PDF metadata (i.e. from the "Document Information Dictionary" or the XMP-stream) and use it to create new bibliographic items?

    Cheers
  • Most of the threads go into discussing if and how metadata could be written! to a PDF.
    The threads I linked discussed both reading and writing metadata. This thread has not really added anything that is not there. You, yourself, noted there were other threads on the topic of reading metadata.

    Could Zotero read PDF metadata (i.e. from the "Document Information Dictionary" or the XMP-stream) and use it to create new bibliographic items?
    As in past threads: pdfinfo (which Zotero already uses) can, indeed, read this information. In principle, it is possible. Someone just has to write the code to do this.
  • noksagt - Sorry for having caused any inconveniences...
  • No worries. This is a somewhat frequently asked question. Ticket created:
    https://www.zotero.org/trac/ticket/1695
  • Does the info available to pdfinfo in PDFs from major sites actually contain useful info? I haven't checked recently, but from what I've seen in the past it often does not.
  • It is far-from-perfect, but it is improving. Elsevier has been including it since 2009 or so, for example.
  • Dan, noksagt: "Does the info available to pdfinfo in PDFs from major sites actually contain useful info?"
    I don't think so either, and one can't expect that there will ever be a consensus with thousands of journals around.
    However, as I pointed out, there are work-arounds to force relevant things into the PDF using e.g. either cb2bib or JabRef (both stick to bibtex). It won't be enough to populate the "Document Information Dictionary" simply because its limited to title, author, subject, keywords, some copyright information and date codes - so the XMP-stream has to be used. I'm not sure if xpdf's pdfinfo can read latter in its full extent. cb2bib uses ExifTool.
  • Nature publisher includes DOI in a pdfinfo-accessible field in their journals, e.g.,

    Subject: Nature Reviews Neuroscience 10, 670 (2009). doi:10.1038/nrn2698

    yet these PDFs typically fail with Retrieve metadata :-( Yes, I can Add item by identifier using this DOI manually but this will not link the new item with the PDF in question.
Sign In or Register to comment.