Get at that PDF metadata...

Hi.
I started a thread a couple of years ago -> https://forums.zotero.org/discussion/13415/take-pdfs-existing-metadata/
Clearly there has been a lot of development ever since. However, functionality that would enable the user to use pre-populated metadata from "within" an attached pdf (i.e. document information dictionary or an XMP-stream) has not really been on the roadmap (as opposed to full-text retrieving of metadata via pdftotext).
See more recent posts:
-> https://forums.zotero.org/discussion/24831/extract-metadata-from-pdf-file-itself/
-> https://forums.zotero.org/discussion/25108/add-metadata-to-pdf/
This may partly be due to the fact that the developers have correctly come to the conclusion that "there are no real bibliographic metadata standards for PDFs and so importing from XMP tags (the generic pdf metadata fields) would almost certainly be connected with rather substantial losses in data...", perhaps also because Google Scholar has "grown up" - enabling sensible full-text comparisons with that "gold-standard" reference set. Nonetheless, some of us go to pains to add metadata to a pdf (admittedly, everybody does it their own way - but I do guess that most would put the author's name in the author's field of the pdf's DID or follow "strict" bibtex standards as cb2bib does), "hard-wiring" that information to the document.
As pointed out back then, pdfinfo does do a good job (using the -meta switch) to get at the information. Why not use it? As https://forums.zotero.org/discussion/21295/how-is-it-possible-to-append-metadata-to-a-pdf/ pointed out: "Zotero Retrieve Metadata" "doesn't even look at that data". Or is everybody waiting for pdf.js?
Cheers
Stefan
  • I believe this is mainly a question of effort. If you're interested in supplying a patch that would allow reading (or even writing?) of XMP metadata (probably as a fallback), I think devs would be glad to accept it.
    There is tons of stuff to do in Zotero, I just don't think this is a priority for any of the active devs, both paid and volunteer.
  • Yup! If I had the ab(g)ility to do it - I would have done it ;)
    I looked at the repository today - with no prior idea of how Zotero works down under. The only sensible place I found a reference to pdfinfo is in fulltext.js. This may be too much of a dev thing: I didn't see an "easy" entry point. Easy for me (shouldn't take my mouth too full) would be: have pdfinfo (or e.g. exiftool - which works even better) throw a text file and have the information therein picked up. Would that then go under the topic "translator" - bypassing any "real" programming? The information stored in the pdf's is "pure" bibtex. An ideal workflow would need to somehow trigger the creation of an item object attaching the pdf, trigger pdfinfo (or exiftool), have e.g. the bibtex translator read the generated text file and then populate the item object. Sounds simple...
    Cheers
  • well, the entry point would be
    https://github.com/zotero/zotero/blob/4.0/chrome/content/zotero/recognizePDF.js
    and you'd have to add another fallback to Zotero_RecognizePDF
    there that then calls pdfinfo (which is indeed in fulltext.js) and checks&reads out the XMP fields.
    But no, I don't think there is any way to do this easily without real programming. Certainly not in a way that would be acceptable as a patch into Zotero. You can't build the whole thing as a translator, no, translators can't, e.g., use the pdf tools. The reverse is possible - you can call on a translator from recognizePDF - but anything that assumes bibtex in XMP fields will almost certainly not be accepted into Zotero. While it may make sense for you & your set-up, it doesn't make sense as a general solution.

    You can take a look, but to be honst, if you have no prior experience coding, I don't think you will be able to do it in a reasonable time-frame. If you do decide to work on this, any further questions should go to the zotero-dev listserv.
    https://groups.google.com/forum/?fromgroups=#!forum/zotero-dev
    And as I say above - don't expect any solution that's not generally applicable to be accepted into Zotero (though you could of course personally use it or - since you'd likely only change one file - pretty easily write a plugin.)
  • I guess "Zotero Retrieve Metadata" "doesn't even look at that data" is absolutely correct - I did not see a reference to anything non-fulltext in the code.
    I didn't quite understand "but anything that assumes bibtex in XMP fields will almost certainly not be accepted into Zotero". If I parse the XMP metadata into a text file and just simpy copy and then use "Import from clipboard", that perfectly populates the item's fields - author goes to author, abstract to abstract, etc..
    Thanks!
    Stefan
  • I didn't quite understand "but anything that assumes bibtex in XMP fields will almost certainly not be accepted into Zotero".
    I mean if you were to propose such a solution as a patch, Zotero devs wouldn't accept it (or at least I'm pretty sure they wouldn't).
  • As you've probably already seen, I implemented basic XMP support using pdf.js a while ago in the pdfjs branch of my personal fork (https://github.com/simonster/zotero/tree/pdfjs/). The main reason this didn't make it into the main Zotero tree is that the pdf.js-based text extraction didn't work too well. I haven't checked recently to see if that's changed, but even if text extraction still sucks, nothing would prevent us from using the XMP code from that branch.

    The other reason this didn't make it into the main Zotero tree is that it's not necessarily clear what we should do with the XMP metadata once we have it. We need to make sure that we can read major publishers' metadata and develop heuristics for when we think the XMP metadata is bad or likely to be wrong so that we can do a Google Scholar lookup in those cases. This is something that I want to do, but at the moment the value doesn't seem so great.

    In any case, embedding BibTeX in as XMP metadata in PDF seems kind of wrong. XMP is XML-based (and usually RDF-based). BibTeX requires a parser of its own, it isn't very expressive, and it lacks fields for common metadata (e.g. DOIs). Do any major publishers embed BibTeX into their PDFs? Most of the PDFs I have don't have any embedded metadata, but the ones that do have RDF.
  • no, I had missed that - thanks
  • in terms of low-hanging fruit: If we find a DOI in the metadata we could get CrossRef data & likely save an additional number of GS requests. If I understand your code correctly, you already do this in https://github.com/simonster/zotero/blob/pdfjs/chrome/content/zotero/recognizePDF.js#L272
    but given the quality & lack of restrictions on the CrossRef API, I'd actually reverse the order of this, i.e. use it in any case.

    That doesn't address skreisel's request at all, but would still be neat.
    As I say above, given the current state of XMP, my guess is that it is too rarely good, so we should use it as a fallback only.
  • This is sort of stealing the fruit (to stay in the picture) of thy neighbor - or just simple advertisement: take a look at Per Constans cb2bib (that's what I use for pdf metadata) -> http://www.molspaces.com/cb2bib/
    He's written something that resembles the way Zotero gets at the essential data before querying Google Scholar - just that there's a way to use other databases, most notably Pubmed (works ~90% of the time). The metadata is then pulled from those sources and added as bibtex compliant fields in XMP (http://www.molspaces.com/d_cb2bib-metadata.php).
    So in my case I'll have the PMID (and/or DOI) there already - it doesn't solve the much more generic case Simon pointed out -> "We need to make sure that we can read major publishers' metadata and develop heuristics for when we think the XMP metadata is bad or likely to be wrong so that we can do a Google Scholar lookup in those cases." I actually go to pains cleansing the pdf of that data before I send it to cb2bib...
    @Simon: So if I'd use your fork, I'd get it to work off the shelf ;)
Sign In or Register to comment.