Get at that PDF metadata...
Hi.
I started a thread a couple of years ago -> https://forums.zotero.org/discussion/13415/take-pdfs-existing-metadata/
Clearly there has been a lot of development ever since. However, functionality that would enable the user to use pre-populated metadata from "within" an attached pdf (i.e. document information dictionary or an XMP-stream) has not really been on the roadmap (as opposed to full-text retrieving of metadata via pdftotext).
See more recent posts:
-> https://forums.zotero.org/discussion/24831/extract-metadata-from-pdf-file-itself/
-> https://forums.zotero.org/discussion/25108/add-metadata-to-pdf/
This may partly be due to the fact that the developers have correctly come to the conclusion that "there are no real bibliographic metadata standards for PDFs and so importing from XMP tags (the generic pdf metadata fields) would almost certainly be connected with rather substantial losses in data...", perhaps also because Google Scholar has "grown up" - enabling sensible full-text comparisons with that "gold-standard" reference set. Nonetheless, some of us go to pains to add metadata to a pdf (admittedly, everybody does it their own way - but I do guess that most would put the author's name in the author's field of the pdf's DID or follow "strict" bibtex standards as cb2bib does), "hard-wiring" that information to the document.
As pointed out back then, pdfinfo does do a good job (using the -meta switch) to get at the information. Why not use it? As https://forums.zotero.org/discussion/21295/how-is-it-possible-to-append-metadata-to-a-pdf/ pointed out: "Zotero Retrieve Metadata" "doesn't even look at that data". Or is everybody waiting for pdf.js?
Cheers
Stefan
I started a thread a couple of years ago -> https://forums.zotero.org/discussion/13415/take-pdfs-existing-metadata/
Clearly there has been a lot of development ever since. However, functionality that would enable the user to use pre-populated metadata from "within" an attached pdf (i.e. document information dictionary or an XMP-stream) has not really been on the roadmap (as opposed to full-text retrieving of metadata via pdftotext).
See more recent posts:
-> https://forums.zotero.org/discussion/24831/extract-metadata-from-pdf-file-itself/
-> https://forums.zotero.org/discussion/25108/add-metadata-to-pdf/
This may partly be due to the fact that the developers have correctly come to the conclusion that "there are no real bibliographic metadata standards for PDFs and so importing from XMP tags (the generic pdf metadata fields) would almost certainly be connected with rather substantial losses in data...", perhaps also because Google Scholar has "grown up" - enabling sensible full-text comparisons with that "gold-standard" reference set. Nonetheless, some of us go to pains to add metadata to a pdf (admittedly, everybody does it their own way - but I do guess that most would put the author's name in the author's field of the pdf's DID or follow "strict" bibtex standards as cb2bib does), "hard-wiring" that information to the document.
As pointed out back then, pdfinfo does do a good job (using the -meta switch) to get at the information. Why not use it? As https://forums.zotero.org/discussion/21295/how-is-it-possible-to-append-metadata-to-a-pdf/ pointed out: "Zotero Retrieve Metadata" "doesn't even look at that data". Or is everybody waiting for pdf.js?
Cheers
Stefan
There is tons of stuff to do in Zotero, I just don't think this is a priority for any of the active devs, both paid and volunteer.
I looked at the repository today - with no prior idea of how Zotero works down under. The only sensible place I found a reference to pdfinfo is in fulltext.js. This may be too much of a dev thing: I didn't see an "easy" entry point. Easy for me (shouldn't take my mouth too full) would be: have pdfinfo (or e.g. exiftool - which works even better) throw a text file and have the information therein picked up. Would that then go under the topic "translator" - bypassing any "real" programming? The information stored in the pdf's is "pure" bibtex. An ideal workflow would need to somehow trigger the creation of an item object attaching the pdf, trigger pdfinfo (or exiftool), have e.g. the bibtex translator read the generated text file and then populate the item object. Sounds simple...
Cheers
https://github.com/zotero/zotero/blob/4.0/chrome/content/zotero/recognizePDF.js
and you'd have to add another fallback to Zotero_RecognizePDF
there that then calls pdfinfo (which is indeed in fulltext.js) and checks&reads out the XMP fields.
But no, I don't think there is any way to do this easily without real programming. Certainly not in a way that would be acceptable as a patch into Zotero. You can't build the whole thing as a translator, no, translators can't, e.g., use the pdf tools. The reverse is possible - you can call on a translator from recognizePDF - but anything that assumes bibtex in XMP fields will almost certainly not be accepted into Zotero. While it may make sense for you & your set-up, it doesn't make sense as a general solution.
You can take a look, but to be honst, if you have no prior experience coding, I don't think you will be able to do it in a reasonable time-frame. If you do decide to work on this, any further questions should go to the zotero-dev listserv.
https://groups.google.com/forum/?fromgroups=#!forum/zotero-dev
And as I say above - don't expect any solution that's not generally applicable to be accepted into Zotero (though you could of course personally use it or - since you'd likely only change one file - pretty easily write a plugin.)
I didn't quite understand "but anything that assumes bibtex in XMP fields will almost certainly not be accepted into Zotero". If I parse the XMP metadata into a text file and just simpy copy and then use "Import from clipboard", that perfectly populates the item's fields - author goes to author, abstract to abstract, etc..
Thanks!
Stefan
The other reason this didn't make it into the main Zotero tree is that it's not necessarily clear what we should do with the XMP metadata once we have it. We need to make sure that we can read major publishers' metadata and develop heuristics for when we think the XMP metadata is bad or likely to be wrong so that we can do a Google Scholar lookup in those cases. This is something that I want to do, but at the moment the value doesn't seem so great.
In any case, embedding BibTeX in as XMP metadata in PDF seems kind of wrong. XMP is XML-based (and usually RDF-based). BibTeX requires a parser of its own, it isn't very expressive, and it lacks fields for common metadata (e.g. DOIs). Do any major publishers embed BibTeX into their PDFs? Most of the PDFs I have don't have any embedded metadata, but the ones that do have RDF.
but given the quality & lack of restrictions on the CrossRef API, I'd actually reverse the order of this, i.e. use it in any case.
That doesn't address skreisel's request at all, but would still be neat.
As I say above, given the current state of XMP, my guess is that it is too rarely good, so we should use it as a fallback only.
He's written something that resembles the way Zotero gets at the essential data before querying Google Scholar - just that there's a way to use other databases, most notably Pubmed (works ~90% of the time). The metadata is then pulled from those sources and added as bibtex compliant fields in XMP (http://www.molspaces.com/d_cb2bib-metadata.php).
So in my case I'll have the PMID (and/or DOI) there already - it doesn't solve the much more generic case Simon pointed out -> "We need to make sure that we can read major publishers' metadata and develop heuristics for when we think the XMP metadata is bad or likely to be wrong so that we can do a Google Scholar lookup in those cases." I actually go to pains cleansing the pdf of that data before I send it to cb2bib...
@Simon: So if I'd use your fork, I'd get it to work off the shelf ;)