DOI not recognized in pdf metadata if file otherwise contains no OCR
I have a large bunch of old PDF scans. For none of these Zotero could automatically retrieve the metadata.
I managed to automatically couple the DOI to the paper (as the filename contained the reference), and wrote the DOI as metadata into the pdf file (the metadata of the pdf file, that is). In two ways
1) add the DOI as the Subject
2) add a custom field called DOI where I again wrote the DOI
Acrobat reader properly displays the two occasions of the DOIs.
In the pdf files that had OCR data, the DOI was properly extracted from the pdf metadata.
But in the files that had no OCR data, Zotero still could not recognize the DOI. Essentially, it would complain the pdf does not contain OCR text. That is a pity: the DOI is clearly well documented in the pdf metadata, yet Zotero won't even look for it.
(as a workaround, I ended up writing the DOI onto the pages of the pdf. Then there clearly was some OCR and it all worked)
I managed to automatically couple the DOI to the paper (as the filename contained the reference), and wrote the DOI as metadata into the pdf file (the metadata of the pdf file, that is). In two ways
1) add the DOI as the Subject
2) add a custom field called DOI where I again wrote the DOI
Acrobat reader properly displays the two occasions of the DOIs.
In the pdf files that had OCR data, the DOI was properly extracted from the pdf metadata.
But in the files that had no OCR data, Zotero still could not recognize the DOI. Essentially, it would complain the pdf does not contain OCR text. That is a pity: the DOI is clearly well documented in the pdf metadata, yet Zotero won't even look for it.
(as a workaround, I ended up writing the DOI onto the pages of the pdf. Then there clearly was some OCR and it all worked)
We do have an open ticket to also check the filename, but due to limitations on filenames, that relies on various character substitutions, so I expect that to also be useful solely for people who make it to the forums or documentation and manually assign DOIs according to the documented convention. That's a pretty small audience.
You can, of course, OCR the PDF yourself outside of Zotero, in which case Zotero may be able to recognize it. There's also an OCR plugin for Zotero, though that may be harder to set up than an external tool you already have.
But if there's no text in the PDF, there's just not a good automated solution here.
[Edit: What adamsmith said.]
It's possible there wouldn't actually be any false positives from blindly using identifiers in metadata, but that's not something we've evaluated.
Indeed w/o any OCR, one would have to fully rely on the metadata DOI being correct. Indeed, many of the other data fields in the pdf were pretty meaningless, so I can see why there is little trust there.
Still, Zotero is happy to accept the DOI from the pdf metadata if it only finds the DOI also printed in the text (i.e. no readable data about author, title, etc). But that's probably fine. Why would the printed DOI be wrong?