DOI not recognized in pdf metadata if file otherwise contains no OCR

rfolmer · May 24, 2021

I have a large bunch of old PDF scans. For none of these Zotero could automatically retrieve the metadata.
I managed to automatically couple the DOI to the paper (as the filename contained the reference), and wrote the DOI as metadata into the pdf file (the metadata of the pdf file, that is). In two ways
1) add the DOI as the Subject
2) add a custom field called DOI where I again wrote the DOI
Acrobat reader properly displays the two occasions of the DOIs.

In the pdf files that had OCR data, the DOI was properly extracted from the pdf metadata.
But in the files that had no OCR data, Zotero still could not recognize the DOI. Essentially, it would complain the pdf does not contain OCR text. That is a pity: the DOI is clearly well documented in the pdf metadata, yet Zotero won't even look for it.

(as a workaround, I ended up writing the DOI onto the pages of the pdf. Then there clearly was some OCR and it all worked)

adamsmith · May 24, 2021

Zotero doesn't look in the PDF metadata at all because it tends to be of little value. The only reason there were DOIs in your PDFs is that you explicitly wrote them there, which isn't exactly a common use case.

dstillman · May 24, 2021

PDF metadata is generally very low quality, so Zotero doesn't even look at it. We could consider looking for just a DOI. If we could validate it with a text from the PDF, there's no real downside. I'm less sure about just trusting a DOI with no OCRed text. But either way, bear in mind that you added the DOI yourself, so it's not even a general solution that would help most people.

We do have an open ticket to also check the filename, but due to limitations on filenames, that relies on various character substitutions, so I expect that to also be useful solely for people who make it to the forums or documentation and manually assign DOIs according to the documented convention. That's a pretty small audience.

You can, of course, OCR the PDF yourself outside of Zotero, in which case Zotero may be able to recognize it. There's also an OCR plugin for Zotero, though that may be harder to set up than an external tool you already have.

But if there's no text in the PDF, there's just not a good automated solution here.

[Edit: What adamsmith said.]

dstillman · May 24, 2021

Actually, it looks like we do look for DOIs and ISBNs in the metadata already. We have to check the specifics, but I'm guessing we do what I say above and require those to be validated by the text (e.g., a title showing up), in which case it still wouldn't work for a PDF without text.

It's possible there wouldn't actually be any false positives from blindly using identifiers in metadata, but that's not something we've evaluated.

rfolmer · May 27, 2021

OK. I can see why this is deliberate - that explains the behavior I see.
Indeed w/o any OCR, one would have to fully rely on the metadata DOI being correct. Indeed, many of the other data fields in the pdf were pretty meaningless, so I can see why there is little trust there.

Still, Zotero is happy to accept the DOI from the pdf metadata if it only finds the DOI also printed in the text (i.e. no readable data about author, title, etc). But that's probably fine. Why would the printed DOI be wrong?