PDF metadata retrieval occasionally incorrect

dstillman · November 21, 2018

This discussion was created from comments split from: zotero 5 metadata retrieval not working for mac.

Walt8 · November 21, 2018

dstillman, thank you so much for your quick response. Looks like things are working again. However, I have also noticed on occasion that the metadata retrieval is incorrect (it's a book but it recognizes it as a document, or sometimes as an entirely different document written by a different author). I've typically just manually changed the info but it seems a bit buggy compared to zotero 4 ...

dstillman · November 21, 2018

PDF metadata retrieval is an inherently inexact process. Generally speaking, Zotero 5 should do much better.

Earlier versions of Zotero used Google Scholar to retrieve metadata, which often resulted in limited metadata (e.g., no abstracts) and, more importantly, meant that Google would block you after retrieving more than a few items (particularly in the standalone version of Zotero, which is the only version in Zotero 5, so continuing to use it wasn't an option).

Starting with Zotero 5.0.36, Zotero uses a a new system we developed. You can read the details in the linked blog post, but basically, it allows unlimited retrieval without rate-limiting, should generally produce more complete metadata, and for non-academic PDFs without DOIs or other identifiers will produce basic metadata that it's able to extract. For the last case, those items often wouldn't have been in Google Scholar anyway, so it's still an improvement.

If there's a specific PDF for which the metadata is incorrect, you can report it through Zotero as explained there, but if you post examples here we can say more about particular files. If something is just recognized as a document, there's probably not much to be done — anything can be a PDF, but that doesn't mean there's metadata to extract — but we'd certainly want to know about something that was identified incorrectly. (In the rare cases where that happens, it's usually because the recognized paper appears as a citation in the first few pages and Zotero picks up that one by mistake. It shouldn't be possible to end up with a completely different document.)

Walt8 · November 27, 2018

Hello there, I've been downloading some pdfs and zotero is just not extracting any metadata from them at all. I believe the metadata is not being read properly as I feel certain these pdfs should have metadata attached. Thank you.

adamsmith · November 27, 2018

What are some examples?

Walt8 · November 27, 2018

Sure,:

Walter Benjamin, Rolf Tiedemann, Howard Eiland, Kevin McLaughlin-The Arcades Project-Belknap Press of Harvard University Press (2002).pdf

Walter Benjamin, George Steiner, John Osbourne-The Origin of German Tragic Drama-Verso (2003).pdf

Walt8 · November 27, 2018

Not sure if that is what you meant.

DWL-SDCA · November 27, 2018

Are these not older books that have since reprinted? Are these scans of these older books into pdf versions? Are these commercially available pdf items or did you receive these from a colleague?

I ask these questions because the nature and quality of the pdf can be significant.

Walt8 · November 27, 2018

Yes, they are scans of older books into pdf versions. I'm just realizing that they are not ocr'd, which I assume is the reason there is no metadata. They are not commercially available items, however they are taken from a site that also has commercially available items.

scocaud · April 14, 2022

Here is an example of how Zotero incorrectly processed a pdf : https://www.ouvrirlascience.fr/wp-content/uploads/2022/04/Guide_Partager_les_donnees_web.pdf

and here is the reference generated by Zotero :

Wittenburg (Ed.), P., Hellström (Ed.), M., Carlo-Maria Zwölf (Ed.), Abroshan, H., Asmi, A., Bernardo, G. D., Couvreur, D., Gaizer, T., Holub, P., Hooft, R., Häggström, I., Kohler, M., Koureas, D., Kuchinke, W., Milanesi, L., Padfield, J., Rosato, A., Staiger, C., Uytvanck, D. V., & Weigel, T. (2018). Persistent identifiers : Consolidated assertions [Text/pdf]. https://doi.org/10.15497/RDA00027

dstillman · April 14, 2022

@scocaud: That's a generic document that's unlikely to be automatically recognized. There just happens to be a DOI in the first few pages, so that's being picked up instead, though Zotero should be able to figure out that it's unrelated. We'll try to fix that — thanks.

scocaud · April 14, 2022

Thanks for the explanation which I hadn't thought of.