PDF metadata retrieval occasionally incorrect

This discussion was created from comments split from: zotero 5 metadata retrieval not working for mac.
  • dstillman, thank you so much for your quick response. Looks like things are working again. However, I have also noticed on occasion that the metadata retrieval is incorrect (it's a book but it recognizes it as a document, or sometimes as an entirely different document written by a different author). I've typically just manually changed the info but it seems a bit buggy compared to zotero 4 ...
  • PDF metadata retrieval is an inherently inexact process. Generally speaking, Zotero 5 should do much better.

    Earlier versions of Zotero used Google Scholar to retrieve metadata, which often resulted in limited metadata (e.g., no abstracts) and, more importantly, meant that Google would block you after retrieving more than a few items (particularly in the standalone version of Zotero, which is the only version in Zotero 5, so continuing to use it wasn't an option).

    Starting with Zotero 5.0.36, Zotero uses a a new system we developed. You can read the details in the linked blog post, but basically, it allows unlimited retrieval without rate-limiting, should generally produce more complete metadata, and for non-academic PDFs without DOIs or other identifiers will produce basic metadata that it's able to extract. For the last case, those items often wouldn't have been in Google Scholar anyway, so it's still an improvement.

    If there's a specific PDF for which the metadata is incorrect, you can report it through Zotero as explained there, but if you post examples here we can say more about particular files. If something is just recognized as a document, there's probably not much to be done — anything can be a PDF, but that doesn't mean there's metadata to extract — but we'd certainly want to know about something that was identified incorrectly. (In the rare cases where that happens, it's usually because the recognized paper appears as a citation in the first few pages and Zotero picks up that one by mistake. It shouldn't be possible to end up with a completely different document.)
  • Hello there, I've been downloading some pdfs and zotero is just not extracting any metadata from them at all. I believe the metadata is not being read properly as I feel certain these pdfs should have metadata attached. Thank you.
  • What are some examples?
  • Sure,:

    Walter Benjamin, Rolf Tiedemann, Howard Eiland, Kevin McLaughlin-The Arcades Project-Belknap Press of Harvard University Press (2002).pdf

    Walter Benjamin, George Steiner, John Osbourne-The Origin of German Tragic Drama-Verso (2003).pdf
  • Not sure if that is what you meant.
  • Are these not older books that have since reprinted? Are these scans of these older books into pdf versions? Are these commercially available pdf items or did you receive these from a colleague?

    I ask these questions because the nature and quality of the pdf can be significant.
  • Yes, they are scans of older books into pdf versions. I'm just realizing that they are not ocr'd, which I assume is the reason there is no metadata. They are not commercially available items, however they are taken from a site that also has commercially available items.
  • Here is an example of how Zotero incorrectly processed a pdf : https://www.ouvrirlascience.fr/wp-content/uploads/2022/04/Guide_Partager_les_donnees_web.pdf

    and here is the reference generated by Zotero :

    Wittenburg (Ed.), P., Hellström (Ed.), M., Carlo-Maria Zwölf (Ed.), Abroshan, H., Asmi, A., Bernardo, G. D., Couvreur, D., Gaizer, T., Holub, P., Hooft, R., Häggström, I., Kohler, M., Koureas, D., Kuchinke, W., Milanesi, L., Padfield, J., Rosato, A., Staiger, C., Uytvanck, D. V., & Weigel, T. (2018). Persistent identifiers : Consolidated assertions [Text/pdf]. https://doi.org/10.15497/RDA00027
  • @scocaud: That's a generic document that's unlikely to be automatically recognized. There just happens to be a DOI in the first few pages, so that's being picked up instead, though Zotero should be able to figure out that it's unrelated. We'll try to fix that — thanks.
  • Thanks for the explanation which I hadn't thought of.
Sign In or Register to comment.