Majority of PDF metadata retrieval is wrong

carstenk · March 18, 2018

Hi,

The majority of PDFs are incorrectly recognized since Zotero was switched over to the new service. The mismatches are often hilarious (don't even have a subject matter in common). I can't figure out what the supposed commonality is (it's not the ISBN/ISSN, for example, and it's not the DOI, where one exists).

Is there some way to configure it to use the Google Scholar resolver again?

In this state, Zotero becomes borderline useless to me as I have to create all "parents" manually.

Is anyone else seeing similar behaviour?

adamsmith · March 19, 2018

Have you reported any of these? You're the only person so far who has reported any significant issues with PDF recognition. What sorts of PDFs are we talking about.

(And also a reminder that even with the new service, you're still almost always better of importing items via the save to Zotero button and not by using retrieve metadata)

carstenk · March 19, 2018

Well, I have now reinstalled the app and the plugin, and the issue seems to have gone away. Some minor problems remain, such as Zotero's insistence on adding the creators of edited collections as authors (it fails to distinguish almost 100% of the time, something the Google service did not do, or at least not nearly as often). I was also amused to note that the Place field was occasionally filled by the entire address of the publisher (i.e. SAGE's address in London, right down to the postal code).

Regarding your point about importing—sure, but the Google Scholar service was way more accurate. I never had any _mis_matches, just the occasional case of "too little" information.

I'll hold off on reporting these issues until/unless they occur again now that I've reinstalled (which I assume has reset my various parameters).

adamsmith · March 19, 2018

For one, Google Scholar _never_ includes DOIs or Abstracts, so I'd say the case of "too little information" wasn't just occasional but constant.

dstillman · March 19, 2018

Google Scholar blocks the standalone Zotero (which doesn't share the browser cookie store) after a very small number of requests, so its accuracy is somewhat beside the point. But yes, in many cases we should now be returning much more complete metadata, including the canonical metadata for a much larger proportion of articles.

Reinstalling shouldn't really have any effect on the recognizer, so you might try re-adding one of the earlier PDFs that was recognized incorrectly to see if it's still incorrect and report it if so.

Note that "reporting" means right-clicking on the new parent item and selecting "Report Inaccurate Metadata" (which is only available for a limited time after the retrieval).

Re: "borderline useless", keep in mind that, while the recognizer should do a pretty good job, and we understand that people's workflows differ, adding items and their associated PDFs via the "Save via Zotero" button is still the best/recommended way of getting the majority of items into Zotero, so a misbehaving recognizer generally shouldn't be a significant impediment.

scholarium · August 13, 2018

For books metadata retrieval now seems to work much worse than before. Approximate results for ebooks in PDF form with readable text (OCR) and ISBN either within first five or last five pages: 10% correctly recognised, 40% parent item incorrectly classified as Journal Article with only metadata title & author (mostly correct), 10% totally off (different unrelated metadata), 40% no metadata found.

The ISBN being frequently on the last pages of ebooks may account for the lack of recognition, but classification as Journal Article is a bug. Any plans to improve on those issues soon?

Thanks for the great work!

adamsmith · August 13, 2018

The items with just very basic metadata are Zotero's fallback guesses if it doesn't find anything more systematic. Since it has no way to distinguish item types for those, it defaults to the most common one, journal article.

The 10% incorrect ones would be useful to report.