Retrieving Metadata on PDFs

aliviah · April 6, 2017

Recently, every PDF I have imported into Zotero and attempted to retrieve the metadata from has failed and I've been left with a "no matching references found", even searchable PDFs. In the past, this only happened every once in a while but for several weeks its been every single PDF I use. I need this metadata for the citations, how can I get it to work again? The code for the error report I submitted is 192633482.

dstillman · April 6, 2017

4.0.28.7

This is an extremely old version of Zotero, from October 2015. Upgrade to the latest version from the download page.

aliviah · April 6, 2017

Okay I did this, however it did not resolve the problem. The new error report is 1892333757

dstillman · April 7, 2017

OK, can you provide a Debug ID (different from a Report ID) for a retrieval attempt that fails?

aliviah · April 7, 2017

The Debug ID is D1796524728

dstillman · April 7, 2017

The document you're trying to import isn't an academic paper — it doesn't have any global identifiers (e.g., a DOI or PubMed ID) and isn't something you'd find in Google Scholar. Any document can be turned into a PDF, but that doesn't mean Zotero will be able to find metadata for it.

aliviah · April 7, 2017

Okay, thank you!

lizrohit · April 13, 2017

Mendeley is able to do it. So what I do is - I will retrieve metadata with Mendeley. And then I export the bibliography to Zotero. It is annoying that Zotero can't support this. I usually have these issues with consumer / market research or PDF printouts from research companies. It is a shame that Zotero can't do it.

dstillman · April 13, 2017

@lizrohit: Not sure what you mean by "able to do it" — my answer was specific to the PDF aliviah was trying to import. If you have examples of PDFs that work in Mendeley but that don't work in Zotero, we can take a look.

It's unlikely that any software could extract high-quality metadata from random non-academic PDFs without also generating a lot of junk data (which was the case with Mendeley in the past, though I thought they stopped using that method). It's possible they're using metadata that other users have entered for the same files, which is something we hope to start doing, but there are various privacy considerations there (e.g., we'd only draw from files in public libraries).

DWL-SDCA · April 13, 2017

Please be very cautious about user data. At several points I've tried Mendeley's user metadata and at least 30 percent of the time records contained major inaccuracies. Some were so bad that we suspected sabotage by trolls. Other problems were clearly there because the person who entered the metadata never thought beyond their own use/purposes. Some were filler: ISSNs 1234-5678, 9999-9999, etc. DOIs of journal articles that didn't begin with 10 (or even with a numeral). Publication place with "unknown" "Unknown City" "Needs Verification" "Maybe Chicago" Lots and lots of journal name, publisher name, book titles, that had spelling errors or obvious "fat-finger" typos. My belief is bad data is worse than no data. When user-supplied data is necessary, the source is probably difficult to find. This suggests that it might be difficult for secondary users to recognize that there are problems with the metadata. I also believe that 1) one should never, ever cite something unread; and 2) sometimes it will be necessary to hand enter metadata.

dstillman · April 13, 2017

Yes, we'd only use data that we had a high confidence was accurate, through various methods, for exactly these reasons.

When user-supplied data is necessary, the source is probably difficult to find.

There's no reason to think this is true. This thread is just about PDFs that people download that aren't academic papers. E.g., the PDF from aliviah above was just a PDF from an organization's website. If you save them via Zotero, they'll already have a correct URL, and the relevant metadata is quite possibly on the first page of the PDF or on the page it was downloaded from. It just doesn't have a non-URL identifier and isn't in Google Scholar. But there's no reason everybody should need to enter the same fields by hand over and over.