Best way to match 200+ PDF with metadata

yotiao · November 17, 2019

Hello,

I've recently moved from another reference organiser/manager and in the process ended up with hundreds pdfs that Zotero matching magic didn't recognise. I have records with the correct title and first author name - and nothing else. The CrossRef and Google Scholar lookups do not work on them, even though when I just copy-paste the title to Google Scholar it finds the correct paper.

I do not want to manually match all these papers. Is there any way to do this automatically? I was thinking that Zotero Storage solution would have some sort of matching service built-in (like Apple's iTunes Match), where the messy records would be straightened out based on other records in the cloud. But I don't think they do that.

Any ideas would be appreciated...
Best
yot

dstillman · November 17, 2019

If you moved from another reference manager, how did you end up with PDFs without metadata? Were you not able to export the actual metadata? What program were you using previously? Generally you would want to export to RIS, BibTeX, or similar when moving between tools, not just copy PDFs.

You may want to consider redoing the transfer if possible, but in any case, were the papers not being recognized academic papers, or something else? Generally speaking, academic papers — at least modern ones — should have a pretty high recognition rate, and it should be close to 100% for ones with a DOI on the first page. Other documents that just happen to be in PDF format wouldn't be recognized beyond possibly title and author.

I was thinking that Zotero Storage solution would have some sort of matching service built-in (like Apple's iTunes Match), where the messy records would be straightened out based on other records in the cloud.

We do use cloud data to help recognize files, but many/most academic papers are watermarked, so we don't bother trying to match based on the exact file hash.

yotiao · November 19, 2019

Hi,
well. I think there are two non-mutually exclusive reasons for this issue - 1) I have already had this many unannotated pdfs (my library is over 5000 papers, almost all of them academic life science journals) and didn't realise it; 2) the unannotated papers are not annota-table automatically. Many of those are News & Views types of articles (with embedded titles often different from published ones) and many other are old pdfs - simply scanned papers with no OCR or any metadata embedded in them.

But I do think that some non-trivial proportion of those papers were annotated by me before. I was a (devout) Papers user for 10+ years and I did export my entire library as BibTex before importing it into Zotero. I think I will have to do it again for the papers that are unannotated...

As a side note, in my experience (and I am a hoarder of papers) it is not that unusual to have reference manager fail at annotating. Zotero and Papers are quite similar in this respect in my experience.

If that problem (cloud-based matching of metadata) could be solved, that would be a massive advantage for all users...

Thanks again,
Jarek

dstillman · November 20, 2019

But I do think that some non-trivial proportion of those papers were annotated by me before. I was a (devout) Papers user for 10+ years and I did export my entire library as BibTex before importing it into Zotero. I think I will have to do it again for the papers that are unannotated...

I don't understand this part. If you had items with metadata in Papers, they should've been exported that way. If you only had PDFs without metadata before, they wouldn't.

If that problem (cloud-based matching of metadata) could be solved, that would be a massive advantage for all users...

I'm not sure what you mean here. We do cloud-based matching already. As I say, if there's a DOI on the first page, it should work nearly every time. And if it's an academic paper that has a DOI assigned but that doesn't have a DOI in the PDF, we can usually look up the DOI and retrieve metadata. We can't do anything for non-OCRed PDFs, and we can't do much more than extract the title and sometimes author for random PDF documents. (Even if we matched the exact file, for privacy reasons we couldn't reveal data entered by other users that wasn't in the file contents, which rules out non-OCRed PDFs and probably many/most random PDF documents.)

If there's a specific PDF that you think should be recognized automatically but that isn't, you can link to it here (if it's available publicly) or email it to support@zotero.org with a link to this thread.

qqbb · November 20, 2019

I have records with the correct title and first author name - and nothing else. The CrossRef and Google Scholar lookups do not work on them, even though when I just copy-paste the title to Google Scholar it finds the correct paper.

If you could create parent items with the correct title in the title field, you could use lookup engines:

https://www.zotero.org/support/locate

For example, use "Google Scholar - Title Only":

https://github.com/bwiernik/zotero-tools/blob/master/engines.json

This will not be automatic, but it might help accelerate the process of finding matching metadata.

many other are old pdfs - simply scanned papers with no OCR or any metadata embedded in them

Running OCR software on those pdf files could help with automatic metadata retrieval.