Restrict metadata match scope

dvs0826 · August 23, 2009

Consider the following use case:
On your hard drive you have a collection of PDFs in a folder, one entry for each article in some issue of some journal. You'd like to enter all these PDFs into Zotero with maximum possible citation information.

Current solution:
Drag all the PDFs into Zotero. We assume Retrieve PDF Metadata is insufficient as none of these documents contain DOI information in them, so we don't use it. Instead we search through the journal issue in an online database, manually using the magic wand for each article to add by DOI. Then do a manual correlation between the PDFs and the items by dragging each PDF to its corresponding full-info DOI entry.

Proposed solution:
1) Drag all the PDFs into Zotero.
2) Either search through the journal issue manually adding each DOI with the magic wand, or batch import them if it's possible for said journal.
3) Select all PDF files and choose a new option "Associate Items with Current Collection". This loops through each selected item and goes through the normal PDF Metadata retrieval process of looking on google scholar. If any of the google scholar results match the DOI of one of the items created in step 2, a correspondence is made. Otherwise full text search is done on the PDF, attempting to match text from the PDF to the titles of each of the items created in step 2.

This would significantly increase the accuracy of PDF metadata retrieval. I'm not sure how common the use case is for other people, but it's very common for me.

dstillman · August 23, 2009

We assume Retrieve PDF Metadata is insufficient as none of these documents contain DOI information in them, so we don't use it.

DOI is only the first pass of the recognizer.

Do these PDFs have embedded text? Are the associated articles on Google Scholar? If the answer to both is "yes", then Retrieve PDF Metadata should work. If it doesn't currently, then it should be made to work.

dstillman · August 23, 2009

See also my response here: http://forums.zotero.org/discussion/8383/#Item_4

In other words, until that's adjusted, the recognizer is only failing because it's not searching using the right data, and limiting the scope wouldn't help.

dvs0826 · August 23, 2009

Well consider the document with DOI 10.1145/1330511.1330517, which I obtained by saving it to disk through my ACM Digital Library membership. In it is the following text on the first page:

ACM Reference Format:
Weidlich, A. and Wilkie, A. 2008. Realistic rendering of birefringency in uniaxial crystals. ACM Trans. Graph. 27, 1, Article 6 (March 2008), 12 pages.
DOI = 10.1145/1330511.1330517 http://doi.acm.org/10.1145/1330511.1330517

When I do Retrieve PDF Metadata, it correctly resolves all the authors and title. But it does not populate the publication, volume, issue, or even the DOI for that matter. Does this mean that it missed the DOI information in this document? Currently I can't figure out a way of knowing if the PDF metadata it obtained was a result of finding the DOI or of matching text on Google Scholar.

When I click the magic wand and enter 10.1145/1330511.1330517 into the box, I get considerably more information.

Even if it does end up resorting to google scholar to locate the document, if that succeeds then shouldn't it at that point have the correct DOI? Or does Google Scholar's index not always contain the DOI?

dstillman · August 23, 2009

Or does Google Scholar's index not always contain the DOI?

http://scholar.google.com/scholar?hl=en&q=Realistic+rendering+of+birefringency+in+uniaxial+crystals&btnG=Search

Click "Import into BibTeX" (or "EndNote" or "RefMan", depending on your settings, though Zotero will save those automatically rather than displaying them), or just click the address bar icon. What you got is all that's there.

However, as you note, that PDF does have the DOI on the first page, which (as far as I know) Zotero should find. We'll have to look into why that's not happening.

dstillman · August 23, 2009

Fixed in the latest dev build. Now running the recognizer on that PDF creates a parent item identical to what you get from Add Item by Identifier. Thanks.

dvs0826 · August 23, 2009

Awesome. I have many documents with the DOI listed in the exact same manner that all fail in the same manner, or sometimes even result in completely wrong documents. Since I'm new to Zotero, do beta updates get released frequently or would I need to be brave and update to trunk to test out your changes within a reasonable amount of time?

An example of a document that gives the DOI in the exact same format and results in metadata for a completely different document being retrieved has DOI 10.1145/1330511.1330520, and is resolved to a document named "Agent-oriented software engineering (workshop session)" from different conference / author.

dstillman · August 23, 2009

That other paper is detected correctly in the new build.

A new beta should be out this week. You can safely try trunk builds in a separate profile. See the SVN and Trac access page for more info.

dvs0826 · August 23, 2009

Whatever you changed to fix that is definitely made of win. I had hundreds of failed documents before that required manual attention, now I haven't found a single one that fails using trunk :)

dvs0826 · August 23, 2009

It's still made of win, but I did find a document that still fails even with the fix:

10.1145/1330511.1330518

I think it's failing because the DOI crosses a newline boundary.

dstillman · August 24, 2009

Where did you obtain the PDF for that article? Not from ACM, I would think, since the DOI in the PDF available on that page doesn't appear to cross a newline, and it is recognized correctly for me.

dvs0826 · August 24, 2009

Odd, I could have sworn it did. Maybe I was looking at a different document. I have seen a few that cross newline boundaries though and I'm pretty sure that caused a problem. This one definitely did cause a problem, and when I look at it again I notice the DOI has a space in it. This appears to be a typo in the ACM paper as it shouldn't have a space in it, but maybe that's actually the problem. It's written in the paper as

10.1145/ 1330511.1330518

The embedded space is after the / and before the 133. It's possible there's nothing reasonable you can do about that since it looks like it might be an error in the document.

dstillman · August 24, 2009

Yeah, unless that's a common error, that's probably going to have to remain undetected (or fall back to an improved Google Scholar mode).

Do you have examples of any that cross newline boundaries? I suppose accounting for that might fix the above as well.

dvs0826 · August 24, 2009

I'll try to look for one, but now that I think about it I'd bet that an embedded space is what caused it to wrap in the first place in which case it's the same problem.

dvs0826 · August 24, 2009

ok here we go. http://portal.acm.org/citation.cfm?doid=1289603.1289605 wraps the DOI across a newline.