Restrict metadata match scope
Consider the following use case:
On your hard drive you have a collection of PDFs in a folder, one entry for each article in some issue of some journal. You'd like to enter all these PDFs into Zotero with maximum possible citation information.
Current solution:
Drag all the PDFs into Zotero. We assume Retrieve PDF Metadata is insufficient as none of these documents contain DOI information in them, so we don't use it. Instead we search through the journal issue in an online database, manually using the magic wand for each article to add by DOI. Then do a manual correlation between the PDFs and the items by dragging each PDF to its corresponding full-info DOI entry.
Proposed solution:
1) Drag all the PDFs into Zotero.
2) Either search through the journal issue manually adding each DOI with the magic wand, or batch import them if it's possible for said journal.
3) Select all PDF files and choose a new option "Associate Items with Current Collection". This loops through each selected item and goes through the normal PDF Metadata retrieval process of looking on google scholar. If any of the google scholar results match the DOI of one of the items created in step 2, a correspondence is made. Otherwise full text search is done on the PDF, attempting to match text from the PDF to the titles of each of the items created in step 2.
This would significantly increase the accuracy of PDF metadata retrieval. I'm not sure how common the use case is for other people, but it's very common for me.
On your hard drive you have a collection of PDFs in a folder, one entry for each article in some issue of some journal. You'd like to enter all these PDFs into Zotero with maximum possible citation information.
Current solution:
Drag all the PDFs into Zotero. We assume Retrieve PDF Metadata is insufficient as none of these documents contain DOI information in them, so we don't use it. Instead we search through the journal issue in an online database, manually using the magic wand for each article to add by DOI. Then do a manual correlation between the PDFs and the items by dragging each PDF to its corresponding full-info DOI entry.
Proposed solution:
1) Drag all the PDFs into Zotero.
2) Either search through the journal issue manually adding each DOI with the magic wand, or batch import them if it's possible for said journal.
3) Select all PDF files and choose a new option "Associate Items with Current Collection". This loops through each selected item and goes through the normal PDF Metadata retrieval process of looking on google scholar. If any of the google scholar results match the DOI of one of the items created in step 2, a correspondence is made. Otherwise full text search is done on the PDF, attempting to match text from the PDF to the titles of each of the items created in step 2.
This would significantly increase the accuracy of PDF metadata retrieval. I'm not sure how common the use case is for other people, but it's very common for me.
Do these PDFs have embedded text? Are the associated articles on Google Scholar? If the answer to both is "yes", then Retrieve PDF Metadata should work. If it doesn't currently, then it should be made to work.
In other words, until that's adjusted, the recognizer is only failing because it's not searching using the right data, and limiting the scope wouldn't help.
ACM Reference Format:
Weidlich, A. and Wilkie, A. 2008. Realistic rendering of birefringency in uniaxial crystals. ACM Trans. Graph. 27, 1, Article 6 (March 2008), 12 pages.
DOI = 10.1145/1330511.1330517 http://doi.acm.org/10.1145/1330511.1330517
When I do Retrieve PDF Metadata, it correctly resolves all the authors and title. But it does not populate the publication, volume, issue, or even the DOI for that matter. Does this mean that it missed the DOI information in this document? Currently I can't figure out a way of knowing if the PDF metadata it obtained was a result of finding the DOI or of matching text on Google Scholar.
When I click the magic wand and enter 10.1145/1330511.1330517 into the box, I get considerably more information.
Even if it does end up resorting to google scholar to locate the document, if that succeeds then shouldn't it at that point have the correct DOI? Or does Google Scholar's index not always contain the DOI?
Click "Import into BibTeX" (or "EndNote" or "RefMan", depending on your settings, though Zotero will save those automatically rather than displaying them), or just click the address bar icon. What you got is all that's there.
However, as you note, that PDF does have the DOI on the first page, which (as far as I know) Zotero should find. We'll have to look into why that's not happening.
An example of a document that gives the DOI in the exact same format and results in metadata for a completely different document being retrieved has DOI 10.1145/1330511.1330520, and is resolved to a document named "Agent-oriented software engineering (workshop session)" from different conference / author.
A new beta should be out this week. You can safely try trunk builds in a separate profile. See the SVN and Trac access page for more info.
10.1145/1330511.1330518
I think it's failing because the DOI crosses a newline boundary.
10.1145/ 1330511.1330518
The embedded space is after the / and before the 133. It's possible there's nothing reasonable you can do about that since it looks like it might be an error in the document.
Do you have examples of any that cross newline boundaries? I suppose accounting for that might fix the above as well.