"Retrieve Metadata for PDF" not working as expected

z8080 · December 16, 2010

Hi all,

I usually add new items to my Zotero collection by dragging a PDF I have on my hard-drive into Zotero and then choosing the right-click menu option "Retrieve Metadata for PDF" to have Zotero automatically create an item that has the PDF as an attachment and that fills in automatically all the tags (Title, Author, etc).

The problem is, the metadata retrieval isn't always working, even for PDFs that you would expect it to work for, such as recent papers. If I look at the Document Properties of one of those PDFs, for example, I see that the Title is "doi:10.1016/j.neubiorev.2009.03.005", which should presumable enable Zotero to retrieve all the metadata from the DOI database!

An example of a PDF that did *not* have problems retrieving metadata was one that had Title = "PII: 0013-4694(93)90006-H", so presumably also a link to an online database, only this time a "working" one, as opposed to the previous example!

Is there any way to make this feature more useful, for example by having Zotero search for metadata in more places? Many thanks in advance for any help!

adamsmith · December 16, 2010

Zotero looks for information in the first pages of the document, not in its title or meta-tags - first it checks for a doi, then it just tries google scholar.
I can't tell you why it failed for the document in question, but usually it's doing well for recent pdfs - there is also always the question of whether DOI lookup works for a given doi - have you tried "add by identifier for this one?

z8080 · December 16, 2010

It appears that Add By Identifier doesn't work either for that PDF, but if you say Zotero scans the content of the PDF and Google Scholar to find its title, author, etc, then it is really strange it could not do it for this particular PDF (a 2010 Elsevier-published journal article). I find that I have to complete these fields by hand more and more often, which is a shame since the Retrieve Metadata feature could really be a huge time saver.

adamsmith · December 16, 2010

I think in some cases Zotero gives up when it finds a DOI, but then doesn't get information back from the DOI database - I don't think that's actually intended, nor is it consistent behavior.
But why don't you use Zotero's own import features - i.e. the URL bar item? I understand that initially, to get your pdfs into Zotero, you need the retrieve function - but then?

z8080 · December 16, 2010

I'm not quite sure what you mean - I usually create new items in my collection based on PDFs that I receive, and I use Retrieve Metadata to save me having to fill in all the fields. If I find an article online I could, indeed, use the URL bar commands to make into an item, but if I want to have a local copy of the PDF attached to that item, I still have to drag the direct link to the PDF into Zotero, and then probably the fields will be filled in correctly only in those cases where the Retrieve Metadata would work as well.

Am I doing things the hard way here?.. :-)

adamsmith · December 16, 2010

yes. If you find an article online and use the URL bar icon, in many cases the pdf will attach automatically (JSTOR, highwire, informaworld).
If it doesn't, your data is still likely to be better and more reliable, as it's right from the source and Zotero doesn't have to guess - and you can just attach the pdf afterwards (download and drag from the file system).
Google scholar data, which you'll often get for "retrieve" is pretty good, but not super reliable - and, for example you never get full first names, only initials.

z8080 · December 16, 2010

THis is what I used to do until I noticed a couple of times that the item wasn't saving correctly (fields were missing, or PDF wasn't attached), and so I started doing this the other way (saving the PDF manually, then adding it and retrieving the metadata for it.

However I just tried applying the method you suggested on a new article, and this time it worked, so maybe some bugs were fixed in the meantime (well, I'm guessing they are all the time).

Many thanks once again for your help adam!

adamsmith · December 16, 2010

you have to get a sense for the quality of the database - for some, like JSTOR, you don't even have to check, for others you need to be wary of the quality of imports. Whether a pdf imports or not is mostly consitent within a database - either it works or it doesn't.

DWL-SDCA · December 17, 2010

Bringing in PDF metadata usually works fine and is one of those "best things since sliced bread". However, it must be used with caution.

Just a few moments ago I was importing metadata from several PDFs of articles published in the Journal of the Indian Academy of Forensic Medicine. This journal doesn't use the DOI system. Thus, Zotero uses Google Scholar. For some articles, a similarly titled article is imported instead of the one I have requested. This seems to be a problem when the article I want isn't included in Google Scholar. This doesn't happen frequently but even if it happens only every now and then; having the wrong metadata attached to a pdf can be frustrating. For articles without DOIs I always bring in metadata one article at a time and check them carefully.

For example: grabbing metadata from the following article:

A study of homicidal deaths by mechanical injuries in Surat, Gujarat. J Indian Acad. Forensic Med. 2010; 32(2):134-138.

http://medind.nic.in/jal/t10/i2/jalt10i2p134.pdf

will instead bring metadata for the article:

Pattern of head injury in homicidal deaths. Indian J Forensic Med. Technol. 2009; 3(2)18-21.

This is the wrong article, wrong authors, wrong year, and the wrong journal. Sometimes the article error is much less obvious.

bothide · December 18, 2010

After only one day of using zotero, I have learned to like most of the things I have used. But I am somewhat disappointed by the low success rate of finding even very recent references in (in my case physics) journals. What would it take to include the retreival directly from the journal web site first? Another thing that is quite disappointing is the BibTeX support. I am currently trying to get around this with the help of LyZ...

DWL-SDCA · December 18, 2010

Can you not use the address bar icon to download recently published articles directly from a journal's website? If not, does the journal not offer EndNote or RIS downloads? Either of these methods should bring the article into Zotero. If you have the doi number you can bring in the article using the "Magic Wand" tool. (Just click the tool and paste in the doi. Very recently published articles that do not have a doi and have not had time to be found by Google Scholar can be a problem. This last mentioned issue will exist with any bibliographic management program.

bothide · December 18, 2010

Of course I can do it manually, but my point was that I want it to be done automatically when I click on "Retrieve Metadata for PDF" for a group of files. Do it manually is not an option when the files for which the "Retrieve Metadata for PDF" did not work are in the hundreds... So I think that zotero should not rely only on Scholar Google but also try (automatically) the web site for the journal in question.

About LyZ I have tried to get it to work on three different systems but only get error messages of the type

SERVER ERROR:
[Exception... "Component returned failure code: 0x80520001 (NS_ERROR_FILE_UNRECOGNIZED_PATH) [nsILocalFile.initWithPath]" nsresult: "0x80520001 (NS_ERROR_FILE_UNRECOGNIZED_PATH)" location: "JS frame :: chrome://lyz/content/lyz.js :: anonymous :: line 85" data: no]

and

Could not contact server at: \\.\pipe\lyxpipe

To judge from comments from more or less satisfied LyZ users on the net it seems that it should indeed be possible to run LyZ from inside zotero. Wonder what tricks these users have used to get it work...?

Simon · December 18, 2010

Trying every journal website is infeasible. Determining the journal from the PDF would require detection code for each supported journal (of which there are several hundred). Someone would also need to write code to search the websites. This is a rather large undertaking. Unless we can get a volunteer to write all this code, it's unlikely to happen.

You're probably best off making LyZ-related inquiries at the LyZ launchpad page.

bothide · December 18, 2010

If we all contribute with our "own journals" it would be feasible, I think... A good starting point would be a template with generic detection code and a few examples.

noksagt · December 18, 2010

What would it take to include the retreival directly from the journal web site first?

How would zotero even know what journal site to look at?

Another thing that is quite disappointing is the BibTeX support.

I have not yet seen specifics, though I asked you for them in another thread.

adamsmith · December 18, 2010

yeah - I also don't see how that would even start to look logistically. You could try if Mendeley peforms much better for you - they use, in addition to look-up, the cddb approach of using data input by other users. That seems potentially promising to me - (though I feel somewhat squeezy about the data privacy implications involved) - but last time someone compared the two, Zotero did, if anything, slightly better.
So I guess for the foreseeable future you'll just have to deal.

bothide · December 18, 2010

About BibTeX support, he most frustrating things that I have found so far are:

1. Too liberal a use of {} which limits the effects of bibstyle, possibly producing wrong
results in terms of lower/upper case in TITLE.

2. No way to use @§TRING constructs and journal abbreviations.

3. No way of modifying the key to include, e.g., the .bib file name as an element.

4. No way of changing the layout of the bib entries themselves ("" vs {}, lower vs upper case, etc)

bothide · December 18, 2010

About the site to look at: By extracting the journal name from the pdf file and then go to the journal''s web site and there search for the article based on the other data (volume, issue, page no, publication year) in the pdf file.

noksagt · December 18, 2010

Which of those does LyZ fix? In your previous post, you that "the default output does not conform to de-facto standards". None of the issues you list appear to be issues of standards to me.

(1) I believe that {} is only triggered when an uppercase letter appears in a non-initial position in a word (though there have been examples where we should probably break on punctuation in addition to spaces). Do you have other specific examples of it being over-zealous?

(2) No program I know handles this well. Are you aware of any? This might be possible when/if there is hierarchical support, so that journal names can be better normalized.

(3) I don't think I've ever seen that notation. Examples?

(4) How does this impact you? Zotero data should be stored in title-case as much as possible & your .bst should handle case changes (as you said in your first point). There are various tools to perform these sorts of transformations, though.

noksagt · December 19, 2010

By extracting the journal name from the pdf file and then go to the journal''s web site and there search for the article based on the other data (volume, issue, page no, publication year) in the pdf file.

But extracting that information is the very part that is non-trivial.

Finding a journal name is hard. You can look in the text for journal names already in your zotero database, but you'd have to avoid having the bibliography provide false positives & I have no idea how you'd identify some journal names that are likely very common (like 'Nature' or 'Science'). Using heuristics to find vol/issue would also be hard (page & year seem easier, but not without issues).

Zotero has reasonable success by querying google scholar with phrases in the pdf if a unique identifier (doi) is not found. Going back to the publisher after such a process might make for better data, but would not help with false identification.

z8080 · December 21, 2010

The reason I said the Save To Zotero button (URL bar) doesn't usually work for me is because, most of the time, what is saves as an attachment to the items is a *link* to the PDF, rather than the PDF itself. Is this how it should behave?

adamsmith · December 21, 2010

no - but that happens for some translators, yes.