Retrieving Metadata

alexwilkie · November 15, 2012

Hi,

just in the process of moving to Zotero from an Endnote/Papers setup. I have a library of around 1500 papers. While about half have been identified and metadata attached there's still around 700 papers that the locate search engines cannot identify. It will be a real pain to do this manually. Also, in my efforts I'm also trying to use different locate engines to see if these will identify and retrieve info for the pdf. So, I have two questions:

1. Are there any other ways of batch retrieving metadata? Many of my PDFs are text based but they are not being identified.

2. I am having trouble adding new locate engines. I have followed the instructions but working with the standalone version on a Mac I cannot see an option to add a new engine via a web page and the search engine's icon.

Some help would be great.

Thanks

dstillman · November 15, 2012

Retrieve Metadata and locate engines aren't related. All of the logic of Retrieve Metadata is in the code itself.

Can you provide an example of a PDF that's not being recognized?

alexwilkie · November 15, 2012

Hi Dan,

here's a link to a paper I have in my library that can't be identified (just as an example, not my site):

http://ewasteschools.pbworks.com/f/Law2002ObjectsandSpacesTheoryCulture&Society.pdf

Thanks for the info about the locate engines and metadata. How is Zotero retrieving metada then?

Thanks.

dstillman · November 15, 2012

That PDF works fine for me. My guess is that you hit a limit on Google Scholar (which is one of the things Zotero uses). I don't think Zotero currently always displays an appropriate error message in that case.

Wait a number of hours, or until tomorrow, and try a single PDF again. If that works, go in smaller batches at a time.

alexwilkie · November 15, 2012

Okay, that makes sense. Only thing is I've already waited since yesterday. I'll give it a bit longer.

I've come from papers and they use a number of sources to retrieve data. Is Zotero the same and is there a way to add to the sources being used?

Thanks again.

aurimas · November 15, 2012

Zotero tries to identify PDFs by looking for a DOI in the first couple pages. If a DOI is not found, it picks a long string of consecutive words and searches Google Scholar. If Google Scholar returns any results, the first result is imported. (You're likely getting stuck here, because Google Scholar thinks that you're a script, which in this case you are, and blocks you) Otherwise, PDF metadata retrieval fails. In an upcoming Zotero release, metadata retrieval will also look for ISBN numbers as well (for books). I'm not sure how else we can go about figuring out what the PDF is.

If you are coming from another bibliographic management software, there are better ways to transfer your library (along with metadata) than transferring PDFs and trying to fetch their metadata. If you're transferring from EndNote, see http://www.zotero.org/support/kb/importing_records_from_endnote

I'm not too familiar with Papers, but I believe it can export the library in either RIS or BibTeX format, which Zotero will be able to import. I'm not sure if the attachment files will be linked properly though.

Finally, PDF metadata retrieval is not a very good (or reliable) way to import metadata. If you haven't done so yet, I would encourage you to look at http://www.zotero.org/support/quick_start_guide If you already have a large collection of PDFs with no associated metadata, then this would be your only reasonable choice, but you probably don't want to make this your standard workflow.

alexwilkie · November 15, 2012

Thanks aurimas,

I've been through some of this stuff but I guess I have a special case in that I have always kept my Endnote and Papers databases separate i.e. I have used endnote for references and citations only and Papers as the mechanism for managing my pdf files. My rationale was that neither was good for both but each was good at either reference or pdf management.

So, the upshot is, I am trying to bring together my endnote reference library and papers pdf library in Zotero. So far, this has meant cleaning my reference library in endnote with search and replace, importing it into Zotero and then importing the pdf's separately, using the retrieve metadata to get the best metadata for the files and then using the duplicate items to identify and resolve conflicts and duplications (I really like the duplicate items and it works better than endnote and Papers). I am now at the stage of retrieving the metadata as doing this manually, for hundreds of pdfs, would be a lot of work.

Once this is set-up I can imagine using Zotero as a 'normal' user might - that is, importing files and reference metadata on a one by one basis but for now batch processing is really going to save me a lot of time.

Any more advice would be warmly welcomed.

Thanks.

alexwilkie · November 15, 2012

BTW: if i login on another machine running zotero can I do the retrieval that way, thus bypassing google ip policing?

DWL-SDCA · November 15, 2012

Possibly, but only if the other machine connects with a sufficiently different ip.

I seem to remember a discussion about slowing down the request rate to make the Zotero script less script-like. Did anything come of that? Sage also has limits to the number of records allowed within a time span.

Let me repeat what has been said before: Google Scholar is OK for finding articles but not quite so good for getting good metadata into Zotero. GS often omits authors (sometimes several authors), gets the order of authors wrong, provides incorrect dates and pagination, etc. GS will guide you to articles on a publisher's website. Much better metadata is available there.

adamsmith · November 15, 2012

(staggering GS requests is still on the menu but nothing has been implemented. Nothing similar is planned for Sage afaik, but I'd say that's also somewhat less important, since Sage never works in the background and falls silently like the GS translator.

DWL-SDCA · November 16, 2012

After 50 downloads, Sage "wants" no less than 6 seconds between citation requests. If you exceed this limit, your IP is locked-out for an hour. If you exceed the limit again it is necessary to send an email to them with a code before you again have access.

alexwilkie · November 16, 2012

Well, I used another machine (at home) and managed to retrieve around 20 requests until GS blocked me again.

DWL-SDCA - yes, I've experienced the inaccuracy of GS in Papers too but in my case where I want to batch match hundreds of papers doing it manually, which would be the most accurate way, implies a lot of work. I once used the British Library with endnote but they were really inaccurate and inconsistent.

If, like me, others are migrating from another setup for managing references and have lots of PDFs then some way to batch process metadata is really useful and attractive.

alexwilkie · November 19, 2012

Okay, the retrieval of metadata is a real pain with an archive being imported over. There must be other ways around this or other way to match a paper to identifying data. I still have around 1200 papers to match and for the time being looks like I can do around 20 per day...

aurimas · November 19, 2012

On one hand, we can try to trick Google Scholar into thinking Zotero is not a script by delaying requests by a random number of seconds (maybe doing some other magic), but I'm not sure we should be in the business of tricking Google (idk what the Zotero core dev team thinks of this though).

Maybe we can get Google to cooperate with Zotero. They don't seem too fond of Google Scholar in general though, since there is still no API for it, nor is it even included as a choice on google.com (while other specialized searches are)

Rintze · November 19, 2012

Microsoft Academic Search has an API: http://academic.research.microsoft.com/about/Microsoft Academic Search API User Manual.pdf

I think Google Scholar used to be preferred because it was more comprehensive, but maybe things have changed?

aurimas · November 19, 2012

I don't think MS Academic Search indexes manuscript content. Searching for a sentence from a few abstracts I had on hand did not return any results. Also their metadata is much worse than Google Scholar (from my experience). It seems to be user generated. At the very least, it's user-editable. IMO it's absolutely useless as far as Zotero is concerned.

adamsmith · November 19, 2012

IIRC I suggested contacting GS once and one of the core-devs said that it's very hard to get that type of cooperation with google, simply because they're so big and so busy.
I wouldn't be concerned about tricking GS, since Zotero doesn't actually do what they're trying to fend off (i.e. systematic scraping of their entire database). I know Simon is OK with staggering the requests.

Agree with Aurimas on the current uselessness of MS Academic for Zotero.

DWL-SDCA · November 20, 2012

Although I agree about the problems with metadata from MSAS, it seems that every record has a correct DOI. I am able to be successful by clicking through the DOI to the publishers' sites. I receive the correct metadata from the publisher. I am not able to reliably get DOIs directly from Google Scholar.

aurimas · November 20, 2012

But what about clicking on the Google scholar search result? Doesn't that typically take you to the publisher's website?

Either way, without full text indexing I don't see how we can use MSAS.

Rintze · November 20, 2012

Scirus.com seems to do some full-text indexing, but its search results seem to be rather poor in metadata. (it's Elsevier owned)

DWL-SDCA · November 20, 2012

@aurimus In my experience, clicking on the main link will take me to a full text item that may or may not be on the publisher's website. When I am taken to a publishers website I'm usually taken directly to the pdf. Unless my university has a subscription and the publisher recognizes I'm from an appropriate IP range, I find that I'm at a tollgate that often does not have any metadata available that is recognized by Zotero and that I need to search the site for the article abstract page.

In short, I find that the ability to directly download metadata to Zotero from GS or MSAS is more of a nuisance than a benefit. I get grossly incomplete metadata that is also frequently wrong. If it were possible for Zotero to identify DOIs and follow them to the publishers' sites and grab metadata there, I would be really pleased.

aurimas · November 20, 2012

If it were possible for Zotero to identify DOIs and follow them to the publishers' sites and grab metadata there, I would be really pleased.

This is likely how Google Scholar and PubMed translators are going to evolve. I'm still working out some of the kinks though. Ability to navigate to a different domain has only recently been introduced into Zotero. Hopefully this will be ready by the end of the year.

alexwilkie · November 21, 2012

An additional way of doing it is to provide the user with a list of possible matches, sourced from various databases from which they can choose. This doesn't help much with batch large batch retrievals but does help to identify smaller sets as well as not having to rely on textual indexing as well as PDFs that are not OCR text.

BTW: the process of retrieving metadata is slowing down for me - perhaps google is quicker to lock me out meaning I still have hundreds of PDFs to retrieve data for.

aurimas · November 21, 2012

...but does help to identify smaller sets as well as not having to rely on textual indexing as well as PDFs that are not OCR text.

How can PDFs be identified on any database without looking at their content? If we could reliably identify article/book titles within PDFs, then perhaps we could ditch Google Scholar.

alexwilkie · November 21, 2012

aurimas, maybe the user can input author and title and get a list of possible matches back?

aurimas · November 21, 2012

In that case you can just Google it, which is only slightly less convenient, but a lot more flexible than a built-in search interface in Zotero. Or use the magic wand tool to put in a DOI and it will import an article automatically, but that's not useful in your case, since Zotero probably already imported articles with DOIs.

Which makes me wonder why 700/1500 articles did not have detectable DOIs in them. Are you sure these remaining articles are OCR'ed? Alternatively, these could be old articles and actually not have DOIs.

Edit: I didn't think my statement through. Importing metadata from the web (after Googling for them) will not attach them to an existing PDF, so that's not a solution for you either.

mel47 · April 15, 2013

Hi,
I also migrate a large database and get this problem of limit of request.
Could I suggest to not stopping metadata retrieval in case of exceeding limit and continue the queue?
Many pdf seems to be found by either way than GS, but the process is stopped because one paper is not found by doi. I have to re-select entry, excluding the problematic one, and re-start a retrieving. Quite long and painful.
Thanks