Dubious quality of metadata from google scholar

I use zotero for my research (biomedical) and I like many things about it.

However one major problem I have is the dubious quality of the metadata that is retrieved from google scholar. I would say the majority of the entries retrieved from pdfs have errors or limitations that result in not generating correct citations. For instance, some of the common ones are:

- lack of ending page number
- lack of "Journal Abbr" field
- lack of author first names (a problem only for some output styles)
- weird handling of umlauts and accents and so on
- occasional all caps
- miscellaneous inaccuracies

In contrast, when I add references directly from pubmed, they are basically 99.9% correct.

I have a couple of related questions.

Regarding my existing library (~2000 publications), is there any way I can improve the metadata by making it somehow search pubmed and replace the google scholar metadata with the data from pubmed? Or am I going to have to pay a poor undergrad to do that manually?

Second, is there any simple way of adding new pdfs such that the metadata comes from pubmed? I realize I can press the scraper in pubmed, then download the pdf, then import it and associate it with the reference. But it would be nice if there were an easier way to do it. I don't suppose the scraper is able to find the associated pdf?

Thanks for any input.
  • Zotero can't magically improve the data supplied by google scholar, so there isn't much we can do about this except tell you to try avoiding google scholar as a source for data. We might be able to fix all caps and _maybe_ umlauts if you provide an example, but none of the other issue are fixable, since the data simply isn't there.

    There is no function to automatically update data from another source, no. It's something that I think would be nice to have, but it's a lot of work to implement so I don't see it happen anytime soon.

    The easiest way to get articles with PDFs is to click through from pubmed to the publisher and get the data from there - for many publishers (including Sage, Wiley, Elsevier) Zotero attaches the PDF automatically. Zotero also gets PDFs when you use PubMed Central (as opposed to NCBI pubmed).
    Aurimas has mentioned investigating trying to get PDFs through pubmed directly, but I don't think that's anywhere close.
  • Thanks for your quick and informative response.

    I do realize the problem is with the quality of the data in google scholar, not with how zotero scrapes it (because the scraping of pubmed is essentially perfect, or at least, I haven't noticed any problems with it).

    But this is kind of a shame... for me, the automatic retrieval of metadata was one of the main things that drew me to zotero, because it enabled me to build a database from my huge library of pdfs. But now that I'm actually using it "in production", I am realizing that the metadata is really flawed. It's good enough for navigating my papers, for the most part (aside from the occasional "epic fail"), but certainly not suitable for making bibliographies. This really limits the usefulness of the automatic metadata, obviously. Like I mentioned, I'm going to have to get someone to manually go through and fix my entire database by looking up each paper in pubmed. I guess I should count myself lucky that I'm at the point in my career where I can pay someone to do that rather than do it myself!

    Would you guys ever considering improving the retrieve metadata feature so that it would look at some reliable data sources first (e.g. pubmed) and only resort to google scholar as a fallback? Is that even technically feasible, i.e. how easy is it to find a paper in pubmed based on the doi that you pull from the pdf?

    Regarding your final suggestion, I might try adding future citations from the publishers, it seems like you have a lot of the market covered. I wonder how the bibliographic information scraped from the publishers compares to pubmed in quality. I guess there's only one way to find out. Pubmed central is not a good option though, as you know the papers are not really final versions. My personal opinion of pubmed central is that it's a rather unsuccessful attempt at solving the access problem it set out to solve.

    Thanks again for your response and for zotero, which despite these issues is an incredibly useful program.
  • Would you guys ever considering improving the retrieve metadata feature so that it would look at some reliable data sources first (e.g. pubmed) and only resort to google scholar as a fallback?
    In general yes - in fact Zotero does that already - when it finds a DOI in a PDF it gets metadata from CrossRef, which is usually quite good. Doing other things along those lines (Pubmed, using ISBN etc.) would certainly be doable. On the part of the core development team, the retrieve metadata function isn't currently a priority afaik (once you have moved your data/files into Zotero it's not really a much used feature anymore), so I don't think they'll do much to improve it any time soon, but I'm also sure that they'd be excited to accept patches - it is an open source project after all.
    I wonder how the bibliographic information scraped from the publishers compares to pubmed in quality.
    Data from most publisher sites is either good or excellent - if you find problems you can let us know. Zotero uses an individual translator for most of these and they can often be tweaked to improve import.
  • Thanks again for your feedback. You're right, the ones that come from CrossRef are pretty good, the only weakness is the journal name abbreviation which is usually not correct on those.

    Are you sure you're not underestimating the importance of the retrieve metadata feature? For one, lots of people have folders and folders of pdfs and this is the primary way for them to turn it into a zotero database and get started with zotero. Secondly, for me at least, "retrieve metadata" is still the way I get papers into my database. My workflow is basically, search pubmed, download some papers, read (or skim) them, and then add the ones that are relevant to my database. By the time I've decided if I want them or not, I don't have the pubmed pages around anymore, so I "retrieve metadata". Now I'm thinking of tweaking this procedure and adding things from pubmed as I download them, but then I'm going to have to later delete the ones I don't want. Anyway, bottom line is, I think retrieve metadata is a really important function. I wish I could offer to help improve it but I'm afraid the science job doesn't leave too much time for such projects...

    Thanks again.
  • the basic issue with retrieve metadata is that it's by design an imperfect process - it necessarily involves some degree of guesswork so relying heavily on it is just not going to be a workflow that gets you consistently clean and reliable data into Zotero.

    I don't know how other people do things, but other than getting it via e-mail from a colleague, I essentially never download a PDF to my hard drive outside of Zotero - there is just no reason to get an article without getting the metadata at the same time. If I want to throw it out later that's just as fast in Zotero as in the file system and if I want to keep it I don't have to go through another step (or actually three: drag to Zotero, retrieve metadata, check metadata).
  • Yes, I see what you're saying. I am going to try adjusting my workflow to accommodate the limitations of metadata retrieval.
  • Hi,
    I am looking for the same functionality and for the same reason/work-flow as smwilson (maybe it's because I'm a medical scientist too). So definitively having a powerful retrieve PDF metadata system from Pubmed instead of Google Schoolar would be great.

    smwilson
    May 22nd 2012
    "I'm going to have to get someone to manually go through and fix my entire database by looking up each paper in pubmed. I guess I should count myself lucky that I'm at the point in my career where I can pay someone to do that rather than do it myself!"


    adamsmith
    May 22nd 2012
    "In general yes - in fact Zotero does that already - when it finds a DOI in a PDF it gets metadata from CrossRef, which is usually quite good. Doing other things along those lines (Pubmed, using ISBN etc.) would certainly be doable. On the part of the core development team, the retrieve metadata function isn't currently a priority afaik (once you have moved your data/files into Zotero it's not really a much used feature anymore), so I don't think they'll do much to improve it any time soon, but I'm also sure that they'd be excited to accept patches - it is an open source project after all."


    Maybe the student you would pay for this boring exercise could work of the code development of this new Zotero feature that we all want!
  • There has been some movement on this:
    http://forums.zotero.org/discussion/23748/automating-massimport-from-pdfs/
Sign In or Register to comment.