Improving the quality of pdf metadata retrieval

I've been importing my collection of PDFs in to Zotero recently.

The way it works, as I understand it, is as follows.
1. Grep for DOI in the text output from pdftotext.
2. If found, query crossref for DOI.
3. If not, query Google Scholar (GS) for a few lines selected from the text.
4. Select the first available GS result.

The citation quality is best when a DOI is found. Google Scholar is not very good for citations. Sometimes the year is wrong, capitalization is wrong. DOI is also not imported. It is very good though at finding out the paper in the first place from the queried lines. It also points to the publisher's site as the search result.

As an example of getting the year wrong, see this.
http://scholar.google.com/scholar?cluster=6212186994448200360&hl=en&as_sdt=0,5
In the citation, it shows the year as 2005, when it actually is 1998.
Also, the DOIs are not usually present in the PDFs themselves before year 2000 or so. However the publishers websites do show them.

So, I've been thinking about doing a two-stage query when it hits Google-Scholar.
1. Parse the title and author from GS citation result.
2. Query crossref with the title and author. ( http://www.crossref.org/guestquery/ )

This should be better than just using GS. Essentially GS will just be used as a search engine for crossref.

Would this work? Can this be implemented easily? I don't have enough experience in the web-programming world to judge for myself. I can take a crack at it though, if it is feasible.
  • We're planning something along those lines - probably not going through CrossRef, but going right to the publisher website (which would then also give us abstracts and some other info which CrossRef doesn't have). I understand the functionality is mostly in place, this will just need to be finished, tested and fine-tuned.
  • Great. Excellent.

    Can I try it out, please? I understand it is not-ready yet. But something even half-way implemented would hugely help me. Atleast reduce significantly the busy-work I'm doing now.
  • I can try to remember to let you know when there is something you can try, but there isn't anything workable - even in alpha status - right now. When I said "functionality is mostly in place" I was referring to the underlying structure, not an actually working version.
  • Oh. No worries. It'd be great if you can let me know when it is ready to try out.

    Thanks, especially for your astonishing pace in answering user questions.
Sign In or Register to comment.