Improving the quality of pdf metadata retrieval
I've been importing my collection of PDFs in to Zotero recently.
The way it works, as I understand it, is as follows.
1. Grep for DOI in the text output from pdftotext.
2. If found, query crossref for DOI.
3. If not, query Google Scholar (GS) for a few lines selected from the text.
4. Select the first available GS result.
The citation quality is best when a DOI is found. Google Scholar is not very good for citations. Sometimes the year is wrong, capitalization is wrong. DOI is also not imported. It is very good though at finding out the paper in the first place from the queried lines. It also points to the publisher's site as the search result.
As an example of getting the year wrong, see this.
http://scholar.google.com/scholar?cluster=6212186994448200360&hl=en&as_sdt=0,5
In the citation, it shows the year as 2005, when it actually is 1998.
Also, the DOIs are not usually present in the PDFs themselves before year 2000 or so. However the publishers websites do show them.
So, I've been thinking about doing a two-stage query when it hits Google-Scholar.
1. Parse the title and author from GS citation result.
2. Query crossref with the title and author. ( http://www.crossref.org/guestquery/ )
This should be better than just using GS. Essentially GS will just be used as a search engine for crossref.
Would this work? Can this be implemented easily? I don't have enough experience in the web-programming world to judge for myself. I can take a crack at it though, if it is feasible.
The way it works, as I understand it, is as follows.
1. Grep for DOI in the text output from pdftotext.
2. If found, query crossref for DOI.
3. If not, query Google Scholar (GS) for a few lines selected from the text.
4. Select the first available GS result.
The citation quality is best when a DOI is found. Google Scholar is not very good for citations. Sometimes the year is wrong, capitalization is wrong. DOI is also not imported. It is very good though at finding out the paper in the first place from the queried lines. It also points to the publisher's site as the search result.
As an example of getting the year wrong, see this.
http://scholar.google.com/scholar?cluster=6212186994448200360&hl=en&as_sdt=0,5
In the citation, it shows the year as 2005, when it actually is 1998.
Also, the DOIs are not usually present in the PDFs themselves before year 2000 or so. However the publishers websites do show them.
So, I've been thinking about doing a two-stage query when it hits Google-Scholar.
1. Parse the title and author from GS citation result.
2. Query crossref with the title and author. ( http://www.crossref.org/guestquery/ )
This should be better than just using GS. Essentially GS will just be used as a search engine for crossref.
Would this work? Can this be implemented easily? I don't have enough experience in the web-programming world to judge for myself. I can take a crack at it though, if it is feasible.
Can I try it out, please? I understand it is not-ready yet. But something even half-way implemented would hugely help me. Atleast reduce significantly the busy-work I'm doing now.
Thanks, especially for your astonishing pace in answering user questions.