Why the "retrieve PDF medata" function doesn't use web translators

Chrystalyst · September 28, 2012

Hi,

Firstly, sorry if this question is in the wrong forum but I didn't know where to post.

I found that the metadata extracted from my PDF are poor compared to the metadata extracted, for the same article, by the corresponding web translator.

I wonder why we can't use this web translator when importing a PDF file. I mean, if we can extract the DOI from the pdf file, we can easily retrieve the web page of the article and use the web translator to retrieve information, isn't it ?

For example, if I have a PDF file corresponding to this DOI: 10.1152/ajplung.00262.2006
I can enter in my browser the address: http://dx.doi.org/10.1152/ajplung.00262.2006
Then the browser redirect me to this page: http://ajplung.physiology.org/content/292/2/L476
And I can use the "HighWire 2.0" translator to get more informations like Abstract, volume, abbreviated name of the journal or keywords.

Is there a way to automate such a procedure ?
Tell me if I'm wrong but I think it could be a way to improve the extraction of metadatas from PDF files.
The same procedure can be used when adding documents by DOI, ISBN, ...

What do you think about this ?

Regards

adamsmith · September 28, 2012

while it's desirable, it's not "easy"
http://forums.zotero.org/discussion/23748/automating-massimport-from-pdfs/

Chrystalyst · September 29, 2012

I read the whole conversation and, while nomadize seems to have the same problem than me, my proposition is slightly different.

He asked for a GUI to choose the correct database. I propose to use the DOI to find the url of the article and then use the already written translators on this page to find datas.

I tried to investigate in the code but I don't find the portion of code used to select a translator from an url. Is the "target" variable is used? Or the detectWeb function?
Could you point me out to this piece of code?

Thanks for your answer

aurimas · September 29, 2012

Most of the relevant code is in https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js

If a DOI is found inside the PDF we query CrossRef for metadata, which is fairly accurate and complete, but might not include abstract and keywords (perhaps something else).

If we don't detect a DOI, we pick out a phrase from the PDF that looks unique and query Google Scholar. The metadata from Google Scholar is not as great.

I do understand what you're proposing and it's not a bad idea. It's also not the easiest thing to implement correctly. There are several DOI/Google Scholar look-up improvements that I would like to make, but it's going to take some time.

Chrystalyst · September 29, 2012

Thanks for your answer. I had already found the code in recongnizepdF.js, the piece of code I'm looking for is the one that filters the available translators when opening a webpage.

I think when a DOI can't be found, zotero could use a phrase from the PDF (like what is currently done), and with the help of Google Scholar, it may be possible to find a DOI, find the article page and extract data from the corresponding translator.

The zotero's behavior when importig a PDF could be :
1. Search the DOI in the PDF.
2. If a DOI is found:
2.1. Generate an URL like this: http://dx.doi.org/$DOI
2.2. Open the URL and wait for the redirection
2.3. Get the redirected URL
2.4. Identify the translator associated to this site
2.5. If a translator is available:
2.5.1 Get data from the translator
2.6. Else if no translator is available:
2.6.1. Get data from CrossRef
3. Else if no DOI is found:
3.1. Use a phrase from the PDF to query Google Scholar
3.2. If Google Scholar gives a hit:
3.2.1 Try to find a DOI
3.2.2 If a DOI is found:
3.2.2.1 Apply the same procedure as if a DOI is found in the PDF (Go to 2.)
3.2.3 Else if no DOI is found:
3.2.3.1 Get data from Google Scholar

What do you think about this?

I tried to modify the code by myself but I can't figure how to load the http://dx.doi.org/$DOI page and get the redirected URL.

It's just a suggestion. I really love Zotero but it seems to be bet by Mendeley or EndNote on the PDF import part. I think my proposition can help improving the data extraction.

adamsmith · September 29, 2012

the piece of code I'm looking for is the one that filters the available translators when opening a webpage.

the code is in translate.js but it's a bit more complicated than you make it out to be:
Each translator has a regex for the URLs on which it runs. When that regex is matched, it then runs a "detectWeb" funtion to determine if there is an item (or multiple items) on that page.
Making all of that run smoothly and internally is, as I said, not trivial. The problem really isn't coming up with the logic, but coding 2.3-2.6 in a stable way.

FWIW I think Zotero has Endnote beat hands down on pdf import. Mendeley does do a bit better by now, maybe at least.
http://musingsaboutlibrarianship.blogspot.com/2010/07/extracting-metadata-from-pdfs-comparing.html

aurimas · September 29, 2012

What do you think about this?

That's more or less the logic I would follow.

I think the first thing to do would be to make Google Scholar (and PubMed, since it's long been requested) into a recursive translator. That is, have them follow links and try to use translators on those sites (just like you suggest). Note that this has only recently become possible with the release of Zotero 3.0.4. Furthermore, I'm not sure if we should force this sort of recursive behavior onto everybody, since this does involve additional HTTP requests and, depending on how it's coded, may significantly slow down metadata retrieval. That means that we need to think about how and if users would have the option to turn this on or off.

A similar thing can be done for DOI, but I think it's a little bit more involved, since CrossRef is not the only DOI registration authority. I'm currently looking into this part myself (among other things)