CrossRef

yaroslav.pavlyukh · October 13, 2011

Automatic retrieving Metadata for PDFs is a very useful feature of Zotero. When doi can be found goes via CrossRef query. This is also perfect, since the amount of errors is reduced compared to Google scholar. However, I am missing page/article numbers for APS journals in so generated data. Missing journal abbr. is also something that could be fixed automatically.
Would it be possible to update crossref.js in such a way that it follows the url to find the missing info?

ajlyon · October 13, 2011

We're probably not going to add a fallback like that, but it is often possible to notify the publishers when data is missing in CrossRef. I've found that they are usually quite responsive and do update their data in the CrossRef database.

joosthoek · October 17, 2011

I tried to extract DOIs from PDFs using python. My approach is to attempt extracting the DOI using pypdf and then I download the contents of http://dx.doi.org/<specific DOI>. This downloads the publication page. I'm mostly interested in publications from Nature, Science, AGU or Sciencedirect. With AGU papers you can f.e. download a RIS citation which can be imported in Zotero. Is this approach similar to what you guys are doing? Because I'm missing things like abstract or keywords when I try 'Retrieve metadata from pdf'.

adamsmith · October 17, 2011

it's similar, but once the DOI is extracted, it's sent to CrossRefs DOI lookup service, which then sends data back - and that data doesn't have keywords and abstracts.
I don't think going to the actual publisher page and scraping from there is feasible for a DOI lookup.

joosthoek · October 17, 2011

Ah okay. Well what is feasible is to extract from the publisher page (forwarded by dx.doi.org) the download location of the RIS file. This is what I want to try for myself, but maybe it could be incorporated into Zotero? This of course is not general, it only works for certain publishers (so far I only checked Nature and AGU).

adamsmith · October 17, 2011

you can certainly try - I don't see how you can do this in a general way - essentially you'd need Zotero to check the target regexp and select the right translator for every page that it gets back (so that it can look for the RIS, or bibtex etc. in the right place), which I don't think you can do with a look-up (or multiple) translator.
In other words - I can see how this works when you know already where the doi will take you, but I don't see what you do in the normal case, where you don't.

If you manage, no guarantees about getting this back into Zotero - I think it sounds interesting, but there might be concerns about leaking user data (you'd be making request to multiple third party URLs) and excessive http requests - I don't make these type of calls, just flagging these issues.

yaroslav.pavlyukh · October 17, 2011

I would like to take the side of joosthoek in this discussion. Very often I face a situation where the bib. data is not complete or even erroneous. It is, thus, would be very valuable if it can be verified by another source. Say, publisher's webpage is used to verify the CrossRef or Google Scholar information. Maybe it is against the general policy of Zotero, but I feel that if a partial solution exists one should use it. The aim should be not to create a perfect tool for all cases of life, but a tool that is able to satisfy to majority of users.

At the present time the behavior of Zotero is rather inconsistent. By following the logic of adamsmith when getting the bibliographic data from the publisher's web page one should only aim at getting doi and the rest should be accomplished by sending a request to CrosssRef. That is not how Zotero with its rather large number of translators works!

Very close to the request of some verification tool would be to allow users to fill in missing fields themselves by using some simple scripting language. Say,

for all Journal_Articles in My_Collection
if (Publication == Physical Review Letters) {Journal Abbr.="Phys. Rev. Lett."}

adamsmith · October 17, 2011

there are no sides to take here. I was pointing out to Jost what Zotero currently does and why I don't think what he's suggesting can be done.

Batch editing and journal abbreviations are separate topics, there is, in fact, an experimental plugin to address the abbreviations issue which should work with this or the next version of the Zotero beta:
http://citationstylist.org/tools/?#abbreviations-gadget-entry

yaroslav.pavlyukh · October 17, 2011

Dear adamsmith, thank you for clarifying this issue. I was already pointed to the abbreviation gadget in another thread. Here I used an example with J. Abbr. to illustrate an option of giving users a possibility to improve the data they already have in hands. Abbreviations was just an example. Here is another one:
The name of the journal is "Physical Review Letters", also CrossRef correctly provides this info. However, Google Scholar always uses a different capitalization: "Physical review letters". If some subordination rules are set, say, Journal's webpage has a priority over CrossRef, CrossRef has a priority over Google Scholar this inconsistency would never take place.
Batch editing could just be an alternative to this approach if adamsmith considers it to be too http-request-intensive.

adamsmith · October 17, 2011

To be clear

If some subordination rules are set, say, Journal's webpage has a priority over CrossRef, CrossRef has a priority over Google Scholar this inconsistency would never take place.

The way metadata retrieve works, CrossRef does take priority over google scholar. Zotero just doesn't always find the DOI in a pdf.

And as I note above, the principal problem with querying the publisher site first is that I don't think it's feasible technically.
If I'm wrong about that and it can be done technically, there may be other issues to worry about (the leaking and http requests - though, again, this isn't something I decide).

An option to update existing data e.g. from CrossRef is yet another topic, I think that'd be nice to have but not trivial to do, as you'd have to deal with merging conflicts (i.e. what do you do if the locale and the remote version of the file have different information for a field - not saying that can't be done, but it takes work).

yaroslav.pavlyukh · October 17, 2011

@joosthoek as far as I know, Zotero is indexing all pdf files. Thus, information in the abstract is just redundant. In addition, it unnecessary increases the size of MS Word files because Zotero incorporates also this information in the fields.
What I find can be a valuable feature if Zotero would be able to give a context sensitive hints to the search based on the indexed material (a la google). Show similar articles...

yaroslav.pavlyukh · October 17, 2011

Thank you adamsmith for paying attention to this thread. As far as I know, the developers of Zotero have already dealt with the merging problem: when there are local duplicates. To my opinion, local vs. remote conflicts can be handled in a similar way, by giving the choice to users, or, by using a subordination.
Concerning CrossRef vs. Google Scholar. Why not to go a step further in the following scenario:
i) doi can be found in pdf -> request to CrossRef ->Successful return
ii) doi cannot be found in pdf ->Google Scholar Search->Free doi Look up at CrossRef -> Successful return or i).

Extra query of CrossRef is not going to kill your internet traffic, but it can do some verification.
I made manual tests for the old articles which do not contain doi. By querying CrossRef with results of Google Scholar search I was always able to find a unique doi and fill in missing bib. data!

adamsmith · October 17, 2011

@joosthoek as far as I know, Zotero is indexing all pdf files. Thus, information in the abstract is just redundant.

no it's not. We try very hard to get abstracts when we write translators and many users complain when they're not imported. One reason is that you may want to restrict a search to the abstract. Another use case are annotated bibliographies. Yet another one are reports. And finally some users just like the info available in the right hand pane, w/o opening the pdf.

yaroslav.pavlyukh · October 17, 2011

Yes, it all makes sense for abstracts, but no chance for articles published before the Internet era...