Enter item ISBN-13 and getting multiple erroneous PubMed entries

azjakec · July 31, 2012

I've noticed about 5-10% of the time when I enter a book's ISBN-13 into the magic wand button on Zotero standalone 3.0.8 (with OSX 10.8 Firefox 14.0.1), instead of the sought-after book item being added to my collection, I get 3-4 PubMed entries for journal articles instead. Try it yourself:

ISBN-13 for Oxford World's Classics edition of "Timaeus and Critias": 978–0–19–280735–9
Which produces the following articles for me:
"Antidepressant drugs affect dopamine uptake"
"Outpatient phenothiazine use and bone marrow depression. A report from the drug epidemiology unit and the Boston collaborative drug surveillance program"
"Radiochemical assay of glutathione S-epoxide transferase and its enhancement by phenobarbital in rat liver in vivo"
"Surgical treatment of duodenal ulcer"

All, I'm sure, groundbreaking work - but not exactly what the doctor ordered.

Occasionally, deleting the hyphens in the ISBN will bring up the proper entry, but in this case the lookup fails completely.

As soon as this started happening I started entering ISBN-10s, which have never produced this behavior, but many new titles, like "Timaeus and Critias" above no longer include an ISBN-10.

This bug aside, thanks for making Zotero great!

adamsmith · August 1, 2012

I can reproduce that. It's two separate issues:
1. Zotero looks up 4 pubmed IDS for ISBNs seperated by hyphens (if you look at the items you can see that the have pubmed IDs 9, 280735, 19 etc.) - that shouldn't be the case
2. If you remove the hyphens, Zotero doesn't find the item, although it's in worldcat.

We'll look into both of them (I'm not going to be able to do much for the next couple of weeks, maybe someone else will get to it earlier).

azjakec · August 1, 2012

So after discovering the debug output in the preferences and taking a look at how things work under the hood, I worked-around the PubMed issue by simply blocking the eutils.ncbi.nlm.nih.gov domain by modifying my machine's host file, which allows Crossref-PubMed-WorldCat lookup to proceed to WorldCat. Since I don't foresee needing PubMed in the near future, that kind of fix is fine for me, but changing the magic wand's parser to better handle the seeming conflict between the ISBN-13 syntax and a multiple PubMed lookup would be good for everyone. (While you're at it, it'd be nice if the PDF metadata parser looked past the first few pages for things like ISBNs, or even searched Google Scholar with text from the middle of the PDF. I know most probably use Zotero with journal full-text PDFs (and grabbing the jstor etc URL on the first page of humanities articles without DOIs and grabbing the metadata from jstor would be neat too), but the transition to ebooks in academia is accelerating.)

As for the second issue, the problem seems to be on WorldCat's end. Looking at the URLs in the Zotero debug output, the hiccup seems to be that the initial ISBN WorldCat search is successful in generating a results page with various slightly different entries from libraries, but the link to the first result (and by process of elimination the correct one in terms of what local US libraries are generating/using) is failing. WorldCat returns an "Our system is taking too long to respond" page when you click on the human-readable results page link, copy/paste the URL to that page from the appropriate place in the Zotero debug output, or copy/paste the URL for the RIS file. WorldCat also failed in a similar way for a translation of Plato's Republic which did not have the PubMed problem at all with an ISBN-10: 0465069347
But weirdly the second link on the search results page for both books resolves fine, and through it one can access the RIS file with all the correct metadata WorldCat seems to provide otherwise - either by Zotero's citation icon in the browser url bar or by downloading the RIS file from WorldCat. Does the hours-long selective server overload happen to WorldCat often? And could Zotero be modified to process the 2nd, 3rd, etc. results on the ISBN search results page should the first fail?

adamsmith · August 1, 2012

the PDF metadata parser (...) even searched Google Scholar with text from the middle of the PDF.

it does that already, so many ebooks should work fine, as should JSTOR articles.
Some improvements are planned, I think detecting ISBNs on the first five pages might be an option to consider.

I don't see a reason against a fallback to the 2nd search result, but I've not heard of this particular problem occuring much, so I'm not sure if it's worth it - Worldcat is usually quite robust and reliable.

azjakec · August 2, 2012

Well, the WorldCat error is still there (and it doesn't matter how you try to get at the entries, e.g. searching by title or author rather than ISBN), so I submitted a comment on their web form. Do you all have a better way to contact them?

As for the metadata issue with ebook PDFs, I have to say that it hasn't been my experience over my first few weeks with Zotero trying to get my library in the system. Now, I tried all of your counterparts, and Zotero does work better than they do, but for 80% of my ebooks and 40-50% of my journal articles, Zotero's metadata retrieval fails. Looking at the pdftotext output using the arguments Zotero uses made it obvious why for ebooks: not enough pages were being parsed. So I eventually found my way to the newest recognizePDF.js file on github, and boy does it work better! The ISBNs of 8 of 8 ebook PDFs that the 3.0.8 recognizePDF.js had failed to find were correctly pulled out of the files. 2 did fail because of the same WorldCat error mentioned above, but looking at the Zotero debeg output shows the ISBNs were recovered. It is quite a bit slower, but that's not an issue for me. So push that improved file out in the next update!

This is probably not the right place to make this kind of specific suggestion, but looking at the source for recognizePDF.js, I might recommend that where it handles the case of two ISBNs next to each other, like the hardback and paperback, that the code be expanded to cover more cases. It might even be a moot idea, since tit is based only on what the comments suggest is the operation of the code, not the code itself, since I couldn't follow the regular expressions and it hasn't seemed to be a problem. But with legitimate ebooks, i.e. not simply pagescans, there is often an ebook ISBN separate from paperback and hardback. And some even have an ISBN-10 and ISBN-13 for paperback and hardback and sometimes even the ebook for 4 or 6 total ISBNs. And then there are ISBNs buried inside the Library of Congress data as well as being elsewhere on the same page for more repetition.

adamsmith · December 15, 2012

a note on the original issue with ISBNs - I have looked at this again and the problem is that the ISBN13 you give isn't separated by hyphens
978-0-19-280735-9
but by en-dashes,
978–0–19–280735–9
which is incorrect (ISBN blocks should be separated by hyphens or spaces)
If you try this out, the first ISBN gives you the correct
Timaeus and Critias
the 2nd one gives you the pubmed articles. We could test for en-dashes, but I'd rather not mess with that.
Also, we have recently improved ISBN detection and now use Worldcat only when the Library of Congress doesn't find an entry for an ISBN.
Article recognition will improves somewhat with the next minor ZOtero update and bigger improvements are still planned for medium term.