Can ISBN import be improved?

mark · January 11, 2012

I regularly use the "add item by identifier" function. Apart from a small and easily fixable UI problem, I find it works well for DOIs, but not for many books.

Current problems with ISBN book import:
1. All creators are imported as authors (editor/author distinction is lost)
2. Multiple creators are not supported, only the first creator is imported (always as author, cf. #1)
3. Complex names are truncated ("van der Zee, Emile" becomes "Zee, Emile")
4. Multiple places (e.g. Oxford/New York) are imported with messy diacritics in between (Oxford;;New York)
5. Publishers are inconsistent (e.g. some "John Benjamins", others "John Benjamins Pub. Co.")

Many of these problems may be due to the source repository Zotero import is relying on, but unfortunately it seems the user has no say in what repository is used (Worldcat? Google Books?).

If it would be possible to get the Library of Congress data that would be near-perfect; they have in my experience not only the most extensive collection but also the highest quality metadata.

DWL-SDCA · January 11, 2012

WorldCat, while being one of the wonders of the world, is based upon data provided by member libraries from many nations. Unless there has been a recent improvement, there is no "standard" listing of places, publishers, etc. Author names are entered according to the convention of the place where the cataloging was done.

My own experience with LOC data has not been as you describe. Perhaps, better than most but I find many differences in publisher names and places. I find that author names are not very consistent -- especially when books are released by different publishers.

I have given up hope that this will be fixed for items published in the past and I've little hope that there will be consistency for future publications in my lifetime. (Who would establish the gold standard?) My own experience with operating the SafetyLit database is that the metadata we are fed from publishers isn't consistent even from the same publisher. We add only 700 records a week and are able to hand edit to improve the consistency of publisher names and places. This work is done by volunteers. This requires lots of time -- an unnecessary cost to publishers when the goal is to list items for sale and not to blend their products with that of other publishers to facilitate a comprehensive search or a listing of works by an author.

adamsmith · January 11, 2012

While what DWL says about Worldcat is correct, ISBN data is worse than the data available from Worldcat. The biggest issue here is actually COinS, on which the Worldcat translator relies (that explains the issues about creators e.g.) - I have asked about shifting the translator to RIS (which Worldcat also supplies readily) but never received an anwer for that.

I believe it should also be possible to query LoC for ISBNs and get much better data where it's available, but that would involve a lot more work.

ajlyon · January 11, 2012

Thinking about this, I agree we should try to query LoC-- one possibility is to query xISBN, OCLC's ISBN cross-reference service, which will give us LCCNs for getting the full MARC or MODS data from LoC. For items without an LCCN, we can still fall back on OCLC's data, preferably using RIS. This wouldn't be too hard, and it would be much, much cleaner. This is what I have tentatively planned for Zandy already, so I'll look into, hopefully soon.

adamsmith · January 11, 2012

ajlyon - do you have any thoughts about coverting Worldcat to RIS. It's easy to do, but I'm concerned about messing things up with ISBN lookup.

ajlyon · January 11, 2012

Don't worry about breaking ISBN lookup. We'll test thoroughly and make it work. We can also implement a fallback to COinS if the RIS fails.

mark · January 12, 2012

That would be great. The problems with creators (my points 1-3) are the most serious and it sounds like these would be partly solved by using RIS instead of COinS. I fully understand that metadata providers differ in how they handle publishers and places.

adamsmith · January 12, 2012

I've started on the Worldcat translator - it will improve Worldcat somewhat: Some abstracts, better names, all available names.
That should be done soon-ish. But editors still won't work - they could, but Worldcat seems to only know authors.

Ajlyon - wouldn't SRU work really well for LoC?
http://z3950.loc.gov:7090/voyager?version=1.1&operation=searchRetrieve&query=dc.resourceIdentifier=9780199286546&maximumRecords=1
gives us marcxml for any ISBN

adamsmith · January 12, 2012

I have a new version of the worldcat translator, but given it's importance it needs testing both on the site and for ISBNs:

To test, download this file:
https://github.com/adam3smith/translators/raw/worldcat/Open%20WorldCat.js
and place it in the translator folder in your Zotero data directory
http://www.zotero.org/support/zotero_data#locating_your_zotero_library
replacing the existing one with the same name.
Restart Firefox/Zotero Standalone.
Don't expect too much - Worldcat's RIS isn't that great either - but you should see some improvements in data quality, most notably multiple creators.
Any issues or observations let us know.

(Also - Worldcat RIS provides multiple ISBNs - the current translator imports all of them - what would be the desired behavior)

mark · January 15, 2012

Does ISBN import use this updated version of the translator? If so, these are issues I run into:

1. Name of (last) author imported with a period
2. Some ISBNs don't work (e.g. 0313304483)

(How do I revert to the old translator?)

adamsmith · January 15, 2012

"Does ISBN import use this updated version of the translator?"
yes, it should.

I'll check for the period.
The ISBN - it seems like the translator only works for 13 digit ISBNs, not for 10 digit. I don't think it's actually getting called for the 10 digit ones, so I'm not sure how my changes could cause that.

revert by using "reset translators" from the advanced tab of the Zotero preferences. For good measure, follow up by "update translators" from the general tab. May have to restart FF/ZSA.

mark · January 15, 2012

I've previously used the "import by identifier" function with both 10 and 13 digit ISBNs.

Note that multi-author import works fine indeed with the new one.

ajlyon · January 16, 2012

ISBN-10 and ISBN-13 are working for me; I just tried 5776118557 and 9785776118555 (equivalent ones, but still).

adamsmith · January 16, 2012

Oh, I think I know what I did, I think I'm looking for the second result instead of the first one. *headdesk*

ajlyon · January 16, 2012

(Also - Worldcat RIS provides multiple ISBNs - the current translator imports all of them - what would be the desired behavior)

I've covered some of this ground before: https://www.zotero.org/trac/ticket/1604 and https://www.zotero.org/trac/ticket/1606

If you embed the ISBN parsing script at https://github.com/ajlyon/identifiers-js, you could check if they're the same and drop the ISBN-10 in that case, but that'd certainly be more than any other translator does.

adamsmith · January 16, 2012

thanks mark - that was very helpful. (I actually hadn't been stupid, I just hadn't realized/never seen a possible item display)
A new version is up under the same link.

All ISBNs (as long as they're in worldcat) should now work and periods removed after authors. Further testing much appreciated.

adamsmith · January 16, 2012

actually - ajlyon just pushed the fixed version to the repository. Any problems with ISBN or worldcat report here.

adamsmith · December 13, 2012

I'm happy to say that we now query the Library of Congress for ISBNs and only move on to Worldcat if we don't find anything there. You should be seeing a marked improvement of data imported via ISBN.