Documentation request: Coaxing your favorite database towards helpfulness.

scot · March 19, 2007

Am I right in saying that screen scraping is only a best-effort stopgap measure to help import from unenlightened databases. (Even if it does work very well sometimes)? And That it will always be subject to the changing of page layout and perhaps to URL? That it's hard to get right and hard to keep right, given the 1000s of databases that Zotero users are likely to find 'essential' for their work? ("How could you let the Norwegian Potato History Archive Catalog import mangled page references for going on three days now?!" [etc]) It seems that what really needs to happen is for data providers to move towards enlightenment and give Zotero the handles to accurately import metadata without having to resort to scraping. One of my favorite databases (Copac) jumped at the chance to add COinS support to their new interface after I only mentioned Zotero to them, but I expect others will need a little further persuasion. Could you (Zotero developers or other in-the-know community members) put together a "Guide to help your favorite data provider towards community standards."

I'm ready and eager to write intelligent, persuasive letters to my favorite data providers, but I don't know enough just yet, and although there are some pointers on the Zotero site (to standards like COinS, etc), I don't have time to learn everything I need to write those 'intelligent user' letters which can soften the minds of even the most hardened of administrators.

This could include:

(1) What are the best technologies for Zotero to use for importing bibliographic and other metadata?

(2) What it means for data providers to use those technologies. How hard is it? What would they have to do? I always feel like the kind of emails I intend to write benefit greatly from some kind of understanding of the size of the favour you are reqesting.

(3) Links to (a) useful discussions of those technologies, (b) data providers who have implemented them, with some description of who those providers are and how much weight they carry--such things matter to people (c) user-level tools (like Zotero) which implement them. Zotero is cool, but it carries more weight to say 'What I'm asking has the potential to benefit all kinds of users'

(4) Short summaries of the knock-down drag-out arguments for these new technologies.

(5) how to tell what technologies your provider may already be using. (Display a record, view source and look for what?) Avoids the embarrassment of asking for things that are already there. Helps you identify any problems you might be having with importing by helping you figure out what is on offer. It might be useful in this whole process if there was a way to figure out what translator Zotero is using to import from my site.?

What do you think?

noksagt · March 19, 2007

Individual scrapers are great IF sites don't make changes which require updated scrapers.

While I use COinS, it is far from perfect. Two notable deficiencies are the lack of abstract support and the lack of file links (for automatic download of a PDF of the article).

Eventual UnAPI support in Zotero offers the promise of a "standard" which has both of these things (as most bibliographic export/import formats have them).

This isn't exactly what you want, but Zotero has:
http://dev.zotero.org/docs/exposing_your_metadata
http://www.zotero.org/documentation/compatible_standards_and_software

The former is on the wiki, so could be expanded.

scot · March 19, 2007

Thanks, noksagt. It's interesting (and mildly disappointing) that COinS is missing abstract support and links. Still, it's perhaps the 'way to go' just now for many sites (or not)? Can Zotero make use of a solution which uses COinS, and scrapes the abstracts and links?

I'd be glad to here some more comments regarding the best strategies to improve our bibliographic and metatdata import in Zotero. (Should I ask my library to do anything, and if so what?) Does it just amount to: (1) keeping on top of our favorite sites with evolving scrapers? (2) encouraging them to "expose their metadata" in friendly ways, perhaps with COinS now, adding UnAPI later, and all the while, (3) writing clever translators to use whatever metadata we can get, and scrape the rest.

dstillman · March 19, 2007

Exposed metadata is certainly the ideal, and any efforts to convince site operators to implement such changes are quite welcome, but since screen scraping will likely be necessary on many sites for the foreseeable future, one thing we're planning is a better system for the community to contribute to the upkeep of translators that rely on screen scraping. We get errors when translators fail and have a mechanism for pushing updated translators daily, but the turnaround time on updates would likely be much shorter if community members could take a look at the error, make the necessary (and sometimes trivial) changes, and mark the fixed translator as ready to push to the repository.

We hope to have such a system up soon.

noksagt · March 19, 2007

COinS was designed (after OpenURL) for findability, not for citation.

I am not aware of any hybrid approaches which allow both COinS and some supplementary data currently.

Site translators aren't too complex, so some sites may want to build their own.

As for individual libraries: most use one of a handful of popular catalogs. Most of these can be coaxed into having discoverable metadata. Direct your librarian to this site. If you're feeling particularly helpful, deduce which catalog you're using & see if there could already be Zotero support.

Matthias · March 19, 2007

I agree with Scot that it may be beneficial to have a single page that describes (in simple words) the metadata "standards" supported by Zotero and which gives advice on possible implementation requirements. While screen scraping will surely remain important, it should be our goal to convince data providers to properly expose their metadata.

I also think that it may be a good idea to convince the developers of web catalog products to support "standards" such as COinS (or even better unAPI). In addition to end-user requests, the fine folks at CHNM in conjunction with Mellon Fundation may be in a powerful position to talk with data providers and product developers about this.

dancohen · March 19, 2007

A quick couple of points on this very helpful thread:

1) Simon added unAPI support to the SVN today (and is working out some of the kinks). I think unAPI shows a lot of promise and hopefully having support for it in Zotero will help with adoption.

2) I am going to a "big picture" Mellon meeting next week, and I would be happy to raise the importance of supporting standards. Many other Mellon projects are interested in this too, of course, because it eases exchange of information between and among various digital collections and applications. Ideally we shouldn't have to do any scraping; it's obviously a pragmatic short- or medium-term solution until metadata is exposed consistently (hopefully using standards).

Matthias · March 20, 2007

Dan (and Simon), many thanks for adding unAPI support, this is great news and I'm eager to try this out!

Re. raising the issue of standards support, I'm sure that many of us would greatly appreciate if you'd bring this up at a coming Mellon meeting. Many good standard methods exist to exchange bibliographic metadata, but they need to be agreed upon and implemented by the different parties (and especially the big players) to get truly useful. I don't think that implementation of these standards would pose any big effort for big vendors, they just need to be convinced of the benefits.

scot · April 16, 2007

Could I re-raise this issue in a slightly more focused manner? There is a manager of a particularly useful library database for my discipline (Biblical Studies) which is so far not compatible with Zotero. It's apparently a homegrown OPAC of some sort, but quite featureful. The database manager seems very interested in new technological developments which can help with research. I'd like to write to him and propose that he make his data Zotero-friendly.

What is the best avenue to ask him to take? From the previous discussion I assume it would be to 'add unAPI tags to his pages.' Although Zotero doesn't have support for it in its public builds, it is in already in the dev pipeline, and will presumably show up before too long. Or is it still best to ask for COinS support, until unAPI gets wider adoption. (Or indeed, is it easy enough to implement both simultaneously?)

(1) Could someone comment on unAPI as a recommendation for libraries to 'expose their metadata'' ? Presumably if it is a good solution we could throw together some documentation to help any forward-thinking library implement it.

(2) Is there any detail specifically related to Zotero-compatibility (i.e. not available from the canonical documentation or promotion websites) which I should pass on to get them started?

noksagt · April 16, 2007

unAPI support is now in 1.0.0b4.r2 & so it can be implemented to make sites compatible with Zotero.

unAPI retains more metadata than COinS & uses standard bibliographic export formats that may be useful to people who don't use Zotero, but do use some other bibliographic software. If the product already has these export formats, unAPI is easy to add. If it doesn't, implementing the export formats will only improve the product.

COinS is also fairly easy to implement & there's no reason not to have both. COinS can additionally be used by LibX and by a number of other bookmarklets, which make it easier to find downloadable copies of articles.

I don't think any Zotero-specific documentation is needed for either COinS or unAPI--there are servers who have had implementations which predate Zotero & Zotero seems to work with them.

kkraus · April 16, 2007

"COinS can additionally be used by LibX and by a number of other bookmarklets, which make it easier to find downloadable copies of articles."

This is a good point that Noksagt has made elsewhere on this thread, but bears emphasis: COinS tags can be used to help users grab metadata and cite sources, but also locate full-text electronic copies of them through their local institutions. In addition to LibX, I'd also mention other OpenURL resolvers such as SFX. Bottom line is that there are lots of good reasons why content providers might want to embed COinS in their markup.