Re-prioritize generic translators

adamsmith · September 15, 2016

(continued from https://forums.zotero.org/discussion/62433/report-id-964910888#latest)

I think it's time to re-order the priority of generic translators.

Background:(mainly for those not familiar)
Zotero runs for translators on every site a user opens in a browser: unAPI, DOI, COinS, and Embedded Metadata. They're run in order of priority, lower to higher. Their current priorities are:
unAPI: 300
COinS: 310
DOI: 320
Embedded Metadata: 400
(All site-specific translators have priority 100, so are run first. Catalog/system translators with somewhat specific URL patterns like Atypon have priority 200).
While users can now select which translators to actually use for import, most people just click the Save to Zotero button, i.e. the lowest-priority translator that detects importable content, so the priority order is quite important.
Why a change?:
We have seen two very positive developments: We're seeing more DOIs used and we're seeing embedded metadata of higher quality, especially on academic sites, which are increasingly using google highwire metatags. With good metadata, import quality in Zotero is higher than with DOIs. In particular, they contain abstracts and PDF attachments. Moreover, the DOI importer will often pick up cited items, which is confusing for users.
Proposed change:
unAPI: 300
embedded Metadata: 310
COinS: 320
DOI: 330

I think given what we're seeing (and even more so once we add JSON-LD support to EM), this makes sense, will lead to an overall improved user experience, and will make it easier for sites/publishers to get reliable results when implementing support for Zotero via EM.

Interested to hear from anyone, @zuphilip and @Dan%20Stillman in particular

zuphilip · September 16, 2016

Thank you for starting this discussion. Yes, I absolutely agree, that at the moment in a lot of cases EM translator should be preferred over DOI translator (mostly when some DOIs are present in the reference section of an HTML article).

However, I am not sure, that we also should move EM above COinS. I think that in most cases COinS are made because of the possibilities for extracting data into reference management systems or to use some OpenURL linking. But embedded metadata might also serve other purposes like search engine optimization which might be the "wrong" data for us. Thus, I would slightly argue to prefer COinS over EM. Do you have different arguments or experience concerning these two translators?

In the future, we want to have a generic translator, which might even be part of the EM translator: https://github.com/zotero/translators/issues/1092 . What are the implications from that?

adamsmith · September 16, 2016

Yes, I think COins versus EM is probably the more controversial part and I'm also open for this to go either way. My argument for EM would be that it allows for richer metadata (again, abstract & PDF, probably more) and should just be preferred, especially because I'd argue that the presence of COinS correlates with high-quality EM.

Again, though, I think this is a lot less clear-cut than downgrading DOI.

dstillman · September 16, 2016

My argument for EM would be that it allows for richer metadata (again, abstract & PDF, probably more) and should just be preferred

Yeah, this seems reasonable to me.

EM as fallback translator would then present a bit of a pickle, though, since COinS would never trigger, but we can discuss workarounds on the ticket — I don't think it needs to affect this decision.

Semi-relatedly, though, do we have a sense for how often DOI actually produces a result for the current page (rather than cited items) when no other translator triggers? I'm almost inclined to say that DOI should always be relegated to the context menu (as it would be if EM became the fallback translator). There could possibly even be some modification to the save icon to indicate that secondary translators like that were available.

adamsmith · September 16, 2016

Semi-relatedly, though, do we have a sense for how often DOI actually produces a result for the current page (rather than cited items) when no other translator triggers?

good question. Not really -- I don't think it's that rare, but I'd have said between 30-50% of relevant sites (i.e. sites that don't have another translator decting anyway)?

DWL-SDCA · September 16, 2016

There are a few publishers that consistently have awful EM. Some of those also have peculier and inconsistent formatting of author names. Off the top of my head, Emerald Group journals are a major publisher with both EM and DOI problems. But there are many more.

For these I use both the default translator and at least one of the others available from the drop-down. Then combine them into one or the other to obtain a more complete record.

I import more than 2500 journal articles every week. I'll be glad to keep notes of detailed publishers' / stand-alone journals' metadata quality and share it with you. Wouldn't it be better to base this decision on real information instead of sense?

( I'm compulsive about getting the most complete metadata available -- so much that for publishers that only provide author initials instead of full names we have volunteers who find the full names even if they must go to the author's university, agency, or institution to find those names. Think CMoS ¶14.72 and APA 6.27. )

zuphilip · September 16, 2016

My argument for EM would be that it allows for richer metadata (again, abstract & PDF, probably more) and should just be preferred, especially because I'd argue that the presence of COinS correlates with high-quality EM.

Sorry, but I am not convinced. The metadata might be not much more than some keywords optimizing search engine ranking. On the other hand, everybody using COinS tries to optimize the bibliographic metadata. Moreover, it is possible to use COinS in search result pages as multiples, but then the EM would only grab the metadata of the search result website itself. This would effect most VuFind catalogues, see for example https://katalog.tub.tuhh.de/Search/Results?lookfor=zotero&type=AllFields

How about "merging" these two translators into one or maybe calling one translator from the other?

Semi-relatedly, though, do we have a sense for how often DOI actually produces a result for the current page (rather than cited items) when no other translator triggers?

I searched my library for catalogue name equals "CrossRef" and found 81 items which I checked: only 7 items where produced from a website where there is no other translator present. The total count of My Library is 5314.

dstillman · September 16, 2016

That's a fair point about 'multiple'.

How about "merging" these two translators into one or maybe calling one translator from the other?

This was what I was going to suggest on #1092, since if EM became the fallback translator and we were also inclined to prioritize it, this would allow COinS to supersede a generic webpage save.

If COinS remained a separate translator that was called from EM, though, it'd mean running its detectWeb() twice on every page. If that turned out to have a performance impact, we'd want to either combine them (which would mean users couldn't distinguish between the two manually, which maybe is fine) or add caching of detectWeb() for all runs on a given page load.

Rintze · September 16, 2016

Semi-relatedly, though, do we have a sense for how often DOI actually produces a result for the current page (rather than cited items) when no other translator triggers?

In my experience (I have 35 items with Library Catalog = "CrossRef" in a library of 740), the DOI translator works more often than not, and when it does the metadata is quite good.

A lot of these items were from publishers for which we have translators (maybe those were temporarily broken), like Springer Link, Hindawi Publishers, ACS, etc. A few links that currently still only report the DOI translator:

http://scripts.iucr.org/cgi-bin/paper?S1399004714002788
http://www.nature.com/articles/sdata201514
http://www.nature.com/articles/srep19233
http://content.karger.com/Article/Abstract/170217
http://europepmc.org/abstract/MED/9836874

zuphilip · September 16, 2016

@Rintze : All these examples also provide metadata which can be extracted by using the EM translator, which is just "behind" the currently more prioritized DOI translator.

Rintze · September 16, 2016

Yeah, sorry, I shouldn't have written "only". I meant that these were the links in my library that didn't show a publisher-specific translator.

adamsmith · September 16, 2016

OK, zuphilip has convinced me. I'd suggest the following then:

Immediately: switch the priority of DOI and EM, leave COinS before EM
Medium term: incorprate COinS into the EM translator (FWIW, I don't think users need to distinguish between the two. Only librarians -- no offense -- know what COinS means and technically it is embedded metadata, too.)

dstillman · October 16, 2016

For what it's worth:

https://github.com/zotero/zotero/issues/1110

There's BibTeX on the page (which we don't do anything with), embedded metadata (which omits at least the DOI), and a DOI (which we don't extract properly currently but which gives the best metadata). So here, at least, if we moved EM over DOI we'd be getting worse data, though we wouldn't need to show the Select Items dialog. Still probably makes sense, but it's probably pretty common for EM to be worse than DOI data.

dstillman · October 16, 2016

Ultimately, I suspect we'll want to merge DOI into the EM translator as well (as suggested here), though I'm not sure what logic we'd use to handle a case like this.