Re-prioritize generic translators
(continued from https://forums.zotero.org/discussion/62433/report-id-964910888#latest)
I think it's time to re-order the priority of generic translators.
Background:(mainly for those not familiar)
Zotero runs for translators on every site a user opens in a browser: unAPI, DOI, COinS, and Embedded Metadata. They're run in order of priority, lower to higher. Their current priorities are:
unAPI: 300
COinS: 310
DOI: 320
Embedded Metadata: 400
(All site-specific translators have priority 100, so are run first. Catalog/system translators with somewhat specific URL patterns like Atypon have priority 200).
While users can now select which translators to actually use for import, most people just click the Save to Zotero button, i.e. the lowest-priority translator that detects importable content, so the priority order is quite important.
Why a change?:
We have seen two very positive developments: We're seeing more DOIs used and we're seeing embedded metadata of higher quality, especially on academic sites, which are increasingly using google highwire metatags. With good metadata, import quality in Zotero is higher than with DOIs. In particular, they contain abstracts and PDF attachments. Moreover, the DOI importer will often pick up cited items, which is confusing for users.
Proposed change:
unAPI: 300
embedded Metadata: 310
COinS: 320
DOI: 330
I think given what we're seeing (and even more so once we add JSON-LD support to EM), this makes sense, will lead to an overall improved user experience, and will make it easier for sites/publishers to get reliable results when implementing support for Zotero via EM.
Interested to hear from anyone, @zuphilip and @Dan%20Stillman in particular
I think it's time to re-order the priority of generic translators.
Background:(mainly for those not familiar)
Zotero runs for translators on every site a user opens in a browser: unAPI, DOI, COinS, and Embedded Metadata. They're run in order of priority, lower to higher. Their current priorities are:
unAPI: 300
COinS: 310
DOI: 320
Embedded Metadata: 400
(All site-specific translators have priority 100, so are run first. Catalog/system translators with somewhat specific URL patterns like Atypon have priority 200).
While users can now select which translators to actually use for import, most people just click the Save to Zotero button, i.e. the lowest-priority translator that detects importable content, so the priority order is quite important.
Why a change?:
We have seen two very positive developments: We're seeing more DOIs used and we're seeing embedded metadata of higher quality, especially on academic sites, which are increasingly using google highwire metatags. With good metadata, import quality in Zotero is higher than with DOIs. In particular, they contain abstracts and PDF attachments. Moreover, the DOI importer will often pick up cited items, which is confusing for users.
Proposed change:
unAPI: 300
embedded Metadata: 310
COinS: 320
DOI: 330
I think given what we're seeing (and even more so once we add JSON-LD support to EM), this makes sense, will lead to an overall improved user experience, and will make it easier for sites/publishers to get reliable results when implementing support for Zotero via EM.
Interested to hear from anyone, @zuphilip and @Dan%20Stillman in particular
However, I am not sure, that we also should move EM above COinS. I think that in most cases COinS are made because of the possibilities for extracting data into reference management systems or to use some OpenURL linking. But embedded metadata might also serve other purposes like search engine optimization which might be the "wrong" data for us. Thus, I would slightly argue to prefer COinS over EM. Do you have different arguments or experience concerning these two translators?
In the future, we want to have a generic translator, which might even be part of the EM translator: https://github.com/zotero/translators/issues/1092 . What are the implications from that?
Again, though, I think this is a lot less clear-cut than downgrading DOI.
EM as fallback translator would then present a bit of a pickle, though, since COinS would never trigger, but we can discuss workarounds on the ticket — I don't think it needs to affect this decision.
Semi-relatedly, though, do we have a sense for how often DOI actually produces a result for the current page (rather than cited items) when no other translator triggers? I'm almost inclined to say that DOI should always be relegated to the context menu (as it would be if EM became the fallback translator). There could possibly even be some modification to the save icon to indicate that secondary translators like that were available.
For these I use both the default translator and at least one of the others available from the drop-down. Then combine them into one or the other to obtain a more complete record.
I import more than 2500 journal articles every week. I'll be glad to keep notes of detailed publishers' / stand-alone journals' metadata quality and share it with you. Wouldn't it be better to base this decision on real information instead of sense?
( I'm compulsive about getting the most complete metadata available -- so much that for publishers that only provide author initials instead of full names we have volunteers who find the full names even if they must go to the author's university, agency, or institution to find those names. Think CMoS ¶14.72 and APA 6.27. )
How about "merging" these two translators into one or maybe calling one translator from the other? I searched my library for catalogue name equals "CrossRef" and found 81 items which I checked: only 7 items where produced from a website where there is no other translator present. The total count of My Library is 5314.
If COinS remained a separate translator that was called from EM, though, it'd mean running its detectWeb() twice on every page. If that turned out to have a performance impact, we'd want to either combine them (which would mean users couldn't distinguish between the two manually, which maybe is fine) or add caching of detectWeb() for all runs on a given page load.
A lot of these items were from publishers for which we have translators (maybe those were temporarily broken), like Springer Link, Hindawi Publishers, ACS, etc. A few links that currently still only report the DOI translator:
http://scripts.iucr.org/cgi-bin/paper?S1399004714002788
http://www.nature.com/articles/sdata201514
http://www.nature.com/articles/srep19233
http://content.karger.com/Article/Abstract/170217
http://europepmc.org/abstract/MED/9836874
Immediately: switch the priority of DOI and EM, leave COinS before EM
Medium term: incorprate COinS into the EM translator (FWIW, I don't think users need to distinguish between the two. Only librarians -- no offense -- know what COinS means and technically it is embedded metadata, too.)
https://github.com/zotero/zotero/issues/1110
There's BibTeX on the page (which we don't do anything with), embedded metadata (which omits at least the DOI), and a DOI (which we don't extract properly currently but which gives the best metadata). So here, at least, if we moved EM over DOI we'd be getting worse data, though we wouldn't need to show the Select Items dialog. Still probably makes sense, but it's probably pretty common for EM to be worse than DOI data.