DOI and duplicate

thanatophobic · October 2, 2018

Hya

It is always nice to use such an excellent software.
The document reads "Zotero currently uses the the title, DOI, and ISBN fields to determine duplicates." under duplicate detection.
The rule for DOI reads, "Each DOI® name is a unique "number", assigned to identify only one entity. " http://www.doi.org/doi_handbook/2_Numbering.html#2.1
But, some publisher DO NOT FOLLOW the rule. Many entries share same DOI, 10.1056/NEJMc1804294. I also have seen entire abstracts of a meeting have one DOI.
I guess some sort of tweak may be necessary for duplicate detection.
cheers

dstillman · October 2, 2018

I'm not sure what you mean here. 10.1056/NEJMc1804294 is https://www.nejm.org/doi/10.1056/NEJMc1804294

Where are you seeing many entries sharing that DOI?

bwiernik · October 2, 2018

Some conferences will publish all of the abstracts for the meeting as one article in a journal, with a single DOI. I’ve also seen some journals bundle all of the commentaries or letters responding to an article together into one publication and give it a single DOI.

In a technical sense, neither of these are “many entries sharing same DOI”. They are single publications, with several component parts. This is an edge case enough that I don’t think it is necessary change anything about the duplicate detection algorithm other than providing an option to manually mark items as non-duplicates.

djross3 · October 2, 2018

DOIs are not truly unique identifiers:
1. One-to-Many: Yes, sometimes multiple items are included under the same general DOI (as bwiernik explained).
2. Many-to-One: Different websites give the same article different DOIs that point to their servers. This applies to journals available through multiple databases (and possibly directly at the publisher).
3. New DOIs: I've often found newly published articles, especially ahead-of-print, have DOIs that don't work. This is usually resolved quickly (weeks or months?) but given the (justifiable) bias toward citing the newest, state-of-the-art research in many fields, these papers will be cited more often than most. (These probably are unique most of the time, but I wouldn't be surprised if some are duplicates or errors.)
4. Duplicates (or just errors): some publishers (especially less prestigious/sophisticated ones) just mistakenly assign the same DOI multiple times and might (or might not) fix it later.

Ideally, DOIs should be unique identifiers, but that's not always the case in practice.

adamsmith · October 2, 2018

All of this is correct, but only 1) and 4) are relevant for Zotero's duplicate detection algorithm and I think sufficiently rare/edge cases to ignore for this purpose.
(Ideally none of these, with the possible exception of 2), should occur, obviously, but that's not our concern here).

djross3 · October 2, 2018

I understand. However, do note that (2) will give false negatives for duplicates, so it also isn't a unique identifier that can always reliably determine duplicates. This is something that might be important for someone who for example searches multiple databases for relevant entries then adds them to Zotero without manually checking, and later has many duplicates from that but Zotero would ignore them because they have different DOIs.

As @bwiernik said, having an option to manually remove duplicates would be fine. As a heuristic, DOIs are fine for identifying possible duplicates, but it's frustrating that there's no way to clear out the duplicates list after checking manually. I know this feature is complicated and in development, though.

bwiernik · October 2, 2018

An option to manually mark non-duplicates is planned. For now, I recommend not worrying about it.

Gurdas_Sandhu · October 2, 2018

Not really a Zotero issue, but related to duplicate detection. The case is similar to (2) in djross3's post.

A year ago, the Transportation Research Record (TRR) journal of the Transportation Research Board (TRB) moved to Sage publications.
http://www.trb.org/Research/Blurbs/177011.aspx

And guess what, Sage has assigned different DOIs to all historical articles. I happened to add a few pre-2017 articles yesterday and they seemed familiar so I was wondering that I possibly already have them in my database. But they would not show up in my duplicates list because they had different DOIs.

Can the user have more control on how to detect duplicates? Or, AT LEAST be able to mark a group of articles as duplicates (thus, different than marking as non-duplicates and maybe easier?). Or, allow a "maybe duplicates" collection based on somewhat less stringent criteria such as identical titles?

Ideally, I wish DOIs were strictly unique. I will write to TRR and Sage but doubt that's going to help.

djross3 · October 2, 2018

Aside from difficulty in finding them, once you recognize duplicates that actually are duplicates, why not just merge the items rather than manually marking them as duplicates?

Regarding TRR and Sage, my guess is that Sage is now using their own prefix. One problematic aspect of DOIs is that only the first part of the DOI is centrally assigned, and that just then points to a collection of DOIs defined by the second part at the host. That's why most DOIs follow a mostly numerical format but some publishers/journals decide to have mostly alphabetic DOIs, again just for that second part. In the end, it's probably better if Sage uses their own DOIs because they're now managing the material, but of course this just reveals the inherent weakness in DOIs: It's still up to publishers/distributors to manage their content and keep it online. If Sage or another publisher shuts down and disappears from the internet, the DOI is useless. Or if TRR changes distributors again, they might reset all the DOIs again as well. Unlikely, and still helpful in most cases for locating articles, but yet another example of why DOIs are better in theory than practice. Certainly we would never want to replace full citations with DOIs only!

Gurdas_Sandhu · October 2, 2018

You're right, I could just use the merge function. I did not think of that since I've never used it.

Just in case this is helpful to anyone, here's an example of the same article with different DOIs:
https://doi.org/10.3141/1987-09
https://doi.org/10.1177/0361198106198700109

Interestingly, when I click the DOI of another article from 2013, it is redirecting to the SAGE page that has the same DOI.
http://dx.doi.org/10.3141/2340-02

So, SAGE is not assigning new DOIs to all articles but redirecting some of them? Confusing.

thanatophobic · October 3, 2018

Umm,

https;//doi.org/10.1172/JCI18937
https;//doi.org/10.1172/JCI200318937
These two DOI's point same article. Within a publisher, two DOI's.

Now, I've learned duplicate-detection is not an easy task.
cheers

DWL-SDCA · October 3, 2018

The same TRB / Sage DOI problem occurred for many (but not all) Blackwell journals when Blackwell was bought by Wiley. For a year or two the Blackwell DOIs still connected. The old DOIs no longer resolve. I could editorialize about my disappointment at this but I'll save that for another venue.