Two simple improvements for Duplicate Items (#papercut)

mark · December 16, 2017

While the current implementation of Duplicate Items probably works reasonably well for many users, in large libraries, it quickly becomes unusable. In my >13.000 item library, the Duplicate Items view is permanently populated by 149 items, most of which are true non-duplicates (e.g. different editions, articles that come in multiple parts, books that come in multiple volumes, etc.), and another sizable subset of which are near-duplicates that are unresolvable because of item type differences. Any newly introduced simple full duplicates are almost impossible to spot in this mess.

Here are two simple improvements for Duplicate Items that would make life easier for people like me:

1. Allow users to mark items as non-duplicates. Allow users to hide items from the Duplicate Items to avoid the Duplicate Items pane filling up with false positives that are not actionable. See this recent thread but also this this ancient comment by @danstillman, where the hope still was that the detection algorithm would improve soonish. A rough solution would be to simply prevent items marked as non-duplicates from showing up in the Duplicate Items view. (A problem with that could be that newly introduced duplicates to those items would then also not show up, but for such boundary cases the benefits outweight the cost.)

2. Allow resolving near-duplicates with item type differences. Near-duplicates where only item type differs are show under Duplicate Items but do not permit any action because 'Merged items must be of the same type'. See this ancient message describing the problem. Note that this is increasingly common with the rise of preprint servers, which have items starting out as preprints and later coming out as papers. It looks like the existing version resolution dialog can already display (and therefore handle) item type differences, so little extra UI work seems needed (can't speak to what needs to be done under the hood to make this happen of course).

Thanks for considering these papercuts!

DWL-SDCA · December 16, 2017

Please excuse my naïveté here but a quick look at the Zotero database structure seems to show that each record item (article, book, book section, thesis, etc.) in a library has a unique identifier. Is it not possible to add a new field (NotDuplicateWith) to each record of the appropriate table(s) and have this field contain all of the unique IDs of items marked as non-duplicates? The duplicate detection utility could consult this field and exclude record pairs (or multiples) from the list. If a newly added item is a duplicate (or possible duplicate); its ID number wouldn't be in any existing record's NotDuplicateWith field thus could show-up in the duplicate utility's results. This is essentially what we successfully did with the SafetyLit database and its duplicate detection system.

dstillman · December 17, 2017

It's not quite that simple — while the local implementation is easy, Zotero data needs to sync, so the API has to support this somehow — but we're planning to implement marking of non-duplicates using our item relation mechanism.

Allow resolving near-duplicates with item type differences

I think we can just allow merging any items and hide fields that aren't valid for the selected item type. Issue created.

rtgilbert · December 14, 2018

My library is rather small (~2000 items) yet I have the false duplicate problem as well. Am posting here and here to keep the issue alive.

bjohas · August 15, 2020

Adding myself to this thread.

mark · February 5, 2025

I missed the seventh year anniversary of this thread but will bump it just to note that my Duplicate Items pane has grown to 276 items and is unmanagable as ever due to these papercuts still existing.

Besides the two things I noted above, I will add a third, which is by far the quickest to implement:

3. Hide duplicate items that cannot be resolved from the Duplicate Items pane. The rationale is that the Duplicate Items pane as it is currently is unusable in any library of reasonable size. As long as we can't merge near-duplicates of different types it doesn't make sense to taunt me with their existence and their listing crowds out the actual duplicates I do want to resolve.