Two simple improvements for Duplicate Items (#papercut)

While the current implementation of Duplicate Items probably works reasonably well for many users, in large libraries, it quickly becomes unusable. In my >13.000 item library, the Duplicate Items view is permanently populated by 149 items, most of which are true non-duplicates (e.g. different editions, articles that come in multiple parts, books that come in multiple volumes, etc.), and another sizable subset of which are near-duplicates that are unresolvable because of item type differences. Any newly introduced simple full duplicates are almost impossible to spot in this mess.

Here are two simple improvements for Duplicate Items that would make life easier for people like me:

1. Allow users to mark items as non-duplicates. Allow users to hide items from the Duplicate Items to avoid the Duplicate Items pane filling up with false positives that are not actionable. See this recent thread but also this this ancient comment by @danstillman, where the hope still was that the detection algorithm would improve soonish. A rough solution would be to simply prevent items marked as non-duplicates from showing up in the Duplicate Items view. (A problem with that could be that newly introduced duplicates to those items would then also not show up, but for such boundary cases the benefits outweight the cost.)

2. Allow resolving near-duplicates with item type differences. Near-duplicates where only item type differs are show under Duplicate Items but do not permit any action because 'Merged items must be of the same type'. See this ancient message describing the problem. Note that this is increasingly common with the rise of preprint servers, which have items starting out as preprints and later coming out as papers. It looks like the existing version resolution dialog can already display (and therefore handle) item type differences, so little extra UI work seems needed (can't speak to what needs to be done under the hood to make this happen of course).

Thanks for considering these papercuts!
  • Please excuse my naïveté here but a quick look at the Zotero database structure seems to show that each record item (article, book, book section, thesis, etc.) in a library has a unique identifier. Is it not possible to add a new field (NotDuplicateWith) to each record of the appropriate table(s) and have this field contain all of the unique IDs of items marked as non-duplicates? The duplicate detection utility could consult this field and exclude record pairs (or multiples) from the list. If a newly added item is a duplicate (or possible duplicate); its ID number wouldn't be in any existing record's NotDuplicateWith field thus could show-up in the duplicate utility's results. This is essentially what we successfully did with the SafetyLit database and its duplicate detection system.
  • edited December 17, 2017
    It's not quite that simple — while the local implementation is easy, Zotero data needs to sync, so the API has to support this somehow — but we're planning to implement marking of non-duplicates using our item relation mechanism.
    Allow resolving near-duplicates with item type differences
    I think we can just allow merging any items and hide fields that aren't valid for the selected item type. Issue created.
  • My library is rather small (~2000 items) yet I have the false duplicate problem as well. Am posting here and here to keep the issue alive.
  • Adding myself to this thread.
Sign In or Register to comment.