Large collections and bulk de-duplication

One of our group libraries contains more than 60k records. A big part of them are duplicates. Removing them one by one through deduplication menu would be extremely costly both in time and effort, but as far as I'm concerned, there are no built-in or third party solutions for bulk dedup in Zotero.

1. We've had the idea to dedup the SQlite file directly. Deduplication should be possible with either SQL or pandas or other solutions; however, the Zotero database file contains multiple tables, so I suppose deduping just one would corrupt the entire thing. So I doubt this idea is actually viable, but perhaps someone has tried this?

2.Are there any new solutions we haven't heard of?

3.On a side note - is there any 'maximum intended' collection size for the best performance of Zotero? E.g. the aforementioned 60k library causes Zotero to constantly freeze on Windows; Ubuntu works slightly better for some reason. I have a modest laptop configuration - but it'd still be cool to know what resources a collection like this requires.
  • There's not currently anything better than the Duplicate Items view.

    This is more of a conceptual problem than a technical one. It's not really clear what any sort of automatic deduplication process would do — duplicate items can have both different versions of fields and different versions of files (e.g., from PDF watermarks or separate snapshots of the same webpage), neither of which can necessarily be resolved in any automated manner.

    If we could come up with a UI for it, we could at least offer to automatically merge the subset of items that can be resolved automatically. We could also merge exact file matches when merging parent items, but even there there's the attachment title, filename, and embedded note, which could all differ.

    Out of curiosity, how did you end up with so many duplicates?
  • There is no need to go as deep as you wrote, at least for starters. How I imagined it: Zotero could look for similar items - like it currently does in 'Duplicate Items' menu; if only one of them has an attachment, it is the one that stays; if all of them have attachments, some additional criteria/criterion is applied, e.g. the newest item stays; or the one with the fullest metadata etc.

    Being able to do something like this would be a blast - probably not even through UI, but with some tweaks, like editing sqlite (and which, I presume, would not work). I suppose that our issue is unusual indeed, but others might also find a use for the 'rough' deduplication.

    As for the amount of duplicates - I guess the main reason is that we used that group as a db for our text corpora. And as we were getting it collaboratively and from places not really adapt for that - it turned out to be quite a mess.
  • And Dan, just conceptually: for most of the work we do these days (we're a small Dutch think tank), we try to collect as big a corpus as we possibly can in Zotero, and we then textmine it to identify the main topics, how they change over time, etc. Just to get a bird's eye view of what 'academia' has been (or has not been) working on with respect to our research topics. We gather the items from various aggregators (from the usual academic aggregators, Publish or Perish sources, Unpaywall, etc.) and get them into Zotero. But as Yevhen indicated, that creates lots of duplicates.
    We have a Ukrainian developer looking at ways to still dedupe in sqlite without 'breaking' it. But if you could already implement what you suggested (at least automatically merge the 'obvious' ones (in libraries - across collections), that would already be a great step forward.
Sign In or Register to comment.