Large collections and bulk de-duplication
One of our group libraries contains more than 60k records. A big part of them are duplicates. Removing them one by one through deduplication menu would be extremely costly both in time and effort, but as far as I'm concerned, there are no built-in or third party solutions for bulk dedup in Zotero.
1. We've had the idea to dedup the SQlite file directly. Deduplication should be possible with either SQL or pandas or other solutions; however, the Zotero database file contains multiple tables, so I suppose deduping just one would corrupt the entire thing. So I doubt this idea is actually viable, but perhaps someone has tried this?
2.Are there any new solutions we haven't heard of?
3.On a side note - is there any 'maximum intended' collection size for the best performance of Zotero? E.g. the aforementioned 60k library causes Zotero to constantly freeze on Windows; Ubuntu works slightly better for some reason. I have a modest laptop configuration http://i65.tinypic.com/2a4ogok.png - but it'd still be cool to know what resources a collection like this requires.
1. We've had the idea to dedup the SQlite file directly. Deduplication should be possible with either SQL or pandas or other solutions; however, the Zotero database file contains multiple tables, so I suppose deduping just one would corrupt the entire thing. So I doubt this idea is actually viable, but perhaps someone has tried this?
2.Are there any new solutions we haven't heard of?
3.On a side note - is there any 'maximum intended' collection size for the best performance of Zotero? E.g. the aforementioned 60k library causes Zotero to constantly freeze on Windows; Ubuntu works slightly better for some reason. I have a modest laptop configuration http://i65.tinypic.com/2a4ogok.png - but it'd still be cool to know what resources a collection like this requires.
This is more of a conceptual problem than a technical one. It's not really clear what any sort of automatic deduplication process would do — duplicate items can have both different versions of fields and different versions of files (e.g., from PDF watermarks or separate snapshots of the same webpage), neither of which can necessarily be resolved in any automated manner.
If we could come up with a UI for it, we could at least offer to automatically merge the subset of items that can be resolved automatically. We could also merge exact file matches when merging parent items, but even there there's the attachment title, filename, and embedded note, which could all differ.
Out of curiosity, how did you end up with so many duplicates?
Being able to do something like this would be a blast - probably not even through UI, but with some tweaks, like editing sqlite (and which, I presume, would not work). I suppose that our issue is unusual indeed, but others might also find a use for the 'rough' deduplication.
As for the amount of duplicates - I guess the main reason is that we used that group as a db for our text corpora. And as we were getting it collaboratively and from places not really adapt for that - it turned out to be quite a mess.
We have a Ukrainian developer looking at ways to still dedupe in sqlite without 'breaking' it. But if you could already implement what you suggested (at least automatically merge the 'obvious' ones (in libraries - across collections), that would already be a great step forward.
did you get anywhere with this? We're looking at citation trees (ping @dlesieur), which means that we have to periodically merge (as Zotero doesn't have a function to check on import).
We've also implemented a non-descriptive merge, where on merge, the item data of both items is written to a note. So if something goes wrong, you can then find the missing metadata.
Björn
I imported a Mendeley library into my Zotero, now have like 2000 duplicates.
It seems an option to choose "treat the newer version as master" would be perfectly helpful, since now I am not making any more educated assessment than this anyway.
More likely I am about to delete my entire Zotero library and just re-import the Mendeley into an empty library. Not exactly satisfying either.