Duplicate detection?

adamsmith · April 8, 2011

Zotero has been around since fall 2006 ;-)
There is no duplicate detection that's currently useable, no, so 2.)

s marple · April 8, 2011

Ok, gosh i was pretty sure i used zotero for my masters in 2005, i guess not. at any rate, if i am going to spend the next 10+hours trying to fix my dissertation, which is giving me all sorts of corrupted citation errors and has duplicated all of my works cited in my bibliography, i gather because of the duplication issue, how can i keep the duplication problem from cropping up in the future? my issue seems to have developed due to using two different computers with the same sync'd database. I have 2-4 records for each reference
or is it time for me to finally give up on zotero?

adamsmith · April 8, 2011

I can't tell you how you duplicated your references - that doesn't happen with regular Zotero use.
I also can't tell you how you got corrupted citations - again, that's not something that happens with regular Zotero use.
I don't know if you want to give up on Zotero, you'll have to decide that for yourself, hundreds, possibly thousands of people (me included) write their dissertations in Zotero successfully and without major issues.
Duplicate detection is certainly an important feature and it's going to happen eventually, but for regular use, not having the feature is a minor nuisance, not a major disaster.

s marple · April 8, 2011

Hey, the problem is that the duplication just showed up in my entire database after i synced with my other computer last year. Every entry is duplicated- or quad-plicated. It is not that i added the same ref multiple times, having forgotten that i already had it in my database. For my smaller projects i have just manually dealt with the refs page hoping for some sort of fix.
does this mean that perhaps there is a fix for my data base?

fbennett · April 8, 2011

You would want to proceed with caution, but if you're willing to set up a separate Firefox profile with a second copy of your database, you could give the duplicates detection system in the multilingual version a try. It's been used successfully by at least one major project that I know of. Take a look at this page, and at this page, and watch this screencast (terrible narration by yours truly, but it gives the idea). If the procedures there don't look too daunting, post back and I can walk you through the process. The multilingual client is still experimental, so I can't guarantee instant success; but if you have a little leeway in your schedule, it might save you some extra typing.

101james · June 26, 2011

I've been using Zotero for almost a year now and really appreciate the ease of importing items to the database, the portability of those databases and the ability to share with colleagues. This is the 'magic bit' that saves hours of time and makes me a big Zotero fan.
Killer, though, is the absence of duplicate detection. I cannot import all my references from Endnote. Each time the import fails and I try again then duplicates are created. I've posted like so many others on this before. It is becoming massively inconvenient not having one unique database for my work.
As I've said before, I'd accept a beta and sub-optimal solution (e.g., my libraries will no longer link to existing documents) as a price to pay for having a single library free of duplicates.
Please offer a solution to this long standing sore of an issue! Thanks.

mark · June 27, 2011

@101james, I totally agree that duplicate detection is the single most essential missing feature in Zotero. (And I have been advocating the inclusion of a simple duplicate check upon adding items for years.) But the way you describe your specific case makes me think you may be running into problem that can be solved in another way. Note that whenever you import, the imported items (a) go into a new collection and (b) get the same "date created" setting. That gives you are two ways to easily select and delete all items of an import gone wrong. For that kind of problem, duplicate detection is not the only (and perhaps not even the best) solution.

Now if you're trying to import an EndNote library that significantly overlaps with items already in your library, yes, then you're going to need duplicate detection badly.

njudge · July 7, 2011

I'm thinking about setting up a group page for my entire department. Obviously, though, such a large group page would have to wait until a duplicate deletion feature is in zotero in an easy-enough format for mass use. is there some sort of ETA for when this feature will be available?

mark · July 11, 2011

I, too, am starting to use Zotero for collaborations on a larger scale. The lack of duplicate detection is simply unexplainable to colleagues. I want to draw attention yet again to my simple and workable suggestion to provide at least a basic author/title check at import. Something like "Do you really want to add this item? It looks like it already exists in your library. [A] Cancel and go to similar item. [B] Add anyway."

I know there is working code by Frank Bennett in the multilingual version, but with all the warnings plastered over the page that that is an experimental version and that it should be used with a separate profile there is no way anybody is going to use that in a production environment. From which I conclude with some regret that even the simplest form of duplicate detection, the one that would have forestalled a lot of problems down the track, is not in place as we speak, despite Dan Cohen's remark in October 2006:

We're definitely going to have duplicate detection in a (near) future release.

I hate to be nagging like this but as I've tried to make clear here and elsewhere I do think Zotero could sometimes benefit more from a stance that was aimed at first providing quick and simple solutions that work for a lot of users and only then working out the one perfect solution to all related problems. Had this been done in the case of duplicate detection, this would probably have made the problems a lot less ugly, because right now our libraries are more messy and duplicate-ridden than they could have been.

ajlyon · July 11, 2011

I'm using Frank's builds of Zotero on a regular basis, and I think that they're fine for most power users of Zotero, who make regular backups and who are fine with the occasional display / UI glitch. The risk of actual data damage or loss with the multilingual version is minimal, and it is being used by at least dozens of people, maybe a hundred, around the world, with no known data loss.

mark · July 12, 2011

I'm tempted; from the screencast it seems Frank has done an admirable job in coding a fully functional duplicates solution. But really I've been thinking of something much simpler: a simple author/title based check the moment an item is added. Seems to me that without something like that in place it is still too easy to add duplicates to one's library in the first place.

Frank's code obviously solves a lot of the more complex problems; I can imagine it also working as a solution for duplicates introduced by two collaborating researchers with partially overlapping (group) libraries, which is great. But it is still, as you say, for power users only.

Which means (unless this is going to be merged into the main development line soon, which I don't think we can expect) that my main argument still stands: prevention is better than cure, and a simple solution that comes soon is better than the mother of all duplicate detection solutions that takes five or six years to materialise.

jneef · July 14, 2011

Just like mark, I also think that a quick duplicate check at the time of adding a new entry would be extremely helpful and should not be too hard to implement. A short warning as suggested above should be enough. For me, this would make Zotero advance from "awesome" status to "incredible!"...

schmid · July 20, 2011

Agree!!! And would it be really be so difficult? As a first step, simply checking for identity of DOI or URL with an item in the database would be sufficient. For a nice user interface, see this post by Mark:
http://forums.zotero.org/discussion/42/2/duplicate-detection/#Comment_71964

I also tried the hidden feature for duplicate detection as described in
http://forums.zotero.org/discussion/13658/barrier-to-entry-no-duplicate-detection/#Item_2
This works with some restrictions (slow with my 1000+ items, dozens of false positives) - better than nothing, so adding it to the regular gear menu would be already helpful for those users who don't want to tinker in the about:configs. But even if improved, it won't be a substitute for a warning that comes already when I add a duplicate!

hagver · July 20, 2011

Agree completely ! Duplicate detection on import is separate and much simpler than detecting duplicates later. Is there any reason at all not to implement it immediately ?

mark · July 20, 2011

So for this pre-add check to be done right it needs to be fast. I propose this basic workflow: Upon adding a new item, check a low number of strategically chosen fields and assign a duplicate score according to some simple rules, similar to spam rating systems. If duplicate score exceeds x, bring up the interface I proposed above (x and the weight of individual rules could be made customizable but there is no need in a first version).

My proposal of fields to use, ranked by descending weight:
1. DOI
2. author last name
3. title not case-sensitive (only first n words?)
4. year
5. publication
6. page numbers

DOI is hit or miss, so good; but not all items have DOI. Author last name + Title + Year probably should receive a combined weight that is the same or higher as DOI. Given the importance of these first four perhaps 5 and 6 have little added value.

Interface-wise, it is really important that the users gets to see the existing item that the new item is assumed to be duplicate of. So the prompt should include a citation form of the existing item.

ajlyon · July 20, 2011

This could be done as a Zotero plugin, even-- there is an event triggered when a new item is added, so we don't need to dig around in the Zotero innards to do that. We can let saving proceed normally, then offer the option to delete the new (or old) item from the "Possible Duplicate!" dialog box.

pedrobrasil · July 20, 2011

Hello zotero firends,

My experience in detecting duplicates through a combination of author last name or title (with endnote many years ago) is not good. Thats because often duplicated references come from different sources, such as LILACS and SCOPUS. Some of these remote databases have different character encoding and thus some names are different, specially latin characters... example.

author name: Lilacs - alberto muños
Scopus - Munos, Alberto

Sometimes when "muños" is imported, it becomes something weird like "mun/A$s" in the reference manager. Thus, the problem here are the simbols such as ~ ç ¨ that may mass up the character comparison. The same happens with the paper title.

In my humble opinion DOI (or other unique identifier) and URL are nice when they are present. After that, some combination of numeric fields such as year, issue, volume and initial page is unlikely to happen twice.

Early in this topic, proabilistic linkage was proposed, if the text matching is really necessary, then this might be a way to go. Although zotero has not (yet) a duplicate detection tool, it is as very very handy tool. I must congratulate Its developers. Keep on going... ;-) Champions never give up!

Kind regards,

Pedro

dstillman · July 22, 2011

I've committed duplicate detection functionality to the trunk:

https://www.zotero.org/trac/changeset/9932

The matching algorithms are currently fairly simplistic, but the basic functionality is in place, and we'll be improving the detection going forward.

As with the trunk in general, I don't recommend trying this with a production database.

This functionality will be included in the next beta of Zotero.

dstillman · July 22, 2011

As for pre-flight:

We can let saving proceed normally, then offer the option to delete the new (or old) item from the "Possible Duplicate!" dialog box.

That wouldn't be a good way to go about this, as deletions get synced to the server and carried around with the library history. A pre-flight check would have to be (somewhat awkwardly, I still think) integrated into the translator architecture, but Simon would have to comment on the feasibility of that.

ajlyon · July 23, 2011

I'll run the trunk through it's paces and post bugs as necessary. Thanks for working on this.

As for the pre-flight, is the extra burden of deletion syncing really such a big deal? I would imagine that there wouldn't be an awful lot of spurious deletions-- I don't think that duplicates are quite that common.

dstillman · July 23, 2011

As for the pre-flight, is the extra burden of deletion syncing really such a big deal?

Wouldn't be terrible, but they could add up, and it's a pattern that would be better to avoid if possible.

fbennett · July 23, 2011

With the preservation of links when items are merged, there's not much difference between catching duplicates during translation and catching them later.

ajlyon · July 23, 2011

Except that a person might be interested in knowing that they're creating a duplicate when saving -- it's unlikely to be intentional. And presumably the item merge information would also have to be synced, so there's sync overhead there too.

I'm not dead-set on pre-flight checking using deletion-- a behavior that builds off of the itemDone event and allows the user to prevent saving would of course be cleaner. With the migration of translation into the server and connectors, it's a little hard to imagine a cross-platform way to implement it, but it might still be possible. For now though, we should probably see how this solution works out, and let the translator code rest and settle down before exploring that route.

mark · July 24, 2011

With all due respect for hypothetical technical disadvantages of one suggested way of doing preflight, with ajlyon I think it is important not to lose sight of the fact that many users are really looking for a simple preflight check more than a fully featured solution for after the fact. Users don't want to add duplicates to their library, period. Interface-wise, giving them a way to avoid that seems a much more user friendly route than the after the fact type solution.

dstillman · July 24, 2011

The technical disadvantages of that approach aren't hypothetical, but it occurs to me that a plugin could delete rows from the deletion log too. There could still be some minor side effects (e.g., for installed Zotero plugins that watched for add events), but they'd be more minor. An itemDone-based approach would still be the cleanest, though.

mark · July 25, 2011

Agreed on the delete log — duplicates aren't the type of thing you'd expect to be kept in your Trash anyway.

jneef · August 24, 2011

For those of you who are following this thread and who didn't notice yet: There's a beta version of Zotero 3 out. It contains duplicate detection and merge functionality and my first impression of it is very good - it seems quite powerful and user-friendly.
Thanks a lot to the Zotero team!

mark · August 24, 2011

It is a nice interface and quite helpful after the fact. Thank you to Dan for adding it. I keep thinking however that prevention is better than cure, and that many users would be helped by a simple, quick preflight check that would alert them when they are possibly adding a duplicate.

Code-savy Zotero users, here's your chance to write a plugin that would be massively popular! (See aylon's point, and useful UI suggestions above.)

jneef · August 26, 2011

I tried the duplicate detection and I would have one suggestion for an improvement: When I merge two entries, all the attached files are kept. When I have a duplicate entry, I usually also have a duplicate pdf attached to it. It would be nice to be able to select which files to keep or discard - for me one of the main points of getting rid of duplicates is reducing the size of the database, so getting rid of duplicate pdf files is quite important. Of course I can still manually delete one of the files afterwards, but this means that I have to go look up the freshly merged entry again. A possibility to chose during the merging process would save a lot of time. Another way would be to add the possibility to delete attachments in the duplicate detection view.

dstillman · August 26, 2011

Please post comments on the 3.0 feature to the new thread dedicated to that.

Since this thread goes back to literally the first few days of Zotero's existence, I'm going to close it now that duplicate detection is available. If someone wants to take up Mark's suggestion, feel free to start a new thread.