Duplicate detection?

pedrobrasil · September 24, 2010

Hello all,

I have read most of the 91 messages above and at some time I felt petty of the developers. Almost four years of users desperately requesting the same thing and it seems programing this stuff is not really easy. Im a physician thus I cant say how easy or difficult could it possibly be. But come on guys ... if you are really really desperate... donate!!! Isnt it good and useful??? ... than donate as much as you can! Get the programmers of your institutions involved! Sell zotero functionalities to your librarians.

Giving some thought to this discussion, what is the problem of duplicates?

The problem I faced more then once is that I cite one reference in a paper... later in the same paper I cited the duplicated reference, thinking it was the first. Thus at Bibliogrphy I would have the "same" reference inserted twice with different numbers. If this is the case, then a carefull exam of bibliography would detect this error. Fixing is the problem. Editing the citations may take some time but it will work.

The second problem is after inserting a citation, the users deletes the reference from the library thinking that is a duplicate, then when bibliography is inserted zotero will not be able to find the correspondent reference. When I used to work with EndNOte (version 10) it had a warning window that it was not able to find the cited reference and at the same it opened the library and asked which reference it should be used instead! At this time it was also possible to stick in the library a new reference to replace the old one! Never happened to me though!

Now a question to further understand the duplicate/merging problem!!!

Suppose I have a good reference and I want to cite it in paper X and paper Y. Thus I import it from PubMed or LILACS or whatever into X collection, later a drag it into Y collection. Am I creating a duplicate? If so then deleting duplicates is really dangerous! But if it works like gmail that everything is in "my collection" ("in box") and Im able to tag it with several tags such as collection X and collection Y, then the reference is really just one but it may appear in several collections because it has several tags. If this is the case, then duplicates are really unnecessary and annoying.

Anyway, congratulations to developers (even if duplicate finding is not yet available) because zotero helped me a lot and Im really a fan... my students are forced to be introduced to zotero as other free and opensource nice software!

Dont give up! ;-)

fbennett · September 24, 2010

@pedrobrasil,

When you drag an item to another collection inside the same library or group, it doesn't create a new copy (it works, as you say, like gmail). If you drag the item across collections, it does create a separate copy. So (at present, at least) if you are collaborating with other authors on a document, everyone should draw their references from the same library.

There is a plan to smooth this out (so that the reference looks and feels like a single copy, everywhere), as the developers find time to work on it.

I'm sure that improved duplicates management will come in due course -- no one is giving up!

SalishSea · September 29, 2010

Clearly duplicate detection is both hard to do, and valued. There is a relevant statistical literature. Here's one lead:

http://www.springerlink.com/content/788fq0p1wkjy63lt/

There may well be methods in "R" (open source statistics software) that could be used for this purpose. I am not sure whether one could bring R to bear directly on the problem without having everyone install R. That's probably too muchh to ask! Anyhow, hopefully something to thing about.

As always, thanks.

pedrobrasil · September 29, 2010

Pretty cool this book. About probabilistic linkage... is a complex stuff and very useful in several situations. Ive seen in connecting very large databases such as finding tuberculosis notification subjects in the general mortality databases. This would be pretty cool but my guess is that this level of complexity wont be necessary in zotero. will it? About R... Im familiar with R but it took me about 2 to 3 years to be so. In a R local list Ive seen people discussing weird stuff with R such using twitter with R, sending mail with R and managing references with R. There is a R package that is able to connect and import references from MedLIne for example. There are several command lines functions to deal with texts, regular expressions and finding duplpicates. Again, I agree that askinf for zotero users to have R is asking too much.
But I have to agree these were very nice insights! How could probabilistic linkage be programmed in zotero? No idea! :-(

SalishSea · September 29, 2010

Pedro,

I have to agree, not every problem needs a hammer and this problem may well not need a jackhammer like R. The point is more to say 1) probabilistic methods are preferable to deterministic methods and 2) there are methods for this sort of thing and 3) there may even be extant code. I don't think we would want to bundle R to this, but one option might be a batch process enabled on Amazon's cloud cluster, so that you could have R sync to your zotero cloud, do its thing, and send back a script of recommended deletes or something like that. You could also do that locally if you wanted, but then you'd need R.

Lots of options, clearly.

The interesting thing with these databases is that - at least those that have been scraped - there should be some high value fields (PMIDs, etc) which have high validity and high discriminatory power. So, that should make the process a lot easier.

I would not underestimate the challenges, but I would think that a quick walk from the Centre for History and New Media to the Stats department would yield a graduate student looking for an M.A. topic.

mark · September 30, 2010

Indeed, where are the computer science grads and undergrads looking for topics? Zotero needs a Summer of Code! So many useful plugins could be written!

dstillman · September 30, 2010

Again, the main thing keeping this from happening (aside from other development priorities, including grant requirements) isn't that it's particularly difficult to implement—Frank Bennett even provided a prototype implementation, though I haven't had a chance to review it—but that we'd have to deal with merged items in an intelligent way, which has implications in both client and server code. We're not going to roll out duplicate detection that produces broken citation links when you actually use it.

That can definitely be solved, but just remember that this isn't as simple as searching for matching unique ids.

We're well aware of the demand for this feature, though.

mark · October 2, 2010

Dan/others, have you ever considered that it would be a good start to at least do a summary duplicate check when adding new items? In fact this would solve most of the problem in personal libraries. Note that the very first post in this thread (and Dan Cohen's reply to it, back in 2006) refers not to duplicate detection after the fact (where there is the possibility of breaking citation links), but before adding possible duplicate items. Why not start with that?

Note that as long as some simple check like that is not in place, the problem will get progressively worse because people have no way of avoiding adding (and citing) duplicate items. So my suggestion would be to start with a low-tech, low-complexity solution that serves a lot of users, and worry about the more difficult cases later.

I'm think of something like this:

Do you really want to add this item? It looks like it already exists in your library.

Smith, Joe. 2010. How to avoid duplicate entries. Ms., Amsterdam.

[A] Cancel and go to similar item. [B] Add anyway.

A simple check and warning like that would be welcomed by all users and would certainly ease the pain of waiting for the more complex solution.

fbennett · October 2, 2010

mark,

That's the way my experimental code works. It flags newly arrived items as "on probation", as it were, and then allows you to open a vetting display where possible clashes between the new arrivals and existing content can be resolved.

It provides a primitive "merger" facility that will slot the master reference into any collections where the newcomer is present, but it doesn't include a mechanism for mapping the identity of the deleted item to the master -- so word processing documents that depend on the newcomer will break. That's a bit beyond my skill level, but (although I would say so, being myself) I do like the interface and the workflow.

There's a screenshot here, if you're interested. Patience, but the developers are aware of our eagerness for duplicates handling, and when a solution arrives I'm sure it will be a good one.

hagver · October 17, 2010

Like so many people have written now: a simple heuristic duplicate warning on import
(configurable, please) would solve almost all problems. This can't be so difficult !

Anything at a later stage is complicated and awkward - including manually wading through a thousand records.

Please, please, a quick and simple solution on import, way overdue.

Keith Ralston · November 2, 2010

Hey guys. I appreciate all the hard work. I have seen a patch for deduping. I was just wondering if you have a status for working it into release?

I would be more than happy to help test it.

101james · November 4, 2010

Duplicate search - please. I'm close to everyone else in begging for this! As a non-tech person I'm sure there are huge complexities associated with what, a first sight, is a simple request.
For my part:
- Even a crude check would be helpful (e.g., "Zotero thinks the following may be duplicates...")
- It seems one problem is maintaining links between writing and the library as duplicates are removed? From what I can understand this creates complexity with the unique identifying code for each database entry. I'm quite happy to re-insert references to my work (i.e., break the link between the library and the paper) if this is what it takes to have a remove duplicates function. Also, if this keeps Zotero's code simple and so more reliable it definitely gets a big vote from me.

Thanks for all the work. Any news on when this duplicate removal function will come?

mronkko · November 5, 2010

The problem with duplicates is that there are a lot of non-trivial issues. We are using a group library for 12 people in the research group. If someone just deletes duplicates it will break many papers causing extra work and confusion for also other people besides the one that did the duplicate removal.

If something "quick and dirty" is implemented now, it will most likely backfire by causing compatibility issues when users later migrate to a "more proper" duplicate detection.

There is some functionality for duplicate removing available in the latest version, but it is not enabled by default due to potential problems. If you search the forum, you will probably find instructions. (I do not remember how to enable it.)

101james · November 5, 2010

Just to reinforce a few points and add views to those raised here.
I'm a fairly new user (months) to Zotero. I've a couple of papers that use shared libraries. I'd certainly like a duplicate detection for these libraries as well as my 'own' (i.e., not shared) libraries. Why?
At the moment I can't import my references from Endnote completely. If the process has a problem, as it has three times, duplicates very quickly build up.
I'd like to feel condident Zotero will stay reliable. Many software problems seem to start from trying to maintain backward compatability. As the software gets more complex and more involved it is harder to maintain and less reliable. If that means perhaps users have to do a little extra then perhaps that is helpful?
Perhaps a solution is to have choice about which libraries are examined for duplicates? This way libraries that are shared and on-going remain untouched. If a shared library can only administrators run duplicate check to avoid problems highlighted in the previous post.
Thanks for flagging the availability of some functionality to check for duplicates.

mark · November 6, 2010

If the process has a problem, as it has three times, duplicates very quickly build up.

Anytime you import from a file, a collection is created that contains all items imported. So if you find a problem with that particular import, you can easily delete all of them from your library (additionally, even if a collection was not created, you would be able to sort by "Date added" and select all imported items to delete them if needed). So, while duplicate detection is sorely needed, in the particular case you mention there are other (and better) solutions.

mark · November 6, 2010

Mronkko writes:

The problem with duplicates is that there are a lot of non-trivial issues. We are using a group library for 12 people in the research group. If someone just deletes duplicates it will break many papers causing extra work and confusion for also other people besides the one that did the duplicate removal.

If something "quick and dirty" is implemented now, it will most likely backfire by causing compatibility issues when users later migrate to a "more proper" duplicate detection.

I have to register my disagreement with this. Yes, duplicate detection can be extremely tricky, but no, the implementation of basic functionality should not as a rule wait until all conceivable usage scenarios can be catered for. As I and others have argued, had there been a quick fix for avoiding duplicates from the start, then the problems in many individual libraries (and I'm willing to hazard a guess that most libraries are still mainly individual) would not have accumulated to the point where introducing duplicate detection causes potential problems because people have been citing duplicates etc etc. You have to start somewhere, and starting with basic functionality that works for many users is probably always going to be better than waiting until that imaginary moment when you can address all conceivable usage scenarios.

bdarcus · November 10, 2010

Just highlighting Dan's point that ...

We're not going to roll out duplicate detection that produces broken citation links when you actually use it.

... and mronkko's note about citation breakage with group libraries underlines a broader problem with the citation design that we've gone over before, but which also makes document portability a real challenge. So moving to a system that is less brittle to begin with alongside a way to merge "duplicates" (which is part of a broader problem of selecting metadata sets for particular resources) is critical.

But yeah, I think mark's point about stopping duplicates on import should be doable without the above: low-hanging fruit. I don't remember, but didn't Frank's code fix this? If it did, can you review it Dan?

adamsmith · November 10, 2010

I'm pretty sure Frank's code doesn't test on import, no.

fbennett · November 10, 2010

The experimental code is a compromise; it tags items on import as "new", so they can reviewed as a batch via a special view. It would need an update to run against the trunk. I can update the patch, if there's interest.

Maple42 · November 10, 2010

Recently, I used zotero to collect references. I found it was pretty good. But there is only one thing which I think should be improved. I found there is no warning about duplicate references. I think it is better to give a warning when we add a similar reference even if it cannot delete it automatically.

darrask · November 12, 2010

Another vote for duplicate detection. It should be impossible to import a new reference that is already present. I ended up with a huge library, that I'm still trying to clean up.

ajlyon · November 12, 2010

I do think it would be good to get your patch up to speed, Frank. Maybe we can get it into the trunk while the 2.1 betas are still happening and provide some temporary workflow to address this problem inherent to research.

ajlyon · November 12, 2010

And if we land it on trunk, I'll try to learn enough about programming the Zotero core to help support it-- I feel for you trying to support citeproc-js single-handedly.

fbennett · November 13, 2010

Avram: I'll get back to it, although I won't hurry. In the short term, there has been a report of trouble with the Google Scholar translator (fetching the same document repeatedly in multiple mode). If you could take a look at that one, it would be a huge help.

citeproc-js seems to be settling down nicely, at long last. Most of the time I've put into it over the past couple of weeks has been spent on refactoring and voluntary extensions. Shouldn't be much of a burden going forward. He said.

Kevin_T · January 31, 2011

Hi,
I do not know much about the structure of the zotero-code but it could be possible to customize a saved search as a duplicate-check.

1. Each item should get an additional property “version”. Standard value should be 0.
2. The program should check on import if the entry already exists (e.g. tile/author). There is no need to make the detection perfect, as long as every duplicate is found.
3. Then it should set the value of the existing item to 1 and the value of the new item to 2.
4. A saved search could be customized to show all items with version > 0.

Here are all potential duplicates listed. One can go through it step by step deleting duplicate entries or (in case of false duplicate) set the version-number to 0.

If such a code once is implemented it will get easier to advance it getting a real duplicate-check, but for now this would really be enough.

Regards,
Kevin

adamsmith · January 31, 2011

fbennett actually has implemented experimental duplicate detection in his experimental multilingual version of Zotero - Sean of Zotero has tweeted that they're already looking at the code - so things are moving, iuf slowly.

Remember the real challenge is not to find duplicates but to not break citations that use one of them.

Kevin_T · February 1, 2011

right, and the challenge is a real big one, too. It might take years to solve it clean and nicely.

That is why I would recommend to split the task.

The first one is to avoid importing duplicates.
The second one is to deal with existing duplicates.

The first task is the main one, as if it is done correctly, it makes the second one unnecessarry - and luckily it is the easier one to do.

In addition it would be much of a simplification for dealing with the second task if there would be a list of all duplicates. The user could decide by himself which items to delete or which to keep or how to deal with them (e.g. one could set a tag "duplicate" instead of deleting).

fbennett · February 1, 2011

The trial code I've implemented solves both problems, actually. The URLs of deleted items are reserved, and retained on file for 2 years. Any documents that reconnect to Zotero during that time will have their IDs remapped to a real existing ID. So there's something on the stocks, but the code needs careful review before it can be considered for inclusion in Zotero proper. (In the current version, it's also impossibly slow with large numbers of references, but I have a devious plan to speed things up.)

mark · February 1, 2011

Really looking forward to this Frank. Thanks for your work!

Kevin_T · February 1, 2011

wow, great :)

I did not know that you are this far already. I am looking forward to the final version and try to keep patience.