Duplicate detection?
This is an old discussion that has not been active in a long time. Before commenting here, you should strongly consider starting a new discussion instead. If you think the content of this discussion is still relevant, you can link to it from your new discussion.
This discussion has been closed.
I have read most of the 91 messages above and at some time I felt petty of the developers. Almost four years of users desperately requesting the same thing and it seems programing this stuff is not really easy. Im a physician thus I cant say how easy or difficult could it possibly be. But come on guys ... if you are really really desperate... donate!!! Isnt it good and useful??? ... than donate as much as you can! Get the programmers of your institutions involved! Sell zotero functionalities to your librarians.
Giving some thought to this discussion, what is the problem of duplicates?
The problem I faced more then once is that I cite one reference in a paper... later in the same paper I cited the duplicated reference, thinking it was the first. Thus at Bibliogrphy I would have the "same" reference inserted twice with different numbers. If this is the case, then a carefull exam of bibliography would detect this error. Fixing is the problem. Editing the citations may take some time but it will work.
The second problem is after inserting a citation, the users deletes the reference from the library thinking that is a duplicate, then when bibliography is inserted zotero will not be able to find the correspondent reference. When I used to work with EndNOte (version 10) it had a warning window that it was not able to find the cited reference and at the same it opened the library and asked which reference it should be used instead! At this time it was also possible to stick in the library a new reference to replace the old one! Never happened to me though!
Now a question to further understand the duplicate/merging problem!!!
Suppose I have a good reference and I want to cite it in paper X and paper Y. Thus I import it from PubMed or LILACS or whatever into X collection, later a drag it into Y collection. Am I creating a duplicate? If so then deleting duplicates is really dangerous! But if it works like gmail that everything is in "my collection" ("in box") and Im able to tag it with several tags such as collection X and collection Y, then the reference is really just one but it may appear in several collections because it has several tags. If this is the case, then duplicates are really unnecessary and annoying.
Anyway, congratulations to developers (even if duplicate finding is not yet available) because zotero helped me a lot and Im really a fan... my students are forced to be introduced to zotero as other free and opensource nice software!
Dont give up! ;-)
When you drag an item to another collection inside the same library or group, it doesn't create a new copy (it works, as you say, like gmail). If you drag the item across collections, it does create a separate copy. So (at present, at least) if you are collaborating with other authors on a document, everyone should draw their references from the same library.
There is a plan to smooth this out (so that the reference looks and feels like a single copy, everywhere), as the developers find time to work on it.
I'm sure that improved duplicates management will come in due course -- no one is giving up!
http://www.springerlink.com/content/788fq0p1wkjy63lt/
There may well be methods in "R" (open source statistics software) that could be used for this purpose. I am not sure whether one could bring R to bear directly on the problem without having everyone install R. That's probably too muchh to ask! Anyhow, hopefully something to thing about.
As always, thanks.
But I have to agree these were very nice insights! How could probabilistic linkage be programmed in zotero? No idea! :-(
I have to agree, not every problem needs a hammer and this problem may well not need a jackhammer like R. The point is more to say 1) probabilistic methods are preferable to deterministic methods and 2) there are methods for this sort of thing and 3) there may even be extant code. I don't think we would want to bundle R to this, but one option might be a batch process enabled on Amazon's cloud cluster, so that you could have R sync to your zotero cloud, do its thing, and send back a script of recommended deletes or something like that. You could also do that locally if you wanted, but then you'd need R.
Lots of options, clearly.
The interesting thing with these databases is that - at least those that have been scraped - there should be some high value fields (PMIDs, etc) which have high validity and high discriminatory power. So, that should make the process a lot easier.
I would not underestimate the challenges, but I would think that a quick walk from the Centre for History and New Media to the Stats department would yield a graduate student looking for an M.A. topic.
That can definitely be solved, but just remember that this isn't as simple as searching for matching unique ids.
We're well aware of the demand for this feature, though.
Note that as long as some simple check like that is not in place, the problem will get progressively worse because people have no way of avoiding adding (and citing) duplicate items. So my suggestion would be to start with a low-tech, low-complexity solution that serves a lot of users, and worry about the more difficult cases later.
I'm think of something like this: A simple check and warning like that would be welcomed by all users and would certainly ease the pain of waiting for the more complex solution.
That's the way my experimental code works. It flags newly arrived items as "on probation", as it were, and then allows you to open a vetting display where possible clashes between the new arrivals and existing content can be resolved.
It provides a primitive "merger" facility that will slot the master reference into any collections where the newcomer is present, but it doesn't include a mechanism for mapping the identity of the deleted item to the master -- so word processing documents that depend on the newcomer will break. That's a bit beyond my skill level, but (although I would say so, being myself) I do like the interface and the workflow.
There's a screenshot here, if you're interested. Patience, but the developers are aware of our eagerness for duplicates handling, and when a solution arrives I'm sure it will be a good one.
(configurable, please) would solve almost all problems. This can't be so difficult !
Anything at a later stage is complicated and awkward - including manually wading through a thousand records.
Please, please, a quick and simple solution on import, way overdue.
I would be more than happy to help test it.
For my part:
- Even a crude check would be helpful (e.g., "Zotero thinks the following may be duplicates...")
- It seems one problem is maintaining links between writing and the library as duplicates are removed? From what I can understand this creates complexity with the unique identifying code for each database entry. I'm quite happy to re-insert references to my work (i.e., break the link between the library and the paper) if this is what it takes to have a remove duplicates function. Also, if this keeps Zotero's code simple and so more reliable it definitely gets a big vote from me.
Thanks for all the work. Any news on when this duplicate removal function will come?
If something "quick and dirty" is implemented now, it will most likely backfire by causing compatibility issues when users later migrate to a "more proper" duplicate detection.
There is some functionality for duplicate removing available in the latest version, but it is not enabled by default due to potential problems. If you search the forum, you will probably find instructions. (I do not remember how to enable it.)
I'm a fairly new user (months) to Zotero. I've a couple of papers that use shared libraries. I'd certainly like a duplicate detection for these libraries as well as my 'own' (i.e., not shared) libraries. Why?
At the moment I can't import my references from Endnote completely. If the process has a problem, as it has three times, duplicates very quickly build up.
I'd like to feel condident Zotero will stay reliable. Many software problems seem to start from trying to maintain backward compatability. As the software gets more complex and more involved it is harder to maintain and less reliable. If that means perhaps users have to do a little extra then perhaps that is helpful?
Perhaps a solution is to have choice about which libraries are examined for duplicates? This way libraries that are shared and on-going remain untouched. If a shared library can only administrators run duplicate check to avoid problems highlighted in the previous post.
Thanks for flagging the availability of some functionality to check for duplicates.
But yeah, I think mark's point about stopping duplicates on import should be doable without the above: low-hanging fruit. I don't remember, but didn't Frank's code fix this? If it did, can you review it Dan?
citeproc-js seems to be settling down nicely, at long last. Most of the time I've put into it over the past couple of weeks has been spent on refactoring and voluntary extensions. Shouldn't be much of a burden going forward. He said.
I do not know much about the structure of the zotero-code but it could be possible to customize a saved search as a duplicate-check.
1. Each item should get an additional property “version”. Standard value should be 0.
2. The program should check on import if the entry already exists (e.g. tile/author). There is no need to make the detection perfect, as long as every duplicate is found.
3. Then it should set the value of the existing item to 1 and the value of the new item to 2.
4. A saved search could be customized to show all items with version > 0.
Here are all potential duplicates listed. One can go through it step by step deleting duplicate entries or (in case of false duplicate) set the version-number to 0.
If such a code once is implemented it will get easier to advance it getting a real duplicate-check, but for now this would really be enough.
Regards,
Kevin
Remember the real challenge is not to find duplicates but to not break citations that use one of them.
That is why I would recommend to split the task.
The first one is to avoid importing duplicates.
The second one is to deal with existing duplicates.
The first task is the main one, as if it is done correctly, it makes the second one unnecessarry - and luckily it is the easier one to do.
In addition it would be much of a simplification for dealing with the second task if there would be a list of all duplicates. The user could decide by himself which items to delete or which to keep or how to deal with them (e.g. one could set a tag "duplicate" instead of deleting).
I did not know that you are this far already. I am looking forward to the final version and try to keep patience.