Duplicate detection?
This is an old discussion that has not been active in a long time. Before commenting here, you should strongly consider starting a new discussion instead. If you think the content of this discussion is still relevant, you can link to it from your new discussion.
This discussion has been closed.
There is no duplicate detection that's currently useable, no, so 2.)
or is it time for me to finally give up on zotero?
I also can't tell you how you got corrupted citations - again, that's not something that happens with regular Zotero use.
I don't know if you want to give up on Zotero, you'll have to decide that for yourself, hundreds, possibly thousands of people (me included) write their dissertations in Zotero successfully and without major issues.
Duplicate detection is certainly an important feature and it's going to happen eventually, but for regular use, not having the feature is a minor nuisance, not a major disaster.
does this mean that perhaps there is a fix for my data base?
Killer, though, is the absence of duplicate detection. I cannot import all my references from Endnote. Each time the import fails and I try again then duplicates are created. I've posted like so many others on this before. It is becoming massively inconvenient not having one unique database for my work.
As I've said before, I'd accept a beta and sub-optimal solution (e.g., my libraries will no longer link to existing documents) as a price to pay for having a single library free of duplicates.
Please offer a solution to this long standing sore of an issue! Thanks.
Now if you're trying to import an EndNote library that significantly overlaps with items already in your library, yes, then you're going to need duplicate detection badly.
I know there is working code by Frank Bennett in the multilingual version, but with all the warnings plastered over the page that that is an experimental version and that it should be used with a separate profile there is no way anybody is going to use that in a production environment. From which I conclude with some regret that even the simplest form of duplicate detection, the one that would have forestalled a lot of problems down the track, is not in place as we speak, despite Dan Cohen's remark in October 2006: I hate to be nagging like this but as I've tried to make clear here and elsewhere I do think Zotero could sometimes benefit more from a stance that was aimed at first providing quick and simple solutions that work for a lot of users and only then working out the one perfect solution to all related problems. Had this been done in the case of duplicate detection, this would probably have made the problems a lot less ugly, because right now our libraries are more messy and duplicate-ridden than they could have been.
Frank's code obviously solves a lot of the more complex problems; I can imagine it also working as a solution for duplicates introduced by two collaborating researchers with partially overlapping (group) libraries, which is great. But it is still, as you say, for power users only.
Which means (unless this is going to be merged into the main development line soon, which I don't think we can expect) that my main argument still stands: prevention is better than cure, and a simple solution that comes soon is better than the mother of all duplicate detection solutions that takes five or six years to materialise.
http://forums.zotero.org/discussion/42/2/duplicate-detection/#Comment_71964
I also tried the hidden feature for duplicate detection as described in
http://forums.zotero.org/discussion/13658/barrier-to-entry-no-duplicate-detection/#Item_2
This works with some restrictions (slow with my 1000+ items, dozens of false positives) - better than nothing, so adding it to the regular gear menu would be already helpful for those users who don't want to tinker in the about:configs. But even if improved, it won't be a substitute for a warning that comes already when I add a duplicate!
My proposal of fields to use, ranked by descending weight:
1. DOI
2. author last name
3. title not case-sensitive (only first n words?)
4. year
5. publication
6. page numbers
DOI is hit or miss, so good; but not all items have DOI. Author last name + Title + Year probably should receive a combined weight that is the same or higher as DOI. Given the importance of these first four perhaps 5 and 6 have little added value.
Interface-wise, it is really important that the users gets to see the existing item that the new item is assumed to be duplicate of. So the prompt should include a citation form of the existing item.
My experience in detecting duplicates through a combination of author last name or title (with endnote many years ago) is not good. Thats because often duplicated references come from different sources, such as LILACS and SCOPUS. Some of these remote databases have different character encoding and thus some names are different, specially latin characters... example.
author name: Lilacs - alberto muños
Scopus - Munos, Alberto
Sometimes when "muños" is imported, it becomes something weird like "mun/A$s" in the reference manager. Thus, the problem here are the simbols such as ~ ç ¨ that may mass up the character comparison. The same happens with the paper title.
In my humble opinion DOI (or other unique identifier) and URL are nice when they are present. After that, some combination of numeric fields such as year, issue, volume and initial page is unlikely to happen twice.
Early in this topic, proabilistic linkage was proposed, if the text matching is really necessary, then this might be a way to go. Although zotero has not (yet) a duplicate detection tool, it is as very very handy tool. I must congratulate Its developers. Keep on going... ;-) Champions never give up!
Kind regards,
Pedro
https://www.zotero.org/trac/changeset/9932
The matching algorithms are currently fairly simplistic, but the basic functionality is in place, and we'll be improving the detection going forward.
As with the trunk in general, I don't recommend trying this with a production database.
This functionality will be included in the next beta of Zotero.
As for the pre-flight, is the extra burden of deletion syncing really such a big deal? I would imagine that there wouldn't be an awful lot of spurious deletions-- I don't think that duplicates are quite that common.
I'm not dead-set on pre-flight checking using deletion-- a behavior that builds off of the itemDone event and allows the user to prevent saving would of course be cleaner. With the migration of translation into the server and connectors, it's a little hard to imagine a cross-platform way to implement it, but it might still be possible. For now though, we should probably see how this solution works out, and let the translator code rest and settle down before exploring that route.
Thanks a lot to the Zotero team!
Code-savy Zotero users, here's your chance to write a plugin that would be massively popular! (See aylon's point, and useful UI suggestions above.)
Since this thread goes back to literally the first few days of Zotero's existence, I'm going to close it now that duplicate detection is available. If someone wants to take up Mark's suggestion, feel free to start a new thread.