Duplicate detection?
This is an old discussion that has not been active in a long time. Before commenting here, you should strongly consider starting a new discussion instead. If you think the content of this discussion is still relevant, you can link to it from your new discussion.
This discussion has been closed.
Merging entries about the same item that differ is the important feature. A simple approach that would work for me would be: 1) Information (and links) that exist in only one entry is combined, 2) In the case of a conflict, let me pick a ‘master’ entry. A more complex solution would be to let me choose on a per-tag basis, but I could live without something that complex.
I'm sure that other users have different needs and the issue is more complex than I understand. But, I would prefer to have an imperfect solution over none.
It would be nice to get a quick comment from Dan or another dev if duplicates are going to be addressed in 2.1 (which I agree they should be).
If a simple internal interface leveraging Zotero search were exposed in the translators, and tied to a popup warning mechanism (use existing item/download duplicate item/reject), contributors could slot it in to fix the translators they use often.
Without experimenting with heuristics, an author merge feature could be implemented simply as a manual select-from-search. (Similar to one very popular webmail client's contact merging feature.)
A similar function for other potentially duplicated items such as publisher, journal or book title, location, etc. could eventually give some human logic help to the heuristic check for full-duplicates.
This could help with the first-name-disambiguation confusion which has been reported (incorrectly) as a bug in a number of styles. Sometimes, the disambiguation is triggered when the same author is entered twice with slight differences such as with/without period following intial, etc.
see: http://forums.zotero.org/discussion/7457/should-there-be-a-no-givenname-disambiguation-default-style/
1. Notifying when the user tries to add an article, and ask whether to continue to add that article or not, if there already exists an article in the library with either of the same ...
1) title (case-insensitive, ignoring whitespaces),
2) DOI, or
3) publication, volume, issue, and page at once
This would be satisfying, at least for most scholars.
Regarding the timing,
it should obviously be done after retrieving the article's meta data, and
it would be best to be done before the article is actually added to the library.
If there are many articles added at once, you can just show up the dialog box many times.
2. Duplicate management in the existing library
1) Detect the duplicates with the above algorithm when the user asks to do so (i.e. press a specific button)
2) Show a list of duplicates
3) Let the users do the rest!
I know that these algorithms will be far from perfection, but again, what is definitely and desperately needed for many existing users, is this simple improvement. I think there has been strong and persistent need for this functionality, as is proved by the many replies on this discussion, and this functionality might be given priority over others.
If there remain any possible complications in implementing the algorithms I suggested, I'm eager to hear about that.
Likewise, the titles of articles sometimes have something like "[review article]" appended. Sometimes they do, sometimes they don't. So I would want to leave off the last part of the title when doing a comparison by title. Etc.
I think there should be multiple, easily developed algorithms that the user can choose from for their duplicate checking.
Kieren: Would you want to try some of these out? I imagine duplicate detection and handling would be a good killer app for zotero-browser.
Well I have a working installation of my browser here, and the (non-gui parts of the) code should be identical in the browser to what would end up in zotero, so in principle I agree. My time is a bit limited for the next month or so, but in principle I can have a look later in June. Developing duplicate detection algorithms was the third use I came up with for the thing, after enhanced reports, and playing with my naive text mining environment.
If you fork on github, just notify me of changes through pull requests.
One temporary solution.
Step:
Occasionally...
1) Export your references from zotero as .bib
2) Use jabref to find duplicates and correct the database and save it.
3) Delete all data from zotero.
4) Import the corrected .bib file into zotero.
I dont know if Jabref people are willing to share the code/algorithm and if so how portable it is to zotero coding.
I certainly don't think that the current solution is sufficient, but it may alleviate the pain of some users for now.
I would also vote for this to be a high development priority, else I am reluctant to make the complete switch from EndNote using my 4000+ item library :-(
Thx
Instructions by Frank Bennett here.
thanks god to this description:
http://forums.zotero.org/discussion/13658/barrier-to-entry-no-duplicate-detection/#Item_2
It made cleaning up my libary very easy! It's defintely a good start.
However, I would prefer a "pre-import" warning, still.
Still no news from the devs, about this function?
Best,
Jan