Duplicate detection?

brandtb · February 28, 2010

I'm just getting started with Zotero. So far it is great, but dups are already an issue. From reading this discussion, there are two issues. The first is finding exact duplicates. The second is merging entries about the same item that are not exact. Finding and removing exact duplicates is not hard right now. Detection on import would be nice.

Merging entries about the same item that differ is the important feature. A simple approach that would work for me would be: 1) Information (and links) that exist in only one entry is combined, 2) In the case of a conflict, let me pick a ‘master’ entry. A more complex solution would be to let me choose on a per-tag basis, but I could live without something that complex.

I'm sure that other users have different needs and the issue is more complex than I understand. But, I would prefer to have an imperfect solution over none.

adamsmith · February 28, 2010

the biggest problem about merging is actually how to deal with citations using the "merged" item. I believe everything else would be quite doable.
It would be nice to get a quick comment from Dan or another dev if duplicates are going to be addressed in 2.1 (which I agree they should be).

emorales · March 17, 2010

Hi. This thread was started by CB on October 10, 2006. Do we have pdf duplicate detection yet?

kalital · March 31, 2010

I'm another scholar who would greatly appreciate duplicate detection. I do a lot of searches with minor keyword differences to catch all possible references, and it would be great if Zotero didn't add exact duplicates, plus it would lower my total number of citation downloads from Google Scholar, and thus lower my risk of being locked out for adding unnecessary numbers of items. It's been 4 years since folks have started to ask for it. Is it actually coming soon?

noksagt · March 31, 2010

plus it would lower my total number of citation downloads from Google Scholar, and thus lower my risk of being locked out for adding unnecessary numbers of items

I don't know how likely this is: Zotero would need to retrieve bibliographic data before it could tell you whether the record was an exact duplicate, which google would count as a query.

glientsc · April 2, 2010

Why not just go to "My library" and sort for title, then it is easy to spot duplicates as you scroll down to clean up your database...

kalital · April 2, 2010

@glientsc This is what I do, but it's a real pain to spot them when the list is several hundred citations long. I've gone back to using EndNote and and Mendeley. I find them more functional in combination, though EndNote is painfully slow.

rwielage · April 30, 2010

This is becoming a real headache for us. We would appreciate a duplicate prevention facility, even if were not perfect. One could let the user determine what to compare. I'd be happy with the last names of the authors, year, volume and page number. Someone else might want to compare something else. Anyway, if the comparison matches, bring up a dialogue box with the complete information of both references and let the user decide whether it's a duplicate. This is a case when a feature that works most of the time is better than not having the feature at all.

fbennett · May 1, 2010

Agree that this is an important item. I had a similar idea with the approach I suggested under the zotero-dev post that I linked above: the risk of acquiring duplicates is greatest when items enter Zotero through a translator; translators are limited to a single site, and typically work on a limited set of item types; given those constraints, checking for duplicates in the translator after the item fields are collected, but before submission, is simple to do reliably -- which is maybe not the case for a one-size-fits-all duplicates bloodhound function.

If a simple internal interface leveraging Zotero search were exposed in the translators, and tied to a popup warning mechanism (use existing item/download duplicate item/reject), contributors could slot it in to fix the translators they use often.

rwielage · May 3, 2010

We are also having the problem when users import references from other reference managers.

jameshalgren · May 4, 2010

A first step would be to build an author management tool which allows for merging multiple versions of the same author name.

Without experimenting with heuristics, an author merge feature could be implemented simply as a manual select-from-search. (Similar to one very popular webmail client's contact merging feature.)

A similar function for other potentially duplicated items such as publisher, journal or book title, location, etc. could eventually give some human logic help to the heuristic check for full-duplicates.

This could help with the first-name-disambiguation confusion which has been reported (incorrectly) as a bug in a number of styles. Sometimes, the disambiguation is triggered when the same author is entered twice with slight differences such as with/without period following intial, etc.
see: http://forums.zotero.org/discussion/7457/should-there-be-a-no-givenname-disambiguation-default-style/

rwielage · May 4, 2010

I don't think item duplication needs to involve author disambiguation. Users have needed this feature for years. If this could be implemented with 99% sensitivity and 90% selectivity (i.e. miss 1% of dups and produce erroneous dup messages 10% of the time) users would be far ahead of not having duplicate checking at all. Let's not let the perfect be the enemy of the good.

ohthere · May 5, 2010

I suggest a list of very simple algorithms for duplication management that can be easily implemented. Although this would not be perfect, I think most users will be *far happier* having this, than having none.

1. Notifying when the user tries to add an article, and ask whether to continue to add that article or not, if there already exists an article in the library with either of the same ...
1) title (case-insensitive, ignoring whitespaces),
2) DOI, or
3) publication, volume, issue, and page at once

This would be satisfying, at least for most scholars.
Regarding the timing,
it should obviously be done after retrieving the article's meta data, and
it would be best to be done before the article is actually added to the library.

If there are many articles added at once, you can just show up the dialog box many times.

2. Duplicate management in the existing library
1) Detect the duplicates with the above algorithm when the user asks to do so (i.e. press a specific button)
2) Show a list of duplicates
3) Let the users do the rest!

I know that these algorithms will be far from perfection, but again, what is definitely and desperately needed for many existing users, is this simple improvement. I think there has been strong and persistent need for this functionality, as is proved by the many replies on this discussion, and this functionality might be given priority over others.

If there remain any possible complications in implementing the algorithms I suggested, I'm eager to hear about that.

rwielage · May 6, 2010

I heartily agree. I think, however, there should be multiple algorithms that the user can choose to include or omit depending on their circumstance. For instance, for the sources from which we get our references the journal name may be the full name, the abbreviation, or the name plus the sponsor of the journal. Sometimes the issue number is included and sometimes it isn't. Sometimes the final page number is complete and sometimes it is shortened (849-852 might be 849-52). I'd like the ability to include or omit these elements in the duplicate checking.

Likewise, the titles of articles sometimes have something like "[review article]" appended. Sometimes they do, sometimes they don't. So I would want to leave off the last part of the title when doing a comparison by title. Etc.

I think there should be multiple, easily developed algorithms that the user can choose from for their duplicate checking.

ajlyon · May 6, 2010

I wonder if a couple of these proposed algorithms could be implemented using Kieren's zotero-browser approach to accessing Zotero data (http://github.com/singingfish/zotero-browser), to see how effective they are.

Kieren: Would you want to try some of these out? I imagine duplicate detection and handling would be a good killer app for zotero-browser.

mark · May 6, 2010

I'm quite sure I have added my voice to this in this or some other thread, but I'll do it again. I think this is important for enough users to be a development priority.

kieren · May 6, 2010

ajlyon

Well I have a working installation of my browser here, and the (non-gui parts of the) code should be identical in the browser to what would end up in zotero, so in principle I agree. My time is a bit limited for the next month or so, but in principle I can have a look later in June. Developing duplicate detection algorithms was the third use I came up with for the thing, after enhanced reports, and playing with my naive text mining environment.

If you fork on github, just notify me of changes through pull requests.

mav1234 · May 12, 2010

this is really necessary, please consider implementing it. Manual deletion of duplicates is frustrating!

Joby Joseph · May 22, 2010

IMP: As mentioned in the post below this will break things if you are using wordprocessor plugin. If not but you are using stuff like lyx/latex and normal .bib file genrated from zotero export you can follow

One temporary solution.
Step:
Occasionally...
1) Export your references from zotero as .bib
2) Use jabref to find duplicates and correct the database and save it.
3) Delete all data from zotero.
4) Import the corrected .bib file into zotero.

I dont know if Jabref people are willing to share the code/algorithm and if so how portable it is to zotero coding.

fbennett · May 22, 2010

@Joby Joseph, Unfortunately, this would break all references in all documents that rely on the word-processor plugins.

Joby Joseph · May 22, 2010

@fbennet Thanks for pointing this out. I was not aware of that as I was not using the plugin.

Dologan · June 1, 2010

Just want to add my voice to the cry for some form of duplicate removal and prevention (really, how difficult can it be to pop up an alert when an article of an existing name is added and stop the importing?).

my_kan · July 2, 2010

There is an urgent need to add the duplication detection function, or else it cannot compete with Endnote.

ajlyon · July 2, 2010

The duplicate detection hidden preference does in fact work decently well -- take a look at https://www.zotero.org/trac/ticket/1146 for information.

mark · July 2, 2010

In my experience it doesn't work that well. Takes a long time to load in my 5000 item library and generates loads of false positives, including different chapters from books and articles with relatively similar titles but different authors. A very strict algorithm (author lastnames, titles, and dates exactly identical) would be an indispensable option especially for larger libraries.

ajlyon · July 2, 2010

Well, it did take a long time and had many false positives, but it was fairly easy to browse the resulting list and quickly identify the real duplicates.

I certainly don't think that the current solution is sufficient, but it may alleviate the pain of some users for now.

my_kan · July 2, 2010

Could you please show the steps? thx

phlustik · July 28, 2010

Sorry for my ignorance, I loaded zotero trunk into a separate Firefox profile and searched for the duplicate detection feature but do not see extensions.zotero.debugShowDuplicates in the about:config and consequently no Duplicates menu option. How do I get it?
I would also vote for this to be a high development priority, else I am reluctant to make the complete switch from EndNote using my 4000+ item library :-(
Thx

Gracile · July 28, 2010

@phlustik
Instructions by Frank Bennett here.

gebauer · August 3, 2010

Hello,

thanks god to this description:
http://forums.zotero.org/discussion/13658/barrier-to-entry-no-duplicate-detection/#Item_2

It made cleaning up my libary very easy! It's defintely a good start.
However, I would prefer a "pre-import" warning, still.

Still no news from the devs, about this function?

Best,
Jan