[3.0b1] duplicate detection

Szabolcs · August 22, 2011

I read the great news that 3.0 will have duplicate detection, so I tried the beta. Two questions regarding this feature:

1. How can the files (PDFs) associated with the two versions be viewed from inside the duplicate items view (i.e. without the need to search for the two items first)?

2. How can I tell Zotero that two items are not duplicates?

EDIT:

Also, a suggestion: if the files associated with the two items are identical, only one should be kept. Of course even if the contents of the PDFs are identical, the PDFs themselves might not be (because of e.g. timestamping done by journals upon download etc.), so this won't work perfectly. Nevertheless it'd be useful.

fbennett · August 22, 2011

A means of marking (some) items in a matching set as confirmed non-duplicates would be useful. We'll have a lot of these in legal materials, with multiple instances of a legal decision.

One approach might be to cull (all but one of) partners that are linked with an explicit relation. That should would work very well for law, not sure about other contexts.

(In passing, compliments to Dan for the interface; it's very nicely designed.)

dstillman · August 22, 2011

1. How can the files (PDFs) associated with the two versions be viewed from inside the duplicate items view (i.e. without the need to search for the two items first)?

After selecting a set, you can press right-arrow to expand all the selected items, and then click one of the attachments. It doesn't look like double-click is currently working to open attachments, which we'll fix, but you can still view it from the context menu or Locate menu.

2. How can I tell Zotero that two items are not duplicates?

No way to do this now, but we'll see what we can do. Easiest (and likely fastest, in terms of performance) would be to just give the option to remove a given item from duplicate detection altogether. There could be an option to reset blacklisted items from the folder's context menu. Blacklisted items could also be cleared automatically when the detection algorithm improves, though that would be somewhat confusing for users.

The main issue is just that the detection algorithm is fairly simplistic at the moment. Once that improves, there will be far fewer false positives.

if the files associated with the two items are identical, only one should be kept. Of course even if the contents of the PDFs are identical, the PDFs themselves might not be (because of e.g. timestamping done by journals upon download etc.), so this won't work perfectly.

Also on the to-do list, and yes, that's one problem with it. But better than nothing.

dstillman · August 22, 2011

(In passing, compliments to Dan for the interface; it's very nicely designed.)

(Thanks.)

Szabolcs · August 23, 2011

Thanks for the replies!

Indeed, false positives are not common. I have 477 items and 2 false positives only. False positives are probably harmful only if they clutter up the duplicates view (i.e. there's too many of them).

mark · August 24, 2011

I also quite like the interface (though I do hope a preflight check is not off the radar — users would be much helped by it). However, I have almost only false positives. Merely having the same title (but different authors and different years) appears to be enough.

Most of my false positives are items which have the same title but different authors, different years, or different item types. So I have "Greenberg (1963) The languages of Africa [book]" marked as a duplicate of "Cust (1881) The languages of Africa [journal article]". That is way out.

What is worse is that the lax matching sometimes makes it impossible to solve the real duplicates. E.g. I have a real duplicate of "Nunberg (1994) Idioms [journal article]" which is also marked duplicate with a false positive "Ayto (2006) Idioms [book section]". This cannot be fixed since, Zotero tells me, "Merged items must all be of the same item type". (By the way, whoever implemented that blocking message must have realized that duplicate items also have a very strong tendency to be of the same item type, no?)

It appears that a "strict" mode which insists on same item type, same title, same author and same year would be very very handy.

ajlyon · August 24, 2011

False positives are a huge problem for me-- I feel like the algorithm is running into issues with Cyrillic or something, because I have items with nothing but a few common letters (say, the first few out of a 20-30 letter title) that are being marked as duplicates.

ajlyon · August 24, 2011

These two items are marked as duplicates: https://gist.github.com/1167767

Камалова, Люция. 2011. В июне в Болгарах появится палаточный лагерь для паломников. ИА “Татар-информ”, February 17. http://www.tatar-inform.ru/news/tatarstan/2011/02/17/258131/.
Семенов, Никодим. 2011. В Казани турецкие и российские ученые не нашли общего языка. ИА REGNUM, May 20, sec. Новости Татарстана. http://www.regnum.ru/news/fd-volga/tatarstan/1407071.html.

The titles have the same initial two characters, and the publications have the same initial three characters. That's about all. My guess is that the removeDiacritics function is somehow to blame, but I'm not sure.

dstillman · August 24, 2011

ajlyon: We'll look into what's happening for you.

Merely having the same title (but different authors and different years) appears to be enough.

Yes, right now title matches will show up as duplicates. This will be improved.

By the way, whoever implemented that blocking message must have realized that duplicate items also have a very strong tendency to be of the same item type, no?

That's a fair point, but if the algorithm is improved to include other fields, I don't think we'll want to ignore a good match just because it's a different item type.

chris · August 25, 2011

I'm also getting false positives for items whose titles are written in Japanese. E.g., these three all get selected as duplicates of one another:

[1] 今日のニュースなのか. (2011, May 10).Asahi Shimbun, pp. 3-4.
[2] 東京物語. (2010, August 15).ところどころ. 東京.
[3] 東京物語その２. (2010).ところどころ. 東京.

Item types: [1] is newspaper article, [2] is newspaper article, [3] is book section.

dstillman · August 26, 2011

Issue with non-Latin characters should be fixed on the trunk. The fix will be in 3.0b2, which will be out shortly.

jneef · August 29, 2011

I "played around" with the duplicate detection for a while and it seems that for me (using Zotero standalone 3.0b1) selecting a master item works, but selecting entries to take from another item (title, abstract, etc) doesn't. For example, I have two duplicates (journal articles), one with a complete authors list but without the abstract. If I select the item with the complete author's list as master item and chose to keep the abstract of the other item from the dropdown menu next to "abstract", the merged item will have the complete authors list but will still lack the abstract. This will also happen for other types of entries.
Also, as I previously posted in the wrong thread, it would be great if one could in some way decide on what attached files to keep (i.e. delete duplicate pdfs).

w80 · August 29, 2011

A quirk of duplicate detection that makes it difficult to follow up the merged duplicates afterwards:

in the duplicate detection window, if one tries to select all and move all the duplicates into a separate window, the select-all command appears to work (ie, all the items get highlighted and zotero tells you how many items are selected), but when one tries to drag all items into a collection, only the items where the mouse was positioned before the click-and-drag move. In other words, the interface looks like select all, but then behaves like select individual. This means that there is no way to get all duplicates identified as a group for later post-merge tracking and deleting of supernumerary PDFs.

dstillman · August 29, 2011

jneef: Confirmed. Fixing for 3.0b2. Thanks.

w80: Hadn't considered dragging, but I'll see if I can fix that.

mronkko · August 29, 2011

I am running Zotero from the SVN (revision 10319) and noted the following issues:

1) The items that are duplicate and have an attachment have the small collapse-expand triangle, but clicking on this triangle does not expand the item.

2) If both items have the exactly same attachment file, the new item should probably delete the other file instead of having two copies of the same file.

3) A feature for merging all exact duplicates with a single click would be useful.

dstillman · August 29, 2011

1) The items that are duplicate and have an attachment have the small collapse-expand triangle, but clicking on this triangle does not expand the item.

Might be able to fix that, but right-arrow will expand all selected items.

dstillman · August 29, 2011

w80: Try Control-click-drag (or Cmd-click-drag on a Mac). I think the only other solution would be to add collection filing to the context menu (which should be done anyway for accessibility), but in the meantime the modifier key circumvents selection of the duplicate set on mouse-down.

w80 · August 29, 2011

Thanks, Dan--interestingly, ctrl-click-drag did not work, but shift-click-drag did.

weinhc2 · September 2, 2011

I am using Standalone 3.02b.1 on Windows. I would like to check my library for duplicates, but I can't find the menu (or whatever) where I would initiate this from. Help?

adamsmith · September 2, 2011

right-click on "My Library"

ben58 · October 1, 2011

After some playing I have found a way to select only one of the duplicate items from the group. This may be useful to check to which collections belong the items, to expand one of them with the triangle (doesn't work yet anyway) etc.

The way is select the duplicates group by a mouse, and then use up/down arrows to select individual items.

However, some more intuitive way would be welcome.

dstillman · October 1, 2011

ben58: The other way is to use a modifier key (Command on a Mac, probably Control elsewhere) when clicking to deselect.

wielager · October 10, 2011

Just heard about version 3. Is there a fix for breaking documents when you delete a reference that's been cited? I.e. merging duplicates rather than deleting them. If not, will there be an indication whether a reference has been cited or how many times it has been cited?

Simon · October 10, 2011

If you merge items using the duplicate detection interface, your documents will continue to work, regardless of which was cited.

wielager · October 10, 2011

Great. We're looking forward to some sort of duplicate processing. I think the user should be able to set parameters for the duplicate match. Is there a description/picture of the duplicate results interface somewhere?

As others have pointed out, a "preflight check" would be helpful, too.

dstillman · November 2, 2011

Closing this thread. Please start new threads for new issues related to the duplicate detection in 3.0.