[3.0b1] duplicate detection

I read the great news that 3.0 will have duplicate detection, so I tried the beta. Two questions regarding this feature:

1. How can the files (PDFs) associated with the two versions be viewed from inside the duplicate items view (i.e. without the need to search for the two items first)?

2. How can I tell Zotero that two items are not duplicates?

EDIT:

Also, a suggestion: if the files associated with the two items are identical, only one should be kept. Of course even if the contents of the PDFs are identical, the PDFs themselves might not be (because of e.g. timestamping done by journals upon download etc.), so this won't work perfectly. Nevertheless it'd be useful.
  • A means of marking (some) items in a matching set as confirmed non-duplicates would be useful. We'll have a lot of these in legal materials, with multiple instances of a legal decision.

    One approach might be to cull (all but one of) partners that are linked with an explicit relation. That should would work very well for law, not sure about other contexts.

    (In passing, compliments to Dan for the interface; it's very nicely designed.)
  • 1. How can the files (PDFs) associated with the two versions be viewed from inside the duplicate items view (i.e. without the need to search for the two items first)?
    After selecting a set, you can press right-arrow to expand all the selected items, and then click one of the attachments. It doesn't look like double-click is currently working to open attachments, which we'll fix, but you can still view it from the context menu or Locate menu.
    2. How can I tell Zotero that two items are not duplicates?
    No way to do this now, but we'll see what we can do. Easiest (and likely fastest, in terms of performance) would be to just give the option to remove a given item from duplicate detection altogether. There could be an option to reset blacklisted items from the folder's context menu. Blacklisted items could also be cleared automatically when the detection algorithm improves, though that would be somewhat confusing for users.

    The main issue is just that the detection algorithm is fairly simplistic at the moment. Once that improves, there will be far fewer false positives.
    if the files associated with the two items are identical, only one should be kept. Of course even if the contents of the PDFs are identical, the PDFs themselves might not be (because of e.g. timestamping done by journals upon download etc.), so this won't work perfectly.
    Also on the to-do list, and yes, that's one problem with it. But better than nothing.
  • (In passing, compliments to Dan for the interface; it's very nicely designed.)
    (Thanks.)
  • Thanks for the replies!

    Indeed, false positives are not common. I have 477 items and 2 false positives only. False positives are probably harmful only if they clutter up the duplicates view (i.e. there's too many of them).
  • edited August 24, 2011
    I also quite like the interface (though I do hope a preflight check is not off the radar — users would be much helped by it). However, I have almost only false positives. Merely having the same title (but different authors and different years) appears to be enough.

    Most of my false positives are items which have the same title but different authors, different years, or different item types. So I have "Greenberg (1963) The languages of Africa [book]" marked as a duplicate of "Cust (1881) The languages of Africa [journal article]". That is way out.

    What is worse is that the lax matching sometimes makes it impossible to solve the real duplicates. E.g. I have a real duplicate of "Nunberg (1994) Idioms [journal article]" which is also marked duplicate with a false positive "Ayto (2006) Idioms [book section]". This cannot be fixed since, Zotero tells me, "Merged items must all be of the same item type". (By the way, whoever implemented that blocking message must have realized that duplicate items also have a very strong tendency to be of the same item type, no?)

    It appears that a "strict" mode which insists on same item type, same title, same author and same year would be very very handy.
  • False positives are a huge problem for me-- I feel like the algorithm is running into issues with Cyrillic or something, because I have items with nothing but a few common letters (say, the first few out of a 20-30 letter title) that are being marked as duplicates.
  • These two items are marked as duplicates: https://gist.github.com/1167767

    Камалова, Люция. 2011. В июне в Болгарах появится палаточный лагерь для паломников. ИА “Татар-информ”, February 17. http://www.tatar-inform.ru/news/tatarstan/2011/02/17/258131/.
    Семенов, Никодим. 2011. В Казани турецкие и российские ученые не нашли общего языка. ИА REGNUM, May 20, sec. Новости Татарстана. http://www.regnum.ru/news/fd-volga/tatarstan/1407071.html.

    The titles have the same initial two characters, and the publications have the same initial three characters. That's about all. My guess is that the removeDiacritics function is somehow to blame, but I'm not sure.
  • ajlyon: We'll look into what's happening for you.
    Merely having the same title (but different authors and different years) appears to be enough.
    Yes, right now title matches will show up as duplicates. This will be improved.
    By the way, whoever implemented that blocking message must have realized that duplicate items also have a very strong tendency to be of the same item type, no?
    That's a fair point, but if the algorithm is improved to include other fields, I don't think we'll want to ignore a good match just because it's a different item type.
  • I'm also getting false positives for items whose titles are written in Japanese. E.g., these three all get selected as duplicates of one another:

    [1] 今日のニュースなのか. (2011, May 10).Asahi Shimbun, pp. 3-4.
    [2] 東京物語. (2010, August 15).ところどころ. 東京.
    [3] 東京物語その2. (2010).ところどころ. 東京.

    Item types: [1] is newspaper article, [2] is newspaper article, [3] is book section.
  • Issue with non-Latin characters should be fixed on the trunk. The fix will be in 3.0b2, which will be out shortly.
  • I "played around" with the duplicate detection for a while and it seems that for me (using Zotero standalone 3.0b1) selecting a master item works, but selecting entries to take from another item (title, abstract, etc) doesn't. For example, I have two duplicates (journal articles), one with a complete authors list but without the abstract. If I select the item with the complete author's list as master item and chose to keep the abstract of the other item from the dropdown menu next to "abstract", the merged item will have the complete authors list but will still lack the abstract. This will also happen for other types of entries.
    Also, as I previously posted in the wrong thread, it would be great if one could in some way decide on what attached files to keep (i.e. delete duplicate pdfs).
  • w80
    edited August 29, 2011
    A quirk of duplicate detection that makes it difficult to follow up the merged duplicates afterwards:

    in the duplicate detection window, if one tries to select all and move all the duplicates into a separate window, the select-all command appears to work (ie, all the items get highlighted and zotero tells you how many items are selected), but when one tries to drag all items into a collection, only the items where the mouse was positioned before the click-and-drag move. In other words, the interface looks like select all, but then behaves like select individual. This means that there is no way to get all duplicates identified as a group for later post-merge tracking and deleting of supernumerary PDFs.
  • jneef: Confirmed. Fixing for 3.0b2. Thanks.

    w80: Hadn't considered dragging, but I'll see if I can fix that.
  • I am running Zotero from the SVN (revision 10319) and noted the following issues:

    1) The items that are duplicate and have an attachment have the small collapse-expand triangle, but clicking on this triangle does not expand the item.

    2) If both items have the exactly same attachment file, the new item should probably delete the other file instead of having two copies of the same file.

    3) A feature for merging all exact duplicates with a single click would be useful.
  • 1) The items that are duplicate and have an attachment have the small collapse-expand triangle, but clicking on this triangle does not expand the item.
    Might be able to fix that, but right-arrow will expand all selected items.
  • w80: Try Control-click-drag (or Cmd-click-drag on a Mac). I think the only other solution would be to add collection filing to the context menu (which should be done anyway for accessibility), but in the meantime the modifier key circumvents selection of the duplicate set on mouse-down.
  • Thanks, Dan--interestingly, ctrl-click-drag did not work, but shift-click-drag did.
  • I am using Standalone 3.02b.1 on Windows. I would like to check my library for duplicates, but I can't find the menu (or whatever) where I would initiate this from. Help?
  • right-click on "My Library"
  • After some playing I have found a way to select only one of the duplicate items from the group. This may be useful to check to which collections belong the items, to expand one of them with the triangle (doesn't work yet anyway) etc.

    The way is select the duplicates group by a mouse, and then use up/down arrows to select individual items.

    However, some more intuitive way would be welcome.
  • ben58: The other way is to use a modifier key (Command on a Mac, probably Control elsewhere) when clicking to deselect.
  • edited October 10, 2011
    Just heard about version 3. Is there a fix for breaking documents when you delete a reference that's been cited? I.e. merging duplicates rather than deleting them. If not, will there be an indication whether a reference has been cited or how many times it has been cited?
  • edited October 10, 2011
    If you merge items using the duplicate detection interface, your documents will continue to work, regardless of which was cited.
  • edited October 10, 2011
    Great. We're looking forward to some sort of duplicate processing. I think the user should be able to set parameters for the duplicate match. Is there a description/picture of the duplicate results interface somewhere?

    As others have pointed out, a "preflight check" would be helpful, too.
  • Closing this thread. Please start new threads for new issues related to the duplicate detection in 3.0.
This discussion has been closed.