[3.0b1] duplicate detection
I read the great news that 3.0 will have duplicate detection, so I tried the beta. Two questions regarding this feature:
1. How can the files (PDFs) associated with the two versions be viewed from inside the duplicate items view (i.e. without the need to search for the two items first)?
2. How can I tell Zotero that two items are not duplicates?
EDIT:
Also, a suggestion: if the files associated with the two items are identical, only one should be kept. Of course even if the contents of the PDFs are identical, the PDFs themselves might not be (because of e.g. timestamping done by journals upon download etc.), so this won't work perfectly. Nevertheless it'd be useful.
1. How can the files (PDFs) associated with the two versions be viewed from inside the duplicate items view (i.e. without the need to search for the two items first)?
2. How can I tell Zotero that two items are not duplicates?
EDIT:
Also, a suggestion: if the files associated with the two items are identical, only one should be kept. Of course even if the contents of the PDFs are identical, the PDFs themselves might not be (because of e.g. timestamping done by journals upon download etc.), so this won't work perfectly. Nevertheless it'd be useful.
This discussion has been closed.
One approach might be to cull (all but one of) partners that are linked with an explicit relation. That should would work very well for law, not sure about other contexts.
(In passing, compliments to Dan for the interface; it's very nicely designed.)
The main issue is just that the detection algorithm is fairly simplistic at the moment. Once that improves, there will be far fewer false positives. Also on the to-do list, and yes, that's one problem with it. But better than nothing.
Indeed, false positives are not common. I have 477 items and 2 false positives only. False positives are probably harmful only if they clutter up the duplicates view (i.e. there's too many of them).
Most of my false positives are items which have the same title but different authors, different years, or different item types. So I have "Greenberg (1963) The languages of Africa [book]" marked as a duplicate of "Cust (1881) The languages of Africa [journal article]". That is way out.
What is worse is that the lax matching sometimes makes it impossible to solve the real duplicates. E.g. I have a real duplicate of "Nunberg (1994) Idioms [journal article]" which is also marked duplicate with a false positive "Ayto (2006) Idioms [book section]". This cannot be fixed since, Zotero tells me, "Merged items must all be of the same item type". (By the way, whoever implemented that blocking message must have realized that duplicate items also have a very strong tendency to be of the same item type, no?)
It appears that a "strict" mode which insists on same item type, same title, same author and same year would be very very handy.
Камалова, Люция. 2011. В июне в Болгарах появится палаточный лагерь для паломников. ИА “Татар-информ”, February 17. http://www.tatar-inform.ru/news/tatarstan/2011/02/17/258131/.
Семенов, Никодим. 2011. В Казани турецкие и российские ученые не нашли общего языка. ИА REGNUM, May 20, sec. Новости Татарстана. http://www.regnum.ru/news/fd-volga/tatarstan/1407071.html.
The titles have the same initial two characters, and the publications have the same initial three characters. That's about all. My guess is that the
removeDiacritics
function is somehow to blame, but I'm not sure.[1] 今日のニュースなのか. (2011, May 10).Asahi Shimbun, pp. 3-4.
[2] 東京物語. (2010, August 15).ところどころ. 東京.
[3] 東京物語その2. (2010).ところどころ. 東京.
Item types: [1] is newspaper article, [2] is newspaper article, [3] is book section.
Also, as I previously posted in the wrong thread, it would be great if one could in some way decide on what attached files to keep (i.e. delete duplicate pdfs).
in the duplicate detection window, if one tries to select all and move all the duplicates into a separate window, the select-all command appears to work (ie, all the items get highlighted and zotero tells you how many items are selected), but when one tries to drag all items into a collection, only the items where the mouse was positioned before the click-and-drag move. In other words, the interface looks like select all, but then behaves like select individual. This means that there is no way to get all duplicates identified as a group for later post-merge tracking and deleting of supernumerary PDFs.
w80: Hadn't considered dragging, but I'll see if I can fix that.
1) The items that are duplicate and have an attachment have the small collapse-expand triangle, but clicking on this triangle does not expand the item.
2) If both items have the exactly same attachment file, the new item should probably delete the other file instead of having two copies of the same file.
3) A feature for merging all exact duplicates with a single click would be useful.
The way is select the duplicates group by a mouse, and then use up/down arrows to select individual items.
However, some more intuitive way would be welcome.
As others have pointed out, a "preflight check" would be helpful, too.