Duplicates

mel47 · August 22, 2011

Hi,
I test this new feature and have a comment.
When different titles are detected, when I click on the little icon, I could see the first 5-6 words. Which are identique, so I don't know what could be the difference. I suppose it's the presence/absence of the final dot. Could it be possible to see the full title? Maybe with a tooltip.

Other question: could it be possible to easily see the collection provenance? If I understand well, merging is only about the "content". I would like to have the ability to just delete one copy from certain folders, directly from the Duplicates folder.

Hope it's clear,
Thanks
Mel

dstillman · August 22, 2011

When different titles are detected, when I click on the little icon, I could see the first 5-6 words. Which are identique, so I don't know what could be the difference.

We might be able to add a tooltip, but you can always use the list box in the right-hand panel to scroll through the different versions of the full items.

I would like to have the ability to just delete one copy from certain folders, directly from the Duplicates folder.

We might be able to activate the usual method for the version you're viewing, but it's unlikely that you'll be able to do beyond that. I think collection management is too complicated to work into this interface.

mapmaker · September 20, 2011

The Duplicates feature in the new 3.0b versions is nice but it seems to base its determination on the title only. In the grey literature it is not uncommon to run across different papers (different authors and years) with the same title.

Is there a way to provide some options for the duplication determination such as using title, year, author for example?

Thanks.

adamsmith · September 20, 2011

duplicate detection is known to be rudimentary and will be improved - although, obviously, a couple of false positives don't hurt, because Zotero asks you before merging.

I'd be reluctant about options - this is a feature that should just work and shouldn't require a separate tab in some preferences where users have to tinker. Too many options make software less user friendly.

smilingallison · September 24, 2011

Hi

I'm having a hard time figuring out how start the duplicate detection process. I'm trying to figure out what reference manager software to use right now, and although zotero seems just great, duplicate detection is a must for me. I've upgraded to the 3.0 version and have gone through all of the menu buttons but can't seem to figure out how to do it.

Any help would be much appreciated.
Thank you!
Allison

adamsmith · September 24, 2011

In the left hand panel, either right-click on "My Library" and select "Show Duplicates" or scroll down to the bottom of the panel, right above the trash, and click on "View Duplicates".

lfochtmann · November 6, 2011

I'm running the 3.0b2.1 beta on a Windows 7 64 bit computer. When I right click on "My Library" and then select "Show Duplicates" it takes me to the "Duplicate Items" link but absolutely nothing appears there, even though I can visually see many duplicates that exist within my collection. I'm not sure if I'm doing something wrong or if it's a bug. Are the duplicates supposed to display automatically in the "Duplicate Items" folder? If not, the need to right click on "My Library" does not seem as intuitive as putting the Show Duplicates command in one of the menus at the top (e.g., Tools).

adamsmith · November 6, 2011

they should show up automatically in the folder. Sounds like it might be a bug - I don't really know how to troubleshoot this, Dan would probably have to say, but for a start:
Restart Firefox, go to the duplicate folder and then see if there is an error report that you can submit (and if so post the error ID here).
http://www.zotero.org/support/reporting_bugs

fbennett · November 6, 2011

Not sure if it's relevant, but I think there's a check of the item type as well as the title. Items of different types will not be identified as duplicates.

lfochtmann · November 6, 2011

Thanks for the idea, but it's clearly not what's happening. They are each showing up as pdf files. I got an error number, which is 780981757, and also posted a new message in the troubleshooting forum since it's happening with the browser-based as well as the standalone version of Zotero.

adamsmith · November 6, 2011

wait - these are individual pdf files? I don't think Zotero would detect those as duplicates because they don't have any metadata - i.e. they don't have an item type at all. While I'm sure it's possible to check for duplicate files - some hash sum or so - since there's really not much of a reason to keep those in Zotero I'm not sure it's worth the effort.

Duplicate detection works for regular Zotero items.

lfochtmann · November 6, 2011

OK. This is literally my first day using Zotero. I had been playing with Mendeley but it seemed to have some significant weaknesses. I had thought, from some of the things that I'd read, that Zotero would let me organize pdf files as well as other web based information that I'm researching and that it also had the ability to extract the meta-data from the pdf files of research articles, similar to what Mendeley does. So I figured it would also detect duplicates of identically named pdfs or those with identical meta-data. It sounds like I was mistaken....

adamsmith · November 6, 2011

Sure you can use Zotero to organize pdfs, but you want them as attachments to Zotero items that actually have the citation data.
Zotero does detect metadata for PDFs, but it doesn't do it automatically like Mendeley - you need to select them - can be multiple at once - and right-click --> Retrieve Metadata for PDF.
Once you've done that (and if it works - the rate of success varies), the pdf will be attached to a Zotero item (I think I remember that's the same in Mendeley). Those will be recognized as duplicates.

fbennett · November 6, 2011

Another tool you might look at is Qiqqa, which began life as a PDF-centric document manager and has since extended into the general ref manager space. I'm only casually familiar -- I've not used it myself -- but it might be worth a look while you're checking tools out.

chritogjon · November 7, 2011

This seems to be a tread of "suggestions to duplicate feature". So, I continue it with my 2c...

The Duplicate search should ignore/exclude items which are related. The relation between two (or more) items has to be entered manually, and then clearly, the user does not consider these to be duplicates.

Thank you for the great tool!
Kind regards,

fbennett · November 7, 2011

I'm not sure excluding related items would be a good idea.

Suppose I have related items Ax and Ay, and related items Bx and By. Suppose also that all items in both sets would ordinary turn up as a single duplicates group. In that case, how should the duplicate relation be shown in the UI? And if the user requests a merge how should that be done?

We do have one instance where related items are generated manually. The Google Scholar translator will set relations between alternative cites to the same legal case. If you download the case twice, you have the situation described above.

In the Google Scholar parallel cites situation, it would be sufficient for the algorithm to be fussier about matching, so that congruent elements in each set are properly paired in the merge display.

chritogjon · November 8, 2011

I see my knowledge and experience in this area is very shallow :) I guess, so will be my answer:

The example described by you is complex ({Ax, Ay}, {Bx, By}). Parentheses () stand for a duplicate group and curly brackets {} for related items. And as such a group, it would be displayed in the list as a duplicate group.

[basic assumption:] When an item is displayed in the duplicate list, it is known to which other items it is a duplicate to, right? Now, when each of the items in the duplicate group (Ax, Ay, Bx, By) are related to each of the items in the same group, this group will be ignored from the list.

That is false for your example of a complex duplicate group ({Ax, Ay}, {Bx, By}): The duplicate item Ay, is not related to Bx and By. But it is true for a simple case, when no Bx and By exist. I can name a couple of such duplicate-related items:
- a book and a chapter from that book (the book title and a chapter name are similar) [related]
- a paper in a journal and a conference contribution (the title and the authors are the same) [related]
- a paper in a journal and a webpage of the scientific topic (same title, same author) [related]
- a book in two different editions (title, authors are the same, ISBN, edition, abstract are different) [in a relation]
- a document in two different language versions (same authors, different titles and language) [related]
- two books with the same titles, different authors [related]

Or, there is another question: Am I misusing the relation feature?

I haven't experienced the Google Scholar automated relation setting yet. But even it is non-manually set, it fits to the above described scenario (simple pairs are ignored, complex groups are displayed).

p.s. I love "fuzzy" :) it has been always a great help when it occurred in my way...

fbennett · November 8, 2011

You're certainly not misusing the Relations feature.

I'm easily confused about these things, but I think I follow your logic, and I think it makes sense.

nihili · February 3, 2012

We might be able to activate the usual method for the version you're viewing, but it's unlikely that you'll be able to do beyond that. I think collection management is too complicated to work into this interface.

Great work on the new version!

Dan, any timeline on implementing the "usual method"? Right now, I need to go back to the full library display to find out which collections will be changed for each duplicate.

dstillman · February 3, 2012

Right now, I need to go back to the full library display to find out which collections will be changed for each duplicate.

Why? When you merge duplicates, the remaining item gets added to all collections where a version appeared.

Nigel007 · March 10, 2012

First time comment, so apologies if I'm not quite following protocol.

The duplicates feature is great, BUT - I have 5 papers from a single conference (obviously the same conference name for each of the 5). Each has a different author(s) and a different paper title, but Zotero treats them as if they were duplicates. They need to be 5 separate references and it is only by looking at the list I realised what was happening. Is there a way of marking them 'not duplicates' so this doesn't happen the next time I look for duplicates? thanks

Nigel007 · March 10, 2012

I'm working on duplicates, so another comment. I have 2 books, both by the same author & publisher (titles as shown in the title list):
"Otago cavalcade, 1911-1915"
"Otago cavalcade, 1901-1905"
These are definitely different books (the titles show that), but they show as duplicates.

Any suggestions for how I could make them definitely not duplicates?

adamsmith · March 11, 2012

not currently, no, see here for more -
http://forums.zotero.org/discussion/22395/duplicates/#Item_6

mronkko · March 11, 2012

It is not possible to mark items as not duplicates.