marking non-duplicates

voynich · June 12, 2012

I have just started using Zotero seriously (even though I had it installed from its beginnings and followed its development). One problem that I seem to have with my data is that of "non-duplicates", ie. items which Zotero thinks are the same while in fact they are not. The two examples I have in my data are: a journal article and a book by the same author with the same title, and two anonymous reviews of the same book in different journals.

Is there a way to tell Zotero these are not duplicates?

dstillman · June 12, 2012

Not currently, but the duplicate detection algorithm should get more sophisticated in future versions. (Marking as non-duplicates might happen too, but the goal is for it to be much less necessary.)

voynich · June 12, 2012

Thanks! For the time being I have added numbers in square brackets to the tile as a quick solution. I think the marking feature will not be needed once the planned algorithm is implemented (I understand it will allow users to define which fields one wants to take into account when detecting duplicates). Many thanks!

dstillman · June 12, 2012

I understand it will allow users to define which fields one wants to take into account when detecting duplicates

Not sure where you got that—that's not planned—but it will use more fields in more complex ways than it's using now. Right now, as you've noticed, it's mostly the title that matters (if there's no id in common).

voynich · June 12, 2012

I was just guessing. I think it would be useful to allow users select the fields to be taken into account -- but not knowing how the new algorithm will work, I will not insist on it :-)

mark · June 12, 2012

FYI, here's a thread in which the virtues and vices of the current duplication detection mechanism(s) are discussed.

voynich · June 14, 2012

mark: Thanks -- I should have found it before asking the question :-)

jaymanxv · July 19, 2012

I have read the above discourse yet I still have a question which differs slightly. These are the titles of two j. articles I've ref'd:

Direct-use Values of Non-Timber Forest Products...
Direct-use Values of Secondary Resources...

They also have one author in common. Is this the same issue as voynich?

dstillman · July 20, 2012

jaymanxv: Are you saying those two items are identified as duplicates in the Duplicates view? Do they have the same DOI?

jaymanxv · July 23, 2012

thanks! missed that :-/

mreiter · October 2, 2012

In my opinion, items should only be considered as duplicates by Zotero if

- title
- author(s), editor(s), etc.
- volume

are the same.

If you want to have different volumes of the same encyclopedia in your library, at the moment, they will all show up as duplicates because the title is obviously the same.

Being able to define the criteria for qualification as duplicate myself would be perfect, but the above mentioned would already make the feature much more useful for me and is much easier to implement, I guess.

tomcloyd · December 13, 2012

Completely agree with mreiter about the desirability of user-designated fields for comparison to determine duplicates!

b0c5 · August 7, 2017

I am having trouble with this. I am getting different volumes of Knuth's art of computer programming marked as duplicates. Is there anything I can do?

paulvdh · September 20, 2017

Hi
I agree the method zotero uses is too limited (described here:
https://www.zotero.org/support/duplicate_detection
)
If you import http://ftp.math.utah.edu/pub/tex/bib/tugboat.html
you will find lots of duplicates, which are actually different installments of recurring columns.
It doesn't seem to be fixed by now. Does anyone know a fix or a workaround?
Paul

realtime99 · December 12, 2017

Zotero seems especially inaccurate with magazine articles, which don't have DOIs or ISBNs. If anonymously authored, matching titles and years alone cause Zotero to think they're duplicates. Consider these film reviews:

“Alice in Wonderland.” Life, vol. 30, June 1951, pp. 85–87.
“Alice in Wonderland.” Library Journal, vol. 76, Aug. 1951, p. 1239.

Zotero marks these and 10 more like them to be duplicates.

Is there any reason that it would be a bad strategy for Zotero to require matching publication title to mark articles as a duplicate?

@paulvdh, here's the duplicate criteria:
"Zotero currently uses the the title, DOI, and ISBN fields to determine duplicates. If these fields match (or are absent), Zotero also compares the years of publication (if they are within a year of each other) and author/creator lists (if at least one author last name plus first initial matches) to determine duplicates. The algorithm will be improved in the future to incorporate other fields."

So only info in those fields would affect the duplicate evaluation.

mark · December 12, 2017

+1 for the original, simplest request, to mark things as non-duplicates. (This probably should apply only to the pairs of items currently detected as possible duplicate, to make sure detection will still work if true duplicates to items in a non-set are added later; but that's a minor detail. In fact I'd already be happy with a rough and ready solution that just hides items marked as non-duplicates from the Duplicate Items view.)

Motivation: In a big library like mine, the Duplicate Items collection is perennially populated by at least 150 items that are non-duplicates, with the effect that true duplicates are really hard to find (especially since sorting by date makes pairs harder to spot and compare).

janz5102 · February 11, 2018

I feel the same way as mark. If the library is big enough you will have a hard time to determine the real duplicates in the big bunch of elements marked falsely as duplicates.

An option to to mark a "false duplicates" as a "non duplicates" would be a great relieve fo me as well.

DO_part73 · February 28, 2018

I have the same problem: lots of false duplicates. I'm convinced the title+DOI+ISBN is a bad strategy (especially because the duplicate is true if DOI/ISBN are empty! this means that if for whatever reason DOI/ISBN information is missing, then if the title is similar they are duplicates, right?).

So I end up having a lot of (wrong) items in the "Duplicate Items" section, which I don't pay attention any more. This is a pitty, and somehow defeats the purpose of this good tool which aim is "automatically find duplicates".

I'm not sure if "mark as non-duplicates" is a straight forward option, because I can't see how it would be implemented: would Zotero include a flag somewhere in each of the items saying that X is not a duplicate of Y for each case? Then it has to store maybe thousands of these flags... I think the most elegant solution is to give the user the capability of defining which fields (and in which order) Zotero should look into to decide if two items are duplicates or not. Then the rules are fixed and Zotero can build the Duplicated Items as many times as necessary.

I hope this make sense. Thanks for this great software!

rupert · October 10, 2018

Making this problem worse is the fact that a PhD thesis does not have DOI or ISBN fields. But a thesis which has a corresponding journal article of the same type, by the same author, in about the same year, is an extremely common thing.

stroom · January 25, 2022

Same here: a lot of false duplicates in the Duplicate items section. Would be great to be able to mark them as such.

mark · February 18, 2022

Yup, just to update my estimate from 4 years back: I now have aboug 235 items in my "Duplicates" pane that are not, in fact, duplicates and so it's only become harder to spot genuine new duplicates.

qqbb · April 5, 2022

An option to mark items as "false duplicates" could cause performance issues and other problems, I think. With many items marked as false duplicates, Zotero would have to work a lot to take this information into account. You'd also get new issues. Let's say Zotero found A = B and you marked A ≠ B. Now you add C and Zotero finds A = B = C, but it will take into account A ≠ B. Therefore, Zotero will output A = C and B = C. Now mark A ≠ C and B ≠ C, adding two new false duplicates to the list...

Perhaps an interface similar to the conditions for saved searches that would allow adjusting the criteria for the Duplicate Items collection could be a better solution. If all criteria were listed with checkboxes, you could tighten or relax the conditions for the duplicate items detection. It wouldn't be a problem to add new conditions that could be switched off by default, e.g., "Not duplicate if different item type", "Not duplicate if different year".

livey_liwei · May 21, 2022

@qqbb I acknowledge the potential issues of marking items as "false duplicates". However, many "false duplicates" can be readily handled even if without this function. I would request to improve the detection mechanism so that it will be able to introduce much fewer "false duplicates". For example, for book collections which are formatted as books somehow in some database, all these papers will be recongonized as duplicates, even if they clearly have different titles and authors. This is one of the major causes of false duplicates for my case, which is also most difficult to handle.

I'm not sure why the detection mechanism is designed in such a way, and I have raised this issue in 2019 and somehow this issue still persists, even though it seems trivial to me to change its detection mechanism to handle this issue.

The reason that I come back to this issue is that, I found some plugin that can merge the items in the duplicates folder in batch, but the existence of the false duplicates, especially those conference papers with different titles, makes the job quite cumbersome.

For my case, I occasionally do literature reviews on different topics, and hence, I need to import the relevant papers at different times. Many times, some duplicates will be imported. If the false duplcates issue can be solved, that would save me a lot of time of manually cleaning up the library, or from suffering of have different versions of the same paper.

I regard this as a fundamental function of a database management software, because importing data is the beginning of everything. If possible, I hope that the team would consider giving higher priority to this issue.

qqbb · June 17, 2022

I would request to improve the detection mechanism so that it will be able to introduce much fewer "false duplicates".

@livey_liwei: Yes, this was also my main point. The suggestion was to make the criteria for the duplicate items detection adjustable. New users with a small library might want to see more suggestions for duplicate items, while users with a large library might prefer tighter criteria.

Not currently, but the duplicate detection algorithm should get more sophisticated in future versions. (Marking as non-duplicates might happen too, but the goal is for it to be much less necessary.)

@dstillman: This seems like a good approach to me. Marking as non-duplicates could be nice, but if you have hundreds of false duplicates, it's probably not the best solution to start with. For example, I have many presentation items and journal articles with the same title. They're related, but they aren't duplicates. So, an obvious optional criterion could be that duplicates should be of the same item type. Rather than marking dozens of items as non-duplicates, tightening the criteria for the duplicate items detection seems more efficient.

livey_liwei · June 18, 2022

another possible solution is to provide the users an interface to append customized rules to the detection mechanism, to handle the issues one may encounter.

fulaoshi · July 17, 2022

+1 for this - I came here to make the same request

eefuller · July 23, 2022

Although there's a degree of fuzziness in the definition of some item types, and the web plugin in particular may not choose the correct one depending on the source page, type is still a reasonably strong predictor of whether or not two items are the same. Zotero even recognizes this to the extent that items all have to be of the same type in order to be merged. So it seems especially silly not to provide any way to stop suggesting a merge when the user knows the items are correct as is and not merge-eligible. I have a lot of items that are different manifestations of the same work (e.g.. a conference presentation and its published version), or related works by the same author with the same or similar titles (e.g., a dissertation chapter published as a journal article, the completed dissertation, and a later book). Not all of these can have ISBNs or DOIs, but they're all different types. Insistance on keeping these things in the list of duplicates makes the process of seeing and resolving real duplicates harder. Giving users the choice to edit the criteria for determining duplicates would solve this.

mark · August 4, 2022

+1 — that is an excellent idea. For items that zotero won't merge because of item type differences, either (i) don't show them in Duplicate Items to start with or (ii) give users the option to hide them

the first seems to hit the #papercut sweet spot of being both easy to implement and solving a fairly big and recurring annoyance, at least for power users!

mark · November 25, 2022

Just revisiting to note that there are still 230 items listed in my 'duplicate items' that are all non-resolvable non-duplicates with item type differences, making the duplicate items impossible to navigate and nearly useless...

realtime99 · December 6, 2022

Last dev response was 10 years ago on this request. It said that they planned to improve the detect duplicate algorithm at that time. Just curious if anyone has information about whether the dev team sees the duplicate issues posted after that time to be significant enough to justify further work on this issue or if they consider this a non-priority—maybe it was discussed elsewhere on the forum?

adamsmith · December 6, 2022

I'm pretty sure there's been work on marking items as non-duplicates in the last 6 months, yes.
I've not seen anything on improving the detection algorithm (or allowing some level of customization) but that doesn't mean no one is thinking about it, just that there isn't any public code.