Duplicate detection false positives

I've encountered an error in duplicate detection--items with the same or very similar titles are identified as duplicates even though they have different author/year/publication in one case, and different patent number and issue date in another. IIRC, the older duplicate detection algorithm tended to to be overly stringent with title-matching--tiny differences of punctuation or capitalization eluded detection. Perhaps the current problem is an unintended consequence of solving that issue?

If there were any way to manually mark non-duplicates (https://forums.zotero.org/discussion/23682/marking-nonduplicates/) that would be one way to solve the problem. Another would be to have the algorithm look to a secondary field to confirm duplicate identity.

example 1:
http://www.ncbi.nlm.nih.gov/pubmed/?term=12152921
http://www.nap.edu/catalog.php?record_id=12875

example 2:
http://www.google.com/patents/about?id=XNMDAAAAEBAJ
http://www.google.com/patents/about?id=moYXAAAAEBAJ
  • Here's a false positive case that I've run into:

    Item 1 is an edited book.

    Item 2 is a chapter in that book. The chapter has the same title as the book (really!); the chapter author is not one of the editors of the book.

    All other information in the two items (editors, place of publication, publisher, and date) is the same.

    I can reproduce this by making two new dummy items like Items 1 and 2 above. They show up as duplicate items only if Item 2 includes the name of the book editor(s); if Item 2 does not include the name of the book editor(s), then the two items do not show up as duplicates.
  • edited July 10, 2013
    I have read in the forums about implementing a button to ignore false positives, but messages are pretty old, and this is the last thread I've seen that mentions the problem. Is this kind of button going to be implemented? It surely gives more flexibility to the user.

    Thanks.
  • I encountered a "false positive duplicates" problem with 3 articles with different authors, titles, summaries, tags etc. - the single common feature being the journal and volume (DOI and ISSN); how can this occur when "Zotero currently uses the title, DOI, and ISBN fields to determine duplicates"? Titles were obiously not included in the duplicate criterion; are the three fields OR rather than AND connected? A solution would be greatly appreciated.
  • the single common feature being the journal and volume (DOI and ISSN)
    I'm not sure what you mean by this. DOI is a document identifier, not a journal identifier. So yes, if the DOI matches, that's enough to consider it a match. Are you saying the different items have the same DOI? If so, that seems like a mistake.
  • Thanks for the quick reply - yes, indeed, the 3 articles have identical DOIs (in the meantime, another one turned up). The PubMed accession numbers are: 21858264, 21743881, 22888485, and 21468385; the common DOI is 10.4193/Rhin.
    Can I do anything to correct this mistake?
  • So the journal seems to only have a single DOI for all articles, which isn't how DOIs are typically - and should be - used.

    CrossRef does assign DOIs not just for articles, but also to journal titles, volumes, issues, etc. but I don't really think they should be part of the article date.
  • edited August 8, 2013
    Well, I guess we can require some remote title similarity as a basic sanity check in order to match by identifier, but yeah, this seems like the journal's problem. (And it's entirely possible that different articles in a given journal would have titles closer than whatever distance threshold we set.)
  • Apparently, the journal's setting of DOIs is inconsistent (I checked other items in my database); so my problem arose from their ambiguous handling of DOIs. But still, some improvement of finding duplicates in zotero might be helpful ... thanks in advance!
Sign In or Register to comment.