marking non-duplicates

I have just started using Zotero seriously (even though I had it installed from its beginnings and followed its development). One problem that I seem to have with my data is that of "non-duplicates", ie. items which Zotero thinks are the same while in fact they are not. The two examples I have in my data are: a journal article and a book by the same author with the same title, and two anonymous reviews of the same book in different journals.

Is there a way to tell Zotero these are not duplicates?
  • Not currently, but the duplicate detection algorithm should get more sophisticated in future versions. (Marking as non-duplicates might happen too, but the goal is for it to be much less necessary.)
  • Thanks! For the time being I have added numbers in square brackets to the tile as a quick solution. I think the marking feature will not be needed once the planned algorithm is implemented (I understand it will allow users to define which fields one wants to take into account when detecting duplicates). Many thanks!
  • edited June 12, 2012
    I understand it will allow users to define which fields one wants to take into account when detecting duplicates
    Not sure where you got that—that's not planned—but it will use more fields in more complex ways than it's using now. Right now, as you've noticed, it's mostly the title that matters (if there's no id in common).
  • I was just guessing. I think it would be useful to allow users select the fields to be taken into account -- but not knowing how the new algorithm will work, I will not insist on it :-)
  • FYI, here's a thread in which the virtues and vices of the current duplication detection mechanism(s) are discussed.
  • mark: Thanks -- I should have found it before asking the question :-)
  • edited July 19, 2012
    I have read the above discourse yet I still have a question which differs slightly. These are the titles of two j. articles I've ref'd:

    Direct-use Values of Non-Timber Forest Products...
    Direct-use Values of Secondary Resources...

    They also have one author in common. Is this the same issue as voynich?
  • jaymanxv: Are you saying those two items are identified as duplicates in the Duplicates view? Do they have the same DOI?
  • thanks! missed that :-/
  • In my opinion, items should only be considered as duplicates by Zotero if

    - title
    - author(s), editor(s), etc.
    - volume

    are the same.

    If you want to have different volumes of the same encyclopedia in your library, at the moment, they will all show up as duplicates because the title is obviously the same.

    Being able to define the criteria for qualification as duplicate myself would be perfect, but the above mentioned would already make the feature much more useful for me and is much easier to implement, I guess.
  • Completely agree with mreiter about the desirability of user-designated fields for comparison to determine duplicates!
  • I am having trouble with this. I am getting different volumes of Knuth's art of computer programming marked as duplicates. Is there anything I can do?
  • Hi
    I agree the method zotero uses is too limited (described here:
    https://www.zotero.org/support/duplicate_detection
    )
    If you import http://ftp.math.utah.edu/pub/tex/bib/tugboat.html
    you will find lots of duplicates, which are actually different installments of recurring columns.
    It doesn't seem to be fixed by now. Does anyone know a fix or a workaround?
    Paul
  • Zotero seems especially inaccurate with magazine articles, which don't have DOIs or ISBNs. If anonymously authored, matching titles and years alone cause Zotero to think they're duplicates. Consider these film reviews:

    “Alice in Wonderland.” Life, vol. 30, June 1951, pp. 85–87.
    “Alice in Wonderland.” Library Journal, vol. 76, Aug. 1951, p. 1239.

    Zotero marks these and 10 more like them to be duplicates.

    Is there any reason that it would be a bad strategy for Zotero to require matching publication title to mark articles as a duplicate?

    @paulvdh, here's the duplicate criteria:
    "Zotero currently uses the the title, DOI, and ISBN fields to determine duplicates. If these fields match (or are absent), Zotero also compares the years of publication (if they are within a year of each other) and author/creator lists (if at least one author last name plus first initial matches) to determine duplicates. The algorithm will be improved in the future to incorporate other fields."

    So only info in those fields would affect the duplicate evaluation.
  • +1 for the original, simplest request, to mark things as non-duplicates. (This probably should apply only to the pairs of items currently detected as possible duplicate, to make sure detection will still work if true duplicates to items in a non-set are added later; but that's a minor detail. In fact I'd already be happy with a rough and ready solution that just hides items marked as non-duplicates from the Duplicate Items view.)

    Motivation: In a big library like mine, the Duplicate Items collection is perennially populated by at least 150 items that are non-duplicates, with the effect that true duplicates are really hard to find (especially since sorting by date makes pairs harder to spot and compare).
  • I feel the same way as mark. If the library is big enough you will have a hard time to determine the real duplicates in the big bunch of elements marked falsely as duplicates.

    An option to to mark a "false duplicates" as a "non duplicates" would be a great relieve fo me as well.
  • I have the same problem: lots of false duplicates. I'm convinced the title+DOI+ISBN is a bad strategy (especially because the duplicate is true if DOI/ISBN are empty! this means that if for whatever reason DOI/ISBN information is missing, then if the title is similar they are duplicates, right?).

    So I end up having a lot of (wrong) items in the "Duplicate Items" section, which I don't pay attention any more. This is a pitty, and somehow defeats the purpose of this good tool which aim is "automatically find duplicates".

    I'm not sure if "mark as non-duplicates" is a straight forward option, because I can't see how it would be implemented: would Zotero include a flag somewhere in each of the items saying that X is not a duplicate of Y for each case? Then it has to store maybe thousands of these flags... I think the most elegant solution is to give the user the capability of defining which fields (and in which order) Zotero should look into to decide if two items are duplicates or not. Then the rules are fixed and Zotero can build the Duplicated Items as many times as necessary.

    I hope this make sense. Thanks for this great software!
  • Making this problem worse is the fact that a PhD thesis does not have DOI or ISBN fields. But a thesis which has a corresponding journal article of the same type, by the same author, in about the same year, is an extremely common thing.
Sign In or Register to comment.