Mark as non-duplicate

I have a couple of items that are quite similar - two presentations on similar topics, that share authors (different date, different presentation contents). Because the titles and authors are basically the same, Zotero keeps telling me that they are possible duplicates. I would like to be able to remove them from the list of duplicates without merging them. Would it be possible to add a blacklist for particular pairs of items, so that they are never marked as possible duplicates?
«1
  • Not currently possible. How similar is "basically the same" though?
  • I know it's not currently possible, that's why I put it in feature requests :)

    Abramowitz, G., 2013. The PALS Land sUrface Model Benchmarking Evaluation pRoject (PLUMBER).

    Best, M.J., 2014. The PALS Land sUrface Model Benchmarking Evaluation pRoject (PLUMBER).

    They are both presentations, and are named the same because that's the name of the experiment. They have the same authors (whole group), but in a different order (presenter first). They have different dates, places and meeting names. And the actual documents are probably only about 20% similar.
  • The feature is generally planned, though I don't think it's high on the priority list: https://www.zotero.org/support/requested_features#zotero_interface
    They have the same authors (whole group), but in a different order (presenter first)
    That should be sufficient to differentiate (along with different dates). I'll take a closer look at the code.
  • Actually, the authors were in the same order, but a different one was marked as a presenter (all others marked as contributors). But I also tried re-ordering it so that the presenter was first in both cases.

    I think my problem would be solved if the duplicate detector took "meeting name" - it seems fairly unlikely that anyone would ever present the same document at the same meeting.
  • aurimas,

    Similar issues here. Not sure if these should get flagged as duplicates according to the current code or not:

    I have several email items where I have stored several people's comments on a paper. The title and date of the items are the same, and they contain the same list of creators. However, the creators are in a different order (and are different types) and the abstracts (where the email text is stored) are different.

    I have several journal articles with one or two (of several) of the authors the same across many articles, the same title and publication, but all different years, volumes, page numbers, etc.

    I have a paper published in a conference proceedings and then its abstract published in a journal. They have the same titles, authors, and years, but one is a conference proceeding and one is a journal article (different publication titles).

    I have two sets of the previous issue from the same conference with the same authors (the two sets have different titles). All four of these are flagged as being duplicates together.
  • I'm trying to figure out a solution to this too, and I wonder why development of a simple tag "not duplicate" isn't a high priority. Seems essential for library management. For me, I see confusions between articles and presentations and book chapters most often. Why not just make "item type" (e.g., book, journal article) part of the dup id algorithm? Then if I have the type right Z won't see them as dups. Below are two I'm looking at right now. One is a book, one is an article, and they're coded that way in Zotero, but Z still calls them dups.

    Maynard, D. W., & Heritage, J. (2005). Conversation analysis, doctor-patient interaction and medical communication. In L. T. Reynolds & N. J. Herman-Kinney (Eds.), The handbook of symbolic interactionism (Vols. 1-Book, 1-Section, Vol. 39, pp. 428–435). Rowman Altamira.


    Maynard, D. W., & Heritage, J. (2005). Conversation analysis, doctor-patient interaction and medical communication. Medical Education, 39(4), 428–435. doi:10.1111/j.1365-2929.2005.02111.x
  • Here are two more...one conference, one j. pub. Very similar tiles, but authors in different order, different date, and also coded as different type. I agree with an earlier comment I saw that more than title should be used, including author list, date, and type.

    Hsu, V., Montaquila, J. M., & Brick, J. M. (2010). Using a Match Rate Model to Predict Areas Where USPS-Based Address Lists May Be Used in Place of Traditional Listing. In Proceedings of the Survey Research Methods Section, American Statistical Association. Retrieved from http://www.amstat.org/Sections/Srms/Proceedings/y2010/Files/306727_57064.pdf

    Montaquila, J. M., Hsu, V., & Brick, J. M. (2011). Using a “Match Rate” Model to Predict Areas Where USPS-Based Address Lists May Be Used in Place of Traditional Listing. Public Opinion Quarterly, 75(2), 317–335. doi:10.1093/poq/nfr008
  • Here is an issue with patents, which we do cite in our works. They are considered as duplicates by zotero. It would be great to tell zotero that it is mistaken.

    [1] F.H. Hurley, Electrodeposition of Aluminum, 2446331, 1948. http://www.google.co.uk/patents/US2446331.
    [2] T.P. Wier, F.H. Hurley, Electrodeposition of Aluminum, 2446349, 1948. http://www.google.com/patents/US2446349.
    [3] T.P. Wier, Electrodeposition of Aluminum, 2446350, 1948. http://www.google.com/patents/US2446350.
  • Yet another example in case it helps improving the code :

    Constrained school choice
    Type Journal Article
    Author Guillaume Haeringer
    Author Flip Klijn
    URL http://linkinghub.elsevier.com/retrieve/pii/S002205310900057X
    Volume 144
    Issue 5
    Pages 1921-1947
    Publication Journal of Economic Theory
    ISSN 00220531
    Date 9/2009


    Constrained school choice : an experimental study
    Type Journal Article
    Author Caterina Calsamiglia
    Author Guillaume Haeringer
    Author Flip Klijn
    URL http://linkinghub.elsevier.com/retrieve/pii/S002205310900057X
    Volume 144
    Issue 5
    Pages 1921-1947
    Publication American Economic Review
    Date September 2009
  • My guess would be that for that last one the DOI is duplicate? But even if it's not, it's only somewhat a false duplicate:
    The volume, issue, and page range info is just wrong for the AER article (it's actually 100(4): 1860-74.) and, in fact, duplicated from the JET one.
    That Zotero would guess that this much overlap can't be a coincidence makes sense--and it's right about it, too.
  • Good catch, thanks. I must have been inadvertently merging the experimental and non-experimental paper before, which would explain the mix of AER with JET's volume and page's number.

    I have fixed these and the two papers are no longer considered as duplicates by Zotero.
  • I have a bunch of items that are movie reviews all with the same title (the movie title) and the same year, but everything else is different: author, publication, volume, day, month, pages...

    Zotero thinks these are all duplicates. Is that expected? Does it just check for title and year? If so, is there any way to change the criteria to add at least one other field as a differentiator?

    Maybe the duplicate criteria should take into account the number of blank fields and/or conflicting fields in some way. If there are more non-blank conflicting fields than matching fields, it seems pretty unlikely that the two items are duplicates.
  • There is an open access project based at the Bond University Center for Research in Evidence-Based Practice (CREBP) with the aim of drastically reducing the time to construct a Systematic Review. http://crebp-sra.com

    One of the key parts of this effort is identifying duplicate articles in a database.

    The PHP / MySQL code is at https://github.com/CREBP/SRA

    While I am decidedly _not_ a programmer; it looks as though the algorithms could be useful for Zotero or for a Zotero plug-in. My own database programmers at SafetyLit.org are pleased with what they see and are using these scripts to improve our duplicate detection process.

    An interesting article in the BioMed Central journal Systemic Reviews ( http://www.ncbi.nlm.nih.gov/pubmed/25588387 ) found
    The sensitivity (84%) and specificity (100%) of the SRA-DM was superior to EndNote (sensitivity 51%, specificity 99.83%). Validation testing on three additional biomedical literature searches demonstrated that SRA-DM consistently achieved higher sensitivity than EndNote (90% vs 63%), (84% vs 73%) and (84% vs 64%). Furthermore, the specificity of SRA-DM was 100%, whereas the specificity of EndNote was imperfect (average 99.75%) with some unique records wrongly assigned as duplicates. Overall, there was a 42.86% increase in the number of duplicates records detected with SRA-DM compared with EndNote auto-deduplication.
    I apologize if this suggestion is more intrusive than helpful.
  • It is not clear to me why Zotero seems to be relying only on title and author. The fact that the two below, in different journals and years are seen as the same is troubling. Is there some way I can correct this?


    Wyon DP. The effects of moderate heat stress on typewriting performance. Arch Sci Physiol. 1973;27(4):499–509.

    Wyon DP. The effects of moderate heat stress on typewriting performance. Ergonomics. 1974;17(3):309–318.
  • No way to correct this currently. I also don't see this as such a critical error. It's not like Zotero automatically merges duplicates, it just displays them in the duplicate special collection. Yes, it's annoying that you have to ignore this false positive.

    Let's appreciate, however, that this is a fairly exceptional case. Same author, publishing an article titled the same exact way, within one year (those are actually exactly the criteria Zotero uses). Yes, it's a different journal, but the reason we don't match on this metadata is because the form that journal titles are scraped from the web varies widely (i.e. full vs abbreviated) and it would get rid of a lot of actual duplicates. One way I can see that we can improve this is to also check ISSNs of the journals to determine if they are different. There's one small problem with multiple ISSNs being entered in the field, but we can figure out how to resolve this.

    So, in short, we should be able to fix the case you supply above.
  • I agree that one has to wonder about people who publish something with identical titles. He actually has a third with the same title in 1975 that, although the type = Book Chapter (the others are type = Journal Article), is also seen by Zotero as a duplicate.

    Wyon DP. The effects of moderate heat stress on typewriting performance. Prevision quantitative des effets physiologiques et psychologiques de l’environnement thermique chez l’homme. Paris: Paris, Centre National de la Recherche Scientifique; 1975. p. A499–509.

    Does this mean that I need to make sure that the ISSN (and ISBN) fields are populated?
  • Currently populating ISSN fields will not help, but we'll implement that in the near future (I hope). In any case, more complete metadata is always better than less complete, so I would certainly encourage you to populate those fields if possible.

    Displaying duplicates for different item types is a bit different. On the one hand, Zotero doesn't always correctly identify item types, so displaying these as duplicates could help people correct such errors. On the other hand, Zotero currently doesn't provide convenient ways to merge such items. Fixing this case of false-positive will be more difficult and may have to wait for a general mechanism of marking items as non-duplicates.
  • Thanks for the update -- is that general mechanism on the to do list? I got the impression from reading the forum that it was not likely to be forthcoming in the foreseeable future.
  • I don't have an ETA. It's on the list, just not sure how far up.
  • Here is an interesting example where Zotero is NOT identifying a duplicate:

    1. Bolen AR, Henneberger PK, Liang X, Sama SR, Preusse PA, Rosiello RA, Milton DK. The validation of work-related self-reported asthma exacerbation. Occup Environ Med. 2007 May;64(5):343–348. PMCID: PMC2092554

    2. Bolen AR, Henneberger PK, Liang X, Sama SR, Preusse PA, Rosiello RA, Milton DK. The Validation of Work-related Self-reported Asthma Exacerbation. Occup Environ Med. 2007;64:343–348.
  • Those two should be duplicates. Do both have data in the DOI field? Are they the same?
  • Can you export both items as Zotero RDF and post the contents on https://gist.github.com ?
  • Sorry: I was under deadline to file a "faculty activity report" with an updated syllabus for input to the Lyterati system the University just adopted and didn't have time to check back here. By now, I've manually merged the two items and cannot access the separate records. But, #2 probably had a blank DOI field.

    Next, I need to figure out how to edit the NLM style so that I can get both PMCID and DOI in the bibliography.
  • change the pmcid macro to:
    <macro name="pmcid">
    <group delimiter=". " prefix=" ">
    <text variable="DOI" prefix"doi: "/>
    <text variable="PMCID" prefix="PMCID: "/>
    <choose>
    <if variable="PMCID" match="none">
    <text variable="PMID" prefix="PMID: "/>
    </if>
    </choose>
    </group>
    </macro>
  • Hi I use Zotaro with historic newspaper articles. This means I have numerous articles from the same newspaper with the same title but with different dates.
    ie
    Title: Shipping Intelligence. Hobsons Bay
    Publication: The Argus
    Date: 21 August 1869

    Title: Shipping Intelligence. Hobsons Bay
    Publication: The Argus
    Date: 21 October 1869

    or titled TO THE EDITOR, or Local News. Or a run of letters with the same heading but each with a slightly different date. They all come up as duplicates.
  • yeah, I don't think that can be avoided with auto detection.
  • edited June 19, 2017
    Hi, I'm having a similar issue: two articles with exactly the same title but they only share one author (and the author of the single-author article is not the 1st author of the 3-authors article). All the other information (journal, etc.) is different. Zotero detects them as duplicates.

    It like having one article:
    Author1. [TITLE].
    and the "duplicate"
    Author2, Author1, Author3. [TITLE].

    Is there a way to specify which fields should be used to detect duplicates?
  • I have lots of false duplicates. I'm convinced the title+DOI+ISBN is a bad strategy (especially because the duplicate is true if DOI/ISBN are empty! this means that if for whatever reason DOI/ISBN information is missing, then if the title is similar they are duplicates, right?).

    So I end up having a lot of (wrong) items in the "Duplicate Items" section, which I don't pay attention any more. This is a pitty, and somehow defeats the purpose of this good tool which aim is "automatically find duplicates".

    I'm not sure if "mark as non-duplicates" is a straight forward option, because I can't see how it would be implemented: would Zotero include a flag somewhere in each of the items saying that X is not a duplicate of Y for each case? Then it has to store maybe thousands of these flags... I think the most elegant solution is to give the user the capability of defining which fields (and in which order) Zotero should look into to decide if two items are duplicates or not. Then the rules are fixed and Zotero can build the Duplicated Items as many times as necessary.

    I hope this make sense. Thanks for this great software!
  • Like you, I have a lot of false duplicates (any sufficiently large library will). See Dan's post in this earlier thread — in brief, marking as non-duplicates is indeed non-trivial but planned (though the issue hasn't seen activity since Nov 2016); but customising the duplicate detection rules themselves would seem to be a bit more complex. It would be lovely though.
  • OK, I get it. So let's say there are two options being considered here: option 1 is "mark as non-duplicate", option 2 is changing how Zotero defines a 'duplicate' (i.e. the user defines which fields Zotero compares to decide if there is a duplicate or not).

    Here is a powerful reason why option 1 is worse than option 2: for option 1 the developers need to change the *database structure* (eg adding the "non-duplicate of" field), while in option 2 the developers simply change the *algorithm* of comparisons, using exactly the database structure exactly as it is now.

    In my opinion this alone should lead the developers to decide option 2. But maybe I'm missing something.

    Any option would be welcome of course. Cheers
Sign In or Register to comment.