Mark as non-duplicate

naught101 · May 21, 2014

I have a couple of items that are quite similar - two presentations on similar topics, that share authors (different date, different presentation contents). Because the titles and authors are basically the same, Zotero keeps telling me that they are possible duplicates. I would like to be able to remove them from the list of duplicates without merging them. Would it be possible to add a blacklist for particular pairs of items, so that they are never marked as possible duplicates?

aurimas · May 21, 2014

Not currently possible. How similar is "basically the same" though?

naught101 · May 21, 2014

I know it's not currently possible, that's why I put it in feature requests :)

Abramowitz, G., 2013. The PALS Land sUrface Model Benchmarking Evaluation pRoject (PLUMBER).

Best, M.J., 2014. The PALS Land sUrface Model Benchmarking Evaluation pRoject (PLUMBER).

They are both presentations, and are named the same because that's the name of the experiment. They have the same authors (whole group), but in a different order (presenter first). They have different dates, places and meeting names. And the actual documents are probably only about 20% similar.

aurimas · May 21, 2014

The feature is generally planned, though I don't think it's high on the priority list: https://www.zotero.org/support/requested_features#zotero_interface

They have the same authors (whole group), but in a different order (presenter first)

That should be sufficient to differentiate (along with different dates). I'll take a closer look at the code.

naught101 · May 21, 2014

Actually, the authors were in the same order, but a different one was marked as a presenter (all others marked as contributors). But I also tried re-ordering it so that the presenter was first in both cases.

I think my problem would be solved if the duplicate detector took "meeting name" - it seems fairly unlikely that anyone would ever present the same document at the same meeting.

bwiernik · May 26, 2014

aurimas,

Similar issues here. Not sure if these should get flagged as duplicates according to the current code or not:

I have several email items where I have stored several people's comments on a paper. The title and date of the items are the same, and they contain the same list of creators. However, the creators are in a different order (and are different types) and the abstracts (where the email text is stored) are different.

I have several journal articles with one or two (of several) of the authors the same across many articles, the same title and publication, but all different years, volumes, page numbers, etc.

I have a paper published in a conference proceedings and then its abstract published in a journal. They have the same titles, authors, and years, but one is a conference proceeding and one is a journal article (different publication titles).

I have two sets of the previous issue from the same conference with the same authors (the two sets have different titles). All four of these are flagged as being duplicates together.

Matt Jans · December 21, 2014

I'm trying to figure out a solution to this too, and I wonder why development of a simple tag "not duplicate" isn't a high priority. Seems essential for library management. For me, I see confusions between articles and presentations and book chapters most often. Why not just make "item type" (e.g., book, journal article) part of the dup id algorithm? Then if I have the type right Z won't see them as dups. Below are two I'm looking at right now. One is a book, one is an article, and they're coded that way in Zotero, but Z still calls them dups.

Maynard, D. W., & Heritage, J. (2005). Conversation analysis, doctor-patient interaction and medical communication. In L. T. Reynolds & N. J. Herman-Kinney (Eds.), The handbook of symbolic interactionism (Vols. 1-Book, 1-Section, Vol. 39, pp. 428–435). Rowman Altamira.

Maynard, D. W., & Heritage, J. (2005). Conversation analysis, doctor-patient interaction and medical communication. Medical Education, 39(4), 428–435. doi:10.1111/j.1365-2929.2005.02111.x

Matt Jans · December 21, 2014

Here are two more...one conference, one j. pub. Very similar tiles, but authors in different order, different date, and also coded as different type. I agree with an earlier comment I saw that more than title should be used, including author list, date, and type.

Hsu, V., Montaquila, J. M., & Brick, J. M. (2010). Using a Match Rate Model to Predict Areas Where USPS-Based Address Lists May Be Used in Place of Traditional Listing. In Proceedings of the Survey Research Methods Section, American Statistical Association. Retrieved from http://www.amstat.org/Sections/Srms/Proceedings/y2010/Files/306727_57064.pdf

Montaquila, J. M., Hsu, V., & Brick, J. M. (2011). Using a “Match Rate” Model to Predict Areas Where USPS-Based Address Lists May Be Used in Place of Traditional Listing. Public Opinion Quarterly, 75(2), 317–335. doi:10.1093/poq/nfr008

strath-mts · January 9, 2015

Here is an issue with patents, which we do cite in our works. They are considered as duplicates by zotero. It would be great to tell zotero that it is mistaken.

[1] F.H. Hurley, Electrodeposition of Aluminum, 2446331, 1948. http://www.google.co.uk/patents/US2446331.
[2] T.P. Wier, F.H. Hurley, Electrodeposition of Aluminum, 2446349, 1948. http://www.google.com/patents/US2446349.
[3] T.P. Wier, Electrodeposition of Aluminum, 2446350, 1948. http://www.google.com/patents/US2446350.

vanderlindenma · January 30, 2015

Yet another example in case it helps improving the code :

Constrained school choice
Type Journal Article
Author Guillaume Haeringer
Author Flip Klijn
URL http://linkinghub.elsevier.com/retrieve/pii/S002205310900057X
Volume 144
Issue 5
Pages 1921-1947
Publication Journal of Economic Theory
ISSN 00220531
Date 9/2009

Constrained school choice : an experimental study
Type Journal Article
Author Caterina Calsamiglia
Author Guillaume Haeringer
Author Flip Klijn
URL http://linkinghub.elsevier.com/retrieve/pii/S002205310900057X
Volume 144
Issue 5
Pages 1921-1947
Publication American Economic Review
Date September 2009

adamsmith · January 30, 2015

My guess would be that for that last one the DOI is duplicate? But even if it's not, it's only somewhat a false duplicate:
The volume, issue, and page range info is just wrong for the AER article (it's actually 100(4): 1860-74.) and, in fact, duplicated from the JET one.
That Zotero would guess that this much overlap can't be a coincidence makes sense--and it's right about it, too.

vanderlindenma · January 30, 2015

Good catch, thanks. I must have been inadvertently merging the experimental and non-experimental paper before, which would explain the mix of AER with JET's volume and page's number.

I have fixed these and the two papers are no longer considered as duplicates by Zotero.

realtime99 · January 31, 2015

I have a bunch of items that are movie reviews all with the same title (the movie title) and the same year, but everything else is different: author, publication, volume, day, month, pages...

Zotero thinks these are all duplicates. Is that expected? Does it just check for title and year? If so, is there any way to change the criteria to add at least one other field as a differentiator?

Maybe the duplicate criteria should take into account the number of blank fields and/or conflicting fields in some way. If there are more non-blank conflicting fields than matching fields, it seems pretty unlikely that the two items are duplicates.

DWL-SDCA · February 2, 2015

There is an open access project based at the Bond University Center for Research in Evidence-Based Practice (CREBP) with the aim of drastically reducing the time to construct a Systematic Review. http://crebp-sra.com

One of the key parts of this effort is identifying duplicate articles in a database.

The PHP / MySQL code is at https://github.com/CREBP/SRA

While I am decidedly _not_ a programmer; it looks as though the algorithms could be useful for Zotero or for a Zotero plug-in. My own database programmers at SafetyLit.org are pleased with what they see and are using these scripts to improve our duplicate detection process.

An interesting article in the BioMed Central journal Systemic Reviews ( http://www.ncbi.nlm.nih.gov/pubmed/25588387 ) found

The sensitivity (84%) and specificity (100%) of the SRA-DM was superior to EndNote (sensitivity 51%, specificity 99.83%). Validation testing on three additional biomedical literature searches demonstrated that SRA-DM consistently achieved higher sensitivity than EndNote (90% vs 63%), (84% vs 73%) and (84% vs 64%). Furthermore, the specificity of SRA-DM was 100%, whereas the specificity of EndNote was imperfect (average 99.75%) with some unique records wrongly assigned as duplicates. Overall, there was a 42.86% increase in the number of duplicates records detected with SRA-DM compared with EndNote auto-deduplication.

I apologize if this suggestion is more intrusive than helpful.

dmilton · February 24, 2015

It is not clear to me why Zotero seems to be relying only on title and author. The fact that the two below, in different journals and years are seen as the same is troubling. Is there some way I can correct this?

Wyon DP. The effects of moderate heat stress on typewriting performance. Arch Sci Physiol. 1973;27(4):499–509.

Wyon DP. The effects of moderate heat stress on typewriting performance. Ergonomics. 1974;17(3):309–318.

aurimas · February 24, 2015

No way to correct this currently. I also don't see this as such a critical error. It's not like Zotero automatically merges duplicates, it just displays them in the duplicate special collection. Yes, it's annoying that you have to ignore this false positive.

Let's appreciate, however, that this is a fairly exceptional case. Same author, publishing an article titled the same exact way, within one year (those are actually exactly the criteria Zotero uses). Yes, it's a different journal, but the reason we don't match on this metadata is because the form that journal titles are scraped from the web varies widely (i.e. full vs abbreviated) and it would get rid of a lot of actual duplicates. One way I can see that we can improve this is to also check ISSNs of the journals to determine if they are different. There's one small problem with multiple ISSNs being entered in the field, but we can figure out how to resolve this.

So, in short, we should be able to fix the case you supply above.

dmilton · February 24, 2015

I agree that one has to wonder about people who publish something with identical titles. He actually has a third with the same title in 1975 that, although the type = Book Chapter (the others are type = Journal Article), is also seen by Zotero as a duplicate.

Wyon DP. The effects of moderate heat stress on typewriting performance. Prevision quantitative des effets physiologiques et psychologiques de l’environnement thermique chez l’homme. Paris: Paris, Centre National de la Recherche Scientifique; 1975. p. A499–509.

Does this mean that I need to make sure that the ISSN (and ISBN) fields are populated?

aurimas · February 24, 2015

Currently populating ISSN fields will not help, but we'll implement that in the near future (I hope). In any case, more complete metadata is always better than less complete, so I would certainly encourage you to populate those fields if possible.

Displaying duplicates for different item types is a bit different. On the one hand, Zotero doesn't always correctly identify item types, so displaying these as duplicates could help people correct such errors. On the other hand, Zotero currently doesn't provide convenient ways to merge such items. Fixing this case of false-positive will be more difficult and may have to wait for a general mechanism of marking items as non-duplicates.

dmilton · February 24, 2015

Thanks for the update -- is that general mechanism on the to do list? I got the impression from reading the forum that it was not likely to be forthcoming in the foreseeable future.

aurimas · February 24, 2015

I don't have an ETA. It's on the list, just not sure how far up.

dmilton · February 27, 2015

Here is an interesting example where Zotero is NOT identifying a duplicate:

1. Bolen AR, Henneberger PK, Liang X, Sama SR, Preusse PA, Rosiello RA, Milton DK. The validation of work-related self-reported asthma exacerbation. Occup Environ Med. 2007 May;64(5):343–348. PMCID: PMC2092554

2. Bolen AR, Henneberger PK, Liang X, Sama SR, Preusse PA, Rosiello RA, Milton DK. The Validation of Work-related Self-reported Asthma Exacerbation. Occup Environ Med. 2007;64:343–348.

bwiernik · February 27, 2015

Those two should be duplicates. Do both have data in the DOI field? Are they the same?

aurimas · February 27, 2015

Can you export both items as Zotero RDF and post the contents on https://gist.github.com ?

dmilton · March 1, 2015

Sorry: I was under deadline to file a "faculty activity report" with an updated syllabus for input to the Lyterati system the University just adopted and didn't have time to check back here. By now, I've manually merged the two items and cannot access the separate records. But, #2 probably had a blank DOI field.

Next, I need to figure out how to edit the NLM style so that I can get both PMCID and DOI in the bibliography.

adamsmith · March 1, 2015

change the pmcid macro to:
<macro name="pmcid">
<group delimiter=". " prefix=" ">
<text variable="DOI" prefix"doi: "/>
<text variable="PMCID" prefix="PMCID: "/>
<choose>
<if variable="PMCID" match="none">
<text variable="PMID" prefix="PMID: "/>
</if>
</choose>
</group>
</macro>

LizDownes · March 4, 2015

Hi I use Zotaro with historic newspaper articles. This means I have numerous articles from the same newspaper with the same title but with different dates.
ie
Title: Shipping Intelligence. Hobsons Bay
Publication: The Argus
Date: 21 August 1869

Title: Shipping Intelligence. Hobsons Bay
Publication: The Argus
Date: 21 October 1869

or titled TO THE EDITOR, or Local News. Or a run of letters with the same heading but each with a slightly different date. They all come up as duplicates.

adamsmith · March 4, 2015

yeah, I don't think that can be avoided with auto detection.

DO_part73 · June 19, 2017

Hi, I'm having a similar issue: two articles with exactly the same title but they only share one author (and the author of the single-author article is not the 1st author of the 3-authors article). All the other information (journal, etc.) is different. Zotero detects them as duplicates.

It like having one article:
Author1. [TITLE].
and the "duplicate"
Author2, Author1, Author3. [TITLE].

Is there a way to specify which fields should be used to detect duplicates?

DO_part73 · February 28, 2018

I have lots of false duplicates. I'm convinced the title+DOI+ISBN is a bad strategy (especially because the duplicate is true if DOI/ISBN are empty! this means that if for whatever reason DOI/ISBN information is missing, then if the title is similar they are duplicates, right?).

So I end up having a lot of (wrong) items in the "Duplicate Items" section, which I don't pay attention any more. This is a pitty, and somehow defeats the purpose of this good tool which aim is "automatically find duplicates".

I'm not sure if "mark as non-duplicates" is a straight forward option, because I can't see how it would be implemented: would Zotero include a flag somewhere in each of the items saying that X is not a duplicate of Y for each case? Then it has to store maybe thousands of these flags... I think the most elegant solution is to give the user the capability of defining which fields (and in which order) Zotero should look into to decide if two items are duplicates or not. Then the rules are fixed and Zotero can build the Duplicated Items as many times as necessary.

I hope this make sense. Thanks for this great software!

mark · February 28, 2018

Like you, I have a lot of false duplicates (any sufficiently large library will). See Dan's post in this earlier thread — in brief, marking as non-duplicates is indeed non-trivial but planned (though the issue hasn't seen activity since Nov 2016); but customising the duplicate detection rules themselves would seem to be a bit more complex. It would be lovely though.

DO_part73 · February 28, 2018

OK, I get it. So let's say there are two options being considered here: option 1 is "mark as non-duplicate", option 2 is changing how Zotero defines a 'duplicate' (i.e. the user defines which fields Zotero compares to decide if there is a duplicate or not).

Here is a powerful reason why option 1 is worse than option 2: for option 1 the developers need to change the *database structure* (eg adding the "non-duplicate of" field), while in option 2 the developers simply change the *algorithm* of comparisons, using exactly the database structure exactly as it is now.

In my opinion this alone should lead the developers to decide option 2. But maybe I'm missing something.

Any option would be welcome of course. Cheers