Mark as non-duplicate
I have a couple of items that are quite similar - two presentations on similar topics, that share authors (different date, different presentation contents). Because the titles and authors are basically the same, Zotero keeps telling me that they are possible duplicates. I would like to be able to remove them from the list of duplicates without merging them. Would it be possible to add a blacklist for particular pairs of items, so that they are never marked as possible duplicates?
Abramowitz, G., 2013. The PALS Land sUrface Model Benchmarking Evaluation pRoject (PLUMBER).
Best, M.J., 2014. The PALS Land sUrface Model Benchmarking Evaluation pRoject (PLUMBER).
They are both presentations, and are named the same because that's the name of the experiment. They have the same authors (whole group), but in a different order (presenter first). They have different dates, places and meeting names. And the actual documents are probably only about 20% similar.
I think my problem would be solved if the duplicate detector took "meeting name" - it seems fairly unlikely that anyone would ever present the same document at the same meeting.
Similar issues here. Not sure if these should get flagged as duplicates according to the current code or not:
I have several email items where I have stored several people's comments on a paper. The title and date of the items are the same, and they contain the same list of creators. However, the creators are in a different order (and are different types) and the abstracts (where the email text is stored) are different.
I have several journal articles with one or two (of several) of the authors the same across many articles, the same title and publication, but all different years, volumes, page numbers, etc.
I have a paper published in a conference proceedings and then its abstract published in a journal. They have the same titles, authors, and years, but one is a conference proceeding and one is a journal article (different publication titles).
I have two sets of the previous issue from the same conference with the same authors (the two sets have different titles). All four of these are flagged as being duplicates together.
Maynard, D. W., & Heritage, J. (2005). Conversation analysis, doctor-patient interaction and medical communication. In L. T. Reynolds & N. J. Herman-Kinney (Eds.), The handbook of symbolic interactionism (Vols. 1-Book, 1-Section, Vol. 39, pp. 428–435). Rowman Altamira.
Maynard, D. W., & Heritage, J. (2005). Conversation analysis, doctor-patient interaction and medical communication. Medical Education, 39(4), 428–435. doi:10.1111/j.1365-2929.2005.02111.x
Hsu, V., Montaquila, J. M., & Brick, J. M. (2010). Using a Match Rate Model to Predict Areas Where USPS-Based Address Lists May Be Used in Place of Traditional Listing. In Proceedings of the Survey Research Methods Section, American Statistical Association. Retrieved from http://www.amstat.org/Sections/Srms/Proceedings/y2010/Files/306727_57064.pdf
Montaquila, J. M., Hsu, V., & Brick, J. M. (2011). Using a “Match Rate” Model to Predict Areas Where USPS-Based Address Lists May Be Used in Place of Traditional Listing. Public Opinion Quarterly, 75(2), 317–335. doi:10.1093/poq/nfr008
[1] F.H. Hurley, Electrodeposition of Aluminum, 2446331, 1948. http://www.google.co.uk/patents/US2446331.
[2] T.P. Wier, F.H. Hurley, Electrodeposition of Aluminum, 2446349, 1948. http://www.google.com/patents/US2446349.
[3] T.P. Wier, Electrodeposition of Aluminum, 2446350, 1948. http://www.google.com/patents/US2446350.
Constrained school choice
Type Journal Article
Author Guillaume Haeringer
Author Flip Klijn
URL http://linkinghub.elsevier.com/retrieve/pii/S002205310900057X
Volume 144
Issue 5
Pages 1921-1947
Publication Journal of Economic Theory
ISSN 00220531
Date 9/2009
Constrained school choice : an experimental study
Type Journal Article
Author Caterina Calsamiglia
Author Guillaume Haeringer
Author Flip Klijn
URL http://linkinghub.elsevier.com/retrieve/pii/S002205310900057X
Volume 144
Issue 5
Pages 1921-1947
Publication American Economic Review
Date September 2009
The volume, issue, and page range info is just wrong for the AER article (it's actually 100(4): 1860-74.) and, in fact, duplicated from the JET one.
That Zotero would guess that this much overlap can't be a coincidence makes sense--and it's right about it, too.
I have fixed these and the two papers are no longer considered as duplicates by Zotero.
Zotero thinks these are all duplicates. Is that expected? Does it just check for title and year? If so, is there any way to change the criteria to add at least one other field as a differentiator?
Maybe the duplicate criteria should take into account the number of blank fields and/or conflicting fields in some way. If there are more non-blank conflicting fields than matching fields, it seems pretty unlikely that the two items are duplicates.
One of the key parts of this effort is identifying duplicate articles in a database.
The PHP / MySQL code is at https://github.com/CREBP/SRA
While I am decidedly _not_ a programmer; it looks as though the algorithms could be useful for Zotero or for a Zotero plug-in. My own database programmers at SafetyLit.org are pleased with what they see and are using these scripts to improve our duplicate detection process.
An interesting article in the BioMed Central journal Systemic Reviews ( http://www.ncbi.nlm.nih.gov/pubmed/25588387 ) found I apologize if this suggestion is more intrusive than helpful.
Wyon DP. The effects of moderate heat stress on typewriting performance. Arch Sci Physiol. 1973;27(4):499–509.
Wyon DP. The effects of moderate heat stress on typewriting performance. Ergonomics. 1974;17(3):309–318.
Let's appreciate, however, that this is a fairly exceptional case. Same author, publishing an article titled the same exact way, within one year (those are actually exactly the criteria Zotero uses). Yes, it's a different journal, but the reason we don't match on this metadata is because the form that journal titles are scraped from the web varies widely (i.e. full vs abbreviated) and it would get rid of a lot of actual duplicates. One way I can see that we can improve this is to also check ISSNs of the journals to determine if they are different. There's one small problem with multiple ISSNs being entered in the field, but we can figure out how to resolve this.
So, in short, we should be able to fix the case you supply above.
Wyon DP. The effects of moderate heat stress on typewriting performance. Prevision quantitative des effets physiologiques et psychologiques de l’environnement thermique chez l’homme. Paris: Paris, Centre National de la Recherche Scientifique; 1975. p. A499–509.
Does this mean that I need to make sure that the ISSN (and ISBN) fields are populated?
Displaying duplicates for different item types is a bit different. On the one hand, Zotero doesn't always correctly identify item types, so displaying these as duplicates could help people correct such errors. On the other hand, Zotero currently doesn't provide convenient ways to merge such items. Fixing this case of false-positive will be more difficult and may have to wait for a general mechanism of marking items as non-duplicates.
1. Bolen AR, Henneberger PK, Liang X, Sama SR, Preusse PA, Rosiello RA, Milton DK. The validation of work-related self-reported asthma exacerbation. Occup Environ Med. 2007 May;64(5):343–348. PMCID: PMC2092554
2. Bolen AR, Henneberger PK, Liang X, Sama SR, Preusse PA, Rosiello RA, Milton DK. The Validation of Work-related Self-reported Asthma Exacerbation. Occup Environ Med. 2007;64:343–348.
Next, I need to figure out how to edit the NLM style so that I can get both PMCID and DOI in the bibliography.
<macro name="pmcid">
<group delimiter=". " prefix=" ">
<text variable="DOI" prefix"doi: "/>
<text variable="PMCID" prefix="PMCID: "/>
<choose>
<if variable="PMCID" match="none">
<text variable="PMID" prefix="PMID: "/>
</if>
</choose>
</group>
</macro>
ie
Title: Shipping Intelligence. Hobsons Bay
Publication: The Argus
Date: 21 August 1869
Title: Shipping Intelligence. Hobsons Bay
Publication: The Argus
Date: 21 October 1869
or titled TO THE EDITOR, or Local News. Or a run of letters with the same heading but each with a slightly different date. They all come up as duplicates.
It like having one article:
Author1. [TITLE].
and the "duplicate"
Author2, Author1, Author3. [TITLE].
Is there a way to specify which fields should be used to detect duplicates?
So I end up having a lot of (wrong) items in the "Duplicate Items" section, which I don't pay attention any more. This is a pitty, and somehow defeats the purpose of this good tool which aim is "automatically find duplicates".
I'm not sure if "mark as non-duplicates" is a straight forward option, because I can't see how it would be implemented: would Zotero include a flag somewhere in each of the items saying that X is not a duplicate of Y for each case? Then it has to store maybe thousands of these flags... I think the most elegant solution is to give the user the capability of defining which fields (and in which order) Zotero should look into to decide if two items are duplicates or not. Then the rules are fixed and Zotero can build the Duplicated Items as many times as necessary.
I hope this make sense. Thanks for this great software!
Here is a powerful reason why option 1 is worse than option 2: for option 1 the developers need to change the *database structure* (eg adding the "non-duplicate of" field), while in option 2 the developers simply change the *algorithm* of comparisons, using exactly the database structure exactly as it is now.
In my opinion this alone should lead the developers to decide option 2. But maybe I'm missing something.
Any option would be welcome of course. Cheers