Option to disable disambiguation?

fbennett · September 9, 2011

This is a continuation from the papercuts thread here:

http://forums.zotero.org/discussion/19556?page=1#Item_4

The issue is whether name disambiguation should be controlled through the UI, and whether it should be disabled by default.

fbennett · September 9, 2011

mronkko raises the following example:

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. London: Lawrence Erlbaum Associates.

If I use Zotero to cite this book using the APA style, the in-text citation is the following:

(J. Cohen, Cohen, West, & Aiken, 2003)

However, what this journal uses is the non-disambiguated version

(Cohen, Cohen, West, & Aiken, 2003)

This is related to an earlier discussion, which ended when adamsmith posted confirmation from the APA that the initial should be included here, even though there is no clash with other cited primary authors.

I agree that it's a weird rule; the initial in this case contributes nothing to disambiguation, and including it is not natural, even if you know the general rule. (I have a patch for citeproc-js ready to go that would restrict primary-name disambiguation to clashes on primary names only, but I filed it in the attic when the APA response came through.)

What mronkko suggests sounds very possible: given the obscurity of the official APA position, we may have a de facto split between a "shy" and "aggressive" form of the rule. If that is the case, we should (additionally) support the "shy" rule in CSL.

That's a separate point from the main topic of discussion, but it's related: I think that disabling name disambiguation is not a good idea, because it can produce errors in the document (ambiguous cites) that are not visually obvious. If there are rules in the style, they should be applied, and if the rules need adjusting, the thing to do is to adjust them.

adamsmith · September 9, 2011

I've actually thought about this - in the example APA gave me, they actually had two different _first authors_ with the same last name. That seems to me to be a more reasonable rule. Does someone have APA at hand? The relevant passage is p. 176.

fbennett · September 9, 2011

Aha! If someone can confirm, that would be great.

fbennett · September 10, 2011

Here we go. An entry on the APA blog presents example cites where the primary author occurs only once in the document, with authors having the same family name and a different initial in secondary position. In the example, the primary author does not receive an initial.

I'll commit a change that suppresses the initial on the primary author in this case.

(This discussion makes me feel the more strongly that we shouldn't lightly adopt suppression of names disambiguation, since mronkko's clear and specific complaint has ferreted out a fault in the processor logic that will benefit everyone.)

adamsmith · September 10, 2011

great, thanks for tracking that down and sorry for the confusion I caused back in the other thread.

adamsmith · September 10, 2011

I would like to bring back up my idea to use disambiguation by cite (i.e. initials/first names are only added when citations would otherwise be identical) more widely and as the de-facto default.
1. Do other people agree that's a good idea in general?
2. If so, what would be a practical way to implement it?

mronkko · September 10, 2011

I would like to bring back up my idea to use disambiguation by cite (i.e. initials/first names are only added when citations would otherwise be identical) more widely and as the de-facto default.
1. Do other people agree that's a good idea in general?

I do not agree with this idea. Restricting the disambiguation to work this way would not be compatible with how the APA style is currently used in many journals.

See e.g. how the two different Smiths are cited in this article using the APA style http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.6787&rep=rep1&type=pdf

Most of the problems caused by the disambiguation are a result of having the same author name stored in a different form for different citations. For example, a paper written by me could have one of the following form for the author name

Ronkko, M
Ronkko, M. V.
Ronkko, Mikko
Ronkko, Mikko V.

Current version of Zotero thinks that these are different authors. If I cite four test articles each uses a different form, the result is this

(M. Ronkko, 2001; M. V. Ronkko, 2002; Mikko Ronkko, 2003; Mikko V. Ronkko, 2004)

Because different databases do have the same author stored in different ways, this case is commonly encountered when you are just building your reference database. This is very confusing to new users of Zotero that do not know why the author firstname is included in the citation. One way to solve what I think is a usability issue in the name disambiguation would be to consider authors to be the same if a name can be an abbreviated version of another name.

What do other people think about this idea?

I marked the number 8 on the paper cuts list with strikethrough for now.

adamsmith · September 10, 2011

I do not agree with this idea. Restricting the disambiguation to work this way would not be compatible with how the APA style is currently used in many journals.

you're misunderstanding what I'm suggesting. I'm talking about the default behavior of Zotero styles of which we don't know the actual requirements. That's not the case for APA and the style is already set to givenname-disambiguation-rule="primary" and should obviously stay that way (and the fix that Frank is implementing is to that setting of the rule).
But the vast majority of author-date styles in Zotero does not have a givenname disambiguation rule specified and thus defaults to disambiguating all (visible) names (the only available behavior under csl 0.8.1)- which may very well be required by some styles, but is certainly not the most common behavior.

fbennett · September 11, 2011

I don't have a view on the best default disambig rule. The "by-cite" rule is more robust now than it was a few months ago, and it may be less puzzling to users. The CSL specification calls for "all-names" as the default, so the CSL spec tracker is probably the place for discussions if you want to make the change.

adamsmith · September 11, 2011

posted here: https://github.com/citation-style-language/schema/issues/76

I see mronkko's point about inconsistent databases - don't have much to add except that that problem would also become _much_ rarer with a" by-cite" default, since it would only affect citations by the same author(s) from the same year with inconsistent first names in Zotero.

mronkko · September 11, 2011

I posted a feature request related to inconsistent databases here

https://bitbucket.org/fbennett/citeproc-js/issue/130/name-disambiguation-should-be-robust-for

I believe that these two separate issue reports will solve all the reasons why I think that the name disambiguation (as currently implemented in Zotero) should be a feature that the user would be able to disable.

fbennett · September 11, 2011

Thanks for filing the ticket, Mikko. The thread here will reach a larger audience, but the issue won't get lost in the shuffle now.

On the suggestion to disambiguate names only if the first-position initial differs, I suppose the obvious example would be "G.W .Bush" versus "G.H.W. Bush".

adamsmith · September 11, 2011

One intermediate solution would be to just take say initials and full first names are equivalent, i.e.
G. W. Bush = George W. Bush = George Walker Bush != George Bush != G. H. W. Bush

fbennett · September 11, 2011

I had the same thought after posting. Let me do some thinking out loud.

The effect would be to completely disable full-name disambiguation in these pairs, even in rule sets that call for it as the final step. Names would either disambiguate on the initials, or not at all. You can get a similar effect by applying "all-names-with-initials" or "primary-name-with-initials". If that is the desired result, those rules can be applied in the style without any changes to the processor. Thinking further ...

(1) Under both a "-with-initials" rule or the proposal, the following two names would not be distinguished:

Jesse Jackson -> J Jackson (final clash) -> Jackson
J Jackson -> J Jackson (final clash) -> Jackson

(2) Under the "-with-initials" rules, you would get the same result if the full names are present:

Janet Jackson -> J Jackson (final clash) -> Jackson
Jesse Jackson -> J Jackson (final clash) -> Jackson

(3) Putting the two samples together, and assuming all-names disambiguation, would this be the intended result under the proposal?

Jesse Jackson -> J Jackson (clash, but abort) -> Jackson
J Jackson -> J Jackson (clash, but abort) -> Jackson

(4) But:

Jesse Jackson -> J Jackson (clash) -> Jesse Jackson
Janet Jackson -> J Jackson (clash) -> Janet Jackson
J Jackson -> J Jackson (clash) -> J Jackson

You can see the difficulty. Whether expansion is limited to initials for the final comparison depends on whether more than one full (non-initials) name exists in the clash set. That's possible, but it would be very hard to get right, and explaining the logic when there are questions ("I add this one name, and suddenly initials and names appear all over my document!") would not be much fun. It would also suggest that having "J Jackson" and "Janet Jackson" coexist in the bibliography is acceptable, which it's not, in any style, unless the authors really are different people, and one of them is firm about always being represented by their initial only -- in which case the result at (3) would be incorrect, and the correct result impossible to produce.

A less painful alternative would be to set "all-names-with-initials" as the default rule, except for styles that explicitly require full-name expansion (like Chicago).

Another route might be to provide a full rule-selection mechanism in Document Preferences (maybe even with "disable" as an option, I guess). That would at least embed guidance on the effect of the rules in the application where people are more likely to see it.

mronkko · September 12, 2011

You can see the difficulty. Whether expansion is limited to initials for the final comparison depends on whether more than one full (non-initials) name exists in the clash set. That's possible, but it would be very hard to get right, and explaining the logic when there are questions ("I add this one name, and suddenly initials and names appear all over my document!") would not be much fun. It would also suggest that having "J Jackson" and "Janet Jackson" coexist in the bibliography is acceptable, which it's not, in any style, unless the authors really are different people, and one of them is firm about always being represented by their initial only -- in which case the result at (3) would be incorrect, and the correct result impossible to produce.

These two issues are already present in the current implementation: Adding one reference with initial where others are with full first name will cause the names to suddenly appear in all citations when one is added. Also the J Jackson and Janet Jackson issue is already present.

Another route might be to provide a full rule-selection mechanism in Document Preferences (maybe even with "disable" as an option, I guess). That would at least embed guidance on the effect of the rules in the application where people are more likely to see it.

I like this idea. Not because it allows for disabling name disambiguation, but because it tells the user about name disambiguation and a little bit of how it works. Also for the rare case of someone who prefers to use initial only as the first name, there could be an option to use strict disambiguation. However, I think that these instances would be extremely rare.

I gave this a bit of thought when driving to work. I agree that it quickly gets complex and is difficult to get right. (I also took a look at the citeproc code, but since it had very few comments was not able to really understand it, so the following is not based on the existing implementation). One way to decide whether authors with the same last name should be disambiguated would be:

1) Give each author name a score: A name gets +10 points for each name that is spelled out and +1 point for each initial.

2) Form a lis tof all author pairs and sort these in descending order based on the sum of the scores

3) Go through the list of author pairs one by one
3.1) Make the names comparable: If one of the authors does not have a middle name, remove that from the other as well and if a name in for one author is only an initial, make it an initial for the other author as well.
3.2) When the names have been reduced to most complex comparable form, check if they are identical. If this check is true, mark the authors as potential duplicates

4) Go through a list of authors ordered in ascending order by how many potential duplicates they have excluding authors with zero potential duplicates
4.1) If an author has only one potential duplicate and that duplicate has higher score than the original author name, replace the original name with the name of the duplicate. Then refresh the list of potential duplicates and start iterating the list again from start (go back to 4).

5) After the step 4 completes, apply the current disambiguation rules.

fbennett · September 12, 2011

These two issues are already present in the current implementation: Adding one reference with initial where others are with full first name will cause the names to suddenly appear in all citations when one is added. Also the J Jackson and Janet Jackson issue is already present.

Yes, but the problem is not currently masked from view for a bit. The proposed solution wouldn't actually solve the problem, just somewhat reduce the number of cases where it is noticed, and slightly raise the number of steps needed to recover when it manifests itself.

fbennett · September 12, 2011

Also the J Jackson and Janet Jackson issue is already present.

The point there was that this might actually be the output that the user wants, but the masking algorithm would make it impossible to produce.

mronkko · September 12, 2011

Yes, but the problem is not currently masked from view for a bit. The proposed solution wouldn't actually solve the problem, just somewhat reduce the number of cases where it is noticed, and slightly raise the number of steps needed to recover when it manifests itself.

The problem that the solution that I propose addresses is the issue of erronneously disambiguating the names of two authors (J Jackson and Janet Jackson) in the case that these are the same person.

I did not really understand the point about raising the number of steps needed to recover.

The point there was that this might actually be the output that the user wants, but the masking algorithm would make it impossible to produce.

The case where J Jackson and Janet Jackson is what the user really wants is extremely rare. My preferred solution would be to by default assume that these are the same person and leave the user an option to use the more strict name disambiguation rule if desired.

mronkko · October 3, 2011

Here is another thread where this problem is reported

http://forums.zotero.org/discussion/18213/slow/#Item_3

spaetz · December 13, 2011

Interesting thread. Suffering from a co-author who is sometimes Georg von Krogh, and sometimes Georg Fridrich von Krogh in the articles, but is really the same person, so I am interested in a solution too.

One more proposal to solve this:

Combine it with a sort-order field, and use that to determine name ambiguation. It would usually default to "family name, first name" if not specified. For 2 articles from M. Ronkko and Mikko V. Ronkko, one could set the sort-order field to "Ronkko, Mikko V." for both cases, so we know that this is really the same person. It would at the same time allow me to have my coauthor sorted under "k" where he prefers to be, I would set the field to "Krogh, von, Georg F." or something for all cases of this person.

Of course the other proposed improvements to name disambiguation (like option to turn it off), are also much appreciated and tangential to this proposal.

A_Hartmann · April 13, 2012

You all give interesting and sometimes amusing examples of names, naming and ambiguitiy of identity.

Finally, its a problem of sensitivity and specificity of the algorithm. I think the Zotero team should implement options for the user to adjust the sensitivity. My suggestion is, add algorithms for:

(A) full name disambiguiation
(B) title fuzziness (80 % of same characters in the first 7 words of the title - or similar)
(C) identification of duplicates as it is

And, let me choose (by right click on the symbol) which way to go.
Always remember, it will be you who clicks "merge"!

Anyhow, the current solution is a huge step forward in the usability of Zotero. Many thanks to the developpers!

fbennett · April 13, 2012

@A_Hartmann,

Just in case there is a misunderstanding, this thread is about name disambiguation by the CSL processor, when citations are formatted. The discussion here isn't related to the detection of duplicate items in the Zotero database.

A_Hartmann · April 13, 2012

Oh, yes. You are right. I searched the forums for duplicate item detection and had this one on the list. Sry.

heidib · June 5, 2012

Regarding the inconsistent initials in the Zotero database: I have a *lot* of these and am likely to end up with a lot more for my dissertation. I don't want to take out the disambiguation option for APA, because there are times as noted above where it is necessary.

This is one of those cases where humans can manage this task so easily, especially for something like a dissertation where you get so familiar with the most common authors. Can't there be some way we can simply flag these for Zotero?

So, if I have Jesse Jackson who is sometimes cited as Jesse A. Jackson, I don't want to change it in the library (because it seems people should be cited the way they were credited when they wrote the paper), wouldn't it be simple enough to bring up a pane in Zotero preferences that lets me specify:

Jesse Jackson = Jesse A. Jackson
Paul Nation = I.S.P. Nation

etc. I can catch these pretty easily as I'm going along. Then, when I click "Refresh" all the entries get checked against the list and fixed.

Thoughts?

Heidi

fbennett · June 5, 2012

@heidib: The eventual solution to this will probably involve using ORCID data to Zotero or to the citation processor. I don't know of any plans short of such a move.