Option to disable disambiguation?
This is a continuation from the papercuts thread here:
http://forums.zotero.org/discussion/19556?page=1#Item_4
The issue is whether name disambiguation should be controlled through the UI, and whether it should be disabled by default.
http://forums.zotero.org/discussion/19556?page=1#Item_4
The issue is whether name disambiguation should be controlled through the UI, and whether it should be disabled by default.
I agree that it's a weird rule; the initial in this case contributes nothing to disambiguation, and including it is not natural, even if you know the general rule. (I have a patch for citeproc-js ready to go that would restrict primary-name disambiguation to clashes on primary names only, but I filed it in the attic when the APA response came through.)
What mronkko suggests sounds very possible: given the obscurity of the official APA position, we may have a de facto split between a "shy" and "aggressive" form of the rule. If that is the case, we should (additionally) support the "shy" rule in CSL.
That's a separate point from the main topic of discussion, but it's related: I think that disabling name disambiguation is not a good idea, because it can produce errors in the document (ambiguous cites) that are not visually obvious. If there are rules in the style, they should be applied, and if the rules need adjusting, the thing to do is to adjust them.
I'll commit a change that suppresses the initial on the primary author in this case.
(This discussion makes me feel the more strongly that we shouldn't lightly adopt suppression of names disambiguation, since mronkko's clear and specific complaint has ferreted out a fault in the processor logic that will benefit everyone.)
1. Do other people agree that's a good idea in general?
2. If so, what would be a practical way to implement it?
See e.g. how the two different Smiths are cited in this article using the APA style http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.6787&rep=rep1&type=pdf
Most of the problems caused by the disambiguation are a result of having the same author name stored in a different form for different citations. For example, a paper written by me could have one of the following form for the author name
Ronkko, M
Ronkko, M. V.
Ronkko, Mikko
Ronkko, Mikko V.
Current version of Zotero thinks that these are different authors. If I cite four test articles each uses a different form, the result is this
(M. Ronkko, 2001; M. V. Ronkko, 2002; Mikko Ronkko, 2003; Mikko V. Ronkko, 2004)
Because different databases do have the same author stored in different ways, this case is commonly encountered when you are just building your reference database. This is very confusing to new users of Zotero that do not know why the author firstname is included in the citation. One way to solve what I think is a usability issue in the name disambiguation would be to consider authors to be the same if a name can be an abbreviated version of another name.
What do other people think about this idea?
I marked the number 8 on the paper cuts list with strikethrough for now.
But the vast majority of author-date styles in Zotero does not have a givenname disambiguation rule specified and thus defaults to disambiguating all (visible) names (the only available behavior under csl 0.8.1)- which may very well be required by some styles, but is certainly not the most common behavior.
I see mronkko's point about inconsistent databases - don't have much to add except that that problem would also become _much_ rarer with a" by-cite" default, since it would only affect citations by the same author(s) from the same year with inconsistent first names in Zotero.
https://bitbucket.org/fbennett/citeproc-js/issue/130/name-disambiguation-should-be-robust-for
I believe that these two separate issue reports will solve all the reasons why I think that the name disambiguation (as currently implemented in Zotero) should be a feature that the user would be able to disable.
On the suggestion to disambiguate names only if the first-position initial differs, I suppose the obvious example would be "G.W .Bush" versus "G.H.W. Bush".
G. W. Bush = George W. Bush = George Walker Bush != George Bush != G. H. W. Bush
The effect would be to completely disable full-name disambiguation in these pairs, even in rule sets that call for it as the final step. Names would either disambiguate on the initials, or not at all. You can get a similar effect by applying "all-names-with-initials" or "primary-name-with-initials". If that is the desired result, those rules can be applied in the style without any changes to the processor. Thinking further ...
(1) Under both a "-with-initials" rule or the proposal, the following two names would not be distinguished:
Jesse Jackson -> J Jackson (final clash) -> Jackson
J Jackson -> J Jackson (final clash) -> Jackson
(2) Under the "-with-initials" rules, you would get the same result if the full names are present:
Janet Jackson -> J Jackson (final clash) -> Jackson
Jesse Jackson -> J Jackson (final clash) -> Jackson
(3) Putting the two samples together, and assuming all-names disambiguation, would this be the intended result under the proposal?
Jesse Jackson -> J Jackson (clash, but abort) -> Jackson
J Jackson -> J Jackson (clash, but abort) -> Jackson
(4) But:
Jesse Jackson -> J Jackson (clash) -> Jesse Jackson
Janet Jackson -> J Jackson (clash) -> Janet Jackson
J Jackson -> J Jackson (clash) -> J Jackson
You can see the difficulty. Whether expansion is limited to initials for the final comparison depends on whether more than one full (non-initials) name exists in the clash set. That's possible, but it would be very hard to get right, and explaining the logic when there are questions ("I add this one name, and suddenly initials and names appear all over my document!") would not be much fun. It would also suggest that having "J Jackson" and "Janet Jackson" coexist in the bibliography is acceptable, which it's not, in any style, unless the authors really are different people, and one of them is firm about always being represented by their initial only -- in which case the result at (3) would be incorrect, and the correct result impossible to produce.
A less painful alternative would be to set "all-names-with-initials" as the default rule, except for styles that explicitly require full-name expansion (like Chicago).
Another route might be to provide a full rule-selection mechanism in Document Preferences (maybe even with "disable" as an option, I guess). That would at least embed guidance on the effect of the rules in the application where people are more likely to see it.
I like this idea. Not because it allows for disabling name disambiguation, but because it tells the user about name disambiguation and a little bit of how it works. Also for the rare case of someone who prefers to use initial only as the first name, there could be an option to use strict disambiguation. However, I think that these instances would be extremely rare.
I gave this a bit of thought when driving to work. I agree that it quickly gets complex and is difficult to get right. (I also took a look at the citeproc code, but since it had very few comments was not able to really understand it, so the following is not based on the existing implementation). One way to decide whether authors with the same last name should be disambiguated would be:
1) Give each author name a score: A name gets +10 points for each name that is spelled out and +1 point for each initial.
2) Form a lis tof all author pairs and sort these in descending order based on the sum of the scores
3) Go through the list of author pairs one by one
3.1) Make the names comparable: If one of the authors does not have a middle name, remove that from the other as well and if a name in for one author is only an initial, make it an initial for the other author as well.
3.2) When the names have been reduced to most complex comparable form, check if they are identical. If this check is true, mark the authors as potential duplicates
4) Go through a list of authors ordered in ascending order by how many potential duplicates they have excluding authors with zero potential duplicates
4.1) If an author has only one potential duplicate and that duplicate has higher score than the original author name, replace the original name with the name of the duplicate. Then refresh the list of potential duplicates and start iterating the list again from start (go back to 4).
5) After the step 4 completes, apply the current disambiguation rules.
I did not really understand the point about raising the number of steps needed to recover. The case where J Jackson and Janet Jackson is what the user really wants is extremely rare. My preferred solution would be to by default assume that these are the same person and leave the user an option to use the more strict name disambiguation rule if desired.
http://forums.zotero.org/discussion/18213/slow/#Item_3
One more proposal to solve this:
Combine it with a sort-order field, and use that to determine name ambiguation. It would usually default to "family name, first name" if not specified. For 2 articles from M. Ronkko and Mikko V. Ronkko, one could set the sort-order field to "Ronkko, Mikko V." for both cases, so we know that this is really the same person. It would at the same time allow me to have my coauthor sorted under "k" where he prefers to be, I would set the field to "Krogh, von, Georg F." or something for all cases of this person.
Of course the other proposed improvements to name disambiguation (like option to turn it off), are also much appreciated and tangential to this proposal.
Finally, its a problem of sensitivity and specificity of the algorithm. I think the Zotero team should implement options for the user to adjust the sensitivity. My suggestion is, add algorithms for:
(A) full name disambiguiation
(B) title fuzziness (80 % of same characters in the first 7 words of the title - or similar)
(C) identification of duplicates as it is
And, let me choose (by right click on the symbol) which way to go.
Always remember, it will be you who clicks "merge"!
Anyhow, the current solution is a huge step forward in the usability of Zotero. Many thanks to the developpers!
Just in case there is a misunderstanding, this thread is about name disambiguation by the CSL processor, when citations are formatted. The discussion here isn't related to the detection of duplicate items in the Zotero database.
This is one of those cases where humans can manage this task so easily, especially for something like a dissertation where you get so familiar with the most common authors. Can't there be some way we can simply flag these for Zotero?
So, if I have Jesse Jackson who is sometimes cited as Jesse A. Jackson, I don't want to change it in the library (because it seems people should be cited the way they were credited when they wrote the paper), wouldn't it be simple enough to bring up a pane in Zotero preferences that lets me specify:
Jesse Jackson = Jesse A. Jackson
Paul Nation = I.S.P. Nation
etc. I can catch these pretty easily as I'm going along. Then, when I click "Refresh" all the entries get checked against the list and fixed.
Thoughts?
Heidi