Merging and Cleaning Author lists, Publishers, etc. to prevent overzealous disambiguation
*Summary*
We need a tool for merging of multiple entries of authors, publishers, and book titles to allow for clean-up of unintentional duplicates.
*Explanation*
A number of users have incorrectly reported a bug with the word and open-office plugins. The problem they observe is seemingly unpredictable inclusion of the first name or initials of the author in the inserted citation. This is usually due to the fact that the first name disambiguation feature is turned on in the style they are using and the user is unaware of the function of this feature (and may be unaware even that such style constraints were a requirement!)
But sometimes, the disambiguation appears when there are no ambiguous references -- apparently. Upon closer examination, the Zotero users have, for the most part, discovered that some miniscule difference has appeared in two different versions of the same name. Correcting this problem turns out to be somewhat tricky -- sometimes requiring deletion of all of instances of an author's name and then replacing these with a consistent, correct entry in each case.
The suggestion feature can help prevent this loss of database regularity during manual entry, but when sources are captured from the internet or shared from other users, then there is a necessary post-processing step to verify that any new references which share an author with an existing reference have the correct spelling, to prevent accidental disambiguation. Duplicate book titles, publisher names, etc. are relatively innocuous, but suffer from the same problem.
One particularly popular webmail client has a tool that could serve as a function model -- if there a duplicate contacts, these may be merged and (I assume) all prior versions are combined and tied to a GUID. A similar tool could be used in zotero to occasionally clean the database.
We need a tool for merging of multiple entries of authors, publishers, and book titles to allow for clean-up of unintentional duplicates.
*Explanation*
A number of users have incorrectly reported a bug with the word and open-office plugins. The problem they observe is seemingly unpredictable inclusion of the first name or initials of the author in the inserted citation. This is usually due to the fact that the first name disambiguation feature is turned on in the style they are using and the user is unaware of the function of this feature (and may be unaware even that such style constraints were a requirement!)
But sometimes, the disambiguation appears when there are no ambiguous references -- apparently. Upon closer examination, the Zotero users have, for the most part, discovered that some miniscule difference has appeared in two different versions of the same name. Correcting this problem turns out to be somewhat tricky -- sometimes requiring deletion of all of instances of an author's name and then replacing these with a consistent, correct entry in each case.
The suggestion feature can help prevent this loss of database regularity during manual entry, but when sources are captured from the internet or shared from other users, then there is a necessary post-processing step to verify that any new references which share an author with an existing reference have the correct spelling, to prevent accidental disambiguation. Duplicate book titles, publisher names, etc. are relatively innocuous, but suffer from the same problem.
One particularly popular webmail client has a tool that could serve as a function model -- if there a duplicate contacts, these may be merged and (I assume) all prior versions are combined and tied to a GUID. A similar tool could be used in zotero to occasionally clean the database.
Either way, when it comes to translator saving I'm not sure there's going to be a good solution—as a rule we don't interrupt the save process with user prompts, even if it means some clean-up is required later.
It seems to me the only real solution here is a full-blown data model (FRBR, or along the same lines) that models the "real world" of people, "works" and so on, rather than being citation-based. Particular representations of a person's name, as used by citations, would be built on top of that.
Or, it might be possible to tie into something like worldcat identities, but I don't know enough about that project to say.
I think this has come up in other threads ... but I don't seem to have any bookmarked.
John Doe
Jane Doe
If you have a style that initializes the family names, then you need the full given names to be printed. If you have some source that lists the first one as "J. Doe" but know in fact it's "John Doe" then you have to treat it as such. E.g. in answer to your question, I don't think it makes any difference: you'd no longer be comparing strings to figure out when to disambiguate.
What about works written under aliases?
(Multi-lingual would also need to be covered, but that's another bundle of issues.)
Another wrinkle to consider is the person that changes their name (through marriage, etc.).
But I still think these are different cases than the difference between "J. Doe", "J Doe" and "Jane Doe".
But I do believe that there's something really bizarre/wrong about adding additional complexity to a system to accommodate errors. I mean, as an author, if someone spells my name wrong in my publication, I sure as hell don't want people repeating that mistake in subsequent citations!
Copying mistakes straight from the source is very different than copying erroneous citations from someone else's bibliography (which also happens quite a bit). If a citation is meant to represent a particular work and make it easier to locate, then there is a good argument to use data exactly as-published (even if you knew the publisher screwed up). It may bruise the ego a bit to have your name misspelled so much. However, at least in the physical sciences, it is great ego boost to see that citations to your works are actually counted by the bibliographic databases.
http://www.lit-link.ch/
as you can see, they have to the left a menu with "persons" and then each bilbiographic item is linked to that person. The additional advantage is that you can go to a specific "person" and then immediately see all his/her works authored, but also edited, co-authored, referenced etc.
For the problem of name changes, aliases, personas etc., a solution could be to integrate a command into each style which triggers whether bibliographic items use the persons name or the name given in a specific bibliogrpahic item. Or simply have an on/off switch which changes the respective values independently of styles.
This approach is hardly "adding complexity...to accommodate errors"; on the contrary supporting any concept of "person" or "work" beyond the present citation-based model raises numerous modeling issues, some quite complex and difficult. Consider above how above immediately we need to start talking about "aliases" and "personas." Or take a look at previous discussions of multi-lingual citations (here's one). From a more pragmatic perspective it implies quite dramatic changes to the UI and user interaction, and I suspect difficult tradeoffs between data integrity and ease of use.
Don't get me wrong, I love the idea of a rigorous model for information about writers and the things they produce (or even more ambitiously, creators and the things they create). But that's a pretty tall order. It also has all sorts of uses, way beyond the things zotero does. Perhaps that's another project (it seems inevitable to me), one that zotero will eventually integrate with. A more feasible way forward for zotero might be to incorporate tools for identifying and correcting inconsistencies, within the existing citation-centric data model. See the OP above, or consider the frequently-discussed issue of duplicate detection.
About the question of whether to cite "as published" or not, certainly style guides have something to say about this. I don't have the Chicago Manual handy, but vaguely recall it wants you to make an author's names consistent. BUT I'm pretty sure not to the point of "correcting" e.g. names that change due to marriage or for other reasons. Another scenario is publications in a language where the name is rendered differently, often handled by having a separate bibliography. So the notion of "consistency" breaks down pretty quickly.
And, I wouldn't be surprised if there exists a style somewhere that requires citing precisely as published, with errors intact! There seems to be a style somewhere that requires anything you can imagine.
- solving the disambiguation problem that was the original subject of this thread
- support for multi-lingual (well, really cross-script) citation practices (not supported at all now)
- easier searching
- ability to attach additional information to these agents (notes, etc.)
To me, the added complexity is probably a reasonable cost.Which of the two is desired (transliteration or "translation") should depend on the language of the source (which we don't track in Zotero records at the moment), and on the conventions adopted by the style.
In fact sometimes I do this in a hackish way, by creating a "document" item with a person's name and a title something like "[misc. notes]". More precisely that would be "notes not attached to a specific publication."
This makes me think about the issue in a different light. Perhaps it's more about the long-discussed enhancement of relations between items. Still need some sort of entity that can represent a person, but the big win seems like the ability to define meaningful relationships to that entity, moreso than creating a full-blown person object (a requirements black hole I suspect). You could even generalize that entity so it wouldn't have to be a person, maybe just a named thing to which relationships could be defined, notes attached, etc. How to visualize and interact in the UI is a big question here.
Or maybe I just haven't had enough coffee yet...
Although now it occurs to me an alternative would be to keep the single field value per item and multiply the items, i.e. for a Japanese publication there might be the following items, each with at least the basics of author, title, publisher, date: (1) as published; (2) phonetic Japanese; (3) romaji (phonetic in roman characters); (4) translation (as many as languages being cited in).
The key requirement would be the ability to define relationships between these items properly. Along with presenting them elegantly in the UI and being able to "walk the tree" when generating citations. Maybe this idea has been discussed but I don't recall thinking about it that way myself.
http://forums.zotero.org/discussion/829/
http://forums.zotero.org/discussion/1130/
http://forums.zotero.org/discussion/8561/
I would like to see multi-language added as an attribute that could be applied to certain fields (names, titles, places, and perhaps more). It would be possible to, say, right click on such a field and edit list of representations for it in a dialog box, and each representation would be represented by a language code (as has been done for CSL 1.0).
The treatment of people (agents, broadly) and perhaps places as entities maintained by users and orthogonal to the publication records is a very attractive one. Since its point of contact with the bibliographic role of Zotero is perhaps small, I've thought that Person records could be handled in a Zotero plug-in. The user might simply assert that the misspelled, garbled name that is in the bibliographic data for a mid-17th-century book corresponds to a given Person. The Person data -- associated creators IDs, items, metadata -- could be maintained through a separate window, dialog box, or panel.
Also, I've used the language of "agent" to point out that institutions can benefit from the same treatment. But you note this in your post as well :-)
Specifically: Zotero can clearly identify which characters in a first name field represent the author's initials. I.e., "John James Wilson" becomes "Wilson, J.J." in an APA reference list. Would it then be possible to disambiguate only when two authors with the same surname also have the same initials - disregarding any other differences in the first name field? It seems to me that this would cut down hugely on the false positive rate, and imply only the occasional false negative.
Examples:
J. J. Wilson vs J. J Wilson - no disambiguation (second author has no period after second initial, but this would be ignored)
John James Wilson vs J. J. Wilson - no disambiguation (full first names are included for the first author and not the second, but they can be reasonably assumed to be the same person)
J. J. Wilson vs J. G. Wilson - should be disambiguated (second initial differs)
John James Wilson vs Jack Jerry Wilson - no disambiguation (false negative).
Given that it'd presumably be pretty rare to have two authors with the same initials and surname publishing in the same area, false negatives would probably be rare, and the consequences of not disambiguating when it is technically required are hardly dramatic.
Could this be a feasible approach?
(Edit: I neglected to mention that this would go against the CSL specification, which is a threshold sticking point. Amending the spec would require the agreement of the CSL list members, and that would be difficult to obtain for the reasons above.)
few styles actually require the "radical" disambiguation that is currently the only option in Zotero. This will lead to a lot less disambiguation and a lot less of this coming up.
In my groups, I want to be able to find articles by Author. If I sort the Creator column, this only sorts by first name, but I want to find anything associated with a particular author.
If I do an Advanced Search, the articles aren't found. I have tried this several times with different authors and different articles. Occasionally one comes up the way I'd expect, but hardly ever. Sometimes a first name works but the last name does not.
http://forums.zotero.org/discussion/15123/