Merging and Cleaning Author lists, Publishers, etc. to prevent overzealous disambiguation

jameshalgren · May 4, 2010

*Summary*
We need a tool for merging of multiple entries of authors, publishers, and book titles to allow for clean-up of unintentional duplicates.

*Explanation*
A number of users have incorrectly reported a bug with the word and open-office plugins. The problem they observe is seemingly unpredictable inclusion of the first name or initials of the author in the inserted citation. This is usually due to the fact that the first name disambiguation feature is turned on in the style they are using and the user is unaware of the function of this feature (and may be unaware even that such style constraints were a requirement!)

But sometimes, the disambiguation appears when there are no ambiguous references -- apparently. Upon closer examination, the Zotero users have, for the most part, discovered that some miniscule difference has appeared in two different versions of the same name. Correcting this problem turns out to be somewhat tricky -- sometimes requiring deletion of all of instances of an author's name and then replacing these with a consistent, correct entry in each case.

The suggestion feature can help prevent this loss of database regularity during manual entry, but when sources are captured from the internet or shared from other users, then there is a necessary post-processing step to verify that any new references which share an author with an existing reference have the correct spelling, to prevent accidental disambiguation. Duplicate book titles, publisher names, etc. are relatively innocuous, but suffer from the same problem.

One particularly popular webmail client has a tool that could serve as a function model -- if there a duplicate contacts, these may be merged and (I assume) all prior versions are combined and tied to a GUID. A similar tool could be used in zotero to occasionally clean the database.

mattw · May 20, 2010

I would love to see this feature - searching for tiny differences in the format of an author's name across multiple items to prevent unnecessary disambigation is very time-consuming.

bdarcus · May 20, 2010

Well, the bigger solution is the same as in a decent contact application: to treat agents (authors, publishers, etc.) as full objects, rather than dumb text strings. But that itself doesn't solve all problems; there'd still be details to work out.

dstillman · May 20, 2010

if there a duplicate contacts, these may be merged and (I assume) all prior versions are combined and tied to a GUID. A similar tool could be used in zotero to occasionally clean the database.

This might not be quite as straightforward in Zotero, since some people may want/need to cite the author as listed in the particular work. We could still link that representation of the name to other representations via a first-class creator object (which actually exists already behind-the-scenes—it's just unsupported in the interface), but then what happens to disambiguation in the document?

Either way, when it comes to translator saving I'm not sure there's going to be a good solution—as a rule we don't interrupt the save process with user prompts, even if it means some clean-up is required later.

alexuw · May 20, 2010

As Dan points out the item you're citing is one thing, the person another. Basically the same issue arises with the "work" as opposed to particular editions, translations, and so on.

It seems to me the only real solution here is a full-blown data model (FRBR, or along the same lines) that models the "real world" of people, "works" and so on, rather than being citation-based. Particular representations of a person's name, as used by citations, would be built on top of that.

Or, it might be possible to tie into something like worldcat identities, but I don't know enough about that project to say.

I think this has come up in other threads ... but I don't seem to have any bookmarked.

bdarcus · May 20, 2010

This might not be quite as straightforward in Zotero, since some people may want/need to cite the author as listed in the particular work.

I think before we'd move forward on any particular solution to this very real problem, we'd want to interrogate this assumption. To me, allowing this creates more problems than it solves.

... then what happens to disambiguation in the document?

The only sane position is to say that a person is a person. So, hypothetically, you have two authors:

John Doe
Jane Doe

If you have a style that initializes the family names, then you need the full given names to be printed. If you have some source that lists the first one as "J. Doe" but know in fact it's "John Doe" then you have to treat it as such. E.g. in answer to your question, I don't think it makes any difference: you'd no longer be comparing strings to figure out when to disambiguate.

dstillman · May 20, 2010

So you're saying that you wouldn't allow multiple representations for a given person in the database? Or only at the citation level, instead using whichever name was marked as canonical in the database? If the latter, would the canonical name be used only if a source using the canonical name was used in the document, or would the canonical name be used for any source by that person?

What about works written under aliases?

fbennett · May 20, 2010

People who change their name on marriage?

(Multi-lingual would also need to be covered, but that's another bundle of issues.)

bdarcus · May 20, 2010

So you're saying that you wouldn't allow multiple representations for a given person in the database?

At least as a first step, yes.

Or only at the citation level, instead using whichever name was marked as canonical in the database?

Not following you here.

If the latter, would the canonical name be used only if a source using the canonical name was used in the document, or would the canonical name be used for any source by that person?

If I understand you correctly, the latter.

What about works written under aliases?

I'd treat them as a special type of linked agent: a "persona." E.g. the persona named "Mark Twain" authored "Huck Finn." That persona links to the person named "Samuel Clemens."

Another wrinkle to consider is the person that changes their name (through marriage, etc.).

But I still think these are different cases than the difference between "J. Doe", "J Doe" and "Jane Doe".

noksagt · May 20, 2010

There are also cases of misspellings and mis-ordering of initials/names. I cite a few papers that the publisher misprinted author names (and these are normally carried over to all bibliographic databases too). I usually want to cite these names exactly as published, as this is the best way to guarantee that my citations will be linked to the original references and for the author's/article to get credit for a citation in the databases that store such information.

So you're saying that you wouldn't allow multiple representations for a given person in the database?
At least as a first step, yes.

If you're arguing we can make this "simple" initially (to get it working), then fine. But am I correct that you'd have no objection to eventually go to a more complex system like Dan describes?

bdarcus · May 20, 2010

If you're arguing we can make this "simple" initially (to get it working), then fine. But am I correct that you'd have no objection to eventually go to a more complex system like Dan describes?

Basically, yes. If we need to go there, so be it.

But I do believe that there's something really bizarre/wrong about adding additional complexity to a system to accommodate errors. I mean, as an author, if someone spells my name wrong in my publication, I sure as hell don't want people repeating that mistake in subsequent citations!

noksagt · May 20, 2010

But I do believe that there's something really bizarre/wrong about adding additional complexity to a system to accommodate errors. I mean, as an author, if someone spells my name wrong in my publication, I sure as hell don't want people repeating that mistake in subsequent citations!

Drifting off topic, but...

Copying mistakes straight from the source is very different than copying erroneous citations from someone else's bibliography (which also happens quite a bit). If a citation is meant to represent a particular work and make it easier to locate, then there is a good argument to use data exactly as-published (even if you knew the publisher screwed up). It may bruise the ego a bit to have your name misspelled so much. However, at least in the physical sciences, it is great ego boost to see that citations to your works are actually counted by the bibliographic databases.

migugg · May 21, 2010

you may want to have a look at litlink, a Swiss citation manager based on a filemaker database.
http://www.lit-link.ch/
as you can see, they have to the left a menu with "persons" and then each bilbiographic item is linked to that person. The additional advantage is that you can go to a specific "person" and then immediately see all his/her works authored, but also edited, co-authored, referenced etc.

For the problem of name changes, aliases, personas etc., a solution could be to integrate a command into each style which triggers whether bibliographic items use the persons name or the name given in a specific bibliogrpahic item. Or simply have an on/off switch which changes the respective values independently of styles.

alexuw · May 21, 2010

Further to the exchange between bdarcus and noksagt, if we start from zotero today, we have a citation-centric model. We grab data from catalogs, indexes, electronic journals and so on, with attendant mutations, inconsistencies, and downright errors intact. Zotero happily takes in whatever we feed it.

This approach is hardly "adding complexity...to accommodate errors"; on the contrary supporting any concept of "person" or "work" beyond the present citation-based model raises numerous modeling issues, some quite complex and difficult. Consider above how above immediately we need to start talking about "aliases" and "personas." Or take a look at previous discussions of multi-lingual citations (here's one). From a more pragmatic perspective it implies quite dramatic changes to the UI and user interaction, and I suspect difficult tradeoffs between data integrity and ease of use.

Don't get me wrong, I love the idea of a rigorous model for information about writers and the things they produce (or even more ambitiously, creators and the things they create). But that's a pretty tall order. It also has all sorts of uses, way beyond the things zotero does. Perhaps that's another project (it seems inevitable to me), one that zotero will eventually integrate with. A more feasible way forward for zotero might be to incorporate tools for identifying and correcting inconsistencies, within the existing citation-centric data model. See the OP above, or consider the frequently-discussed issue of duplicate detection.

About the question of whether to cite "as published" or not, certainly style guides have something to say about this. I don't have the Chicago Manual handy, but vaguely recall it wants you to make an author's names consistent. BUT I'm pretty sure not to the point of "correcting" e.g. names that change due to marriage or for other reasons. Another scenario is publications in a language where the name is rendered differently, often handled by having a separate bibliography. So the notion of "consistency" breaks down pretty quickly.

And, I wouldn't be surprised if there exists a style somewhere that requires citing precisely as published, with errors intact! There seems to be a style somewhere that requires anything you can imagine.

bdarcus · May 21, 2010

OK, good points alexuw. I guess I'm arguing for reasonable complexity with good pay-offs. The pay-offs for treating authors and other contributors as linked contributors include:

solving the disambiguation problem that was the original subject of this thread

support for multi-lingual (well, really cross-script) citation practices (not supported at all now)

easier searching

ability to attach additional information to these agents (notes, etc.)

To me, the added complexity is probably a reasonable cost.

fbennett · May 21, 2010

Just a tiny note: "multi-lingual" is probably the right nomenclature. A name can be transliterated with no change to its original formatting characteristics (e.g. KUROSAWA Akira), or the formatting characteristics of the target script/language domain can be imposed on it, which is, in a sense, a type of translation (e.g. A. KUROSAWA).

Which of the two is desired (transliteration or "translation") should depend on the language of the source (which we don't track in Zotero records at the moment), and on the conventions adopted by the style.

alexuw · May 22, 2010

I'd like the ability to attach notes and tags to a "person."

In fact sometimes I do this in a hackish way, by creating a "document" item with a person's name and a title something like "[misc. notes]". More precisely that would be "notes not attached to a specific publication."

This makes me think about the issue in a different light. Perhaps it's more about the long-discussed enhancement of relations between items. Still need some sort of entity that can represent a person, but the big win seems like the ability to define meaningful relationships to that entity, moreso than creating a full-blown person object (a requirements black hole I suspect). You could even generalize that entity so it wouldn't have to be a person, maybe just a named thing to which relationships could be defined, notes attached, etc. How to visualize and interact in the UI is a big question here.

Or maybe I just haven't had enough coffee yet...

alexuw · May 22, 2010

Also I want to reiterate Frank's point. Having a "person" entity could help with multi-lingual citations but it doesn't address the issue of dealing with multiple representations of author's name, title, etc. for an individual item. Sometimes discussed as "multiple field values" or similar phraseology.

Although now it occurs to me an alternative would be to keep the single field value per item and multiply the items, i.e. for a Japanese publication there might be the following items, each with at least the basics of author, title, publisher, date: (1) as published; (2) phonetic Japanese; (3) romaji (phonetic in roman characters); (4) translation (as many as languages being cited in).

The key requirement would be the ability to define relationships between these items properly. Along with presenting them elegantly in the UI and being able to "walk the tree" when generating citations. Maybe this idea has been discussed but I don't recall thinking about it that way myself.

ajlyon · May 22, 2010

The idea of a "person" type has been floated and has never received much support from Zotero Central:
http://forums.zotero.org/discussion/829/
http://forums.zotero.org/discussion/1130/
http://forums.zotero.org/discussion/8561/

I would like to see multi-language added as an attribute that could be applied to certain fields (names, titles, places, and perhaps more). It would be possible to, say, right click on such a field and edit list of representations for it in a dialog box, and each representation would be represented by a language code (as has been done for CSL 1.0).

The treatment of people (agents, broadly) and perhaps places as entities maintained by users and orthogonal to the publication records is a very attractive one. Since its point of contact with the bibliographic role of Zotero is perhaps small, I've thought that Person records could be handled in a Zotero plug-in. The user might simply assert that the misspelled, garbled name that is in the bibliographic data for a mid-17th-century book corresponds to a given Person. The Person data -- associated creators IDs, items, metadata -- could be maintained through a separate window, dialog box, or panel.

fbennett · May 22, 2010

(A very small follow-on note: the multilingual support in citeproc-js is a processor extension. Styles that make use of multilingual support will validate against the CSL 1.0 schema, but the multilingual behavior is not covered by the specification.)

bdarcus · May 22, 2010

Just to be clear:

The idea of a "person" type has been floated and has never received much support from Zotero Central ...

What we're talking about here is NOT a "person 'type'". It is a new kind of data and UI entity.

Also, I've used the language of "agent" to point out that institutions can benefit from the same treatment. But you note this in your post as well :-)

Matthew HudgensHaney · July 18, 2010

Either way, when it comes to translator saving I'm not sure there's going to be a good solution—as a rule we don't interrupt the save process with user prompts, even if it means some clean-up is required later.

Why exactly is this position taken "as a rule"? I would definitely prefer to be prompted to confirm or modify information rather than have to search back through my library to figure out which author name is different.

mattw · September 21, 2010

Folks, hope you don't mind me resuscitating this thread, but I'm wondering if there might be a simpler (at least temporary) solution than the "person" entity in terms of preventing overzealous disambigation.

Specifically: Zotero can clearly identify which characters in a first name field represent the author's initials. I.e., "John James Wilson" becomes "Wilson, J.J." in an APA reference list. Would it then be possible to disambiguate only when two authors with the same surname also have the same initials - disregarding any other differences in the first name field? It seems to me that this would cut down hugely on the false positive rate, and imply only the occasional false negative.

Examples:
J. J. Wilson vs J. J Wilson - no disambiguation (second author has no period after second initial, but this would be ignored)
John James Wilson vs J. J. Wilson - no disambiguation (full first names are included for the first author and not the second, but they can be reasonably assumed to be the same person)
J. J. Wilson vs J. G. Wilson - should be disambiguated (second initial differs)
John James Wilson vs Jack Jerry Wilson - no disambiguation (false negative).

Given that it'd presumably be pretty rare to have two authors with the same initials and surname publishing in the same area, false negatives would probably be rare, and the consequences of not disambiguating when it is technically required are hardly dramatic.

Could this be a feasible approach?

fbennett · September 22, 2010

Good thinking, but I'd be reluctant to do this in the CSL processor. The error rate would be unpredictable, since it would depend on the distribution of given names among the authors being cited (which we can't predict or control). The failures would also become harder to detect. I know that the current situation can be annoying, when names expand unexpectedly. But at least that means that errors are calling attention to themselves, so they can be remedied.

(Edit: I neglected to mention that this would go against the CSL specification, which is a threshold sticking point. Amending the spec would require the agreement of the CSL list members, and that would be difficult to obtain for the reasons above.)

adamsmith · September 22, 2010

also, with better disambiguation implemented in csl 1.0 this will come up a lot less often -
few styles actually require the "radical" disambiguation that is currently the only option in Zotero. This will lead to a lot less disambiguation and a lot less of this coming up.

amoss · November 15, 2010

Would anyone be willing to comment on a problem I've had, and can't seem to find enough information about in Zotero's Help, Forums, Documentation, etc?
In my groups, I want to be able to find articles by Author. If I sort the Creator column, this only sorts by first name, but I want to find anything associated with a particular author.
If I do an Advanced Search, the articles aren't found. I have tried this several times with different authors and different articles. Occasionally one comes up the way I'd expect, but hardly ever. Sometimes a first name works but the last name does not.

dstillman · November 15, 2010

amoss: Start new threads for new issues.

noksagt · November 15, 2010

But don't for this issue, since you already have a thread open:
http://forums.zotero.org/discussion/15123/

gregor · April 10, 2011

Mendeley has a workable solution to this problem (for multiple versions of the same author). It avoids the difficulty of automatic detection, and simply lists all the authors in a list according to surname. By virtue of this, various versions of the the same author are grouped together and it is simply a matter of clicking and dragging one version onto another version to replace all instances of the undesired version in the database. It works. I am not a convert to Mendeley due to PDF handling/syncing issues, but this feature would be great to see in Zotero.

asplundj · February 15, 2012

I would also like the feature that gregor describes

adamsmith · February 15, 2012

I understand from Dan that batch editing for Zotero is the top priority for the next major release.