Merging and Cleaning Author lists, Publishers, etc. to prevent overzealous disambiguation

DWL-SDCA · February 15, 2012

Not directly a Zotero issue but a problem (at least a puzzle)

I am concerned by this and similar discussions concerning author names. I think that there are larger issues here than disambiguation of author names in a database and using the the disambiguated name for a citation in a manuscript.

For example, if an author's name is Able Brown Word but has publications under "A Word" and "AB Word" should the "A Word" publications be cited "AB Word"? Even though we may know that this is the same author, should a citation list a name that differs from that on the published document? What of a publication with several authors with all author names published as "single initial last_name"? Should the first author be given a more complete name to facilitate automated disambiguation? Certainly, we do not want to edit our database to do the opposite -- omit a part of an author's name so that all versions in the database are at the same level of specificity.

Rintze · February 15, 2012

You might be interested in http://about.orcid.org/. Any solution that only uses names will always be limited in solving ambiguity.

DWL-SDCA · February 16, 2012

Thanks, Rintze. ORCID is certainly interesting. Implementation will likely take years. Look at ResearcherID. I think people and institutions resist ResearcherID for a number of reasons beyond the connection with Thompson. Maybe if publishers begin to mandate author clarity we will finally begin to solve this.

I will stop my way off topic digression.

Lubos · October 8, 2012

I have been facing the author name cleansing issue for some time now and drafted a SQL query that provides a list of duplicate-looking author names with the related Zotero items. The query looks for the same last names and initial character of the first name. It also provides some additional fields to make it easier to locate the respective items in Zotero.

I ran the query in SQLiteStudio but I am sure other SQLite client tools can run it as well. If it runs too slowly, try removing the four "upper()" functions.

The query may be helpful if you just want to find out possible author name disambiguation issues and correct them manually. It sorts by author's last name so you can go to My Library in Zotero, sort by Creator or use the Advanced Search function and follow the query result to make your amendments.

EDIT: Updated the query below. It now also takes into account the second and third character of the first name to minimize false negatives.


SELECT cD1.lastName,
       cD1.firstName,
       cD1.shortName,
       (SELECT cD2.lastName || " " || cD2.firstName
        FROM   creatorData cD2, itemCreators iC2, creators c2
        WHERE  iC2.itemID=i.itemID AND
               iC2.creatorID=c2.creatorID AND
               cD2.creatorDataID=c2.creatorDataID AND
               iC2.orderIndex=0 
       ) as firstCreator,
       cT.creatorType || " of a " || ifnull("" ||
         (SELECT substr(iDV.value, 1, 4)
          FROM   itemData iD, itemDataValues iDV, fields f
          WHERE  iD.itemID=i.itemID AND iDV.valueID=iD.valueID AND
                 iD.fieldID=f.fieldID AND f.fieldName="date"
          ), "") ||
         " " || iT.typeName || ": " || ifnull("" ||
         (SELECT iDV.value
          FROM   itemData iD, itemDataValues iDV, fields f
          WHERE  iD.itemID=i.itemID AND iDV.valueID=iD.valueID AND
                 iD.fieldID=f.fieldID AND f.fieldName="title"
          ), "") as "Participated in Title",
       (SELECT count(*)
        FROM   creatorData cD2, itemCreators iC2, creators c2
        WHERE  cD2.creatorDataID!=c.creatorDataID AND
               cD2.creatorDataID=c2.creatorDataID AND
               iC2.creatorID=c2.creatorID AND
               upper(cD2.lastName)=upper(cD1.lastName) AND
               upper(substr(cD2.firstName,1,1))=upper(substr(cD1.firstName,1,1)) AND
               (upper(substr(cD2.firstName,2,1)) IN (upper(substr(cD1.firstName,2,1)), "", " ", ".") OR
                substr(cD1.firstName,2,1) IN ("", " ", ".")
                ) AND 
               (upper(substr(cD2.firstName,3,1)) IN (upper(substr(cD1.firstName,3,1)), "", " ", ".") OR
                substr(cD1.firstName,3,1) IN ("", " ", ".") OR
                substr(cD2.firstName,2,1) IN ("", " ", ".") OR substr(cD1.firstName,2,1) IN ("", " ", ".")
                )
       ) as alikeItems
FROM   creatorData cD1,
       creators c,
       itemCreators iC,
       creatorTypes cT,
       itemTypes iT,
       items i
WHERE  alikeItems > 0 AND
       c.creatorDataID=cD1.creatorDataID AND
       iC.creatorID=c.creatorID AND
       cT.creatorTypeID=iC.creatorTypeID AND
       i.itemID=iC.itemID AND
       iT.itemTypeID=i.itemTypeID
ORDER BY cD1.lastName, alikeItems DESC, cD1.firstName

david_lindemann · October 17, 2012

I also miss a function for merging author names, as for example in Drupal Biblio, where there is a merge function in the "authors list" that covered all these needs (basically with the aim to have unambiguous "author pages" for each author in Drupal Biblio), except different names for the same author, until the "aka" field was included. See this thread: http://drupal.org/node/409670

It would be great if such a function would be included in Zotero, as I'm planning to export my data to Drupal Biblio for web presentation (this is possible as Bibtex or by the "import from Zotero" add-on to Drupal Biblio) and have to perform the author merge there. As there is (still) no two-way sync between Drupal Biblio and Zotero my Zotero database stays the same.

Another way would be to switch to Mendeley, but NO! I'd like to stay with Zotero...

pascal.martineau · June 28, 2013

Why not export from Drupal Biblio using RIS and then import the file back into Zotero ?