Cleanup names of creators, publishers, cities, etc.?

edited August 1, 2018
I'm attempting to 'proofread' my whole library (4000+ entries currently), and hoping to standardize the formatting for the same entities. I'm not sure what the best (least time-consuming) strategy to do this is, and I'm wondering if there are any automated tools within Zotero (or ways to search the database) that could help.

Examples of entities to standardize:

1) Author names:
--Sometimes they have an initial, other times full first name.
--Sometimes the middle name/initial is included, other times it is omitted.
Ideally, I'd like to have the same name for the author in every instance so that sorting works out consistently. But for even more general consistency I'd like to also standardize this for editors, etc.
(An alternative philosophy would be to keep the names exactly as printed in each reference, but that would lead to inconsistency in, e.g., multiple publications in the same year by the same author.)

2) Publishers
Publishers often have abbreviations, and I get lots of different results when inserting via ISBN and other databases. For example, you might get: Penguin / Penguin Inc. / Penguin Incorporated / etc.

3) Locations:
Cities often had the state or country added, and the way these are abbreviated (or what information they have) is often inconsistent. It also might be worth standardizing the distinction between for example 'Cambridge, MA, USA' and 'Cambridge, England'.

Others include Journal titles (e.g., which words are capitalized, "and" vs. &, etc.), journal abbreviations, series titles, etc. I have a few journals/recurring conference proceedings that have changed their names slightly over the years but can be easily cited by the same name (as for author names above, this might vary based on your citation philosophy).

Out of all of those things, it looks like only "Publisher" and FIRST "Creator" can be sorted automatically in Zotero which would allow me to see a list. For the others, like Location, it would be great to be able to sort by them (or otherwise generate a list), and then standardize them manually from there. The trickiest and most important would be to be able to see ALL Creators (not just first) and make sure they're formatted the same way across entries.

I'm not sure how much time/effort I want to invest in going through all of my entries by hand. And it doesn't need to be perfect in the end. But having some tools or methods to search through and check these things would be great, so I can fix the most glaring inconsistencies.

Has anyone tried anything similar? Do you have any suggestions?

I suppose one way to approach this would be to create a custom style that sorts by the relevant fields, such as Location, but I wouldn't know how to do that for multiple authors and/or editors, etc. It would also be complicated to then go back through from that information and edit the relevant items in Zotero.

It seems that Zotero knows something about recurring names/titles, because they autofill. Is that information accessible anywhere?

  • Batch editing is planned for the relatively near future (I believe Zotero 5.2). For now, the only way to do batch editing is using python and PyZotero.
  • Thanks. I know a full (and easy) version of this would require major changes. But I wonder what can be done now, even awkwardly. Any tips for cleaning up a large number of references?
    Unfortunately python isn't an accessible option for me at this point.
  • Probably the easiest now is to either sort on the field you are cleaning up or to make a saved search with the field to be cleaned up (e.g., an author’s last name that you want to make consistent) and use Copy-Paste to copy the text into the field for each item.
  • The trickiest part is actually finding when there are inconsistencies. If all creator names were just listed from A-Z I could skim through and find what appear to be obvious variants, and then deal with those. But since sort just works on first author, that's limiting. And search would require that I already know what I'm looking for.

    I suppose that the priority is for first authors to be the same (for sorting purposes) so I could just focus on that and hope the rest is relatively consistent.
  • Practically, I recommend not worrying about fixing these until they actually produce a problem in citations (e.g., if an author has a full given name in one item, but initials in another, that would trigger author disambiguation with first names being added, etc.). At that point, go through and fix that inconsistency. But other inconsistencies really aren’t worth the effort to clean up initially.
  • You could probaby get a sorted list of all authors by exporting to CSV, just taking the creators column and moving that to a new spreadsheet, and then splitting the cells up by semicolon (I think that's what we use on CSV export) and sort. That would mean, of course, that you'd have to re-match creator to the actual item, so pretty laborious and bwiernik's approach (which is also what I do) seems more pragmatic to me.
  • edited August 14, 2018
    Thanks. Yes, that seems workable if I want to invest some time in it. There's also an argument to keeping the exact formatting (initials, full names, etc.) from the original publication. So I'll check this out and see what seems like a good compromise. A CSV is workable.

    Edit: just FYI for anyone who comes across this wanting a full answer, the CSV approach did work for almost everything I needed, except to view non-standard-formatting dates, for which I had to actually inspect the database: -- that worked but was slow and difficult. It's an option if you need it though.
  • edited November 28, 2018
    @djross3 - The problem with authors also extends to pen names, eg "Samuel Clemens" aka "Mark Twain".

    I'm not sure how different journals do it, but if Mary Smith gets married and changes her name to Mary Jones, I'd like citation to stay Mary Smith since that is what is on the printed copy of the journal article (or whatever...). However I would like to link them someway in Zotero to say that that Mary Smith and Mary Jones were the same person. Maybe a Zotero display field and a citation name field. If Zotero name field is blank, then the citation display name is used. Disk space is cheap these days.

    This is also a problem for transgender authors. Yes, they do exist...
  • That's a complicated issue, but I'd say a somewhat separate one. That is, it's reasonable to treat the same name as the same name, even if sometimes only an initial and others written out in the publication.

    But if a name is changed by marriage, as a pen name, or whatever, I'd default to citing them as different people. Sometimes I will use brackets if needed to clarify. In my field (Linguistics) I have noticed some inconsistency with this, such as citing an old paper from a now-married author with the new, well-known name, ignoring that it wasn't published that way, making it confusing and hard to find. So to me, the brackets can help with that, which is where I also put original names if I'm transliterating them from another alphabet. It's a somewhat arbitrary and messy process, but I'm not sure how it could be effectively automated.
  • I flipped the name fields around in my above comment. So Zotero could be changed to have two fields for each name, a "Display Name" and a "Citation Name".

    > User could manually ,or by batch edit, add "Display Names" for "Citation Names"

    > If "Display Name" field is blank, then Zotero would use the "Citation Name" by default for views on the screen.

    So imports would be done to citation field but for Zotero's screen displays the Display Name.

    The problem that this doesn't address is how to do bibliographic outputs automagically. I have no idea if any of the styles are setup to handle this.
  • While the idea of citation name vs display name has some merit, the Chicago and APA citation rules for using the fullest name for all references by an author don't argue for double author fields. As for pen names "aka Twain" and other pseudonyms, the cataloging standards (AACR2, RDA, etc.) get around this by use of double entry and the Pseudonym see true name. However, that is different from the best name to cite. For example, there has been (sometimes angry) disagreement in the public health field about whether to cite "Student" or "William Sealy Gosset" when referencing the original publications concerning the t-test. Some of this (in my opinion) depends on the audience of the manuscript you are writing (citing, Lewis Carroll vs. Charles L. Dodgson).

    The issue of how to cite pseudonyms comes to the requirements of the style standard you will use:

    There is a useful entry in the APA Style Blog concerning the "cite what you see" philosophy:

    This post also discusses the single-field Dr. Seuss vs the firstname lastname Theodor Geisel, and the Dalai Lama / Tenzin Gyatso issue.

    The pseudonym citation rules within other standards (Chicago, MLA, etc.) can differ especially when the pseudonym is for an author who wishes to remain anonymous.

    There are a couple of Zotero Forum posts about this:

Sign In or Register to comment.