Overriding Automatic Abbreviations

aurimas · September 18, 2013

There's this interesting behavior that I've observed with automatic abbreviations.

According to the automatic abbreviations Zotero performs, Cell Biophysics is abbreviated as Cell Biophys. and Cell Cycle is not abbreviated. Both of these are right and normally would appear properly in bibliographies. However, if you have something entered into the Journal Abbreviation field in Zotero (say "Phys."), Cell Biophysics would remain displayed as "Cell Biophys.", but Cell Cycle would now be displayed as "Phys." I figured that this is because Journal Abbreviations only overrides automatic abbreviations when there is effectively no abbreviation.

For one, this behavior could be quite confusing to the user. Furthermore, the journal abbreviations that we automatically generate on import could sometimes result in incorrect abbreviations in bibliographies.

I'm not entirely sure what the proper solution to this should be. I think we should stop automatically populating Journal Abbreviation field on import from the web (retain journal abbreviations imported from files, but don't auto-generate them. I don't recall if we currently do.) I also think that, ideally, Journal Abbreviation field should always override automatic abbreviations, but this would be quite problematic with all of the abbreviations that are already hanging out in people's libraries from our automatic imports.

fbennett · September 18, 2013

If I understand how things fit together, this behaviour originates from the processor, so I should be paying attention to these threads.

In the Abbreviation Filter plugin, there is a UI for the abbreviation lists. If the abbreviation is set to the full name of the journal ("Cell Cycle" in the example), the field value would be overridden. Without UI control, I guess the only alternative in the current setup would be to delete the field value and see what happens, which isn't ideal.

One approach might be to flag (with a field highlight and tooltip?) that the field has a preferred abbreviation in the current style. Not sure if that would make things more confusing or less so, but it would provide a little more information to the user, and ease everyone into an awareness of the layers involved.

adamsmith · September 18, 2013

The Zotero list seems to me to be written in a way that never requires a fallback, does it?

aurimas · September 18, 2013

I'm specifically referring to the Zotero implementation of getAbbreviation here.

I guess that in order to have a proper discussion of this, we need to understand from the citeproc-js side what is expected of the getAbbreviation function.

There are several outcomes of processing a given string:

An exact string is found and is replaced by the abbreviated form

A string is split into words and each word is abbreviated based on the list of supplied abbreviations

A string is split into words, but not all words have a match in the abbreviation list (so abbreviation may be incomplete)

A string is split into words, but none of the words can be abbreviated based on the list (likely to happen with foreign languages, although Zotero abbreviation list appears to be quite broad)

1, 2, and 3 can also result in the abbreviated string being identical to the input string. I'd say in the case of 1 and 2 ~~(2 is the case for Cell Cycle here)~~, these should certainly serve as valid abbreviations. Cell Cycle actually falls under 4 here, because short words are not abbreviated and are, thus, not part of the abbreviation list.

I see that getAbbreviations essentially manipulates the second argument to store the result. What should it be setting under each of these circumstances?

If the user-supplied Journal Abbreviation is to serve only as a fallback, I think it should certainly be a fallback for cases 3 and 4. It may also be a fallback for case 2 if we want to give more weight to the user-supplied data.

IMO, though, user-supplied data should take precedence over all of these, but as I mentioned in the first post, this is probably not an option at this point. This is also not an elegant solution because abbreviations are style specific and Journal Abbreviation field is document-independent.

Edit: Edited paragraph 4.

After some further thought, determining what is a valid abbreviation and what is not, depends highly on the list that is being used. Cell Cycle is certainly a correct abbreviation even though neither of those words nor the complete string are in the list. Of course for less complete lists, this is not the case.

fbennett · September 18, 2013

If the user cites a journal that is not covered by the list, she should have some direct way of getting correct output.

In the MLZ setup, it might actually make sense to ignore the journalAbbreviation field entirely when an abbreviation list is in force, since the user has the option of registering an abbrev in the list. Where the list can't be edited, though, the journalAbbreviation field provides a way to supply the abbreviation for a missed journal. (Without fallback, the user could set the abbreviation itself as the name of the journal, but that would lose information from the record.)

fbennett · September 18, 2013

@aurimas: I obviously need to study up a bit.

fbennett · September 18, 2013

We'll be digging into code logic. Shall we move this discussion to zotero-dev?

aurimas · September 18, 2013

I don't think we need to start picking apart code per se. It would suffice to discuss (A) the logic of determining what constitutes a valid _automatic_ abbreviation (I made a remark above that this may be abbreviation list-dependent), (B) when to fallback to/override with the user-supplied abbreviation, and (C) how such abbreviation overrides should be entered/maintained by the user.

(C) is the least clear to me. If abbreviations are style-specific, then it's quite illogical to be entering them into the Journal Abbreviation field in Zotero. On the other hand, it would be very tedious (not to mention having to remember to do so) to have to always enter them at the document level (i.e. in the Word/LO plugin).

I seem to recall some discussion on this, but I can't find it right now.

Edit:

In the MLZ setup, it might actually make sense to ignore the journalAbbreviation field entirely when an abbreviation list is in force, since the user has the option of registering an abbrev in the list.

Sorry, I missed that. I'll have to take a closer look at how MLZ handles this.

fbennett · September 18, 2013

On (A), I assume you mean an abbreviation that is automatically generated from a set of fragment hints (hope I'm not misreading -- correct me if I've slipped here). In the Abbreviation Filter layout, the hints are registered separately, and may consist of phrases or single words. Phrase matches are tried from the longest possible match to the shortest, to avoid false-positives. There is no provision for editing the hint lists themselves at the client level: if they produce an incorrect result, the user needs to adjust the suggested abbrev through the UI. As abbreviations are pretty arbitrary, it seems like you would need some means of overriding the result, unless the abbreviation rules are rigrously defined. Even in that case, variation across styles would make things hard. Providing for user intervention seems simpler.

How (B) is handled depends a lot on whether you have a means of editing the abbreviation lists. I assume that that will be coming to Zotero at some point, but maybe Dan or Simon can speak to it?

On (C), there are two issues: whether the overrides are document-specific or take effect across all styles applying the list; and whether to provide the editing UI in the document, in Zotero. or both.

Overrides should probably not be limited to the document, regardless of how editing works. For a persistent list with editing access, you would want to set the abbreviations in an SQLite database, with a caching layer to provide quick access to abbrevs in use for a given document or bibliography. That's what I've done in the Abbreviation Filter, and it seems to be working pretty well.

For the editing UI, the Abbreviation Filter initially allowed edit access only through the document plugins, and only for items actually cited in the document. That was indeed pretty awkward, and I recently extended it to allow editing through MLZ itself. It required a small change to expose the citation processor in the XUL that provides MLZ-side edit access (the Export Preferences panel) -- I haven't checked, but that may prevent the latest iteration of the plugin from working with Zotero.

I don't think there's a clean solution without direct local edit access to the abbreviation lists. I guess an initial question would be whether to focus on how best to serve user needs with the current setup, or to look ahead to more ambitious extensions to the infrastructure.

aurimas · September 19, 2013

OK, seems like we're on the same page (as in, I pretty much have the same thoughts on A, B, and C).

Basically, in theory, getAbbreviation should do the best it can and citeproc-js will just use the result as an abbreviation. If that's incorrect, the user should override this for each abbreviation list.

This is not quite what Zotero does though. There is no abbreviation list-specific override and the global Journal Abbreviation field is often populated, so it cannot be used generally as a way to override. So I think that's why there's currently this hybrid of only overriding abbreviations that are effectively not abbreviated.

With the current setup, I think the most reasonable solution would be to ignore the Journal Abbreviation field all the time and never override, since the override does not work anyway most of the time. This, part I can fix in https://github.com/zotero/zotero/blob/master/chrome/content/zotero/xpcom/cite.js