Getting Journal Abbreviations from a repository

ben58 · September 19, 2009

http://forums.zotero.org/discussion/5370/wiley-and-science-direct-journal-ab/#Item_3

Strangely enough, was not widely discussed yet. EndNote offers this feature, the open source Jabref also. There exist several publicly available repositories.

Would be nice to have the functionality to automatically insert/replace abbreviations in a selected collection in Zotero.

bdarcus · September 19, 2009

Been previously discussed before:

http://forums.zotero.org/discussion/3877/automatic-generation-of-publication-abbreviated-names/

http://forums.zotero.org/discussion/8278/text-substitution/

http://forums.zotero.org/discussion/6954/bibtex-export-doesnt-contain-journal-abbreviations/

I doubt Endnote or JabRef covers the issues discussed in these threads. To my great regret, this issue is more complex and ugly than I would have thought.

adamsmith · September 19, 2009

But being able to import a dozen or so abbreviation lists would be pretty cool already.

Actually - which issue isn't covered by the jabref/Endnote approach?
I imagine (intuitively) a pretty simple routine: in the style instead of just
format="short" - you write format="short-isi" - and then the style automatically replaces journal names with abbreviations in the isi list, leaving the ones it doesn't have an abbreviation for with their full name (which is what most styles request anyway).

I'm a social scientist and we don't do journal abbreviations much, but for medicine and other natural sciences this seems super standard and important. I've edited a whole bunch of biology and medicine styles over the last weeks, and use journal abbreviations from "such and such list" has always figured prominently in the style guide.

bdarcus · September 19, 2009

Whatever the solution is I really want it to be based around the new open periodicals db. Users shouldn't have to "import" anything; stuff should "just work."

In that case, I'd probably imagine using a URI to identify the abbreviation list.

adamsmith · September 19, 2009

but we agree that the abbreviation list should be called by the style, right?
Right now the periodicals db seems very far from providing a solution.
When you raised the issue in August people really seemed at a bit of a loss of how to deal with it:
http://groups.google.com/group/dataincubator/browse_thread/thread/1c89d436e07fe6b7

There is a limited number of such lists that are actually being used. Those could also just be included in Zotero and things would just work (that's what endnote does, and, dare I say it, I think does well). For people who do require "obscure list" they could be imported separately (once again, as endnote does).
I don't think a status quo that requires pretty much every "hard" scientist to manually edit 80% of his/her bibliography should exist for any longer than absolutely necessary.

fbennett · September 22, 2009

The rather simple mechanical problem of handling abbreviation lists has been solved in the new CSL processor. All that's needed is a means of acquiring the key/value list associated with a given style.

Rintze · September 22, 2009

Will there will be fall-back to the abbreviations supplied with the item data, instead of to the full title?

fbennett · September 22, 2009

Will there will be fall-back to the abbreviations supplied with the item data, instead of to the full title?

If I understand current Zotero correctly, it has a shortTitle variable, which is used to render <text variable="title" form="short"/>. As far as I know there is no corresponding variable to supply the short form of container-title, and if abbreviation lists are used, no such variable would be necessary. The only available abbreviations would be those supplied in the list tied to the style, and they would be applied always, regardless of whether the "short" form is set or not. If the journal title is not in the abbreviation list, the fallback would be to use the full title.

That's not to say that I am afflicted with a chronic belief that this is the right way for things to work, but it's the way things are set up in the test-bed processor at the moment.

PS: Everything in the first paragraph above, from "As far as I know ..." onward is mistaken, as Rintze points out. See my next post below for a corrected outline of processor behavior.

Rintze · September 22, 2009

As far as I know there is no corresponding variable to supply the short form of container-title

Then how exactly are journal abbreviations supplied currently, if not via their own variable?

and they would be applied always, regardless of whether the "short" form is set or not

Perhaps we should make the `form` and `abbreviation-list` attributes on cs:text mutually exclusive? Or you can allow both, and use the value of `form` to specify whether container-titles that do not occur in the abbreviation list should fall back to the full title or the abbreviated title that is in the input data.

fbennett · September 22, 2009

Then how exactly are journal abbreviations supplied currently, if not via their own variable?

Ooooops. There's a journalAbbreviation variable in Zotero that I missed. So scratch what I wrote above, it's based on a false assumption.

Perhaps we should make the `form` and `abbreviation-list` attributes on cs:text mutually exclusive?

The test-bench implementation doesn't rely on an abbreviation-list attribute, the use of lists is managed externally; the list is just installed in the processor whenever it is provided, and takes effect immediately.

Since, contrary to what I wrote, there is a field for journal abbreviations, and since form="short" does have meaning in current Zotero, the supplied abbreviation list should only be used when form="short" is set, and the fallback should be to journalAbbreviation, then to the full title. Sorry for my earlier confusion.

This only solves the mechanical problem of applying abbreviations from an external list. It leaves open the question of how to manage and select the lists themselves (whether by an abbreviation-list attribute or by some mechanism external to CSL markup).

Rintze · September 22, 2009

whether by an abbreviation-list attribute or by some mechanism external to CSL markup

Abbreviation lists are often style-specific, so it would make sense to code it into the style. Some people have suggested URIs as attribute-values:
http://forums.zotero.org/discussion/8278/text-substitution/?Focus=37489#Comment_37489

bdarcus · September 22, 2009

My first hunch is that it might be best to leave this out of CSL ATM. I do very much like the idea of using a URI to specify the abbreviation list, but there's some extra work that would be needed to take advantage of this, and I'm not sure we have the time to worry about that right now.

Dan or Simon: any opinions on this?

Basic issue is that if you go with the approach Frank used here (a simple substitution list untied to CSL), you'd still need some way for a user to load or specify those lists. The user would be responsible for all of this.

OTOH, using a URI would allow lists to be defined by style, and so for it to be essentially transparent to users.

dstillman · September 22, 2009

I agree that a URI-based solution would make the most sense. This would obviously need to be handled by the UA, since the citation processor doesn't (and shouldn't) have networking support.

Two options that come to mind, both starting with the CSL file having some sort of processing instruction or embedded attribute/element that includes a URI:

1) The UA reads the URI, fetches the data, and provides it to the processor using the mechanism Frank already has in place.

2) The UA reads the URI, fetches the data, and preprocesses the CSL file, substituting in the data as XML data. This might make for easier testing and would allow styles to hard-code lists if they really wanted to, though there might not be much of an advantage to that. But I think Ant does something like this.

Rintze · September 22, 2009

CSL styles are already dependent on other files: the locales-xx-XX.xml localization files. Would the following make sense?:

3) Abbreviation lists are stored in a separate folder in the same directory as the styles folder. If bandwidth isn't an issue, standard abbreviation lists can be shipped along with the CSL-application (e.g. Zotero). New lists can be installed by hand by placing them in the abbreviation list folder. If a new style is installed that uses a new abbreviation list, that list is downloaded as well. The CSL-application takes care of automatic updating of the abbreviation lists (just like styles). In this scheme it would be enough if the CSL processor has access to the local abbreviation list folder.

adamsmith · September 22, 2009

I like Rintze's solution. Obviously I'm by far the least techie person involved in this discussion, so that's only worth so much,
but from a style-coding and using perspective that's exactly what I expect.

and if

If a new style is installed that uses a new abbreviation list, that list is downloaded as well.

is possible that will also be in line with Bruce's concern that the user shouldn't have to import any lists.

bdarcus · September 22, 2009

You all are using passive voice ("is installed" etc.). What agent is doing this work (downloading)? How?

adamsmith · September 22, 2009

My idea would be that Zotero comes with all abbr. lists for the pre-installed styles.

When a user installs a new style that requires an abbreviation list, the installation routine would automatically check if that list is already installed in zotero (i.e. the lit file present in the directory) and otherwise download the list together with the style.

Maybe I'm naive about the easiness of doing that from a programming point of view, but I think this would be ideal from a user point of view. I imagine the current install routine checks where the style folder is in zotero, checks if the style already exists there (it distinguishes between install and update) and the installs the style. Writing an additional loop that checks if and what abbreviation list is required for the style wouldn't seem that much harder.

Rintze · September 22, 2009

What agent is doing this work (downloading)? How?

The same agent that can handle automatic style updating (i.e. Zotero).

dstillman · September 22, 2009

The CSL processor doesn't have disk access, either—the locales are passed to the processor as XML by the application. So #1 and #2 already imply caching the lists on disk and updating them as necessary. The main question is how the lists are specified in the style and how the data is passed to the processor. We're also going to need a format for the abbreviation lists that includes a URI in it.

Rintze · September 22, 2009

Perhaps a similar structure as used for styles would make sense, e.g.:

<abbreviation-list>
  <info>
    <title>BIOSIS</title>
    <id>http://www.zotero.org/abbreviation-lists/biosis</id>
    <link href="http://www.zotero.org/abbreviation-lists/biosis"/>
  </info>
  <journal>
    <full-title>Journal A</full-title>
    <abbreviation>J. A</abbreviation>
  </journal>
  <journal>
    <full-title>Journal B</full-title>
    <abbreviation>J. B</abbreviation>
  </journal>
...
</abbreviation-list>

Storing the abbreviation lists in JSON might perhaps make more sense, as many lists contain close to 10,000 journals (http://www.library.illinois.edu/biotech/j-abbrev.html).

bdarcus · September 22, 2009

Two things:

1) a list of 10,000 items is insane. Why not just specify abbreviation algorithms as described in that illinois link ("List of Title Word Abbreviations")?

2) I don't see any great advantage using JSON here; the XML might be as simple as:

<substitutions>
  <substitute match="Test Journal" replace="TJ"/>
</substitutions>

noksagt · September 22, 2009

1) a list of 10,000 items is insane. Why not just specify abbreviation algorithms as described in that illinois link ("List of Title Word Abbreviations")?

Which includes 57,000 word/word fragments. And must be handled in order, as the fragments contain wild cards. For example: words that start with "materia" are abbreviated to "mater." except for the specific cases of "materialhåndtering," "materialoveden*," "materialoznanie," "materialzinatn*", "materiałoznawst*", and "materiálovotechnolog*" (where '*" is a wild card).

I don't see any great advantage using JSON here

Compactness. We should see what other programs use, too. I don't know if there is already a common format, but this is a common problem. JabRef uses:Test Journal = TJ. I see few advantages of inventing our own XML unless we have specific ideas for taking advantage of extensibility.

fbennett · September 22, 2009

1) a list of 10,000 items is insane. Why not just specify abbreviation algorithms as described in that illinois link ("List of Title Word Abbreviations")?

The "algorithm" prescribed by that page just calls for per-word substitution based on another, even larger list: "The List includes 57,000 words in about 70 languages".

Since the list of actual titles for this case is already out there, and since other styles do have abbreviation requirements that cannot be covered by simple word lists alone, it seems like providing for word-substitution in this case, but for title-substitution in others would add unnecessary complexity.

PS: Uh, also what noksagt said.

bdarcus · September 22, 2009

What a PITA!

So for sake of argument, why should this be a matter for the CSL processor at all, except for the fact that it might be nice (though not essential) to be able to specify the list in the style? Seems that clients like Zotero could do the substitutions before sending over the data to process.

noksagt · September 22, 2009

If you're going to have a few that are 57,000 word dictionaries, you might as well share them between CSL-using applications.

Being able to specify a URI to a list of abbreviations to use & having a common file format for such a list seems very simple to put in the specification. Assembling those lists and, to a lesser extent, processing those lists seems harder. If Frank thinks that he can do the latter, I don't see any downsides to putting it in.

If we don't put it in the spec, how are the clients going to decide which abbreviation list to use with each style?

fbennett · September 22, 2009

Just to be clear, it would be simpler to avoid handling the 57,000-word substitution list in this example (and any other form of algorithmic substitution), and rely on simple title-key/title-value pairs like the 10,000-title list mentioned here, instead.

bdarcus · September 22, 2009

Being able to specify a URI to a list of abbreviations to use & having a common file format for such a list seems very simple to put in the specification.

But not as simple as putting nothing in at all (in CSL 1.0).

Assembling those lists and, to a lesser extent, processing those lists seems harder. If Frank thinks that he can do the latter, I don't see any downsides to putting it in.

The downside is that this is another last minute feature request without any significant experience to support this solution.

I also don't see why a CSL processor should be responsible for this. JabRef, for example, does the substitution; not BibTeX?

If we don't put it in the spec, how are the clients going to decide which abbreviation list to use with each style?

Leave it to the user to determine (which admittedly, puts the burden on the client app to support their making this choice).

It might be possible to add a simple attribute for this without adding significant risk or added work; will think on it.

adamsmith · September 22, 2009

but if I understand you correctly that flies in the face of

Users shouldn't have to "import" anything; stuff should "just work."

, right?

I think the reason this should be in the csl is that the direction which abbreviation to use are in the style guide - so when translating the style guide to a csl that information should be included.

noksagt · September 22, 2009

I don't know if this should be a blocker for 1.0 (and could care less about the schedule), but I think it would be a useful killer feature if CSL eventually had it. Other citation options do not do this right & that is exactly why CSL should!

JabRef, for example, does the substitution; not BibTeX?

Because, as you know, BibTeX has many poor decisions which applications must make up for. Doesn't mean CSL should as well. See http://forums.zotero.org/discussion/8278/text-substitution/#Item_9

Leave it to the user to determine (which admittedly, puts the burden on the client app to support their making this choice).

Worse: it puts the burden of implementation on the client app (and history has shown that this gives us crummy, clunky interfaces that can't be shared) & the burden of ensuring this aspect of the style is followed correctly on the user (who must select both a CSL file & a separate abbreviation list when writing a paper).

It might be possible to add a simple attribute for this without adding significant risk or added work; will think on it.

That's what I was aiming for.

Rintze · September 23, 2009

We should see what other programs use, too.

EndNote seems to use this tab-delimited format:

AACN Clinical Issues	AACN Clin. Issues	AACN Clin Issues
AADE Editors Journal	AADE Ed. J.	AADE Ed J

See http://www.library.uq.edu.au/endnote/journal_terms.html

I don't know if this should be a blocker for 1.0

Agreed. Nobody asked for this to be included in 1.0.

bdarcus · September 23, 2009

On the format, it doesn't necessarily follow that we should do what other programs use.

If we consider abbreviation part of CSL (and I'm not convinced we do), then the format should probably be XML, given that CSL is an XML format.

If we consider it part of the client app, then I don't care.