Text substitution

rfrench · August 13, 2009

Hi -

I'm designing a citation style for the Astrophysical Journal (ApJ). This style requires certain journal abbreviations (for example, ApJ!). However, most citation sources spell out the journal name (Astrophysical Journal). I don't want to change the Zotero database entries, because then they won't be correct for other styles. Instead, what I want to do is translate from the full names to the shortened names. Basically something like:

<if publication = "Astrophysical Journal"> <text value="ApJ">
...

But I can't find any way to actually compare a variable *value* with something. I can only check whether or not a variable has a value at all. Is there some way to do this?

adamsmith · August 13, 2009

Zotero has a separate fields for "Journal Abbr." - the journal's short form (scroll down).
Check e.g. the Nature style to see how this is called in .csl.

noksagt · August 13, 2009

But this is a great example on why we must be smarter about abbreviations: abbreviations are style-specific! If rfrench only writes for ApJ, there'd be no problem. But, as soon as he writes for another journal, he'll have to update all of his abbreviations.

Style-specific abbreviations will almost certainly not make it into CSL-1.0. It'd be nice to come up with a solution to this.

adamsmith · August 13, 2009

OK - I thought periods vs. no periods in abbreviations are going to be supported, though, right?
For everything else - are Journal abbreviations really style dependent? The idea would be that there is some fixed rule to create abbreviations, I guess. Since we've seen a number of styles that also want that for authors (in sometimes bizarre ways) I guess allowing users to integrate algorithms to create abbreviations would be good. Maybe very hard?

noksagt · August 13, 2009

I thought periods vs. no periods in abbreviations are going to be supported, though, right?

Yes. But this is trivial and insufficient.

For everything else - are Journal abbreviations really style dependent?

Yes. See http://authors.iop.org/atom/help.nsf/0/60B8EF7A8573E173802574B40052FF65?OpenDocument&journalid=ApJ#_Toc5

If you write for ApJ, you'd abbreviate "The Astrophysical Journal" to "ApJ".
If you write for Nature, you'd abbreviate it as "Astrophys. J."

IIRC, Endnote handles this situation by allowing you to make a database of journal names, each of which can have three or so abbreviations. You then select which abbreviation to use. This is somewhat crude & not ideal. Zotero could potentially do the same, outside of CSL.

But I don't know why this couldn't be handled inside CSL, so that other CSL users could benefit.

I guess allowing users to integrate algorithms to create abbreviations would be good. Maybe very hard?

Why "hard?" I think one good design would be to point to some URI in the CSL file that had a list of substitution patterns to follow. Generating these initial lists could be done automatically for some journals (due, in part, to open data from the data incubator project & similar free use sources) & could be slowly cleaned up by those who use the styles.

bdarcus · August 13, 2009

But this is a great example on why we must be smarter about abbreviations: abbreviations are style-specific!

That may be, but it's a ridiculous state of affairs; almost comical that we have to worry about these stupid details.

So, right: a) it won't be in 1.0, and b) I can't promise it will ever be in CSL.

adamsmith · August 13, 2009

but Bruce - while that may be true, the entire fact that there are more than 10 different citation styles at all is almost comically inefficient and unnecessary.
Yet I doubt we'll ever change that either - or at least not in the medium future.
So we'll just have to do our best to accommodate as much as reasonably possible of the existing styles.
Admittedly, "reasonable" is open to interpretation, but Journal abbreviations seems quite common in the hard sciences.
What does bibtex do with that btw?

noksagt · August 13, 2009

That may be, but it's a ridiculous state of affairs; almost comical that we have to worry about these stupid details.

I used to feel the same way (based mostly on all of the flat databases out there). But, far from being "stupid details," I realized that these specific rules do make a good bit of sense: Journal-abbreviations conserve precious page space & speed perusal. But they, like all jargon, are (at the least) field-specific. Writing for a narrow audience allows you to be more concise than writing for a broader one. Why ignore something that is not only common, but also useful? It is easy enough to apply these rules that some publishers will do it for you. But there's really no reason why our authoring tools shouldn't do it in the first place.

noksagt · August 13, 2009

What does bibtex do with that btw?

Nothing fantastic. Some people use bunch of '@string' statements in their .bib file as a sort-of preamble to automatically assign text for their standardized publication names. These statements are replaced & a new .bib is made for each journal, but could be based off a single .bib database other than these replacements.

Others use a reference manager (e.g. JabRef & probably even EndNote) that has multiple abbreviation lists & will make a new .bib for each journal.

One could put the abbreviations in the .bst file. But those are somewhat hard to write & the input from a .bib file is not standard & those lists aren't maintained.

makebst does come with a few lists, so some newer styles written w/ makebst now come with style-specific abbreviations.

I'd have to check on BibLaTeX....

bdarcus · August 13, 2009

It does come down the question of reasonableness: some balance between need and ease-of-implementation (both in CSL, and in its implementations).

Maybe this is a science thing; the range of possible journals is too large in many other fields to be having to account for every possible variation, and to do so would defeat the purpose anyway.

I don't actually think, though, that this has much of any benefit for readers (I find it really annoying to have to look up acronyms, for example); more likely a cost-saving measure for print publishers.

That said, feel free to submit a ticket to xbib, with a suggested solution.

But I'll say upfront that there are some devil-in-details issues here.

For example, I have always resisted adding regular expression support to CSL. I will almost surely continue to resist that.

Also ,it might as well be a general abbreviation solution, considering that historians and others, for example, sometimes use these sorts of abbreviations for archival collections, or committees and such. Maybe:

<string-substitute match="some-science-journal">
  <text value="SSJ"/>
</string-substitute>

That might make sense for CSL, but might have some performance implications if all strings had to be run through this sort of a process.

noksagt · August 13, 2009

I'd have to check on BibLaTeX....

Nothing particularly clever. You can do many of the same things as in BibTeX (I also forgot that you can use a TeX command for your journal name & define these commands in your .tex file). They've added a 'shortjournal' field, which allows a single abbreviation for each reference (similar to Zotero). This field is not used by most styles, though.

noksagt · August 13, 2009

I don't actually think, though, that this has much of any benefit for readers (I find it really annoying to have to look up acronyms, for example)

Agreed that they are a pain to "newcomers" who don't recognize the abbreviations. But abbreviation is done everywhere, at least casually. "APL" is faster to type in an email or to read than "Appl. Phys. Lett." & I've seen ad hoc abbreviations for journals used by some non-scientists, especially when they're collaborating on a paper.

more likely a cost-saving measure for print publishers.

;-) They'd sell it as an author benefit, though: they limit you to a certain number of pages (based at least partially on their costs) & you can fill more of your allotment with your own content than with citations.

For example, I have always resisted adding regular expression support to CSL. I will almost surely continue to resist that.

I can see that (but I can also see that it'd be a natural feature request if string substitutions became available).

That might make sense for CSL, but might have some performance implications if all strings had to be run through this sort of a process.

I had thought this too, but it works fast enough that JabRef users seem happy enough (though with so many equally bad solutions out there, it doesn't take that much!). Implementers would probably want to be sure to cache abbreviations, such that we didn't have to re-run the replacement on all existing citations each time a new citation was added.

fbennett · August 13, 2009

I agree with Bruce that per-journal abbreviations don't belong in CSL markup, and agree with noksagt that they ought to be handled somewhere. Abbreviations are certainly important in law; the Bluebook has a list of mandatory BB-specific journal title abbreviations that stretches across several pages. As for worrying about "stupid details", well, right. But it's what adamsmith said, isn't it: if the CSL processor does the worrying, then we don't have to.

Hmm. Algorithms are never going to work for this one.

The processor API can be set up to accept a hashed list of abbreviations that is maintained elsewhere. Inside the processor, applying the substitution to journal titles is trivial, and would have no impact on the CSL schema and almost no impact on performance. So that part is easy. In fact, I'll implement it this afternoon. :)

The heavy lifting is in composing and maintaining the per-journal abbreviation lists themselves. That would require a smooth online revision mechanism and a process of review. After that's in place, you just need a means of delivering the lists to the processor together with journal CSL files. No list, no abbreviations. Simple.

noksagt · August 13, 2009

I agree with Bruce that per-journal abbreviations don't belong in CSL markup

Why? If there are style-specific formatting guidelines, I'd argue that CSL is the best place for them! Further, you wrote later about the possibility of implementing substitution into citeproc-js.

The heavy lifting is in composing and maintaining the per-journal abbreviation lists themselves.

Agree that it'd initially be susceptible to a high amount of change. But I'd expect it to stabilize.

That would require a smooth online revision mechanism and a process of review.

This is no different than any other part of a CSL file, though.

If we are really worried about the high rate of initial changes to abbreviation lists and/or the inherent overlap, perhaps that's one reason to think about linking to the substitution lists from the CSL file.

fbennett · August 13, 2009

This is no different than any other part of a CSL file, though.

True. Just to avoid possible misunderstanding, I think this is both doable and desirable.

The idea behind keeping it off CSL markup is that, if you have one list per journal, and the lists can change independently, then the link is implicit. There might be lots of shared lists and shared sublists and whatnot, but there's nothing gained by coding those relations into the CSL files. Basically you just need some mechanism for answering the question, "What abbreviations should I use with this style"?

If an archive or interface is set up someplace, I'll be happy to feed in the stuff for Bluebook.

bdarcus · August 13, 2009

@noksagt: just how variable are these abbreviations likely to be across styles? And what. if anything, structures the variations?

Am wondering if it's at all feasible to put in data, as with the linked periodical data stuff I've previously mentioned.

rfrench · August 13, 2009

Hi all -

Wow! I wasn't expecting such a lively debate on what I thought was a simple question. I will throw in my two cents since I started the whole thing:

1) Yes, it's annoying that journals use different citation styles, and yes, it's annoying that journals require different abbreviations. However, isn't that the whole *point* of CSL? If there were only one citation style, we wouldn't need CSL at all. As a random user, I would want CSL to be sophisticated enough that it could handle what I need to do to satisfy someone's writing guidelines. The guidelines aren't going to change just because I'm using CSL. If CSL can't do what I need to do as an author, I will just have to create/edit the bibliography by hand, which defeats the whole purpose. I need to emphasize that these abbreviations are NOT OPTIONAL when submitting papers.

2) I'm finding it hard to believe that performance is really a big issue here. What kind of bibliography is someone going to have to create that running a few dozen IF statements or regular expressions is going to cause a noticeable delay?

3) While I understand people are trying to solve the Big Picture problem, I just want to write a paper that's due in a few weeks :-) No one actually answered the original question - can I do a comparison of a variable against a constant string? That certainly works for me in the short term, and probably even in the long term. It also seems like a feature that could have many applications in various styles, and wouldn't be difficult to implement.

Thanks,
Rob

bdarcus · August 13, 2009

1) Yes, it's annoying that journals use different citation styles, and yes, it's annoying that journals require different abbreviations. However, isn't that the whole *point* of CSL?

Let's not be pedantic here. CSL already covers a ton of real-world styling issues, but there are all kinds of bizarre corner cases that it will never support, because doing so requires too much effort, for unclear benefit.

Am not saying this example necessarily applies, but we do have to balance competing issues here.

2) I'm finding it hard to believe that performance is really a big issue here. What kind of bibliography is someone going to have to create that running a few dozen IF statements or regular expressions is going to cause a noticeable delay?

Well, the solution I am contemplating here wouldn't involve any if statements: it would run all strings (maybe from a limited list of variables) through the function. But you're probably right.

3) While I understand people are trying to solve the Big Picture problem, I just want to write a paper that's due in a few weeks :-) No one actually answered the original question - can I do a comparison of a variable against a constant string?

No. You'll just have to run a manual search-and-replace (or script) at the end when you submit.

rfrench · August 13, 2009

Thanks, bdarcus.

Don't get me wrong - I love Zotero and the CSL plugins, and it has really changed the way I work.

Rob

Rintze · August 14, 2009

it might as well be a general abbreviation ... That might make sense for CSL, but might have some performance implications if all strings had to be run through this sort of a process.

Perhaps an optional abbreviation-attribute on cs:text would help reducing the performance hit, e.g.:

<text variable="container-title" form="short"/>
<text variable="container-title" abbreviation="BIOSIS"/>

The first line would use the abbreviated journal field. The second line can use any identifying information (ISSN, journal name, journal abbreviation) to look up the correct abbreviation in a list of journal abbreviations (or the journal abbreviation can be generated using an algorithm).
(BIOSIS is a commonly used abbreviation-list in the life sciences, which, very strangely, isn't made available online: http://www.library.illinois.edu/biotech/j-abbrev.html. BTW, can we expect any legal hassle when we'd ship these lists with CSL?)

bdarcus · August 14, 2009

On legal issues, this is part of the reason I'm wondering if this wouldn't be better handled in the periodicals data incubator effort, where there is some infrastructure to ensure the data is truly open. But for the science people here, it couldn't hurt to look into getting access to an open list of these abbreviations.

noksagt · August 14, 2009

I don't think the abbreviation attribute would work: some abbreviation lists are very journal-specific, so there'd be a huge number of values that attribute could take on.

I'm also relatively unconcerned about legal hassles: there are multiple open source projects that ship with abbreviation lists & there are lists that are maintained privately by libraries, at least one of which would likely grant permission for use. The more narrow, journal-specific lists are often too short and lack any sort of novelty to retain a US copyright & the journals would have a material benefit and no material harm from having them adopted. We won't be able to "copy" verbatim from some of the commercial/subscription lists, but I don't think we have to.

Rintze · August 14, 2009

I don't think the abbreviation attribute would work: some abbreviation lists are very journal-specific, so there'd be a huge number of values that attribute could take on.

What if the attribute values would be URIs (e.g. pointing to the periodicals data incubator project resources)?

noksagt · August 14, 2009

What if the attribute values would be URIs (e.g. pointing to the periodicals data incubator project resources)?

I think that should work & is similar to what I proposed.

fbennett · September 21, 2009

Me, writing on August 13th:

In fact, I'll implement it this afternoon. :)

That was a little optimistic, but the new processor does now recognize abbreviation lists. Tests are available here (see items with prefix "abbrevs_").

The tests illustrate the data format of the list (a simple JSON key/value list, with the full name of the journal as key). There's a trivial API for installing a list in the processor, which can be adapted to whatever scheme emerges from this discussion.

EDIT: Adjusted URL of link to tests.

mishas · December 8, 2009

i don't really see a problem with having a per journal (or per set of journals) abbreviation list. host them right along with the csl files and have a field in CSL that specifies the list you'd like to use. have one generic list that is referenced by default if the style can't find the desired list (or don't abbreviate at all in this case). Most journals will be covered by 2 or 3 lists i am guessing and crowd-sourcing the problem will quickly generate the required lists.

I know it's more work and someone has got to do it, but this really seems like the simplest solution that solves the most problems.

The way things are at the moment is a pretty terrible band-aid solution: without the ability to at least batch edit the abbreviation fields, it's tedious work making sure everything is proper even for a single journal!