Making sure language codes are captured where available

FHeimburger · February 20, 2012

There have been several threads recently where users were caught out by the recent change to several Chicago styles which uses title case for item titles unless language codes such as "fr-FR" are correctly entered in the 'language' field. Once explained, this isn't a big deal and the choice makes eminent sense, but it does leave people having to retro-fill the language field (of my 3000+ items, probably 2/3 are non-English and thus need a language code added).
While there is nothing much to be done about the retro-filling until batch-editing is available, I thought it might be worth making sure that translators are filling in language codes when they are available from various sources to at least reduce these kinds of issues going forward from now.
Here's a list compiled from the sources I tend to use for starters, I am sure others could contribute

Translators which provide language codes correctly:
- revues.org - sample url : jsa.revues.org/index11990.html

Translators which fill in language fields, but not with correct codes:
- sudoc.abes.fr captures 'français' which won't work for the aforementioned styles. Sample url : http://www.sudoc.abes.fr/DB=2.1/SET=1/TTL=1/SHW?FRST=7
- WorldCat and thus ISBN-lookup captures 'French' which won't work either : sample ISBN : 9782354570163
- catalogue.bnf.fr for the French national library captures 'fre' which doesn't work either. Sample : http://catalogue.bnf.fr/servlet/biblio?idNoeud=1&ID=38906233&SN1=0&SN2=0&host=catalogue

Translators which do not fill in the language field but where the information seems to be available:
- Amazon.com doesn't seem to pick up the indications - sample url : http://www.amazon.com/Composition-Francaise-Retour-Enfance-Bretonne/dp/2070437884/ref=sr_1_9?ie=UTF8&qid=1329729444&sr=8-9
- Google Scholar doesn't seem to work either on any references I tested, but I can't tell whether the info is there in the first place
- no luck on JSTOR either, but again I can' tell if the info is actually there. Sample http://www.jstor.org/pss/40903264

Those are my main sources for references - can someone who understands translator architecture comment on what is possible as to filling in language codes? Especially for the cases where the info arrives in the wrong format for it to work properly?
In the mean time I'll train the next bunch of French students to fill in the language field manually.

adamsmith · February 20, 2012

Nothing to be done about google scholar and JSTOR. The former doesn't have any language data at all, the latter doesn't have language in the RIS which we currently use and has the wrong language (i.e. in the case of your example English) in the BibTex.

I feel like this needs to be improved on the parsing side of things and not on the translator side of things. We can't really convert the field content to valid two-letter codes in all translators.
I would like this to work if people put in "French, french, francais, fre [which is afaik the ISO three letter code], FR or fr" and similarly for other languages.

I'd be interested how/what Frank, who has spend a lot of time pondering multilingual questions, thinks about this going forward.

I see currently three possibilities:
1. Current behavior. Advantage: no false positives Downside: we can't rely on auto-imported field content and it requires people to know the locale codes.
2. Zotero looks at the case-insensitive first two letters of the language field. Advantage: Will work for many more cases including most ISO codes and many languages input as whole words Downside: This will still not work smoothly for many languages - e.g. for German the field needs to say de or Deutsch, not German (i.e. be in the native language), but e.g. for Finnish it will need to be Finnish or fi, not suomi, i.e. be in English. Both cases are common: es for español instead of Spanish, jp for Japanese instead of nippon etc.
3. Zotero treats any field that's not empty or starting with a case-insensitive "EN" as non-English. Advantage: Almost anything a user or translator puts in the field will work. Downside: lots of false positives for people with random stuff in the language field.

I favor 3. - I think the current behavior accommodates people with bad data in their database to the detriment of people who have good data, but input in a non-technical way.

dstillman · February 20, 2012

Is this rule really everything-except-English?

The language code parsing is in citeproc-js, not Zotero, right? If so, is the title-casing behavior configured solely by the language code? Does the processor use the language value for anything else? (For example, is it usable by styles? I'm guessing not if it's expecting ISO codes.) Because if we were doing #3, either we'd have to pass some arbitrary code regardless of field content or citeproc-js would need to provide another option to disable title-casing.

Re: #2, the parsing could be a bit more sophisticated than that. We could probably parse most legitimate values fairly easily. If the processor might use the language codes for anything else in the future, that would be better.

I'm a bit concerned that neither option is sufficient to avoid confusion among users who don't spend time in these forums, but I'm not sure what else we can do.

Does the bibliography locale setting affect this? That is, if there's no value for Language but the locale is set to fr-FR, does the title casing still happen?

adamsmith · February 20, 2012

Is this rule really everything-except-English?

we did some relatively thorough research on this and couldn't find any other language that titel-cases. So yes, this is indeed "everything except English".

The language code parsing is in citeproc-js, not Zotero, right?

correct.

If so, is the title-casing behavior configured solely by the language code?

Frank would have to say for sure, but yes, that's how I understood him.

Does the processor use the language value for anything else? (For example, is it usable by styles? I'm guessing not if it's expecting ISO codes.)

Yes, it is usable (though currently not used) in styles, another reason I don't think ISO codes are a great solution. On the other hand, Frank may have multilingual-related reasons to want to restrict field values - but then, of course, this shouldn't be a free-text field.

Because if we were doing #3, either we'd have to pass some arbitrary code regardless of field content or citeproc-js would need to provide another option to disable title-casing.

not sure I follow here. This option will disable title-casing in all instances it's disabled in by 1 and 2 and some more - in other words, 1) is a subset of 2) is a subset of 3). But I think I'm just missing your point.

I'm a bit concerned that neither option is sufficient to avoid confusion among users who don't spend time in these forums, but I'm not sure what else we can do.

agreed, but yes, I don't see a good way out.

Does the bibliography locale setting affect this? That is, if there's no value for Language but the locale is set to fr-FR, does the title casing still happen?

don't know - Frank will have to say.

dstillman · February 20, 2012

On the other hand, Frank may have multilingual-related reasons to want to restrict field values - but then, of course, this shouldn't be a free-text field.

It probably shouldn't have ever been a free-form field. I had a plan to make transitioning to (or at least introducing) a fixed list easier. That plan predates MLZ, so interpret "fixed list" as appropriate.

not sure I follow here. This option will disable title-casing in all instances it's disabled in by 1 and 2 and some more - in other words, 1) is a subset of 2) is a subset of 3). But I think I'm just missing your point.

I mean that if this is configurable in the processor just via the passed language field, then to do #3 and treat all non-"en" strings as non-English we'd have to pass an arbitrary code—say, "fr-FR"—for everything to trick citeproc-js into not applying title-casing. Ideally citeproc-js would provide another way of configuring this.

don't know - Frank will have to say.

If it's not the behavior now, we should probably consider the empty string to mean non-English when other locales are in effect. I suspect that would address this problem for most affected users.

fbennett · February 20, 2012

Yes, with the stream of queries over our first step into the multilingual domain, this needs to be sorted out. Here is a set of proposals:

Treat an empty Language field as non-English if the CSL locale is set to a non-English value. As Dan says, that will clear many cases without intervention.
Be indifferent to case when parsing out the field value. Case discrimination is discretionary in RFC 5646; the common convention of lowercase for the first element is not binding, so the processor should not trip over uppercase elements
Treat non-conforming values as non-English across the board. Users who enter something in the field seem most often to be trying to disable title casing. Where they are trying to do something else, it's easy to explain what's going on, and unstructured content can be cleaned up with batch editing in a future version.
Provide a processor toggle for disabling text-case transforms. This would allow introduction of a UI option (similar to the URL toggle) if it's found to be necessary.

If that all makes sense, I can introduce the changes in a fresh processor release later today.

Another touch to help dispel confusion might be first-run guidance on the Language field, with a link to a list of ISO codes on the Web.

adamsmith · February 20, 2012

I have no opinion on 4, but 1-3 sound good to me.

dstillman · February 20, 2012

Same here. With those changes, I see no particular need for a toggle.

Rintze · February 20, 2012

Note that Frank's implementation also has an exception for styles with a non-English default-locale value. Relevant excerpt from the trunk CSL spec:

http://rst.projectfondue.com/api/v1/rst2html/?rst_url=https://raw.github.com/citation-style-language/documentation/master/specification.txt&css_url=http://citation-style-language.github.com/styles/css/screen.css&output_type=html&callback=&document_output=whole&highlight_style=manni#title-case-conversion

dstillman · February 20, 2012

One other thing: looking at some values on the server, "Anglais" is quite common, and "Inglês" appears as well. If this is meant to apply for English items in non-English bibliographies, the processor should probably check for "angl" and "ingl" as well.

fbennett · February 20, 2012

As Rintze notes, the bibliography (CSL) locale governs the behavior of items with no Language field value, so we're covered there.

One other thing: looking at some values on the server, "Anglais" is quite common, and "Inglês" appears as well. If this is meant to apply for English items in non-English bibliographies, the processor should probably check for "angl" and "ingl" as well.

I'd be a little more comfortable adopting a consistent policy in the processor, with exceptional remapping in the Zotero layer. Alternatively, I could introduce a Zotero-specific toggle in the processor, so that the field content can be passed through as-is, but with these particular strings recognized as equivalent to "en". First-run guidance on the field would also be good, to gently encourage people to use the ISO codes.

adamsmith · February 20, 2012

I'm sorry to report that both in German and Spanish, title casing of English titles is common in bibliographies.

Rintze · February 20, 2012

I think most people in the natural sciences don't necessarily need guidance on the language field. The vast majority of their literature is in English.

Rintze · February 20, 2012

@adamsmith, why is that a problem? The spec trunk currently reads:

"If default-locale is set to a locale code with a primary language tag other than "en", items are assumed to be non-English. An item is only considered to be English if the value of its language field starts with the "en" primary language tag."

So in a German or Spanish style, you could tag items as English and they will Title Case.

[edit: @fbennett, yes, I understand. I was just curious about @adamsmith's remark directly above]

fbennett · February 20, 2012

@Rintze: The logic is in there already; Dan's concern is just that users are having to learn the rules by coming to the forums for guidance, and some will not notice, or will give up before they get here. We're just exploring ways of smoothing the transition toward the regular inclusion of language hints in metadata. (It's not needed by all users or in all use cases, but we'll get network-effects benefit from the data once it becomes customary to include it, and it's a good thing to encourage.)

dstillman · February 20, 2012

I'd be a little more comfortable adopting a consistent policy in the processor, with exceptional remapping in the Zotero layer.

We can do this. It would just mean not sending those particular values through as is. That would only be an issue if people were using the field values in style output, but such usage would be obsoleted by an eventual switch to passing through proper locale codes anyway, so I'm not too concerned.

adamsmith · February 20, 2012

@Rintze - right, of course.

fbennett · February 21, 2012

I've just checked in a release that addresses these items, as well as several things related to locale handling that had piled up.

For "Anglais" and "Inglês", I've added a processor method for setting arbitrary English overrides in the instantiated processor, in case it's needed. Usage is documented in the CHANGES.txt file (at version 1.0.287) in the citeproc-js source archive.

dstillman · February 21, 2012

Thanks, Frank. 1.0.287 is now available in the 3.0 Branch dev XPI. (No special handling for those two strings yet.)