Making sure language codes are captured where available
There have been several threads recently where users were caught out by the recent change to several Chicago styles which uses title case for item titles unless language codes such as "fr-FR" are correctly entered in the 'language' field. Once explained, this isn't a big deal and the choice makes eminent sense, but it does leave people having to retro-fill the language field (of my 3000+ items, probably 2/3 are non-English and thus need a language code added).
While there is nothing much to be done about the retro-filling until batch-editing is available, I thought it might be worth making sure that translators are filling in language codes when they are available from various sources to at least reduce these kinds of issues going forward from now.
Here's a list compiled from the sources I tend to use for starters, I am sure others could contribute
Translators which provide language codes correctly:
- revues.org - sample url : jsa.revues.org/index11990.html
Translators which fill in language fields, but not with correct codes:
- sudoc.abes.fr captures 'français' which won't work for the aforementioned styles. Sample url : http://www.sudoc.abes.fr/DB=2.1/SET=1/TTL=1/SHW?FRST=7
- WorldCat and thus ISBN-lookup captures 'French' which won't work either : sample ISBN : 9782354570163
- catalogue.bnf.fr for the French national library captures 'fre' which doesn't work either. Sample : http://catalogue.bnf.fr/servlet/biblio?idNoeud=1&ID=38906233&SN1=0&SN2=0&host=catalogue
Translators which do not fill in the language field but where the information seems to be available:
- Amazon.com doesn't seem to pick up the indications - sample url : http://www.amazon.com/Composition-Francaise-Retour-Enfance-Bretonne/dp/2070437884/ref=sr_1_9?ie=UTF8&qid=1329729444&sr=8-9
- Google Scholar doesn't seem to work either on any references I tested, but I can't tell whether the info is there in the first place
- no luck on JSTOR either, but again I can' tell if the info is actually there. Sample http://www.jstor.org/pss/40903264
Those are my main sources for references - can someone who understands translator architecture comment on what is possible as to filling in language codes? Especially for the cases where the info arrives in the wrong format for it to work properly?
In the mean time I'll train the next bunch of French students to fill in the language field manually.
While there is nothing much to be done about the retro-filling until batch-editing is available, I thought it might be worth making sure that translators are filling in language codes when they are available from various sources to at least reduce these kinds of issues going forward from now.
Here's a list compiled from the sources I tend to use for starters, I am sure others could contribute
Translators which provide language codes correctly:
- revues.org - sample url : jsa.revues.org/index11990.html
Translators which fill in language fields, but not with correct codes:
- sudoc.abes.fr captures 'français' which won't work for the aforementioned styles. Sample url : http://www.sudoc.abes.fr/DB=2.1/SET=1/TTL=1/SHW?FRST=7
- WorldCat and thus ISBN-lookup captures 'French' which won't work either : sample ISBN : 9782354570163
- catalogue.bnf.fr for the French national library captures 'fre' which doesn't work either. Sample : http://catalogue.bnf.fr/servlet/biblio?idNoeud=1&ID=38906233&SN1=0&SN2=0&host=catalogue
Translators which do not fill in the language field but where the information seems to be available:
- Amazon.com doesn't seem to pick up the indications - sample url : http://www.amazon.com/Composition-Francaise-Retour-Enfance-Bretonne/dp/2070437884/ref=sr_1_9?ie=UTF8&qid=1329729444&sr=8-9
- Google Scholar doesn't seem to work either on any references I tested, but I can't tell whether the info is there in the first place
- no luck on JSTOR either, but again I can' tell if the info is actually there. Sample http://www.jstor.org/pss/40903264
Those are my main sources for references - can someone who understands translator architecture comment on what is possible as to filling in language codes? Especially for the cases where the info arrives in the wrong format for it to work properly?
In the mean time I'll train the next bunch of French students to fill in the language field manually.
I feel like this needs to be improved on the parsing side of things and not on the translator side of things. We can't really convert the field content to valid two-letter codes in all translators.
I would like this to work if people put in "French, french, francais, fre [which is afaik the ISO three letter code], FR or fr" and similarly for other languages.
I'd be interested how/what Frank, who has spend a lot of time pondering multilingual questions, thinks about this going forward.
I see currently three possibilities:
1. Current behavior. Advantage: no false positives Downside: we can't rely on auto-imported field content and it requires people to know the locale codes.
2. Zotero looks at the case-insensitive first two letters of the language field. Advantage: Will work for many more cases including most ISO codes and many languages input as whole words Downside: This will still not work smoothly for many languages - e.g. for German the field needs to say de or Deutsch, not German (i.e. be in the native language), but e.g. for Finnish it will need to be Finnish or fi, not suomi, i.e. be in English. Both cases are common: es for español instead of Spanish, jp for Japanese instead of nippon etc.
3. Zotero treats any field that's not empty or starting with a case-insensitive "EN" as non-English. Advantage: Almost anything a user or translator puts in the field will work. Downside: lots of false positives for people with random stuff in the language field.
I favor 3. - I think the current behavior accommodates people with bad data in their database to the detriment of people who have good data, but input in a non-technical way.
The language code parsing is in citeproc-js, not Zotero, right? If so, is the title-casing behavior configured solely by the language code? Does the processor use the language value for anything else? (For example, is it usable by styles? I'm guessing not if it's expecting ISO codes.) Because if we were doing #3, either we'd have to pass some arbitrary code regardless of field content or citeproc-js would need to provide another option to disable title-casing.
Re: #2, the parsing could be a bit more sophisticated than that. We could probably parse most legitimate values fairly easily. If the processor might use the language codes for anything else in the future, that would be better.
I'm a bit concerned that neither option is sufficient to avoid confusion among users who don't spend time in these forums, but I'm not sure what else we can do.
Does the bibliography locale setting affect this? That is, if there's no value for Language but the locale is set to fr-FR, does the title casing still happen?
- Treat an empty Language field as non-English if the CSL locale is set to a non-English value. As Dan says, that will clear many cases without intervention.
- Be indifferent to case when parsing out the field value. Case discrimination is discretionary in RFC 5646; the common convention of lowercase for the first element is not binding, so the processor should not trip over uppercase elements
- Treat non-conforming values as non-English across the board. Users who enter something in the field seem most often to be trying to disable title casing. Where they are trying to do something else, it's easy to explain what's going on, and unstructured content can be cleaned up with batch editing in a future version.
- Provide a processor toggle for disabling text-case transforms. This would allow introduction of a UI option (similar to the URL toggle) if it's found to be necessary.
If that all makes sense, I can introduce the changes in a fresh processor release later today.Another touch to help dispel confusion might be first-run guidance on the Language field, with a link to a list of ISO codes on the Web.
http://rst.projectfondue.com/api/v1/rst2html/?rst_url=https://raw.github.com/citation-style-language/documentation/master/specification.txt&css_url=http://citation-style-language.github.com/styles/css/screen.css&output_type=html&callback=&document_output=whole&highlight_style=manni#title-case-conversion
"If default-locale is set to a locale code with a primary language tag other than "en", items are assumed to be non-English. An item is only considered to be English if the value of its language field starts with the "en" primary language tag."
So in a German or Spanish style, you could tag items as English and they will Title Case.
[edit: @fbennett, yes, I understand. I was just curious about @adamsmith's remark directly above]
For "Anglais" and "Inglês", I've added a processor method for setting arbitrary English overrides in the instantiated processor, in case it's needed. Usage is documented in the CHANGES.txt file (at version 1.0.287) in the citeproc-js source archive.