Non valid language codes in language field

Mvolz-WMF · March 21, 2015

So, my understanding is that the language field is not validated, or that translators don't have to return valid iso language types.

Recently we've encountered translators that return values that LOOK like valid language codes but are not (example: http://www.pbs.org/newshour/making-sense/care-peoples-kids/ has language field value en_US)

Should just things be reported/fixed as bugs or is the general policy that since the field isn't validated, valid language codes don't have to be used?

aurimas · March 21, 2015

Correct, there's currently no validation or requirement for language field. Where possible, we try to supply valid iso codes, but more often than not the data is simply scraped from the page in whatever format is presented. In the future, the field will likely get parsed into a correct iso code, but that will probably not going to be guaranteed (i.e. API consumers can expect a valid code, but shouldn't break if the value is not a valid code)

Mvolz-WMF · March 21, 2015

Okay, thanks! Do you have any idea what standard you might use?

adamsmith · March 21, 2015

yeah, we're likely going to use the ISO two letter language code followed by two letter country code the way mozilla abbreviates locales, i.e. en-US.
(That's what CSL already uses/understands for locales and citeproc-js actually does understand it in the language field). It's possible Dan will prefer a separation between display and database (the way it's done e.g. with date added, which is stored as ISO but displayed as text), but given the complexity of language that seems tricky.