Language-specific case associations and sort orders

sahin · April 11, 2010

Hi,
Probably this should be posted also as a "new request" but i let the moderators decide.

I think a child-element like “sort-order” is necessary in CSL to define language specific case associations and sort orders.

Examples for implementation are,

sorting German characters with umlaut (a, o and u),

Turkish case associations like
- uppercase of U+0131 (LATIN SMALL LETTER DOTLESS I) is U+0049 (LATIN CAPITAL LETTER I) and
- uppercase of "i" (e.g. U+0069 LATIN SMALL LETTER I) is NOT "I" but U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) .

Hint: It is worth to take a look at how Toolbox solves this problem by language encoding (lng) files.

Rintze · April 11, 2010

This seems somewhat related to http://forums.zotero.org/discussion/11498/accented-characters-not-captured-correctly/?Focus=55467#Comment_55467

Are these sort orders and case associations strongly language-specific, or would an language-agnostic character mapping suffice? It's (currently) not possible in CSL to set the language on a per-item basis, the only exception being the sorting of names based on the script used: http://citationstyles.org/downloads/specification.html#name-part-order.

fbennett · April 11, 2010

The new CSL processor will use the Javascript localeCompare() function as the basis of sorts (see ECMA-262 [large PDF, relevant discussion is in section 15.5.4.9, see esp. Note 2 to that section]). In Firefox, this function uses the current locale collation to determine sort position, rather than the raw value of the Unicode character number, and should produce correct or near-correct sorts in most cases. If there are discrepancies between "generic" and language-specific collations, running the browser in the locale of the dominant language should tighten things up a bit.

The one place where this will break down pretty comprehensively is with Asian ideographic scripts (Chinese, Japanese). To cope with those, a shadow sort field of some kind will be required (as is true of any software that must sort terms in those languages).

EDIT: Just in case that explanation seems complacent, here is partial output from a torture test for localeCompare() that I built before concluding that it basically dispenses with the sorting issue. The torture-test program produces 1,000 strings of random, mixed-script garbage, and then sorts them:

ŏ
ŏąęĄĻđęũŬŶčůłŌěĝšİŁťĎĮĩ
ŎāĒıũŴŅĿŻĩūİ
ŏăřęŗŤġŔăąĵŶŝŲōēŋĲĞēśāźő
őčŒűŲŅŦŁđŃŵģśĸĿįŗĔďĦŭĚğŬ
ŌĉŒŹŕřĴŚĶňŵŎŦĤ
œĂŇŶĕĶŵŻųĖāœĚċ
œčċŹ
œĈňŬĩĲĆŎĢĬĠŏűĄŴŽūŜĭħ
ŎĕċŜŏıŉđżĺĶěŤĽőŉŎģ
őęĤīŚŐŵŵĹŷĄń
œĭňļŷŷģńŅğĨņĜņŞ
œĪŞČŭňŒĬġĻŴňĸŌŅĝŹŋı
œĹĔĆŋĠĻ
œŏńşĿŎģŲŀĒňų
ŒŔĠĨĄĲĒąŷīīĞĸěďğŜ
ŐęŕŰŵċťŘŉĉĈĸŚžĜĂĞġĝăŞĕœ
œŘŶŮĲĤēĺĳ
ŒŨĤŞĦŖ
œŪľőň
œůŸĻĔĤŏĆśŃńėŸŤźğčľ
ŎĚżĮť
œĦņĬĞĐ
œŉŧťųŜŞńŠŵĶŹĎĥįűĭşĵįġőů
ŎĤ
ōĥŽņŔũ

... which seems about right.

ajlyon · April 11, 2010

The issue with case transformations for Turkish (and some other Turkic languages in their Latin variants), is an open bug in Firefox (https://bugzilla.mozilla.org/show_bug.cgi?id=231162).

The sorting is a tough one-- the sorting rules for Cyrillic-derived Turkic scripts have changed frequently, sometimes mandating that the, say, Tatar-specific characters be after all of the Russian characters (especially pre-1991), and sometimes that the Tatar-specific characters be sorted immediately after their graphically (and usually phonetically) similar counterparts in the standard Russian Cyrillic alphabet. I don't think that it's necessary to provide support for radically different and poorly defined sorting conventions-- but keep in mind that such conventions do exist, and that publications could conceivably demand certain behavior with regard to them.

fbennett · April 11, 2010

I think we're on the same page. My only point was that sort ordering is a feature of locales on which CSL can depend. Where a locale collation is broken or missing, it needs to be fixed upstream.