Language-specific case associations and sort orders

Hi,
Probably this should be posted also as a "new request" but i let the moderators decide.

I think a child-element like “sort-order” is necessary in CSL to define language specific case associations and sort orders.

Examples for implementation are,

sorting German characters with umlaut (a, o and u),

Turkish case associations like
- uppercase of U+0131 (LATIN SMALL LETTER DOTLESS I) is U+0049 (LATIN CAPITAL LETTER I) and
- uppercase of "i" (e.g. U+0069 LATIN SMALL LETTER I) is NOT "I" but U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) .

Hint: It is worth to take a look at how Toolbox solves this problem by language encoding (lng) files.
  • edited April 11, 2010
    This seems somewhat related to http://forums.zotero.org/discussion/11498/accented-characters-not-captured-correctly/?Focus=55467#Comment_55467

    Are these sort orders and case associations strongly language-specific, or would an language-agnostic character mapping suffice? It's (currently) not possible in CSL to set the language on a per-item basis, the only exception being the sorting of names based on the script used: http://citationstyles.org/downloads/specification.html#name-part-order.
  • edited April 11, 2010
    The new CSL processor will use the Javascript localeCompare() function as the basis of sorts (see ECMA-262 [large PDF, relevant discussion is in section 15.5.4.9, see esp. Note 2 to that section]). In Firefox, this function uses the current locale collation to determine sort position, rather than the raw value of the Unicode character number, and should produce correct or near-correct sorts in most cases. If there are discrepancies between "generic" and language-specific collations, running the browser in the locale of the dominant language should tighten things up a bit.

    The one place where this will break down pretty comprehensively is with Asian ideographic scripts (Chinese, Japanese). To cope with those, a shadow sort field of some kind will be required (as is true of any software that must sort terms in those languages).

    EDIT: Just in case that explanation seems complacent, here is partial output from a torture test for localeCompare() that I built before concluding that it basically dispenses with the sorting issue. The torture-test program produces 1,000 strings of random, mixed-script garbage, and then sorts them:ŏ
    ŏąęĄĻđęũŬŶčůłŌěĝšİŁťĎĮĩ
    ŎāĒıũŴŅĿŻĩūİ
    ŏăřęŗŤġŔăąĵŶŝŲōēŋIJĞēśāźő
    őčŒűŲŅŦŁđŃŵģśĸĿįŗĔďĦŭĚğŬ
    ŌĉŒŹŕřĴŚĶňŵŎŦĤ
    œĂŇŶĕĶŵŻųĖāœĚċ
    œčċŹ
    œĈňŬĩIJĆŎĢĬĠŏűĄŴŽūŜĭħ
    ŎĕċŜŏıʼnđżĺĶ썼őʼnŎģ
    őęĤīŚŐŵŵĹŷĄń
    œĭňļŷŷģńŅğĨņĜņŞ
    œĪŞČŭňŒĬġĻŴňĸŌŅĝŹŋı
    œĹĔĆŋĠĻ
    œŏńşĿŎģŲŀĒňų
    ŒŔĠĨĄIJĒąŷīīĞĸěďğŜ
    ŐęŕŰŵċťŘʼnĉĈĸŚžĜĂĞġĝăŞĕœ
    œŘŶŮIJĤēĺij
    ŒŨĤŞĦŖ
    œŪľőň
    œůŸĻĔĤŏĆśŃńėŸŤźğčľ
    ŎĚżĮť
    œĦņĬĞĐ
    œʼnŧťųŜŞńŠŵĶŹĎĥįűĭşĵįġőů
    ŎĤ
    ōĥŽņŔũ


    ... which seems about right.
  • The issue with case transformations for Turkish (and some other Turkic languages in their Latin variants), is an open bug in Firefox (https://bugzilla.mozilla.org/show_bug.cgi?id=231162).

    The sorting is a tough one-- the sorting rules for Cyrillic-derived Turkic scripts have changed frequently, sometimes mandating that the, say, Tatar-specific characters be after all of the Russian characters (especially pre-1991), and sometimes that the Tatar-specific characters be sorted immediately after their graphically (and usually phonetically) similar counterparts in the standard Russian Cyrillic alphabet. I don't think that it's necessary to provide support for radically different and poorly defined sorting conventions-- but keep in mind that such conventions do exist, and that publications could conceivably demand certain behavior with regard to them.
  • I think we're on the same page. My only point was that sort ordering is a feature of locales on which CSL can depend. Where a locale collation is broken or missing, it needs to be fixed upstream.
Sign In or Register to comment.