Language-specific case associations and sort orders
Hi,
Probably this should be posted also as a "new request" but i let the moderators decide.
I think a child-element like “sort-order” is necessary in CSL to define language specific case associations and sort orders.
Examples for implementation are,
sorting German characters with umlaut (a, o and u),
Turkish case associations like
- uppercase of U+0131 (LATIN SMALL LETTER DOTLESS I) is U+0049 (LATIN CAPITAL LETTER I) and
- uppercase of "i" (e.g. U+0069 LATIN SMALL LETTER I) is NOT "I" but U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) .
Hint: It is worth to take a look at how Toolbox solves this problem by language encoding (lng) files.
Probably this should be posted also as a "new request" but i let the moderators decide.
I think a child-element like “sort-order” is necessary in CSL to define language specific case associations and sort orders.
Examples for implementation are,
sorting German characters with umlaut (a, o and u),
Turkish case associations like
- uppercase of U+0131 (LATIN SMALL LETTER DOTLESS I) is U+0049 (LATIN CAPITAL LETTER I) and
- uppercase of "i" (e.g. U+0069 LATIN SMALL LETTER I) is NOT "I" but U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) .
Hint: It is worth to take a look at how Toolbox solves this problem by language encoding (lng) files.
Are these sort orders and case associations strongly language-specific, or would an language-agnostic character mapping suffice? It's (currently) not possible in CSL to set the language on a per-item basis, the only exception being the sorting of names based on the script used: http://citationstyles.org/downloads/specification.html#name-part-order.
The one place where this will break down pretty comprehensively is with Asian ideographic scripts (Chinese, Japanese). To cope with those, a shadow sort field of some kind will be required (as is true of any software that must sort terms in those languages).
EDIT: Just in case that explanation seems complacent, here is partial output from a torture test for localeCompare() that I built before concluding that it basically dispenses with the sorting issue. The torture-test program produces 1,000 strings of random, mixed-script garbage, and then sorts them:
ŏ
ŏąęĄĻđęũŬŶčůłŌěĝšİŁťĎĮĩ
ŎāĒıũŴŅĿŻĩūİ
ŏăřęŗŤġŔăąĵŶŝŲōēŋIJĞēśāźő
őčŒűŲŅŦŁđŃŵģśĸĿįŗĔďĦŭĚğŬ
ŌĉŒŹŕřĴŚĶňŵŎŦĤ
œĂŇŶĕĶŵŻųĖāœĚċ
œčċŹ
œĈňŬĩIJĆŎĢĬĠŏűĄŴŽūŜĭħ
ŎĕċŜŏıʼnđżĺĶ썼őʼnŎģ
őęĤīŚŐŵŵĹŷĄń
œĭňļŷŷģńŅğĨņĜņŞ
œĪŞČŭňŒĬġĻŴňĸŌŅĝŹŋı
œĹĔĆŋĠĻ
œŏńşĿŎģŲŀĒňų
ŒŔĠĨĄIJĒąŷīīĞĸěďğŜ
ŐęŕŰŵċťŘʼnĉĈĸŚžĜĂĞġĝăŞĕœ
œŘŶŮIJĤēĺij
ŒŨĤŞĦŖ
œŪľőň
œůŸĻĔĤŏĆśŃńėŸŤźğčľ
ŎĚżĮť
œĦņĬĞĐ
œʼnŧťųŜŞńŠŵĶŹĎĥįűĭşĵįġőů
ŎĤ
ōĥŽņŔũ
... which seems about right.
The sorting is a tough one-- the sorting rules for Cyrillic-derived Turkic scripts have changed frequently, sometimes mandating that the, say, Tatar-specific characters be after all of the Russian characters (especially pre-1991), and sometimes that the Tatar-specific characters be sorted immediately after their graphically (and usually phonetically) similar counterparts in the standard Russian Cyrillic alphabet. I don't think that it's necessary to provide support for radically different and poorly defined sorting conventions-- but keep in mind that such conventions do exist, and that publications could conceivably demand certain behavior with regard to them.