[MLZ] Sorting problem in Chinese transliteration

isabel-h1 · May 31, 2014

Hello,

I am using Multilingual Zotero very succsessfully for my papers. However, I ran into an issue related to sorting Chinese entries in the bibliography.
The bibliography entries are sorted according to the authors' names' transliteration (pinyin) or, in case the name is only known in alphabetic characters, by the normal author field. The Pinyin-transliteration uses tone marks that look like "accents", e.g. ōǒóò, on some characters (aeiouü) to specify their tone. Unfortunately, ǔ got sorted before a and e, and è gets sorted after e.
With chinese pinyin, I would like to sort independently of the used tones. So èéēě should be sorted as if it was only "e".

What I got automatically:
Chǔ Jīnqiáo 楚金桥.
Chang, Hui-Ching,
Chen, Sylvia Xiaohua
Cheng, Simone C. L.
Cheng, Y. H.
Chén Xiàngmíng 陈向明.

What I want:
Chang, Hui-Ching,
Chen, Sylvia Xiaohua
Chén Xiàngmíng 陈向明.
Cheng, Simone C. L.
Cheng, Y. H.
Chǔ Jīnqiáo 楚金桥.

I use following style: https://github.com/j-4/styles/blob/master/vienna-journal-east-asian-studies.csl

Can I specify the sorting method in the style-file?

Thanks for your help!
Isabel

fbennett · May 31, 2014

Well, that's not very satisfactory, is it.

I can't reproduce that sort failure with those names here; they come out as desired.

There are a couple of possible causes for the result you're seeing. One is that your platform may have a broken Unicode locale. For a spot-check, try pointing your browser at this page, and let us know what you see:

http://gsl-nagoya-u.net/http/pub/UNICODE-SORT-TEST.html

If the result is not "1", the problem is definitely in your browser. In either case, check the Firefox version (under Help -> About Firefox). There have been improvements in locale sort handling, so it would be best to use the most recent version (v.29).

The other possible problem is that your content may contain non-precomposed characters, which I believe can still cause sorting problems. There is a forum discussion here:

https://forums.zotero.org/discussion/12684/special-character-search/

But we'll cross that bridge only if we have to.

Addendum:

This link suggests that non-precomposed characters may creep into input more easily on Apple systems:

https://developer.apple.com/library/ios/qa/qa1235/_index.html

So if you are on a Mac, this is a possible problem. As the ticket linked in the forum post above has been open for six years, maybe it's time for a solution.

fbennett · May 31, 2014

(Meanwhile ... I have prepared a change to MLZ and the citation processor that should work to normalize non-precomposed characters. If your browser is up to date and the text page above passes okay, let me know and I'll get the revised version online so you can test it.)

adamsmith · May 31, 2014

(Frank - the non-precomposed characters, as you know, have been haunting us for vanilla Zotero, too. Having the processor improvement will be great, but if the MLZ change could extend to regular Zotero that'd be fantastic--I know your last attempt to get an MLZ patch into Zotero wasn't successful, but maybe under-the-hood stuff is easier?)

fbennett · May 31, 2014

I've added a hook to the processor for a function to normalize the string content of sort keys. I haven't tested that it will actually run yet, but the code for it is here:

MLZ:
https://github.com/fbennett/zotero/commit/2feca308c2e0980bd7cb95d7e0ee8b50597ae93e

citeproc-js
https://bitbucket.org/fbennett/citeproc-js/commits/d9320700817f2482b9ab8b36bb6b6b336bf90980

aurimas · June 1, 2014

Oooo, in other great news, looks like string.normalize() is coming to Firefox and is already in Chrome! https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize#Browser_compatibility

dstillman · June 1, 2014

(We've had a function (Zotero.Utilities.Internal.normalize) that does this (using NFC) for years, though. We just don't use it anywhere, as far as I know.)

dstillman · June 1, 2014

We could pretty easily normalize all item field inputs going forward. Normalizing all existing data is a bit trickier. (We can normalize on-demand for things like CSL, but not for things like searching.) Once we have async DB, we could maybe do it in the background.

aurimas · June 1, 2014

I think that (at some point) we should normalize the whole DB, but my concern was always this outstanding ticket: https://bugzilla.mozilla.org/show_bug.cgi?id=728180 If we normalize with the current nsIUnicodeNormalizer and Firefox upgrades this, we have to re-normalize the data or it will break comparisons again (though going from Unicode 4.1 forward, this may not be an issue if Firefox keeps up with Unicode updates). My hope was that Firefox would figure out what to do with this bug when it comes time to implement String.prototype.normalize for ES6, but I don't think they did.

In any case, it doesn't look like there's much activity for that bug, so maybe it's not worth worrying about having to redo the DB normalization whenever a new version arrives (and it seems that if the new version comes, it will not override the old version, so we'll have time to migrate).

Edit: I looked more into the changes for the normalization algorithm in 4.1 and, as they say, this should not apply to anything found in meaningful text. The changes to actual decomposition mappings are also quite minimal. So the only differences that would be introduced upon update are compositions for character sets added since 3.2 (I think that's what the current implementation in Firefox is)

Edit 2: after some more reading, it seems that String.prototype.normalize() uses the ICU library, which should support the most up to date version of Unicode. So that's great. I think that's what we should plan to use for normalization.

isabel-h1 · June 1, 2014

Hi everyone,

Thanks a lot for your fast and great help!

I initially used Iceweasel 24.5.0 and the website you pointed me to showed me 0 as a result. Therefore, I downloaded Firefox 29.0.1, which got me the result 1 and where the sorting in ML Zotero is now correct automatically.

It would be fantastic if ML Zotero was integrated into regular Zotero! I was using regular Zotero before, without knowing of the multilingual version, and always had trouble with creating Chinese entries that require characters as well as transliterations. I only found out about ML Zotero through much googling and specific forum posts, but I think that a lot of people still don't know about the great possibilities that are out there!

All the best,

Isabel

fbennett · June 1, 2014

That's great to hear. I'll retain the Unicode normalizer fix for sort keys in the next MLZ release, just in case, but it's good to know that everything is working for you.

Thanks for your kind words about MLZ. The project has kept a low profile in part to avoid a backlash from disappointment over bugs in the early releases. One of the main aims was to get it working for our staff and students, on the assumption that once it proved itself useful, it would begin to percolate onto people's desktops. It's great to see that starting to happen.

MLZ also has pretty-good support for legal referencing, which is another growth area. There have been a lot of code changes against mainstream Zotero to get things working, and migrating them to the main project may have to wait quite awhile; but the prospects for eventual merger and decommissioning of the MLZ variant will rise as the tool finds favour with new users. Meanwhile, local support for our students (who hail from a bunch of jurisdictions whose languages I do not understand) gives me an incentive to support the tool and maintain parity with changes in mainstream Zotero.

Onward and upward. :)