Citing Cyrillic transliterations
Recently we had a longish conversation on Twitter about citing metadata that's been transliterated from Cyrillic.
Background
There is a number of ways to transliterate Cyrillic text into Latin script.
It seems that many catalogs serve metadata that's been transliterated using the ALA-LC standard (i.e. includes character combinations like i︠e︡, ĭ, T︠︡S, etc.).
Issue
1. Combining marks currently make it difficult to find items in Zotero database.
2. When citing, ligatures/combining marks are not used (though apparently they may be used in some styles?)
Possible solutions
1. This is a general issue and I think it will be addressed when we implement text normalization throughout the database and then strip all special Unicode marks when comparing strings.
2A. We _could_ replace ALA-LC transliterations with "standard" form (I assume this refers to BGN/PCGN system) on import from websites. It seems to be a pretty straightforward 1:1 mapping, though it only works one way. The issue here is that these ligatures are not specific to Cyrillic transliterations and could be used in other scripts, so we would have to make sure that we're only doing it for Cyrillic transliterations. Unfortunately, from what I saw, many catalogs do not include any indication of the language/script, so I would rather leave this in user control. Additionally, it seems that the ligatures are actually informative and Avram has suggested that Zotero should _not_ remove them on import.
2B. The other option I see is that these are cleaned up when citing in citeproc-js. This would allow the user to use the language field to specify what kind of transliteration this is (more on that below*) and we would not have to worry about messing up metadata. Additionally, if some styles do want to use ALA-LC system, there could be a way to specify this in the style. Finally, the original metadata would remain undisturbed.
* There is a "t" extension to the BCP 47 language tag system that allows specifying the source language for transliteration and the system that was used to transliterate. This could allow the character substitution to be fine-tuned based on the style requirements and the language/script of the metadata.
Off topic: in the long long long run, I can see Zotero taking advantage of the ICU project to transliterate metadata on-the-fly.
Background
There is a number of ways to transliterate Cyrillic text into Latin script.
It seems that many catalogs serve metadata that's been transliterated using the ALA-LC standard (i.e. includes character combinations like i︠e︡, ĭ, T︠︡S, etc.).
Issue
1. Combining marks currently make it difficult to find items in Zotero database.
2. When citing, ligatures/combining marks are not used (though apparently they may be used in some styles?)
Possible solutions
1. This is a general issue and I think it will be addressed when we implement text normalization throughout the database and then strip all special Unicode marks when comparing strings.
2A. We _could_ replace ALA-LC transliterations with "standard" form (I assume this refers to BGN/PCGN system) on import from websites. It seems to be a pretty straightforward 1:1 mapping, though it only works one way. The issue here is that these ligatures are not specific to Cyrillic transliterations and could be used in other scripts, so we would have to make sure that we're only doing it for Cyrillic transliterations. Unfortunately, from what I saw, many catalogs do not include any indication of the language/script, so I would rather leave this in user control. Additionally, it seems that the ligatures are actually informative and Avram has suggested that Zotero should _not_ remove them on import.
2B. The other option I see is that these are cleaned up when citing in citeproc-js. This would allow the user to use the language field to specify what kind of transliteration this is (more on that below*) and we would not have to worry about messing up metadata. Additionally, if some styles do want to use ALA-LC system, there could be a way to specify this in the style. Finally, the original metadata would remain undisturbed.
* There is a "t" extension to the BCP 47 language tag system that allows specifying the source language for transliteration and the system that was used to transliterate. This could allow the character substitution to be fine-tuned based on the style requirements and the language/script of the metadata.
Off topic: in the long long long run, I can see Zotero taking advantage of the ICU project to transliterate metadata on-the-fly.
1) No ligatures.
2) i instead of ĭ for й.
3) Capitalized letters which are two roman letters but one in Russian (Ц -> Ts) are rendered with standard English capitalization (i.e. Ts instead of TS with the ligature).
4) Russian old orthography letter ѣ (yat') is transliterated "e" instead of "i︠e︡."
I believe this might be an older version of the LC system, but at any rate it is the standard form used in publications such as the Russian Review. It's also routinely called the LC system; I don't think people are actually aware of what the strict form entails.
Speaking only for myself, it's irrelevant to me that the ligatures are displayed in Zotero itself (as long as the search is made to work properly). What really bugs me and makes Zotero very hard to use in final products is the presence of these forms in the citations. So a citation-level fix would be fine for me as a historian.
We should be able to solve this with a plugin, if I provide a hook in citeproc-js for an unconditional transform function, applied to CSL items before the abbreviation mechanism gets ahold of them. All we would need is a set of JSON mappings for character clusters to be transformed, and a small plugin to attach a function that makes use of them to the processor. Something like:
{
"ru": {
"[ligature chars]": "e",
"[ligature chars]": "i"
}
}
Maybe.
citeproc.sys.stripLigatures = function (Item) {
// Do stuff to Item
}