Citing Cyrillic transliterations

aurimas · July 23, 2014

Recently we had a longish conversation on Twitter about citing metadata that's been transliterated from Cyrillic.

Background
There is a number of ways to transliterate Cyrillic text into Latin script.

It seems that many catalogs serve metadata that's been transliterated using the ALA-LC standard (i.e. includes character combinations like i︠e︡, ĭ, T︠︡S, etc.).

Issue
1. Combining marks currently make it difficult to find items in Zotero database.
2. When citing, ligatures/combining marks are not used (though apparently they may be used in some styles?)

Possible solutions
1. This is a general issue and I think it will be addressed when we implement text normalization throughout the database and then strip all special Unicode marks when comparing strings.
2A. We _could_ replace ALA-LC transliterations with "standard" form (I assume this refers to BGN/PCGN system) on import from websites. It seems to be a pretty straightforward 1:1 mapping, though it only works one way. The issue here is that these ligatures are not specific to Cyrillic transliterations and could be used in other scripts, so we would have to make sure that we're only doing it for Cyrillic transliterations. Unfortunately, from what I saw, many catalogs do not include any indication of the language/script, so I would rather leave this in user control. Additionally, it seems that the ligatures are actually informative and Avram has suggested that Zotero should _not_ remove them on import.
2B. The other option I see is that these are cleaned up when citing in citeproc-js. This would allow the user to use the language field to specify what kind of transliteration this is (more on that below*) and we would not have to worry about messing up metadata. Additionally, if some styles do want to use ALA-LC system, there could be a way to specify this in the style. Finally, the original metadata would remain undisturbed.

* There is a "t" extension to the BCP 47 language tag system that allows specifying the source language for transliteration and the system that was used to transliterate. This could allow the character substitution to be fine-tuned based on the style requirements and the language/script of the metadata.

Off topic: in the long long long run, I can see Zotero taking advantage of the ICU project to transliterate metadata on-the-fly.

slawkenbergius · July 23, 2014

I should clarify: the "standard" form I had in mind is not the BGN system but the LC system, with the following differences:
1) No ligatures.
2) i instead of ĭ for й.
3) Capitalized letters which are two roman letters but one in Russian (Ц -> Ts) are rendered with standard English capitalization (i.e. Ts instead of TS with the ligature).
4) Russian old orthography letter ѣ (yat') is transliterated "e" instead of "i︠e︡."

I believe this might be an older version of the LC system, but at any rate it is the standard form used in publications such as the Russian Review. It's also routinely called the LC system; I don't think people are actually aware of what the strict form entails.

Speaking only for myself, it's irrelevant to me that the ligatures are displayed in Zotero itself (as long as the search is made to work properly). What really bugs me and makes Zotero very hard to use in final products is the presence of these forms in the citations. So a citation-level fix would be fine for me as a historian.

aurimas · July 24, 2014

Here's some info about this from CMoS:

Journals of Slavic studies generally prefer a “linguistic” system that makes free use of diacritics and ligatures. In works intended for a general audience, however, diacritics and ligatures should be avoided. For general use, Chicago recommends the system of the United States Board on Geographic Names.

So it seems that maintaining ligatures in Zotero would be preferential. (Technically, we would probably want to maintain titles in original language and transliterate as required, but we are very very very far from being able to handle that)

slawkenbergius · July 24, 2014

I don't know which journals they're talking about, but a quick scan through the footnotes of the Slavic Review, Russian Review, or Kritika--three of the leading journals for this kind of material--will demonstrate pretty clearly that nobody uses ligatures and diacritics in citations. All three journals say they use the LoC system, but actual usage implies that they do not mean the strict form.

aurimas · July 24, 2014

Yes, so it seems. I have not yet found a journal that includes ligatures. I suggest that we simply strip out ligatures (and do other conversions as noted on Twitter) when citing Russian sources if Frank is ok with this.

fbennett · July 24, 2014

It would still be best to preserve the transliteration taken from the original metadata source, since it will be more likely to succeed when used in an openSearch query.

We should be able to solve this with a plugin, if I provide a hook in citeproc-js for an unconditional transform function, applied to CSL items before the abbreviation mechanism gets ahold of them. All we would need is a set of JSON mappings for character clusters to be transformed, and a small plugin to attach a function that makes use of them to the processor. Something like:

{
  "ru": {
     "[ligature chars]": "e",
     "[ligature chars]": "i"
  }
}

Maybe.

aurimas · July 24, 2014

It would still be best to preserve the transliteration taken from the original metadata source, since it will be more likely to succeed when used in an openSearch query.

Precisely, which is why I suggest doing this in citeproc.

We should be able to solve this with a plugin

Sure, though I don't see why this can't be integrated, since we haven't been able to find any use case for Russian citations with ligatures (is your main concern code organization?). Zotero could just include another .js file in its source, so it's not a big deal.

if I provide a hook in citeproc-js for an unconditional transform function, applied to CSL items before the abbreviation mechanism gets ahold of them.

Not entirely sure what you mean by "unconditional" here. Clearly this would only be applied to items in Russian. I haven't looked into how hooks work in citeproc, could you link to some documentation or relevant code?

fbennett · July 24, 2014

Precisely, which is why I suggest doing this in citeproc.

Yep, we're definitely on the same page; we just crossed in the post.

Sure, though I don't see why this can't be integrated.

No objection to integration into Zotero. A plugin would provide a playground in which to test is all. Abbreviation support unfolded that way too.

Not entirely sure what you mean by "unconditional" here.

Sorry, that wasn't very clear. I only meant that it would be similar to an abbreviation or text-case transform, but would not depend on a triggering CSL attribute.

I haven't looked into how hooks work in citeproc, could you link to some documentation or relevant code?

The processor manual needs to be updated, I should get around to that this summer. Meanwhile, a hook would just be a (documented!) external function added to the sys object before data is loaded into the processor:

citeproc.sys.stripLigatures = function (Item) {
    // Do stuff to Item
}

aurimas · July 24, 2014

OK, sounds good. Let me know when you add the hook.

dwoodruff · April 18, 2019

Hi folks. Did anything ever happen with this?

krnl0138 · November 6, 2019

Bump thread. Is there anyhing new?