Multiple versions of a field in a record

Bruce Rusk · December 14, 2007

I'm trying Zotero out again, seeing if it will meet my needs, and there's a huge barrier that is probably insurmountable with the current design but I figured I'd ask about anyhow. So here goes.

I study Chinese literature, which means that I need to cite works in Chinese and Japanese frequently. It's standard scholarly practice to give the titles of texts in three ways: in Romanization, in characters, and in English translation. In a database, of course, these should be kept separate; they are also formatted differently (the Romanization is italicized, the characters and translation are not).

Is there any chance Zotero will one day be able to handle this kind of situation? As it works now, for example importing from OCLC WorldCat, Zotero does an inconsistent job of handling Chinese titles: the problem seems to be that records tend to just have the two forms one after the other and no distinction between them (sometimes, though, it mixes up author and title info, and putts author data information into name fields). Or is there already some way of, for example, adding new fields that I haven't noticed?

Thanks.

bdarcus · December 14, 2007

Right now, Zotero is not designed to be terribly international-friendly in this way. But I don't think it'd be that hard to extend to support this. It basically needs support for different kinds of titles, and ways to assign them language tags.

It will take some thought to get the details right, though, particularly as Zotero moves into collaborative server functionality. What does a "title" mean, for example, when we're talking about, say, scholars collaborating and sharing data across languages?

Thinking off the top of my head, maybe: title (original language and script), transliteratedTitle (alternative script), translatedTitle (alternative language)?

Bruce Rusk · December 14, 2007

Thanks for your reply.

This issue has been dealt with somewhat by librarians, and the MARC specification has a system for handling variants ($6 and tag 880). A related problem is that some databases list only translated titles (some of the social science indexes for example), and it would be helpful to have Zotero recognize that this is a translation, and in many cases an unofficial one created by the indexer.

I could see other advantages to opening up the category of "title." For example, some citations formats (e.g., Chicago) use "short titles" for citations after the first. If there were a way to input a preferred short title, this would be useful for citation. Also, some early modern books have very long titles (The Chicken, Being an Inquiry into the Origin and Uses of Domestic Fowl, their Eggs & their Feathers, from Antiquity to the Present, based on Sources Scriptural & Profane, with Passages from the Choicest Authors, Englished & Annotated, & An Inquiry into the Eleven Spices & Flavorings....), that one would want to have recorded but that are not usually cited in full.

bdarcus · December 14, 2007

Zotero already supports abbreviated titles.

dstillman · December 14, 2007

Ticket created, thanks.

dstillman · February 10, 2008

Thinking off the top of my head, maybe: title (original language and script), transliteratedTitle (alternative script), translatedTitle (alternative language)?

So, to return to this:

1) Would these three "language tags" be sufficient, or should Zotero allow an arbitrary number of variants per field, with a way to designate the variant containing the primary version? An example of the latter: Traditional Chinese, Simplified Chinese, and Latin transliteration, with Traditional Chinese marked as primary. Are there any examples that would require more than three variants?

2) Is it necessary to have script identifiers (e.g., to say, this is Cyrillic)? Is it desirable to have a fixed set of script identifiers, even if certain languages won't be covered? (MARC has some, but it seems like a rather culturally limited list.)

3) What fields need this support? Do they have different requirements? (For example, creators would need multiple script support but probably wouldn't need translation, so translatedTitle may just need to be a data field separate from this issue of script variants.)

I haven't fully reviewed MARC's multiscript support, but I'm hoping someone with experience using MARC or just with integrating multiple languages into their research can comment on what they see as necessary/desirable for Zotero to support.

dsajdi · April 16, 2008

First, thanks to the creators of Zotero. This is seriously amazing. Beats Endnote anytime!
But, I am having problems with transliterated titles (mainly Arabic and Persian). They appear correctly in the library but not when I cite on a word doc. I am wondering if the problem has to do with my font settings (and not with Zotero). I suppose my new MAC does not support Unicode. Or does it?
Thanks,
D

bdarcus · November 7, 2008

1) Would these three "language tags" be sufficient, or should Zotero allow an arbitrary number of variants per field, with a way to designate the variant containing the primary version? An example of the latter: Traditional Chinese, Simplified Chinese, and Latin transliteration, with Traditional Chinese marked as primary. Are there any examples that would require more than three variants?

First, I don't think we should assume that MARC has the right solution to this issue. Might be better to look to ISO or the W3C?

That aside, my concern is to ensure whatever solution can work well in the context of RDF (bibo, dc, and such; data needs to be able to be reliably imported and exported) and CSL (a processor needs to be able to figure out how to format it).

For sake of argument (since this isn't my expertise), what if all strings get a language tag. RDF supports this out-of-box; e.g.:

<http://ex.net/1> dcterms:title "foo"@en
Is there a way to indicate script with a language tag, such that one could, say, have a transliteratedTitle property that included such tag to indicate the script?

If yes, then it seems to me this would likely be the simplest and most reliable solution.

2) Is it necessary to have script identifiers (e.g., to say, this is Cyrillic)? Is it desirable to have a fixed set of script identifiers, even if certain languages won't be covered? (MARC has some, but it seems like a rather culturally limited list.)

Quick googling turns up this. A quick read seems to suggest the answer to my above question is "yes"?

dstillman · November 28, 2008

For reference, threads pointed here:

http://forums.zotero.org/discussion/3214/more-fields-translation-info/
http://forums.zotero.org/discussion/4641/multilingual-citations/

fbennett · December 10, 2008

The proposed scheme should work for Japanese, but I would request that the multilingual layers be applied to author, translator, and editor names as well, because the phonetic readings of the Chinese characters in Japanese names vary, and are normally included in metadata. Full(ish) data for a Japanese author's family name might be:

lastName: "夏目"
transliteratedLastName: "なつめ" @ ja
transliteratedLastName: "Natsume" @ en
translatedLastName: nil

Full(ish) data for a Japanese title might be:

title: "我輩は猫である"
transliteratedTitle: "わがはいはねこである" @ ja
transliteratedTitle: "wagahai ha neko de aru" @ en
translatedTitle: "I am a Cat" @ en

Hope this helps.

tucabib · March 3, 2009

I am having similar problems entering Chinese titles (with original, transliterated and translated titles) and I'm glad not to be the only one.
I have a few remarks to the discussion:
1. As fbennett has pointed out, one field for transliteration is probably not enough. Though with Chinese titles the problem seems to be slightly different: A lot of older books use old/false transliteration and so this transliteration plus the new/official transliteration have to be included. For Example: 周策纵, Chow Tse-Tsung ("old" transliteration), Zhou Cezong (Pinyin transliteration (which I want to use for alphabetical ordering in a bibliography)).
2. As far as I can tell, these three additional fields (transliteration 1+2, translation) are mostly needed for title and author/editor, but nevertheless (sometimes) they are also necessary for other fields like publisher and place.
3. And a little off topic: For many Chinese Classics there is a commentator which could be included in the roles.

rickus · March 18, 2009

I have the same problem as Bruce Rusk. I've got an database with data in English, Chinese, Japanese, and a bunch of other languages.

1) Would these three "language tags" be sufficient, or should Zotero allow an arbitrary number of variants per field, with a way to designate the variant containing the primary version? An example of the latter: Traditional Chinese, Simplified Chinese, and Latin transliteration, with Traditional Chinese marked as primary. Are there any examples that would require more than three variants?

Three language tags would probably be a bad idea for people with multilingual bibliographies, because (1) it forces you to mix languages in one field and (2) it forces you to choose a main language. Suppose you have Chinese, Japanese, French and English titles in your DB. For some journals, you might want to have the Chinese and Japanese transliteration. When you publish in a Chinese journal, you don't need the transliterations for Chinese, but you'd want them for Japanese, and you might want translations for French...

The most complete solution would probably be to have three sub-fields for all translatable fields (main, transliteration, translation) and the additional option to activate one or more language codes for each field. Actually, why wouldn't you be able to activate more than one language for each text field?

For the language codes, it's maybe best to allow everything in ISO 639-2, that should cover most languages of the world.

fbennett · August 28, 2009

I'm keen to work out a solution to this, for use among our overseas students. Their requirements are similar to what rickus describes: they write in English or in Japanese, citing resources in any of several languages (English, Russian, Chinese, Mongolian, Vietnamese, Khmer, Lao, Korean and Uzbek are all fairly common). Styled citation output should be able to place citations in one of four forms (original, original with translation, transliteration (in appropriate script), or transliteration (in appropriate script) with translation). Japanese and Chinese names and titles need to be sortable on their phonetic representation.

Thinking the issues through this morning, I realized that a trial solution can be implemented entirely in the CSL processor, without any modification to Zotero, and without any extension to the CSL language. I'm now toying with the idea of giving it a whirl. If anyone would like to provide critical feedback while I build tests and play around with possibilities, please post to this thread. If several people are interested, maybe we can set up a google group or something to use for discussion.

alexuw · August 31, 2009

I'm interested. This is one of two or three bits of functionality that prevent me from using "live" Zotero citations.

I already had this thread bookmarked and will keep an eye on it...or feel free to email (see profile).

A few comments:

I can imagine scenarios that would require more translations/forms. For instance, some languages have multiple transliteration standards.

For output, I have recently seen a few bibliographies with Japanese texts in three forms: original, rōmaji (transliterated), and translated.

I find it a little tricky to discuss this without mixing up the concept/requirements with implementation options. Or maybe just hard to come up with the right vocabulary - I'm thinking of terms in this thread like "tags," "sub-fields," "layers." Anyhow when I say "translations/forms" above I mean the various representations of title/author/publisher/etc. - whether translations, transliterations, or phonetic renderings.

bdarcus · August 31, 2009

I think whatever solution really needs to be standards-based, and to consider the issues I've raised re: import/export and such. I also think it's best to start simple and not worry about every possible corner case upfront.

As a practical matter, for example, consider this new periodicals dataset, which I am hoping to see Zotero use. How would alternate language versions of the title be represented here? Here's an example, in the turtle format:

<http://periodicals.dataincubator.org/journal/electrical-engineering-japan>
    dc:subject "Electric engineering" ;
    dct:publisher <http://periodicals.dataincubator.org/organization/wiley-blackwell-john-wiley-and-sons> ;
    dct:subject <http://id.loc.gov/authorities/sh85041666> ;
    dct:title "Electrical Engineering in Japan" ;
    bibo:doi "10.1002/(ISSN)1520-6416" ;
    bibo:eissn "1520-6416" ;
    bibo:issn "0424-7760" ;
    a bibo:Journal ;
    owl:sameAs <http://periodicals.dataincubator.org/eissn/1520-6416>, <http://periodicals.dataincubator.org/issn/0424-7760> ;
    foaf:homepage <http://dx.doi.org/10.1002/(ISSN)1520-6416> .

The simplest is to give each title a language tag, it seems to me. So, for example we could change the title property to:

dct:title "Electrical Engineering in Japan"@en ;

This says, of course, "this string is English." For sake of argument, why not just repeat the title properties in different languages and scripts, and use the relevant lang tag? Presumably, so long as you know the original language, things should work?

fbennett · August 31, 2009

@Bruce,

It goes beyond language alone. For the Asian scripts, we need to have access to multiple methods of transliteration, as well as translations. For example, consider the cite:

我妻栄「民法」有斐閣：1948

In a bibliography, this would be sorted on the author using the katakana character form of the author's name (ワガツマサカエ), so we need to deliver that transliteration to the sort key. In another style, the same citation data might render like this:

Wagatsuma Sakae, Minpō [Civil Law] (1948).

This uses revised Hepburn for both the author name and the title, with a bracketed translation of the title. In this case, the romanized author name should be used for the sort key. If there are standards to cover this, we should certainly be using them. Is there a standard method of specifying transliteration rules and scripts, as well as languages?

These are not corner cases, this is normal citation practice when working with Asian language materials. As alexuw points out, the ability to handle these cases is a minimum threshold requirement for a reference manager in these languages.

@alexuw,

I put in a little time on this yesterday, and there is now trial code for handling the basic possibilities. There is a set of working notes here. You can check out the tests here:

(1) Name: Katakana for name sort
(2) Name: Use Hepburn transliteration and native ordering if available, otherwise English with Western ordering, otherwise fallback to default
(3) Title: Use arbitrary representation for primary display (normally should be a transliteration, if specified)
(4) Title: Fallback to default if requested primary representation does not exist
(5) Title: Include secondary representation in display (will normally be a translation)
(6) Title: Do not add secondary representation if not available.
(7) Title: Sort title on specified alternative text, if available (title variable called from CSL macro).
(8) Title: Sort title on specified alternative text, if available (title variable called from CSL sort key directly).

I'll confess that I tend to drive people nuts by mixing up implementation details with design considerations; I don't have any formal CS training, it just kind of happens. At the moment, though, I'm just interested in getting something out there that interested parties can look at and take for a test drive (with appropriate caveats), to see how well a given approach addresses use cases in the wild. But I do think we have a good start toward a solution with the above.

bdarcus · August 31, 2009

@Frank: I know all that, but my question remains: why not just tag each string with a language tag? AFAIK, those tags can include information about transliteration (e.g. script); see my earlier link.

If that doesn't go far enough, then we can consider alternate title properties (the previously mentioned transliterated and translated).

fbennett · August 31, 2009

dct:title "日本における電子工学"@ja ;
dct:title "二ホン二オケルデンシコウガク"@ja ;
dct:title "nihon ni okeru denshi kougaku"@en ;
dct:title "nihon ni okeru denshi kōgaku"@en ;
dct:title "Electrical Engineering in Japan"@en ;

There are multiple representations for each language, including multiple forms of transliteration. How do I tell the processor which one to use on the basis of the language tag alone?

(EDIT: Sorry, reviewing linked doc now.)

bdarcus · August 31, 2009

To quote from the linked document:

Most language tags consist of a two- or three-letter language subtag. Sometimes this is followed by a two-letter or three-digit region subtag. RFC 4646 also allows for a number of additional subtags, where needed. These will be explained briefly in the next section, and include script, variant, extension and private-use subtags.

So the tag can include the script (and maybe the transliteration rule?).

Does that address the problem?

fbennett · August 31, 2009

This is great as a source of language and subtag names. We'd want to use only a subset of the available specifiers, though, to keep things simple.

alexuw · August 31, 2009

The language tags might be sufficient.

* The script subtag would allow you to distinguish between the two phonetic Japanese scripts.

* The variant subtag is defined in a way that it could be used for the transliteration rule. In fact this subtag registry contains Wade-Giles and Pinyan, but I don't see any others.

It's hard to think of all the use cases though...

fbennett · August 31, 2009

The IANA is open to new submissions. I might test the waters by trying to get modified Hepburn registered. The use cases are an empirical thing. Once we have something running, user complaints will light the way. :)

ajlyon · September 23, 2009

Is this still on its way? I have a computer-hobby background, and I'd love to test and tweak this promising development if there's something to build on. Most of my work is in Slavic and Turkic, and I'm running into the same messes as the Japanese/Chinese/Korean scholars. This would be a superb feature to roll out in Zotero.

fbennett · September 23, 2009

ajlyon: welcome!

A trial implementation has been written into the new CSL processor. The source archive is here (you'll need Mercurial installed to work with it). This minimal implementation provides multilingual support for the title variable and names only. Test fixtures that illustrate how it works are in the archive, or can be viewed online here (see the tests that start with the prefix multilingual_).

The implementation is intended to be somehow-operable from vanilla Zotero, even if there is no specific UI support for multilingual data entry and maintenance. When the processor is integrated into Z (probably by early next year) we'll be able to try this out in real-world documents. Feedback from that experience will be used to refine the way things are handled in the processor and (when the Zotero team turn to support multilingual) in the data layer, and hopefully serve as a basis for designing a proper UI.

fbennett · October 9, 2009

New ja-Latn-hepburn and ja-Latn-hepburn-heploc subtags have worked their way through the review process set up by RFC 4646, and are now available in the IANA Language Subtag Registry.

If anyone has the need and would like to carry forward with other registrations, feel free to get in touch. The process is open, but as it's a contact point between a well-defined standard and the chaotic scrum of real-world practice, it can throw off a few sparks before producing a result. Would be nice to get a registration(s) in place for Korean (which, unfortunately, I neither speak nor read).

ajlyon · November 8, 2009

I'd like to make sure that proper latinizations of my non-Latin research languages, specifically Russian and Tatar, can be used when Zotero starts using this aspect of the new processor. Is it important to attempt to register such transliteration systems?

In particular, I'd like to submit the main transliteration systems used for Russian and the Latin transliteration often used for Turkic languages written in Cyrillic scripts (slightly different from the baku1926 variant already registered).

Is this a process worth undertaking-- does the "official" status of a transliteration system matter for the the practical matter of using it with Zotero?

(edited for clarity)

fbennett · November 8, 2009

It's definitely worth doing, because it will help keep the tagging of transliteration methods consistent across different user databases. Ideally, several people collaborating through a group should be able to copy items from their existing personal databases into the group, and trust that the multilingual stuff will just work, for everyone else as well.

My own registration filing took a bit of work and raised quite a bit of discussion on the reviewer's mailing list, but Japanese transliteration is particularly fragmented and ill-documented, and so (looking back and discounting a bit my mid-process exasperation) some doubt and disagreement was inevitable. With that word of encouragement, here is a link to the start of the mailings over Japanese Hepburn, to give you a flavor of how it works. Normally it takes a couple of weeks for a filing to work its way through review.

The procedure for registering new tags is all handled by the RFC 4646 Reviewer (the process is described fully, including the Reviewer list mail address, in section 3.5 of RFC 4646); all we need to do at our end is make the filing and answer any queries or requests for modification that come back. It's best to take your time responding to comments; as you can see from the archive threads, the list is fairly active when new filings arrive, and there can be conflicting opinions among the membership. If you give them time to sort through any issues, and then post a request for guidance after things quieten down a bit, it will save you extra work, and make it easier to sidestep the politics of the list.

The critical item for the filing is a cite to a stable document somewhere that unambiguously defines the method of transliteration. With that in hand, the only remaining issue, really, is what the tag should look like, and that's something that the folks on the list will help you to settle.

ajlyon · November 8, 2009

Thanks for the detailed advice. I've made my first post, proposing the ALA-LC romanization that dominates in English-language Slavics (at the very least, French and German have distinct romanizations). Judging from my reading of the list archives, this could be an interesting process. As far as I know, this is a less hairy issue than the CJK romanizations that that list discussed at such great length.

For their own sake, I wonder if they ought not add the entirety of the ALA-LC guide to the registry, as it is a set of clearly documented and quite obviously used (at least in bibliography) variants.

ajlyon · November 29, 2009

Barring any unexpected complaints, the variant subtag "alalc97" will be approved in the near future. Since we're in the world of bibliography here, that subtag should cover many transliterations that are likely to be needed with Zotero. It should soon be possible to say "ru-Latn-alalc97", "jp-Latn-alalc97", "ko-Latn-alalc97" and so on, for all transliterations described in http://www.loc.gov/catdir/cpso/roman.html, with the exceptions as listed on that page.

fbennett · November 29, 2009

@ajlyon: I've been following the list traffic; well done, this is a very good result. I see that my "own" tag for ja-Latn-hepburn (and ja-Latn-hepburn-heploc) will now be deprecated (already!), but that is definitely the right way to go, if the alalc97 generic subtag goes through. Quite happy with this.

ajlyon · November 30, 2009

@fbennett: Have you been using some sort of workaround with Zotero to meet your multilingual needs? I haven't been able to do more than read the citeproc-js documentation and wish that it worked today; it looks good, and I can only wish that I had a way to start maintaining multiple versions of fields today.