Multiple versions of a field in a record
I'm trying Zotero out again, seeing if it will meet my needs, and there's a huge barrier that is probably insurmountable with the current design but I figured I'd ask about anyhow. So here goes.
I study Chinese literature, which means that I need to cite works in Chinese and Japanese frequently. It's standard scholarly practice to give the titles of texts in three ways: in Romanization, in characters, and in English translation. In a database, of course, these should be kept separate; they are also formatted differently (the Romanization is italicized, the characters and translation are not).
Is there any chance Zotero will one day be able to handle this kind of situation? As it works now, for example importing from OCLC WorldCat, Zotero does an inconsistent job of handling Chinese titles: the problem seems to be that records tend to just have the two forms one after the other and no distinction between them (sometimes, though, it mixes up author and title info, and putts author data information into name fields). Or is there already some way of, for example, adding new fields that I haven't noticed?
Thanks.
I study Chinese literature, which means that I need to cite works in Chinese and Japanese frequently. It's standard scholarly practice to give the titles of texts in three ways: in Romanization, in characters, and in English translation. In a database, of course, these should be kept separate; they are also formatted differently (the Romanization is italicized, the characters and translation are not).
Is there any chance Zotero will one day be able to handle this kind of situation? As it works now, for example importing from OCLC WorldCat, Zotero does an inconsistent job of handling Chinese titles: the problem seems to be that records tend to just have the two forms one after the other and no distinction between them (sometimes, though, it mixes up author and title info, and putts author data information into name fields). Or is there already some way of, for example, adding new fields that I haven't noticed?
Thanks.
It will take some thought to get the details right, though, particularly as Zotero moves into collaborative server functionality. What does a "title" mean, for example, when we're talking about, say, scholars collaborating and sharing data across languages?
Thinking off the top of my head, maybe: title (original language and script), transliteratedTitle (alternative script), translatedTitle (alternative language)?
This issue has been dealt with somewhat by librarians, and the MARC specification has a system for handling variants ($6 and tag 880). A related problem is that some databases list only translated titles (some of the social science indexes for example), and it would be helpful to have Zotero recognize that this is a translation, and in many cases an unofficial one created by the indexer.
I could see other advantages to opening up the category of "title." For example, some citations formats (e.g., Chicago) use "short titles" for citations after the first. If there were a way to input a preferred short title, this would be useful for citation. Also, some early modern books have very long titles (The Chicken, Being an Inquiry into the Origin and Uses of Domestic Fowl, their Eggs & their Feathers, from Antiquity to the Present, based on Sources Scriptural & Profane, with Passages from the Choicest Authors, Englished & Annotated, & An Inquiry into the Eleven Spices & Flavorings....), that one would want to have recorded but that are not usually cited in full.
1) Would these three "language tags" be sufficient, or should Zotero allow an arbitrary number of variants per field, with a way to designate the variant containing the primary version? An example of the latter: Traditional Chinese, Simplified Chinese, and Latin transliteration, with Traditional Chinese marked as primary. Are there any examples that would require more than three variants?
2) Is it necessary to have script identifiers (e.g., to say, this is Cyrillic)? Is it desirable to have a fixed set of script identifiers, even if certain languages won't be covered? (MARC has some, but it seems like a rather culturally limited list.)
3) What fields need this support? Do they have different requirements? (For example, creators would need multiple script support but probably wouldn't need translation, so translatedTitle may just need to be a data field separate from this issue of script variants.)
I haven't fully reviewed MARC's multiscript support, but I'm hoping someone with experience using MARC or just with integrating multiple languages into their research can comment on what they see as necessary/desirable for Zotero to support.
But, I am having problems with transliterated titles (mainly Arabic and Persian). They appear correctly in the library but not when I cite on a word doc. I am wondering if the problem has to do with my font settings (and not with Zotero). I suppose my new MAC does not support Unicode. Or does it?
Thanks,
D
That aside, my concern is to ensure whatever solution can work well in the context of RDF (bibo, dc, and such; data needs to be able to be reliably imported and exported) and CSL (a processor needs to be able to figure out how to format it).
For sake of argument (since this isn't my expertise), what if all strings get a language tag. RDF supports this out-of-box; e.g.:
<http://ex.net/1> dcterms:title "foo"@en
Is there a way to indicate script with a language tag, such that one could, say, have a transliteratedTitle property that included such tag to indicate the script?
If yes, then it seems to me this would likely be the simplest and most reliable solution. Quick googling turns up this. A quick read seems to suggest the answer to my above question is "yes"?
http://forums.zotero.org/discussion/3214/more-fields-translation-info/
http://forums.zotero.org/discussion/4641/multilingual-citations/
lastName: "夏目"
transliteratedLastName: "なつめ" @ ja
transliteratedLastName: "Natsume" @ en
translatedLastName: nil
Full(ish) data for a Japanese title might be:
title: "我輩は猫である"
transliteratedTitle: "わがはいはねこである" @ ja
transliteratedTitle: "wagahai ha neko de aru" @ en
translatedTitle: "I am a Cat" @ en
Hope this helps.
I have a few remarks to the discussion:
1. As fbennett has pointed out, one field for transliteration is probably not enough. Though with Chinese titles the problem seems to be slightly different: A lot of older books use old/false transliteration and so this transliteration plus the new/official transliteration have to be included. For Example: 周策纵, Chow Tse-Tsung ("old" transliteration), Zhou Cezong (Pinyin transliteration (which I want to use for alphabetical ordering in a bibliography)).
2. As far as I can tell, these three additional fields (transliteration 1+2, translation) are mostly needed for title and author/editor, but nevertheless (sometimes) they are also necessary for other fields like publisher and place.
3. And a little off topic: For many Chinese Classics there is a commentator which could be included in the roles.
The most complete solution would probably be to have three sub-fields for all translatable fields (main, transliteration, translation) and the additional option to activate one or more language codes for each field. Actually, why wouldn't you be able to activate more than one language for each text field?
For the language codes, it's maybe best to allow everything in ISO 639-2, that should cover most languages of the world.
Thinking the issues through this morning, I realized that a trial solution can be implemented entirely in the CSL processor, without any modification to Zotero, and without any extension to the CSL language. I'm now toying with the idea of giving it a whirl. If anyone would like to provide critical feedback while I build tests and play around with possibilities, please post to this thread. If several people are interested, maybe we can set up a google group or something to use for discussion.
I already had this thread bookmarked and will keep an eye on it...or feel free to email (see profile).
A few comments:
I can imagine scenarios that would require more translations/forms. For instance, some languages have multiple transliteration standards.
For output, I have recently seen a few bibliographies with Japanese texts in three forms: original, rōmaji (transliterated), and translated.
I find it a little tricky to discuss this without mixing up the concept/requirements with implementation options. Or maybe just hard to come up with the right vocabulary - I'm thinking of terms in this thread like "tags," "sub-fields," "layers." Anyhow when I say "translations/forms" above I mean the various representations of title/author/publisher/etc. - whether translations, transliterations, or phonetic renderings.
As a practical matter, for example, consider this new periodicals dataset, which I am hoping to see Zotero use. How would alternate language versions of the title be represented here? Here's an example, in the turtle format:
<http://periodicals.dataincubator.org/journal/electrical-engineering-japan>
dc:subject "Electric engineering" ;
dct:publisher <http://periodicals.dataincubator.org/organization/wiley-blackwell-john-wiley-and-sons> ;
dct:subject <http://id.loc.gov/authorities/sh85041666> ;
dct:title "Electrical Engineering in Japan" ;
bibo:doi "10.1002/(ISSN)1520-6416" ;
bibo:eissn "1520-6416" ;
bibo:issn "0424-7760" ;
a bibo:Journal ;
owl:sameAs <http://periodicals.dataincubator.org/eissn/1520-6416>, <http://periodicals.dataincubator.org/issn/0424-7760> ;
foaf:homepage <http://dx.doi.org/10.1002/(ISSN)1520-6416> .
The simplest is to give each title a language tag, it seems to me. So, for example we could change the title property to:
dct:title "Electrical Engineering in Japan"@en ;
This says, of course, "this string is English." For sake of argument, why not just repeat the title properties in different languages and scripts, and use the relevant lang tag? Presumably, so long as you know the original language, things should work?
It goes beyond language alone. For the Asian scripts, we need to have access to multiple methods of transliteration, as well as translations. For example, consider the cite:
我妻栄 「民法 」有斐閣:1948
In a bibliography, this would be sorted on the author using the katakana character form of the author's name (ワガツマサカエ), so we need to deliver that transliteration to the sort key. In another style, the same citation data might render like this:
Wagatsuma Sakae, Minpō [Civil Law] (1948).
This uses revised Hepburn for both the author name and the title, with a bracketed translation of the title. In this case, the romanized author name should be used for the sort key. If there are standards to cover this, we should certainly be using them. Is there a standard method of specifying transliteration rules and scripts, as well as languages?
These are not corner cases, this is normal citation practice when working with Asian language materials. As alexuw points out, the ability to handle these cases is a minimum threshold requirement for a reference manager in these languages.
@alexuw,
I put in a little time on this yesterday, and there is now trial code for handling the basic possibilities. There is a set of working notes here. You can check out the tests here:
(1) Name: Katakana for name sort
(2) Name: Use Hepburn transliteration and native ordering if available, otherwise English with Western ordering, otherwise fallback to default
(3) Title: Use arbitrary representation for primary display (normally should be a transliteration, if specified)
(4) Title: Fallback to default if requested primary representation does not exist
(5) Title: Include secondary representation in display (will normally be a translation)
(6) Title: Do not add secondary representation if not available.
(7) Title: Sort title on specified alternative text, if available (title variable called from CSL macro).
(8) Title: Sort title on specified alternative text, if available (title variable called from CSL sort key directly).
I'll confess that I tend to drive people nuts by mixing up implementation details with design considerations; I don't have any formal CS training, it just kind of happens. At the moment, though, I'm just interested in getting something out there that interested parties can look at and take for a test drive (with appropriate caveats), to see how well a given approach addresses use cases in the wild. But I do think we have a good start toward a solution with the above.
If that doesn't go far enough, then we can consider alternate title properties (the previously mentioned transliterated and translated).
dct:title "日本における電子工学"@ja ;
dct:title "二ホン二オケルデンシコウガク"@ja ;
dct:title "nihon ni okeru denshi kougaku"@en ;
dct:title "nihon ni okeru denshi kōgaku"@en ;
dct:title "Electrical Engineering in Japan"@en ;
There are multiple representations for each language, including multiple forms of transliteration. How do I tell the processor which one to use on the basis of the language tag alone?
(EDIT: Sorry, reviewing linked doc now.)
Does that address the problem?
* The script subtag would allow you to distinguish between the two phonetic Japanese scripts.
* The variant subtag is defined in a way that it could be used for the transliteration rule. In fact this subtag registry contains Wade-Giles and Pinyan, but I don't see any others.
It's hard to think of all the use cases though...
A trial implementation has been written into the new CSL processor. The source archive is here (you'll need Mercurial installed to work with it). This minimal implementation provides multilingual support for the title variable and names only. Test fixtures that illustrate how it works are in the archive, or can be viewed online here (see the tests that start with the prefix multilingual_).
The implementation is intended to be somehow-operable from vanilla Zotero, even if there is no specific UI support for multilingual data entry and maintenance. When the processor is integrated into Z (probably by early next year) we'll be able to try this out in real-world documents. Feedback from that experience will be used to refine the way things are handled in the processor and (when the Zotero team turn to support multilingual) in the data layer, and hopefully serve as a basis for designing a proper UI.
If anyone has the need and would like to carry forward with other registrations, feel free to get in touch. The process is open, but as it's a contact point between a well-defined standard and the chaotic scrum of real-world practice, it can throw off a few sparks before producing a result. Would be nice to get a registration(s) in place for Korean (which, unfortunately, I neither speak nor read).
In particular, I'd like to submit the main transliteration systems used for Russian and the Latin transliteration often used for Turkic languages written in Cyrillic scripts (slightly different from the baku1926 variant already registered).
Is this a process worth undertaking-- does the "official" status of a transliteration system matter for the the practical matter of using it with Zotero?
(edited for clarity)
My own registration filing took a bit of work and raised quite a bit of discussion on the reviewer's mailing list, but Japanese transliteration is particularly fragmented and ill-documented, and so (looking back and discounting a bit my mid-process exasperation) some doubt and disagreement was inevitable. With that word of encouragement, here is a link to the start of the mailings over Japanese Hepburn, to give you a flavor of how it works. Normally it takes a couple of weeks for a filing to work its way through review.
The procedure for registering new tags is all handled by the RFC 4646 Reviewer (the process is described fully, including the Reviewer list mail address, in section 3.5 of RFC 4646); all we need to do at our end is make the filing and answer any queries or requests for modification that come back. It's best to take your time responding to comments; as you can see from the archive threads, the list is fairly active when new filings arrive, and there can be conflicting opinions among the membership. If you give them time to sort through any issues, and then post a request for guidance after things quieten down a bit, it will save you extra work, and make it easier to sidestep the politics of the list.
The critical item for the filing is a cite to a stable document somewhere that unambiguously defines the method of transliteration. With that in hand, the only remaining issue, really, is what the tag should look like, and that's something that the folks on the list will help you to settle.
For their own sake, I wonder if they ought not add the entirety of the ALA-LC guide to the registry, as it is a set of clearly documented and quite obviously used (at least in bibliography) variants.