Multilingual Zotero: mixed RTL-LTR language input problem

seadeer · July 27, 2012

I am trying to enter a title in Farsi that contains some Roman characters in parentheses. As soon as I enter the Roman characters the word order in the paragraph changes, so that words in the title get all mixed up and I cannot put those Roman characters where they should be.

Is there a solution for this problem? The whole reason I've started using Zotero was to have something that could handle a RTL-LTR bibliography...

fbennett · July 27, 2012

This is with an in-text citation, and the bad formatting extends beyond the Zotero citation?

seadeer · July 27, 2012

This happens as I'm editing a field of an entry in Zotero (when one field contains mostly Farsi but with some Roman script included). So, the word order in that field is reflected in the citation. Other than that, the text in citations is exported correctly.
Of course, the same happens if I just start typing Farsi in this window: برراسی تب کریمه-کنگو CCHF
The word "CCHF" really should have been "after" کریمه-کنگو (i.e. to the left of it).

fbennett · July 27, 2012

I have an inkling of where the problem may lie, from this W3C document on bi-directional text.

Probably the way to handle this will be to apply directional markup to the field content in the output, for the RTL languages. We already have the basic facilities in place for doing that, from recent work on title-casing (which is restricted to English titles).

That's the idea in theory, but as I don't know any RTL languages, two things would be very helpful. First would be a screenshot of the bad entry as it appears in your document, with an explanatory note or manually created counter-example that shows what the text should look like. With that, I can identify when I've come up with a fix that works.

The other thing is sample data. If you can export your bad entry as Zotero RDF, paste the code to http://gist.github.com, and post the URL back here, I can import the entry locally for testing.

Thanks for reporting this -- feedback on things that run beyond my own use and knowledge really is invaluable.

fbennett · July 27, 2012

(We crossed in the post. Your explanation is very clear, there is no need for a screenshot. If you can put up sample data, I'll play around a bit and see if I can come up with a fix.)

fbennett · July 27, 2012

Ah, I follow. The problem occurs in the UI (in MLZ itself) not when citations are generated? We'll deal with that first, and then see whether further hints are needed in the citation layer. I'll have a look at a possible adjustment. More soon.

seadeer · July 27, 2012

Frank, thank you so much for responding so rapidly! I'll make up a few entries with some variation (similar to the problems described in the bidi documentation file - by the way thanks for the link), and post the links back in here. It seems to me that one potential solution might be to have a method of entering those Unicode symbols for RTL-LTR formatting (like the document described, U+200E and U+200F) in the MLZ UI.

fbennett · July 27, 2012

If an entry is tagged in the Language field as being an RTL language (such as "fa" for Farsi), a first-cut solution will be to set dir="rtl" on the data entry page as a whole. If that works for primary entries, we can then look at more fine-grained use of the dir attribute on the alternate entries, which have explicit language tags of their own, and may represent LTR languages. If that works, then tagging the entry for language is all that will be required; the UI will just adapt automatically and behave as expected in most cases. A hint might be required for the punctuation examples given in that page, we'll see how it goes.

seadeer · July 27, 2012

I've made a gist with 3 examples. The first one is the one I've discussed above (in the title, (CCHF) should be just to the left of کریمه-کنگو); example 2: CCHF should be on the very right (in the beginning of the RTL line); and example 3 should have اصفهن، ازربیجن on the very right, then "and", then برجند , and then the word "respectively" (on the very left).

git://gist.github.com/3190606.git

fbennett · July 27, 2012

Great, thanks. Testing suggests that the approach outlined above will work. It looks like RTL languages will consist of:

Arabic (ar)
Hebrew (he)
Farsi (fa)
Urdu (ur)
Yiddish (yi)
Pashto (ps)

Any of these might be cast in another script, but the tagging scheme used in MLZ can express the script together with the target language, so we should be able to catch these reliably without causing confusion. I'll brew up an initial fix, and look into the multilingual subfield issues a bit later. More again soon.

fbennett · July 27, 2012

I've put up a fresh release of the MLZ client, you should be able to acquire it either by updating or by reinstalling. If you set "fa" in the Language field for an entry in Farsi, mixed-text editing should behave normally. Changing the language tag to "en" or some other LTR language should produce the messed-up behaviour again, for primary-RTL text.

(Edit: If this works, it may be enough, actually. Translated or transliterated sub-fields will most often be in a uniform script, and should not be affected by mixed-text issues. The only case where problems arise will be where an original title is in English (say), and contains embedded acronyms and whatnot that are reproduced verbatim in a translation into an RTL language, for use in publications directed at an RTL-language audience. We can cross that bridge when we come to it.)

seadeer · July 27, 2012

Is the fresh release version 3.0m197? I've re-installed the MLZ client, then tried to update it several times through Firefox addon management, but I see the same behavior. The field tag in the respective fields is set to "fa", I've also changed the field "Language" in the entry to fa. It also doesn't look like any of the fields are looking like RTL. Maybe I'm downloading the old version somehow?

fbennett · July 27, 2012

No, that's the right version. It just hasn't yet solved the problem because I'm stumbling around in the dark. :-)

It looks like providing a means of inserting the RTL/LTR strong hint characters as you suggest is what's needed. Back in a bit ...

seadeer · July 27, 2012

Well, I've been doing an extensive search for software with these features in the past month, trying out all kinds of programs, and looks like there is very little out there in the way of bidi support implemented in bibliography software. MLZ might be the first program among its user-friendly peers to do it. Certainly, EndNote isn't doing it!

fbennett · July 28, 2012

If you update again, you should get something closer to the mark. This gives you RTL field content across the board on entries that have an RTL language code in the Language field. It can be overridden with left-click + select on individual fields (but I would recommend relying on the Language field except where overrides are necessary, as it requires less editing to change).

Alternative forms also adapt to their language, so an English translation of an Urdu field will be LTR.

The one unhappiness is that the centre panel listing is all set in the mode of the browser locale (I think), so RTL titles come out in LTR mode on my system here, and mixed text that is meant to be primarily RTL is reordered unpleasantly. I've tried various things, but controlling this at the cell level in an XUL tree (the engine that generates the listing) does not seem currently to be possible in Firefox or XULrunner (the platform that Standalone runs on).

Let me know how it looks. I do think we've gotten closer.

fbennett · July 28, 2012

One further point of info. The direction switch only applies to ordinary fields: creators and dates are unaffected. If anyone has occasion to enter mixed-text institutional names or dates, this can be revisited -- it's not a huge problem to implement, but let's wait until there is call for it to be sure it's necessary.

fbennett · July 28, 2012

Take two. In that release, editing of creators was actually broken, so I've gone ahead and set up RTL for both creators and dates. The cosmetics for creators are a just a little bumpy, but it all seems to be working now.

seadeer · July 28, 2012

It works! I didn't have time to play around with it much yet, but I can see that the Farsi fields are right-aligned. So I see that the rest of the fields (date, issue, volume, etc) become RTL as soon as you type in "fa" in the Language field. Well, I think that is cosmetical, and I see that it does not affect the way the references are exported. Thank you very much again!

fbennett · July 28, 2012

Great to hear!

It should be smart enough that a supplementary field in "romanized Farsi" (fa-alalc97) or in one of the phonetic scripts or whatever will be handled as LTR.

There has been a parallel report about RTL parens in exported citations, which can come out half-reversed, and sometimes in the wrong location. This could be an issue particularly in multilingual citations that provide supplementary information. I've been thinking of how to fix it, and I think I understand: we want to tag field output runs for text direction, and separately force parens to the dominant direction of the surrounding text. This can definitely be done, but I'll wait until I have some test data from real-world use to start on the solution.

Thanks for your patience. It's nice to have this working.

seadeer · July 28, 2012

Uh-oh, I'm afraid something is not working quite right. If I add a new entry, I can't edit the field "Author" in it (the old entries that I had added before the update are fine). It seems to be the only field that has this problem. Sorry about that!

fbennett · July 28, 2012

No problem, sorry for the glitch. More soon.

fbennett · July 28, 2012

An update should fix it for you now. Thanks for the early report.

seadeer · July 29, 2012

Okay, now it is definitely working! Thank you very much again Frank. Is there any way I could donate to this project?

Best regards,
Anna

fbennett · July 29, 2012

Anna,

Glad to hear the good news. I just issued a small update that attempts to solve the center-panel display problem for mixed-text titles. Any existing mixed-text entries you have will still display incorrectly following the update, but each will correct itself when the title field is opened for editing and resaved.

What the fix does is to wrap the title field in RTL-language entries in a right-to-left-embedding (RLE, or U+202B) character at the front and a pop-directional-formatting (PDF, or U+202C) character at the back. The small disadvantage of doing this is that the RTL behaviour is slightly "sticky" -- if you change the language of the item to an LTR language, the dominant directionality inside the title string will remain RTL until the field is opened and saved.

The other small downside is that this introduces some extraneous (but invisible) characters into the field that could mess up automagic behaviour that depends on sniffing the character set of the field. I can't think of cases where this would be an issue, though, for the title field.

Thanks for your thoughts on contributing. I'm working on a book about the project that should be ready for public release by late September. The plan for the moment is to see how it does for sales, and think about other channels for project support if that seems necessary.

The text will be available as a PDF free of charge under a Creative Commons license, but the printed version (and eventually an ebook) will come with a pretty cover, and sales will help support the project. Watch for it at an Amazon search listing near you!