Zotero > BibTeX :: transliteration of Arabic

romanov-umich · October 31, 2011

I have a problem with exporting to BibTeX format. After a day or two, I have figured how this whole thing works, but one problem still remains.

I have a lot of Unicode characters in my Zotero DB; the translator to BibTeX format does a great job converting those Unicode symbols into the LaTeX code, but... when it created a BibTeX key, it throws away all those Unicode letters, so the keys become completely unintelligible, for example abar_victory_1990 (whereas the author's name is Ṭabarī). So, the question is: is there anyway to add a similar conversion of authors' names and titles? I.e., Ṭ is converted into T and ī is converted into i, so that the key would look tabari_victory_1990.

romanov-umich · October 31, 2011

NB: See the update below...

Strangely enough, I have figured it out myself (with a help of my friend). Here is the updated BibTeX translator to suit those needs (every change I did is marked); I have not changed anything in the code itself, it turned out that I just needed to add few more options:

http://www-personal.umich.edu/~romanov/BibTeX.js

What it does:

1. BibTeX keys: Changes transliterated names and titles into simplified versions without dots and macrons (e.g.: "Ṭabarī" becomes "tabari"), thus the keys are more intelligible (tabari_victory_1990 instead of abar_victory_1990).
Symbols added: ṭ ū ī ā ṣ ḍ ḥ ḳ ẓ Ṭ Ū Ī Ō ō Ā Ṣ Ḍ Ḥ Ẓ; ("ʾ", "ʿ" - symbols used for Ayns and Hamzas are deleted).

2. Two more transliteration symbols - for 'Ayns and Hamzas - are now converted into LaTeX codes: "ʾ" is converted into "\Alif"; "ʿ" - into "\Ayn"; use \usepackage{semtrans} to activate their conversion).

3. Also the error with "i with macrons" is fixed (now producing a code for "i" with a macron, but without a dot in between: \={\i}, instead of \={i}.

I hope somebody else finds it helpful.

dstillman · November 1, 2011

There's a Zotero.Utilities.removeDiacritics() function in 3.0 (with many more mappings than above). The translator should use that.

romanov-umich · November 1, 2011

I think it does: there is an extensive list in BibTeX.js, but the symbols I need the most were not translated properly. Or am I missing something? How does this Zotero.Utilities.removeDiacritics() removed diacritics and where (I do not need it touch my records in Zotero)? Could you refer me to a description of how it works?

adamsmith · November 1, 2011

I don't believe it's documented beyond the code. Zotero uses this script:
http://lehelk.com/2011/05/06/script-to-remove-diacritics/
it's included in the utilities.js file in Zotero:
https://github.com/zotero/zotero/blob/master/chrome/content/zotero/xpcom/utilities.js

It's a function that can be called from within the translator - if I understand correctly, you could just use it instead of the tidyAccents function (which you wrote?).

romanov-umich · November 1, 2011

Thanks, adamsmith. I understand better now. No, I did not write tidyAccent function - it was already there. I just needed to add the symbols which I needed for my purposes. But lehel's script looks much better - it looks like it covers all symbols.

romanov-umich · November 2, 2011

Something strange happened to that modified file; I have no idea what, but Zotero suddenly stopped recognizing it. So, we had to redo it. We have applied lehel's list for the removal of diacritics. So here is the updated file (the above mentioned corrections are implemented; the list of diacritics removal is expanded), but no comments on what has been changed. I have re-uploaded the file:

http://www-personal.umich.edu/~romanov/BibTeX.js

noksagt · November 3, 2011

I don't understand the reason for some of the changes in the mappingTable and for removing the special treatment of corporate authors when writing out creator names. Also, there's no reason to copy/paste code from utilities.js: you should be able to use the function from within the translator, right?

ajlyon · November 3, 2011

I just cleaned up romanov-umich's code and removed the duplicated code, essentially as noksagt says: http://github.com/ajlyon/translators/raw/master/BibTeX.js

This would be good to cover with unit tests, but we don't have support for unit tests for export translators yet. But it looks to work fine.

ajlyon · November 3, 2011

I also commented out the bit that requires the semtrans package, so that we preserve compatibility with vanilla BibTeX workflows. As it stands, ZU.removeDiacritics is only applied in Zotero 3.0, and they will not be removed in Zotero 2.1.x.

dstillman · November 3, 2011

There shouldn't be a need to run those RegExp lines at all unless removeDiacritics() isn't available, right?

ajlyon · November 3, 2011

Right -- I hadn't read removeDiacritics() carefully enough to confirm that it has proper two-letter expansions where appropriate. Updated on GitHub accordingly.

noksagt · November 3, 2011

While I haven't actually tried the modified translator (perhaps I should, but...), I did look at the diff. The items that still concern me:

The unicode markup for both latin capital and small letter O with a stroke has changed from using the proper '\u' to '\U'
The LaTeX entites for both of the macroned I's ('\\={I}') now have escaped I's ('\\={\\I}')
The closing brace of the creator string (ca. line 2060) has been removed

Otherwise, this looks ok.

ajlyon · November 3, 2011

1 and 3 fixed. The macroned I change was requested by romanov-umich-- apparently to remove the dot before adding the macron. Should a different approach be taken here?

noksagt · November 3, 2011

Ah, I missed that point. I don't think it is needed for the capital I. It is fine for the lowercase one. An entry needs to be added to the reversemappingTable too, right (no reason to replace it, there should be a two-to-one mapping of \={i} and \={\i} to \u012B)?

ajlyon · November 3, 2011

Right. I'll fix the mappings and put up a new version. I won't aim to land this on the main repository until the detection / crash issue on viewing PDFs is taken care of, but I think we're just about there, although it would be good if romanov-umich could test with his data.

ajlyon · November 4, 2011

Mappings fixed.

romanov-umich · November 4, 2011

Thanks a lot for your input and cleaning up the script. It seems to be working fine on my data, except for the screened characters. As far as they go, I do not think it is necessary to screen them, since those not dealing with Arabic (or Arabographic) texts will hardly encounter those characters; otherwise they will have to use {semitrans}, since it seems to be the only package that supports these characters. So, ultimately, what I am trying to say is that there is no need to screen them.

As to utilities.js - I had no idea how it works exactly. I am quite new to both Zotero (switched a month or two ago) and Latex (started just a week or a week and a half ago), so pardon my screw-ups.

As to "removing the special treatment of corporate authors when writing out creator names," I do not think I touched anything like that at all.

Again, thanks a lot, ajlyon! The translator works perfectly well for me at the moment.

ajlyon · November 4, 2011

I see your point on the semitrans characters, but vanilla BibTeX is still our primary target-- hopefully we can point people to those lines and have them uncomment them when issues arise.

As something of an aside, is it still not possible to handle these in a native Unicode LaTeX processor, avoiding the whole morass of character substitutions? It's been some time since I last used LaTeX, but I thought these approaches were on their way out.

romanov-umich · November 4, 2011

I have been trying to find the way to make LaTeX process unicode characters (that would have been much easier), but no luck so far, but again, I have just started using it, so there is a lot of uncharted terrain there for me.

ajlyon · November 4, 2011

I think that the XeTeX engine is relatively Unicode-safe, but my experience is limited and fairly dated at this point.

romanov-umich · November 5, 2011

Yeah, that is what I read about it. I am planning to give it a try as soon as I wrap my head around the basic stuff.