Export of Unicode characters

dvs0826 · December 15, 2009

I'm trying to export a library in BibTeX to unicode. Certain characters do not get correctly characters. For example: ß -> ÃŸ (0xC39F), and ü -> Ã¼ (0xC3BC) among others. The only thing I could find via a search was from 2007 which seemed to indicate that it was being actively worked on. I can verify that the file is saving in UTF-8, as opening it in a hex editor I can see the correct BOM at the beginning of the file. I know BibTeX technically doesn't support UTF-8, but I can get it to work by pasting characters ß, ü, etc directly into the text editor and re-saving the file. In this case the above 2 characters are mapped to 0xDF and 0xFC respectively, with no UTF-8 BOM at the beginning of the file. In that case, it works perfectly but requires manual editing of the exported file.

Is there a way to make this work correctly automatically?

dstillman · December 15, 2009

Works fine for me via Quick Copy and right-click, Export.

@book{last_test_2009,
	title = {Test},
	publisher = {ßü},
	author = {First Last},
	year = {2009}
}

Are you sure you're opening the file correctly as UTF-8? (In a text editor—I have no idea what BibTeX itself does, but Zotero's output is correct for me.)

dvs0826 · December 15, 2009

Hmm.. Unicode issues are always the most confusing.

I'm opening the file correctly, maybe it's a font issue? If that's the case though, then I'm not sure why it would display correctly in Zotero itself. Here's what I get when opening the exported .BIB file in Notepad++ (problem occurs in the author field)

@inproceedings{bartz_extending_1998,
address = {Lisbon, Portugal},
title = {Extending graphics hardware for occlusion queries in {OpenGL}},
url = {http://portal.acm.org/citation.cfm?doid=285305.285317},
doi = {10.1145/285305.285317},
booktitle = {Proceedings of the {ACM} {SIGGRAPH/EUROGRAPHICS} workshop on Graphics hardware - {HWWS} '98},
author = {Dirk Bartz and Michael Mei{9FC3}ner and Tobias H{BCC3}ttner},
year = {1998},
pages = {97--ff.}
},

Since it displays the {9FC3} and {BCC3} in a special format, I know it knows that it's a unicode character, it just can't find the right thing to draw I guess. So that leaves me wondering why Firefox can draw it.

I'm using x64 Windows 7 Ultimate, no additional language packs or anything installed.

dstillman · December 15, 2009

Rename the file to .txt and drag it into Firefox, and make sure Firefox's encoding is set to Unicode.

dvs0826 · December 15, 2009

Doing that, it displays incorrectly in Firefox also with encoding set to Unicode (UTF-8) (and everything else I tried).

So the only place it apparently does display correctly is in the Info pane of Zotero.

Maybe I need to go into Windows and install some kind of extended text services functionality.

dstillman · December 15, 2009

A UTF-8 file displayed as ISO-8859-1 would show ÃŸ and Ã¼.

You can reset your translators and styles from the Advanced pane of the Zotero prefs to make sure you have the latest BibTeX translator, though you certainly should. (You can also just check the date in translators/BibTeX.js, which should be "2009-08-21 15:00:00".)

Have you tried Quick Copy?

dstillman · December 15, 2009

Basically, though, if you open the exported file in a hex editor and see "EF BB BF" at the beginning and "C3 9F" and "C3 BC" for the characters, the file has been correctly exported as UTF-8, and the problem lies elsewhere.

dvs0826 · December 15, 2009

Yea, that is indeed what I see. So I guess the problem is not with the exporting. Guess I'll have to figure out a way to determine where the problem actually is.