Export to BibTeX (Non-latin characters)

pstupin · January 14, 2007

I was very glad to discover Zotero and I'd like to express my gratitude to its authors. However, I've encountered a problem that is really critical for me: I can't export to BibTeX appropriately, i.e. after exporting I get the file where Cyrillic characters that I need are replaced by question marks like this:

% BibTeX export generated by Zotero 1.0.0b3.r1

@book{_<8D>;>38O_1980,
address = {??????},
title = {????????? ??????? ??????},
publisher = {???????? ?????? ?? ?? ????},
year = {1980},
keywords = {????,??????,?????????},
pages = {96}
}

The same turnue out to be true for other non-latin characters.

For example:

% BibTeX export generated by Zotero 1.0.0b3.r1

@book{fukuda_hiroko_jazz_2003,
address = {??},
title = {Jazz Up Your Japanese with Onomatopoeia},
publisher = {Kodansha International},
author = {Fukuda Hiroko},
year = {2003}
}

Instead of "??" must be the "東京" (Tokyo) characters.

Is there any way out? If necessary, I may provide any other pertinent information that may help solving the issue (I'm using the Japanese version of Zotero).

Thank you in advance, Pavel.

Robert Samal · August 4, 2007

The same is happening for "latin-extended" characters too (e.g. scaron: unicode 0161 (hex), but many more). What is wrong? Shouldn't it just write the file in unicode?

This functionality is crucial for me - I can try and help, if someone points me to the right place.

notiX · September 6, 2007

As far as I can see, only BibTex and RIS export is affected, Endnote/Refer/BibX export is in UTF-8.
Is this a bug that could be corrected easily?

dstillman · September 6, 2007

As far as I can see, only BibTex and RIS export is affected, Endnote/Refer/BibX export is in UTF-8.
Is this a bug that could be corrected easily?

It's not a bug. BibTeX and RIS technically don't support UTF-8.

However, as noksagt notes, many programs that handle BibTeX do at least provide the option of working with UTF-8, we should probably at least offer UTF-8 output as an option.

We also need to hard-code more entity mappings for better import.

In short, we're working on it.

lemur · December 11, 2007

It is a bug. BibTeX as shippped in recent distributions of TeX support UTF8. Refworks can do it without a hitch but Zotero can't. Zotero is buggy.

bdarcus · December 11, 2007

"Zotero is buggy"??? WTF lemur; are you serious, or a troll?

You told me on another thread to stop being defensive; I'm telling you here to stop using inflammatory language like this. This project is a free software project and (I hope) a collaborative community of users and developers who are working to create a free next-generation tool for scholars and students. So let's try to ensure that the level of discourse here reflects that collaborative spirit.

Anyway, I missed this earlier, but Dan, what do you mean by the statement that RIS doesn't "support" UTF8? Is the spec even explicit enough anywhere to talk about encodings at all? I know RefDB works with UTF8 RIS files at least.

I'd say that you ought to assume UTF8 as default.

noksagt · December 11, 2007

It was a design decision & is NOT a bug (having the OPTION to allow bibtex to be exported as UTF-8 is a current feature request, though).

MOST versions of bibtex do not fully support UTF-8. See Phillip Lehman's post about multi-byte UTF-8 in bibtex.

lemur · December 11, 2007

bdarcus, I'm totally serious.

Fact: I export all my bibliographical entries from RefWorks into Bibtex without a problem. Everything is handled properly even if bibtex does not have full support for Unicode.

Fact: If I export all my bibliographical entries from Zotero into Bibtex using the default Bibtex filter, a substantial percentage of my bibliography is turned to minced meat because accented characters are turned into question marks.

End result: I can use RefWorks for research but I cannot use Zotero because Zotero cannot convert accents properly.

Corollary: I cannot recommend Zotero to any of my colleagues.

Replacing an accented character by a question mark is a bug no matter how you cut it. Calling it a "design decision" amounts to Microsoft's practice of deciding that bugs are features.

Here's an example of what I'm talking about. Bibtex accepts both of these (and RefWorks produces the first line):

Brahmasūtra Śāṃkara Bhāṣya
Brahmas\={u}tra \'{S}\=a\d{m}kara Bh\={a}\d{s}ya

But here is what I get out of the Bibtex filter bundled with Zotero:

Brahmas?tra ???kara Bh??ya

bdarcus · December 11, 2007

Corollary: I cannot recommend Zotero to any of my colleagues.

As I said, this is a free software project: stop complaining and help fix it. You did with your careful bug report, which is why I'm disappointed to see this kind of rhetoric.

Recognize that noksagt is not some marketing hack simply trying to make Zotero look good, but is a programmer (on another project) and a longtime BiBTeX user. He might actually understand a thing or two about the details that you don't. Or you may simply have different—but equally valid—positions.

None of us here are making money or fame on Zotero; we're mostly scholars contributing what we can to improve a tool that we believe has a lot of potential. If you want to go back to RefWorks, so be it; but I'm not sure that's really in the best long-term interests of you or your colleagues. Just consider this: if this problem was fixed for you tomorrow, would you still recommend your colleagues not use Zotero?

lemur · December 11, 2007

The fact that the project is free software or open source or whatever is neither here nor there. I tend to support free (as in "speech") software and open source but open source software does not get a "free pass" just by the fact that it is open source. And when you have multiple open source projects (e.g. Connotea), open-sourcedness is no longer a distinguishing factor. You can't say "pick me because I'm open source". Much more important is how feature-full the software is and how project members respond to bug reports.

The bug report which corresponds to the issue at hand is here:

https://www.zotero.org/trac/ticket/749

First, it is considered an "enhancement" rather than a "bug" when in fact it is a show stopper. Second, it has been opened 3 months ago, has had an update 2 months ago and yet the problem is still not fixed. It took me all of 15 minutes yesterday to create a filter that would do what I need and I'm not familiar with Zotero's architecture, I've never written plugins for Firefox nor am I a full-time programmer. (But I used to be a software engineer (worked professionally for 5 years) who switched to the humanities.)

As for money or fame. Well, I did research in Computer Engineering in an academic environment, I've contributed to open source projects, I've worked as a professional software engineer and now I'm in a Ph.D. program in the humanities. I know what the motivations are in all those domains and my experience is that people don't do things that don't profit them somehow. The form it takes can vary and can be as unspectacular as fulfilling a degree requirement or being able to appear technologically sophisticated in a humanities department.

As for what my opinion would be if this specific problem was fixed, I will first say that I did fix it for myself. (See above.) But that was not enough. Shortly after I fixed it, I found another problem. (Again something Refworks has no problem with.) I've noted the problem here:

http://groups.google.com/group/zotero-dev/browse_frm/thread/6f6d5c2eec1cc9ae

So in only a few hours I found two bugs in Zotero both of which are show stoppers for me. One of those bugs I fixed but the other one is just too deep for me to deal with. I'm sorry but at this stage opening up Zotero and trying to make deep modifications just does not benefit me in any way. I have papers to produce, grants to apply for, etc. I can't drop everything and work on Zotero just because it has potential. If I blow a deadline or if I write a crappy grant proposal (and thus lose it) because I was to busy with fixing Zotero, I can't ask for a free pass. So RefWorks will have to do for the time being.

dstillman · December 11, 2007

Nobody's asking for a free pass—just a collaborative rather than combative attitude, plus some accuracy in describing the problem.

Incomplete mapping to BibTeX's ASCII representations of Unicode characters is a bug. A design decision not to output UTF-8 because most versions of BibTeX don't support it, which is what both noksagt and I were referring to above, is not, regardless of what other software or some BibTeX implementations might do. That doesn't mean it's not an important feature request for some people, but precision of language is important if for no other reason than that it helps us address the issue correctly.

As you discovered, it's not as simple as replacing the current translator. We need a pref to control the output behavior, and that has to be integrated into the core translation code separate from the translator.

Thanks for your patch. We'll take a look at it and try to address this soon.

Anyway, I missed this earlier, but Dan, what do you mean by the statement that RIS doesn't "support" UTF8? Is the spec even explicit enough anywhere to talk about encodings at all? I know RefDB works with UTF8 RIS files at least.

Bruce: The RIS character encoding issue is discussed on another thread, but it's basically the same issue. The spec says either "IBM-850" or, in some (presumably more recent) versions, "Windows ANSI character set", but some implementations choose to use UTF-8. For output we'd need a pref similar to the BibTeX one (probably defaulting to UTF-8), but as discussed on that thread, import might be a little trickier if some software and sites are still exporting Windows-1252.

Codec · December 12, 2007

The 1.0.2 development version now includes lots more mappings to export and import bibtex. It also allows optionally UTF8 output/input by setting a variable in about:config
extensions.zotero.export.unicodeBibTeX

dstillman · December 12, 2007

BibTeX folks: What's the more reasonable default setting for outputting UTF-8, on or off? Codec's patch has it off, but if most implementations will be able to handle UTF-8 going forward, outputting UTF-8 by default seems reasonable.

noksagt · December 12, 2007

Given that bibtex/latex won't have true 8-bit character support for some time, I think the default should be ISO8859_1.

The default in JabRef, refbase, and Referencer is ISO8859_1. I think this is the case for kbib too.

danmackinlay · March 16, 2008

I'm a "BibTeX folk".

AFAICT, BibTex often relies (even today) on those quirky LaTeX diacritical macros to shoehorn even 8-bit ISO8859-1 characters into the ASCII 7-bit space since encoding problems are so rife across legacy packages. This is not to do with the technical requirements of BibTeX, but merely habit amongst its users, or perhaps desire to avoid any cross-machine encoding problems. ASCII is the only one that works flawlessly without having to have any awareness of character sets across a variety of european code pages (e.g. across legacy mac AND pc machines)

That said, I vote for UTF-8 as the default encoding. My reasoning runs thusly: While BibTeX might technically be ASCII-centric, we don't care, since it handles files with UTF-8 in them with no difficulties, providing it plugs into a LaTeX distro that does. The default TeX distro (TeX Live) now includes XeTeX, which is a UTF-8-aware LaTeX distro, which I use all the time for typesetting documents with arbitrary languages' character sets in them. Occasionally i need to use some badly-written legacy macro that hates unicode; then i may freely convert my UTF-8 bibtex file using the LaTeX-macro-aware recode (http://www.gnu.org/software/recode/) which does just fine for that purpose.

Any other default (e.g. ISO8859-1) will throw out needed character encoding information. I guess the most complete representation (although still missing many characters for minor languages, I will be prepared to bet) would be ASCII with Latex macros; but this would require a massive character translation table which duplicates the effort in projects like e.g. GNU recode. Therefore: UTF-8.