Strange characters stored in zotero DB

There are apparently strange characters stored in (at least) one of my references in zotero. I realized that because I had problems after export to bibtex utf-8 encoded file: the exported file apparently contained non UTF-8 characters. Or there is something I don't get with that reference or with UTF-8...

The strange characters apparently lie in the "pages" property of one of my references. It reads "221 – 236". The hyphen is apparently encoded in a strange manner. If I copy paste that content into a new document (with gedit), save the file, I am supposed (it seems to me) to obtain UTF-8 content, as my locale is UTF-8.

But if I type hexdump -C document to examine its content, I see the following:
00000000 32 32 31 20 e2 80 93 20 32 33 36 0a |221 ... 236.|
0000000c

Thus apparently the hyphen is encoded with e2 80 93. Which is not correct UTF-8, is it?

When exporting to a UTF-8 encoded Bibtex file from zotero, the relevant part is as follows (which does not seem UTF-8 valid either):
00000400 09 70 61 67 65 73 20 3d 20 7b 32 32 31 c2 a0 e2 |.pages = {221...|
00000410 80 93 c2 a0 32 33 36 7d 0a 7d |....236}.}|

I guess the problem was created when importing the reference (I imported it from a big bibtex file, which was possibly itself faulty).

Is it intended that zotero stores such "strange" characters? Shouldn't it refuse them at import time?

Is there some way that I can now check my entire database to verify that I don't have more invalid characters (and future potential problems)?

The problematic reference is stored in the public group "MCDA", reference Voogd, 1982.

Thanks for the help.
  • edited July 13, 2010
    This is valid UTF-8. It is an en dash (LaTeX entity: '--'):
    http://www.fileformat.info/info/unicode/char/2013/index.htm
    An en dash is a valid (and, indeed, typographically preferred) way to specify a numeric range. There has been some discussion on standardizing dashes in the page field of Zotero. I don't know what came of it, but you might search.
  • Stupid me, you are right about the en dash and there does not lie the problem. Both my latex and pdflatex in fact complain about the no-break space (U+00A0, UTF-8 0xC0A0) that comes before and after this dash. Although it is correct UTF-8, the inputenc package does not seem to appreciate no-break spaces (at least in bib files). Or is it again my mistake?

    If anybody wants to try to confirm the problem, this is the XMCDA-dm.bib file:

    @article{voogd_multicriteria_1982,
    title = {Multicriteria evaluation with mixed qualitative and quantitative data},
    volume = {9},
    url = {http://www.envplan.com/abstract.cgi?id=b090221},
    doi = {10.1068/b090221},
    number = {2},
    journal = {Environment and Planning B: Planning and Design},
    author = {H. Voogd},
    year = {1982},
    pages = {221 – 236}
    }

    (NB: the pages endash should be surrounded by U+00A0 no-break spaces, I don't know if they will be preserved on the zotero forum).

    And this is the document.tex file:
    \documentclass{article}
    \usepackage[utf8]{inputenc}
    \bibliographystyle{plain}
    \begin{document}
    \section{Title}
    A citation:\cite{voogd_multicriteria_1982}.
    \bibliography{XMCDA-dm}
    \end{document}

    After latex document, bibtex document, latex document, latex says: "! Package inputenc Error: Unicode char \u8:  not set up for use with LaTeX."

    If the problem happens to be confirmed on other installations, I would suggest to change the zotero export filter to bibtex to replace unicode no-break spaces with the latex-friendly "~" sign.
Sign In or Register to comment.