Strange characters stored in zotero DB
There are apparently strange characters stored in (at least) one of my references in zotero. I realized that because I had problems after export to bibtex utf-8 encoded file: the exported file apparently contained non UTF-8 characters. Or there is something I don't get with that reference or with UTF-8...
The strange characters apparently lie in the "pages" property of one of my references. It reads "221 – 236". The hyphen is apparently encoded in a strange manner. If I copy paste that content into a new document (with gedit), save the file, I am supposed (it seems to me) to obtain UTF-8 content, as my locale is UTF-8.
But if I type hexdump -C document to examine its content, I see the following:
00000000 32 32 31 20 e2 80 93 20 32 33 36 0a |221 ... 236.|
0000000c
Thus apparently the hyphen is encoded with e2 80 93. Which is not correct UTF-8, is it?
When exporting to a UTF-8 encoded Bibtex file from zotero, the relevant part is as follows (which does not seem UTF-8 valid either):
00000400 09 70 61 67 65 73 20 3d 20 7b 32 32 31 c2 a0 e2 |.pages = {221...|
00000410 80 93 c2 a0 32 33 36 7d 0a 7d |....236}.}|
I guess the problem was created when importing the reference (I imported it from a big bibtex file, which was possibly itself faulty).
Is it intended that zotero stores such "strange" characters? Shouldn't it refuse them at import time?
Is there some way that I can now check my entire database to verify that I don't have more invalid characters (and future potential problems)?
The problematic reference is stored in the public group "MCDA", reference Voogd, 1982.
Thanks for the help.
The strange characters apparently lie in the "pages" property of one of my references. It reads "221 – 236". The hyphen is apparently encoded in a strange manner. If I copy paste that content into a new document (with gedit), save the file, I am supposed (it seems to me) to obtain UTF-8 content, as my locale is UTF-8.
But if I type hexdump -C document to examine its content, I see the following:
00000000 32 32 31 20 e2 80 93 20 32 33 36 0a |221 ... 236.|
0000000c
Thus apparently the hyphen is encoded with e2 80 93. Which is not correct UTF-8, is it?
When exporting to a UTF-8 encoded Bibtex file from zotero, the relevant part is as follows (which does not seem UTF-8 valid either):
00000400 09 70 61 67 65 73 20 3d 20 7b 32 32 31 c2 a0 e2 |.pages = {221...|
00000410 80 93 c2 a0 32 33 36 7d 0a 7d |....236}.}|
I guess the problem was created when importing the reference (I imported it from a big bibtex file, which was possibly itself faulty).
Is it intended that zotero stores such "strange" characters? Shouldn't it refuse them at import time?
Is there some way that I can now check my entire database to verify that I don't have more invalid characters (and future potential problems)?
The problematic reference is stored in the public group "MCDA", reference Voogd, 1982.
Thanks for the help.
http://www.fileformat.info/info/unicode/char/2013/index.htm
An en dash is a valid (and, indeed, typographically preferred) way to specify a numeric range. There has been some discussion on standardizing dashes in the page field of Zotero. I don't know what came of it, but you might search.
If anybody wants to try to confirm the problem, this is the XMCDA-dm.bib file:
@article{voogd_multicriteria_1982,
title = {Multicriteria evaluation with mixed qualitative and quantitative data},
volume = {9},
url = {http://www.envplan.com/abstract.cgi?id=b090221},
doi = {10.1068/b090221},
number = {2},
journal = {Environment and Planning B: Planning and Design},
author = {H. Voogd},
year = {1982},
pages = {221 – 236}
}
(NB: the pages endash should be surrounded by U+00A0 no-break spaces, I don't know if they will be preserved on the zotero forum).
And this is the document.tex file:
\documentclass{article}
\usepackage[utf8]{inputenc}
\bibliographystyle{plain}
\begin{document}
\section{Title}
A citation:\cite{voogd_multicriteria_1982}.
\bibliography{XMCDA-dm}
\end{document}
After latex document, bibtex document, latex document, latex says: "! Package inputenc Error: Unicode char \u8: not set up for use with LaTeX."
If the problem happens to be confirmed on other installations, I would suggest to change the zotero export filter to bibtex to replace unicode no-break spaces with the latex-friendly "~" sign.