Bug handling of accented characters in IEEE Xplore translator

jpjacobs · November 5, 2012

Hi!

I just wanted to note that there's a bug in the IEEE Xplore translator, which causes all accented characters (like in authors names) to be displayed like HTML escaped hex codes, eg:

What should be:
J. Muñoz-Marí

Gets imported as:
J. Mu&#x000F1 andoz-Mar&#x000ED and

If I would know more Javascript I would fix it, but sadly I can't.

Kind regards,

Jan-Pieter

adamsmith · November 5, 2012

could you provide a URL to such an entry in IEEE Xplore, please?

jpjacobs · November 5, 2012

Sure!

for instance this one:
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6221978&contentType=Early+Access+Articles&queryText%3Dgraph+matching+remote+sensing

adamsmith · November 5, 2012

great, I can replicate that.
At first glance, this looks dodgy on the IEEE Xplore side of things, but we may be able to fix it by adjusting the character encoding settings. Might take a little & we'll let you know.
Thanks for reporting.

dstillman · November 5, 2012

It's not a character set issue. The site is returning the BibTeX as HTML, including with hexadecimal entities for extended characters, but the translator is just doing a fixed replacement of "<br>" rather than running the BibTeX through Zotero.Utilities.unescapeHTML(), which fixes the problem (at least for this item).

dstillman · November 5, 2012

Well, it partly fixes the problem. The HTML BibTeX actually has spaces where the semicolons should be:

author={Tuia, D. and Mu&#x000F1 andoz-Mar&#x000ED and, J. and G&#x000F3 andmez-Chova, L. and Malo, J.},
unescapeHTML still works without the semicolons, but the extra spaces remain in the imported names. A regexp before the unescapeHTML could fix that, assuming this is universal.

aurimas · November 6, 2012

Looks like IEEE Xplore BibTeX generating algorithm is doing something like
authors.replace(/;/g, " and")

This is obviously not compatible with HTML encoded special characters. IEEE Xplore should be notified of this issue, but in the mean time we might want to consider switching to RIS.

dstillman · November 6, 2012

Ah, good catch. We should notify them. We could also switch to RIS, but fixing the entities with a regexp would be trivial (and shouldn't break anything when they fix this).

aurimas · November 6, 2012

Good call. A regex replace was indeed trivial.

Edit: Oh, and I sent them a message about this bug.

@jpjacobs The translator should automatically update within 24 hours, or you can update manually from Preferences -> General -> Update Now.

dstillman · November 6, 2012

I'd go with the entire "&#x[0-9A-F]{4} and" for good measure. Your "(&\S+) and" could in theory break when they add semicolons back (in the extremely unlikely event that a legitimate "and" appeared in a name field and was preceded by a word that ended in an extended character).

(Though if we hear back from them we can also just get rid of this. Thanks for contacting them.)

aurimas · November 6, 2012

That's assuming that they don't use other forms of HTML special characters. How about "(&[^\s;]+) and"? I guess I don't have a problem with the hex-only form if you feel that it's good enough.

dstillman · November 6, 2012

Sure, "(&[^\s;]+) and" should work.