Bug handling of accented characters in IEEE Xplore translator

Hi!

I just wanted to note that there's a bug in the IEEE Xplore translator, which causes all accented characters (like in authors names) to be displayed like HTML escaped hex codes, eg:

What should be:
J. Muñoz-Marí

Gets imported as:
J. Mu&#x000F1 andoz-Mar&#x000ED and

If I would know more Javascript I would fix it, but sadly I can't.

Kind regards,

Jan-Pieter
  • could you provide a URL to such an entry in IEEE Xplore, please?
  • Sure!

    for instance this one:
    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6221978&contentType=Early+Access+Articles&queryText%3Dgraph+matching+remote+sensing
  • great, I can replicate that.
    At first glance, this looks dodgy on the IEEE Xplore side of things, but we may be able to fix it by adjusting the character encoding settings. Might take a little & we'll let you know.
    Thanks for reporting.
  • It's not a character set issue. The site is returning the BibTeX as HTML, including with hexadecimal entities for extended characters, but the translator is just doing a fixed replacement of "<br>" rather than running the BibTeX through Zotero.Utilities.unescapeHTML(), which fixes the problem (at least for this item).
  • edited November 6, 2012
    Well, it partly fixes the problem. The HTML BibTeX actually has spaces where the semicolons should be:

    author={Tuia, D. and Mu&amp;#x000F1 andoz-Mar&amp;#x000ED and, J. and G&amp;#x000F3 andmez-Chova, L. and Malo, J.},
    unescapeHTML still works without the semicolons, but the extra spaces remain in the imported names. A regexp before the unescapeHTML could fix that, assuming this is universal.
  • Looks like IEEE Xplore BibTeX generating algorithm is doing something like
    authors.replace(/;/g, " and")

    This is obviously not compatible with HTML encoded special characters. IEEE Xplore should be notified of this issue, but in the mean time we might want to consider switching to RIS.
  • Ah, good catch. We should notify them. We could also switch to RIS, but fixing the entities with a regexp would be trivial (and shouldn't break anything when they fix this).
  • edited November 6, 2012
    Good call. A regex replace was indeed trivial.

    Edit: Oh, and I sent them a message about this bug.

    @jpjacobs The translator should automatically update within 24 hours, or you can update manually from Preferences -> General -> Update Now.
  • edited November 6, 2012
    I'd go with the entire "&#x[0-9A-F]{4} and" for good measure. Your "(&\S+) and" could in theory break when they add semicolons back (in the extremely unlikely event that a legitimate "and" appeared in a name field and was preceded by a word that ended in an extended character).

    (Though if we hear back from them we can also just get rid of this. Thanks for contacting them.)
  • That's assuming that they don't use other forms of HTML special characters. How about "(&[^\s;]+) and"? I guess I don't have a problem with the hex-only form if you feel that it's good enough.
  • Sure, "(&[^\s;]+) and" should work.

This is an old discussion that has not been active in a long time. Before commenting here, you should strongly consider starting a new discussion instead. If you think the content of this discussion is still relevant, you can link to it from your new discussion.

Sign In or Register to comment.