Intercepting RIS files: diacritics and and phantom items

scot · September 6, 2007

When intercepting RIS files from EBSCOhost (using the ATLA database). I have the following two problems:

1. characters with umlauts and accents don't get imported correctly.

2. phantom web items are created in the database, unlinked to any parent, and without a title, with only the URL of the database entry.

Also, the URL of the database entry is included in the URL field on Zotero's 'Info' pane. This is apparently contrary to the CSL spec, as Bruce mentions here.

scot · September 6, 2007

For what it's worth, if I turn off RIS intercept and save the file and formally import it, I get (1) but not (2). If I load the file in a text editor and explicitly change the encoding to utf-8, and resave. Everything is fine. I have the same effect when doing the import (from the same files) as BibTeX (incorrect diacritics, but correct if I resave the file as UTF-8. Note that the characters in question were only 'normal' European diacritics (mostly umlauts), and nothing further afield.

I don't know where the problem is with the RIS import, but even if the problem is with the export, could Zotero be perhaps made to guess the encoding? It seems that vim and Notepad++ 'get it right' by default. At any rate both display the saved saved RIS file correctly, even before any explicit conversion.

scot · September 6, 2007

In addition, keywords (tagged KW) in the RIS file get imported into zotero as manually generated tags. This is a area where we would ideally also have choice. Someone importing their entire Endnote library may well be glad to have their keywords intact. But in the RIS interception from a website, the keywords will be functionally equivalent to Zotero's 'automatic' tags.

One solution might be to either ask if and how RIS tags should be imported each time. An intelligent fallback would probably to treat keywords in intercepted RIS files as 'automatic' tags.

nickdos · September 6, 2007

The cause of problem 1 (non-ascii characters not importing properly) is probably due to the RIS standard specifying the character must be "Windows ANSI character set" (standard ascii according to Wikipedia). The RIS translator thinks it should be IBM850 - so the problem could be here?

There is no guarantee that the source app for your RIS file is using Windows ANSI anyway, it may be using UTF-8?

dstillman · September 7, 2007

I didn't code our RIS support, but a few comments:

1) ASCII is just 128 characters, so "Windows ANSI" isn't just standard ASCII—it's ASCII plus one of a bunch of different international sets for the remaining 128 characters. Presumably they mean Windows-1252.

2) ISI says "IBM Extended Character Set", which is why we went with IBM850. Since that's an older DOS character set, that may be an older version of the specs, but it does seem to be the most authoritative source.

3) All of this is largely irrelevant if large sites are using different encodings. It may be that EBSCOhost is using Windows-1252 or UTF-8 and Scot's text editor is adding a UTF-8 BOM at the beginning of the file when saving it, which makes Zotero ignore any character set specified by the translator and use UTF-8 instead.

Scot, if there are extended characters in the RIS file, your text editor (or, at least, some text editor) should give you some indication as to what charset the file is in when you first open it. Could you check that? Otherwise, if you upload it somewhere, we could download it and take a look.

Unfortunately we don't have a good way of detecting file character sets at the moment from within Zotero. (Firefox obviously can do this internally, but we haven't yet found a non-bad way of accessing that routine.) We may be able to hard-code different character sets for certain sites into the translator itself, though it might require a small architectural change to allow that. I'd have to check with Simon, who coded it.

nickdos · September 7, 2007

I've been thinking about this a bit more...

and I think Zotero should interpret RIS format as UTF-8. Seeing as the RIS standard specifies 7-bit ascii and this is a subset of UTF-8, there shouldn't be a problem, as far as backwards compatability.

RIS files created by modern applications (which handle unicode) can then export these characters and not worry about character set conversion -- which essentially means either losing the high bit characters (as described above) or converting them to numbered HTML/XML entities. The latter is how I handled it in my RIS exporter and they are displayed correctly in Zotero.

dstillman · September 7, 2007

Seeing as the RIS standard specifies 7-bit ascii and this is a subset of UTF-8, there shouldn't be a problem, as far as backwards compatability.

Except it doesn't specify 7-bit ASCII—it specifies either "IBM Extended" or "Windows ANSI", depending on which copy of the spec you look at. And even Windows-1252 doesn't constitute the first 256 code points of UTF-8; ISO-8859-1 does. If we interpreted Windows-1252 documents as UTF-8, characters between code points 128 and 159 ('œ', ellipsis, single and double quotation marks, em and en, etc.) would be lost, since those are command characters in ISO-8859-1/UTF-8.

Now, we could interpret those particular code points as Windows-1252, which many web browsers and mail clients do. But we still need to check what the popular sites and programs are exporting to be sure there aren't a sizable number of IBM850-encoded documents out there.

On the other hand, it occurs to me that we could probably write some code that could at least differentiate between IBM850, Windows-1252, and ISO-8859-1/UTF-8, even if we didn't have full charset detection...

nickdos · September 7, 2007

Except it doesn't specify 7-bit ASCII—it specifies either "IBM Extended" or "Windows ANSI", depending on which copy of the spec you look at.

My bad for thinking Windows ANSI was 7 bit ascii... I was reading Wikipedia and it said that Windows-1252 was never standardized by ANSI, so I concluded it wasn't Windows-1252 but I didn't read far enough down to see that it actually is an historical misnomer. Ditto on not realizing there were 2 specifications.

I agree it would need to sniff out those ambiguous characters and deal with them smartly.

scot · September 7, 2007

You can find an example file from EBSCOhost at this link. It has 6 entries with either French accents or umlauts on each. I'm not sure I have the best tools to check the encoding myself, but here goes: The "Notepad++" text editor has that file as ANSI. Vim tells me it's UTF-8 (but may have some more advanced heuristics). Both display it correctly. When I open it in Firefox, "View/Character Encoding" tells me it is ISO-8859-1, but doesn't display it correctly until I manually set the encoding to UTF-8.

For what it's worth, here is the same list when imported directly from EBSCO into RefWorks and exported as RIS. It shows the same results in Firefox and my two text editors.

And this is a sample of what Google Scholar gives as if you do an 'Export to Endnote' and save the result. It's not RIS, and may be irrelevant to the discussion. It was an experiment. The GS entry is imported fine by the Zotero translator/scraper, but this little segment doesn't import right if you do it manually.

dstillman · September 7, 2007

Er, sorry, we wouldn't just lose characters 128–159 interpreting Windows-1252 as UTF-8—we'd lose 128–255. Latin-1 constitutes the first 256 code points of UTF-8, but codes 128–255 are still different (even if they're algorithmic rather than table-based conversions). So we couldn't just interpret Windows-1252 as UTF-8 without mangling all extended characters...

So that leaves 1) hard-coding charsets for particular sites, 2) implementing a simple detection algorithm, or 3) deciding that everybody should be exporting UTF-8 RIS these days and just use that. I'd be a bit wary about the latter and would want to do a pretty large sampling of programs and sites to see how commonplace UTF-8 RIS was. For what it's worth, from Scot's examples is does seem EBSCO and RefWorks are both exporting UTF-8 RIS.

nickdos · September 9, 2007

I just implemented a RIS export format (using unAPI) for the site I work on and in this first version I have left the character set as UTF-8 also (what we use internally).

So from my selfish point of view, assuming UTF-8 would work!

But I think option 2 would be a better solution. Its a shame about not getting access to the Mozilla/FF character set conversion libraries ("in a non-bad way")- this are a very simple solution for a tricky problem. Maybe some lobbying of the FF people could remedy this in a future FF release?

FWIW, I had a look at the RefMan user manual where it describes the RIS output as using "Windows ANSI character set" (Appendix D). Further up in the manual it has a section on entering special characters (p108), which also mentions ANSI character set and includes a (high bit) character table - which looks like Windows-1252. So I think we can at least discount IBM-850 (assuming IBM-850 differs to Windows-1252, which I'm not sure about).

scot · September 10, 2007

Obviously (2) is nice if you can manage it. It might be that the Mozilla/FF character set conversion libraries wouldn't be sufficient even if you had them. At least, the above up-loaded RIS files (which are apparently UTF-8) don't detect as such. (Firefox doesn't display them properly until you manually change the encoding.) Failing (2) you could give the user a choice, between UTF-8 and Windows 1252 or another encoding by means of a dialog box.