Intercepting RIS files: diacritics and and phantom items
When intercepting RIS files from EBSCOhost (using the ATLA database). I have the following two problems:
1. characters with umlauts and accents don't get imported correctly.
2. phantom web items are created in the database, unlinked to any parent, and without a title, with only the URL of the database entry.
Also, the URL of the database entry is included in the URL field on Zotero's 'Info' pane. This is apparently contrary to the CSL spec, as Bruce mentions here.
1. characters with umlauts and accents don't get imported correctly.
2. phantom web items are created in the database, unlinked to any parent, and without a title, with only the URL of the database entry.
Also, the URL of the database entry is included in the URL field on Zotero's 'Info' pane. This is apparently contrary to the CSL spec, as Bruce mentions here.
I don't know where the problem is with the RIS import, but even if the problem is with the export, could Zotero be perhaps made to guess the encoding? It seems that vim and Notepad++ 'get it right' by default. At any rate both display the saved saved RIS file correctly, even before any explicit conversion.
One solution might be to either ask if and how RIS tags should be imported each time. An intelligent fallback would probably to treat keywords in intercepted RIS files as 'automatic' tags.
There is no guarantee that the source app for your RIS file is using Windows ANSI anyway, it may be using UTF-8?
1) ASCII is just 128 characters, so "Windows ANSI" isn't just standard ASCII—it's ASCII plus one of a bunch of different international sets for the remaining 128 characters. Presumably they mean Windows-1252.
2) ISI says "IBM Extended Character Set", which is why we went with IBM850. Since that's an older DOS character set, that may be an older version of the specs, but it does seem to be the most authoritative source.
3) All of this is largely irrelevant if large sites are using different encodings. It may be that EBSCOhost is using Windows-1252 or UTF-8 and Scot's text editor is adding a UTF-8 BOM at the beginning of the file when saving it, which makes Zotero ignore any character set specified by the translator and use UTF-8 instead.
Scot, if there are extended characters in the RIS file, your text editor (or, at least, some text editor) should give you some indication as to what charset the file is in when you first open it. Could you check that? Otherwise, if you upload it somewhere, we could download it and take a look.
Unfortunately we don't have a good way of detecting file character sets at the moment from within Zotero. (Firefox obviously can do this internally, but we haven't yet found a non-bad way of accessing that routine.) We may be able to hard-code different character sets for certain sites into the translator itself, though it might require a small architectural change to allow that. I'd have to check with Simon, who coded it.
and I think Zotero should interpret RIS format as UTF-8. Seeing as the RIS standard specifies 7-bit ascii and this is a subset of UTF-8, there shouldn't be a problem, as far as backwards compatability.
RIS files created by modern applications (which handle unicode) can then export these characters and not worry about character set conversion -- which essentially means either losing the high bit characters (as described above) or converting them to numbered HTML/XML entities. The latter is how I handled it in my RIS exporter and they are displayed correctly in Zotero.
Now, we could interpret those particular code points as Windows-1252, which many web browsers and mail clients do. But we still need to check what the popular sites and programs are exporting to be sure there aren't a sizable number of IBM850-encoded documents out there.
On the other hand, it occurs to me that we could probably write some code that could at least differentiate between IBM850, Windows-1252, and ISO-8859-1/UTF-8, even if we didn't have full charset detection...
I agree it would need to sniff out those ambiguous characters and deal with them smartly.
For what it's worth, here is the same list when imported directly from EBSCO into RefWorks and exported as RIS. It shows the same results in Firefox and my two text editors.
And this is a sample of what Google Scholar gives as if you do an 'Export to Endnote' and save the result. It's not RIS, and may be irrelevant to the discussion. It was an experiment. The GS entry is imported fine by the Zotero translator/scraper, but this little segment doesn't import right if you do it manually.
So that leaves 1) hard-coding charsets for particular sites, 2) implementing a simple detection algorithm, or 3) deciding that everybody should be exporting UTF-8 RIS these days and just use that. I'd be a bit wary about the latter and would want to do a pretty large sampling of programs and sites to see how commonplace UTF-8 RIS was. For what it's worth, from Scot's examples is does seem EBSCO and RefWorks are both exporting UTF-8 RIS.
So from my selfish point of view, assuming UTF-8 would work!
But I think option 2 would be a better solution. Its a shame about not getting access to the Mozilla/FF character set conversion libraries ("in a non-bad way")- this are a very simple solution for a tricky problem. Maybe some lobbying of the FF people could remedy this in a future FF release?
FWIW, I had a look at the RefMan user manual where it describes the RIS output as using "Windows ANSI character set" (Appendix D). Further up in the manual it has a section on entering special characters (p108), which also mentions ANSI character set and includes a (high bit) character table - which looks like Windows-1252. So I think we can at least discount IBM-850 (assuming IBM-850 differs to Windows-1252, which I'm not sure about).