"malformed URI sequence" at capturing a web page

sonik · April 23, 2011

On capturing some web pages (such as ),
I got an error as follows:

Error: malformed URI sequence
Source file: chrome://zotero/content/xpcom/attachments.js
Line: 1169

(Submitted as report ID 878181380)

Line 1169 at chrome://zotero/content/xpcom/attachments.js says:

function _getFileNameFromURL(url, mimeType){
/* ... */
// Pass unencoded name to getValidFileName() so that '%20' isn't stripped to '20'
nsIURL.fileBaseName = Zotero.File.getValidFileName(decodeURIComponent(nsIURL.fileBaseName)); // <-- HERE

I think this problem comes from the fact "%A4%CF%A4%C6%A4%CA" is an encoded string of UTF-8.
(Actually, "%A4%CF%A4%C6%A4%CA" is an encoded string of EUC-JP.)

Perhaps we need some more works in getting file name in saving from URI.

Thanks in advance.

fbennett · April 23, 2011

I'll just offer a quiet, despairing one-line rant over the persistence of EUC-JP and Shift-JIS.

ajlyon · April 23, 2011

So we're trying to get a file name by decoding the URI's final component... And that gives us gibberish, since it's not UTF-8. I feel like we should be able to detect this and handle it correctly, but I'm not sure.

This certainly due to EUC-JP, though, not a general UTF-8 issue.

sonik · April 23, 2011

Yes, this is not a general UTF-8 issue.

But some concerns at getting a file name by decoding the URI's final component.

- what if we got a error while decoding the URI's final component
- what if decoded URI's final component is not appropriate for a file name
(something like accented alphabets at Japanese locale in Windows,
or something like CJKV chars at European locale in Windows)

ajlyon · April 23, 2011

Any modern OS should be perfectly happy with any UTF-8 filename, so long as it doesn't include any explicitly reserved characters (and I think Mozilla will handle those automatically). A French Zotero user won't get Chinese filenames unless she's saving from pages with Chinese URI components. Which seems to mean that she's fine with having Chinese around.
[Edit: I know that there are people using operating systems that don't qualify as "modern" by the definition, but even XP is pretty much OK with Unicode filenames in most cases.]

There is a problem to fix here, but it's that we're failing to account for the non-UTF-8 content in URIs. There is probably somewhat related to an issue that we have with non-UTF-8 COinS (which are also URL-encoded). In both cases, it's hard to know what character encoding is in use (since URIs don't have any way to mark it explicitly, of course).

sonik · April 23, 2011

> Any modern OS should be perfectly happy with any UTF-8 filename,
> so long as it doesn't include any explicitly reserved characters
> (and I think Mozilla will handle those automatically).

I see.
(Some times ago, I've got an error with accented alphabets in file name,
but that was not with Mozilla product, so this is not the problem here.)

I want some fixes here.
- At least, if failed to save a file, prompt me about that.
(Current Zotero finished his work **quietly** even if failed to save a file.)
- If possible, when failed to save a file with a certain name, try some altrenate name.
Maybe alternate name can be a SHA-1 hash of path of URI,
or even the raw final component of URI, I think.

ajlyon · April 23, 2011

I agree on the fixes. Mozilla gives us tools to fall back on a safe file name-- mainly we need to figure out ahead of time that the name is no good, probably by wrapping decodeURIComponent in a try-catch block to catch the error. Since this is happening within Zotero's main code, I'm going to leave this to Dan, but I imagine this can be fixed pretty easily.

In general, there a few cases where we could use more user notification-- a similar case is when PDF attachments fail to be attached, and people would often want to know that, especially if they're used to saves always succeeding.

sonik · April 23, 2011

Thanks.

I've made a quick (and dirty) patch for this <URL:https://gist.github.com/938927>.
Posted to zotero-dev but need some time for an approval.