"malformed URI sequence" at capturing a web page

On capturing some web pages (such as ),
I got an error as follows:

Error: malformed URI sequence
Source file: chrome://zotero/content/xpcom/attachments.js
Line: 1169

(Submitted as report ID 878181380)


Line 1169 at chrome://zotero/content/xpcom/attachments.js says:

function _getFileNameFromURL(url, mimeType){
/* ... */
// Pass unencoded name to getValidFileName() so that '%20' isn't stripped to '20'
nsIURL.fileBaseName = Zotero.File.getValidFileName(decodeURIComponent(nsIURL.fileBaseName)); // <-- HERE

I think this problem comes from the fact "%A4%CF%A4%C6%A4%CA" is an encoded string of UTF-8.
(Actually, "%A4%CF%A4%C6%A4%CA" is an encoded string of EUC-JP.)

Perhaps we need some more works in getting file name in saving from URI.

Thanks in advance.
  • I'll just offer a quiet, despairing one-line rant over the persistence of EUC-JP and Shift-JIS.
  • So we're trying to get a file name by decoding the URI's final component... And that gives us gibberish, since it's not UTF-8. I feel like we should be able to detect this and handle it correctly, but I'm not sure.

    This certainly due to EUC-JP, though, not a general UTF-8 issue.
  • Yes, this is not a general UTF-8 issue.

    But some concerns at getting a file name by decoding the URI's final component.

    - what if we got a error while decoding the URI's final component
    - what if decoded URI's final component is not appropriate for a file name
    (something like accented alphabets at Japanese locale in Windows,
    or something like CJKV chars at European locale in Windows)
  • edited April 23, 2011
    Any modern OS should be perfectly happy with any UTF-8 filename, so long as it doesn't include any explicitly reserved characters (and I think Mozilla will handle those automatically). A French Zotero user won't get Chinese filenames unless she's saving from pages with Chinese URI components. Which seems to mean that she's fine with having Chinese around.
    [Edit: I know that there are people using operating systems that don't qualify as "modern" by the definition, but even XP is pretty much OK with Unicode filenames in most cases.]

    There is a problem to fix here, but it's that we're failing to account for the non-UTF-8 content in URIs. There is probably somewhat related to an issue that we have with non-UTF-8 COinS (which are also URL-encoded). In both cases, it's hard to know what character encoding is in use (since URIs don't have any way to mark it explicitly, of course).
  • > Any modern OS should be perfectly happy with any UTF-8 filename,
    > so long as it doesn't include any explicitly reserved characters
    > (and I think Mozilla will handle those automatically).

    I see.
    (Some times ago, I've got an error with accented alphabets in file name,
    but that was not with Mozilla product, so this is not the problem here.)

    I want some fixes here.
    - At least, if failed to save a file, prompt me about that.
    (Current Zotero finished his work **quietly** even if failed to save a file.)
    - If possible, when failed to save a file with a certain name, try some altrenate name.
    Maybe alternate name can be a SHA-1 hash of path of URI,
    or even the raw final component of URI, I think.
  • I agree on the fixes. Mozilla gives us tools to fall back on a safe file name-- mainly we need to figure out ahead of time that the name is no good, probably by wrapping decodeURIComponent in a try-catch block to catch the error. Since this is happening within Zotero's main code, I'm going to leave this to Dan, but I imagine this can be fixed pretty easily.

    In general, there a few cases where we could use more user notification-- a similar case is when PDF attachments fail to be attached, and people would often want to know that, especially if they're used to saves always succeeding.
  • Thanks.

    I've made a quick (and dirty) patch for this <URL:https://gist.github.com/938927>.
    Posted to zotero-dev but need some time for an approval.
Sign In or Register to comment.