UTF Encoding Issue with Filenames

enozkan · March 17, 2018

I just realized that some of the pdf's I have attached to items in my Zotero storage has been going "unfound". The reason appears to be that they have non-ASCII characters, usually from the author names with characters such as é or á. Zotero renames the files, but the names cause Dropbox, through which my storage (and only the storage, not the database) gets synced, to append "(Unicode Encoding Conflict)" to the filename before the extension ".pdf". Of course, Zotero can no longer find the file.

Just in case, I am attaching a debugging log where I attach a pdf and then it goes missing for Zotero: D188568239. However, I am certain the issue is due to the name change after Dropbox is unhappy about encoding. I am working on a Mac.

The question is then what encoding is Zotero using for file names with non-ASCII characters? They do not appear to be UTF-8? For example, I copied a á from one of these file names and checked the hexdump

%echo á > funnychar.txt
%hexdump -C funnychar.txt
00000000 61 cc 81 0a |a...|
00000004

It appears to be two characters, an "a" and a "´", when it appears as one. I am not sure where the problem is coming from: It's not the parent data for the item: thast one looks good. The pdfs indeed have these odd characters, and this may be happening because the new Zotero server extracts them as is, rather than use the metadata? I am not sure.

Overall, I'd love to know how to make sure non-ASCII characters especially in attachment file names are saved in UTF-8 or something normal. Thank you!

noksagt · March 17, 2018

Utf-8 allows compounded characters like that.

If the filename is valid when written to your OS X filesystem, it'd seem like there may be no zotero issue.

Ask dropbox ssuppor if they support syncing of these without changing the filename or use a sync method that will support it.

enozkan · March 18, 2018

Ahh, I did not know UTF-8/Unicode had a second way of defining those characters. Seems useful in odd cases, but I did not expect Zotero to use them when the simpler way exists, and when the sources of bibliography also use the regular (for lack of a better word) characters.

I'll try the Dropbox support way, but that may be a long shot.

If I actually use Rename File from Parent Metadata, it uses the non-compounded characters, since the sources I use, such as Pubmed, appear to encode those characters the regular way and Zotero imports the citation as is. Similarly, Crossref (which I believe is the resource for the new pdf naming server) also had these characters as simple UTF-8 characters, at least on their webpages. I may be completely wrong (and clearly way over my head) but it appears that the new auto-naming server might be converting the non-ASCII UTF-8 characters to their compounded versions? I must be wrong on that (happy to be corrected). Or maybe from the PDFs themselves? (that can't be right either)

I hope to not waste any developer hours on this, but I'd still suggest the pdf renaming scheme keeps the original UTF-8 characters as used in PubMed/Crossref/Google Scholar/metadata on publisher webpage (I checked them all!), if that is not being done so. It may help avoid complaints from Dropbox users, and also avoid the weird experience of having to hit Backspace twice to delete what is a single character (at least in a Mac terminal, Microsoft Word and possibly elsewhere).

enozkan · March 18, 2018

Also for those users who would like to know of a workaround, use Spotlight (or however you search for files on Mac) with a part of the filename, remove the (Unicode Encoding Conflict) and also replace the non-ASCII character by typing it again (Macs use the regular/simpler UTF-8 characters, so that fixes the issue for Dropbox).

If you want to gain access to all such files, search for filenames containing (Unicode Encoding Conflict), rinse and repeat as above. The issue seems to exist for only recently added files (since the file renaming server?), so there shouldn't be that many to deal with.

dstillman · March 18, 2018

I did not know UTF-8/Unicode had a second way of defining those characters. Seems useful in odd cases, but I did not expect Zotero to use them when the simpler way exists, and when the sources of bibliography also use the regular (for lack of a better word) characters.

That's not what's happening, though it's not totally clear to me what is.

Zotero normalizes all item fields and all filenames to NFC (what you're calling the "simpler way"), which is what's used for filenames on Linux and Windows (or is at least preferred on the latter, though I'm not sure of the specifics there). So when it saves a file, it's NFC.

HFS+, the legacy Mac filesystem, used NFD (composed characters) — a perfectly valid choice 30 years ago, but not what the rest of the world ended up using for most things. So when Zotero (or anything else) saved an NFC filename to HFS+, it became NFD.

APFS, the new Mac filesystem in High Sierra, is normalization-preserving and normalization-insensitive by default, meaning that you can save and access files using either NFC or NFD and it'll find files as expected.

Finder, even with APFS in High Sierra, appears to normalize to NFD, such that if you paste an NFC character into a filename in Finder and press Return, the filename will be encoded using NFD. (This also means that you have to be careful when you want to test a filename. If you press Return, Cmd-C, and then press Esc, the filename will stay as is, but if you press Return, Cmd-C, and then either press Return or switch to another window, Finder will force the filename to NFD.)

ls normalizes to NFC, so the best way to test a filename is by using Cmd-C in Finder (followed by Esc), which you can then paste into echo é | hexdump.

What Dropbox is doing for you is a bit of a mystery to me. This is what they say:

In some instances, there are several ways to create the same character on your keyboard. Although the characters may look the same, they are not the same to operating systems and Dropbox. When Dropbox notices these encoding conflicts, it will create a conflicted copy of the file and save it in the same folder appended with Unicode Encoding Conflict.

I'm not sure how that would come into play here, since Zotero is only saving files as NFC.

I'd be curious to know whether you're running the latest version of Dropbox (v45.4.92), if you're using HFS+ or APFS, and whether you have another computer syncing (and, if so, if disabling it helps). But basically Zotero is doing exactly what you would expect here, so as far as I know there's nothing for us to change.

enozkan · March 18, 2018

Thank you for that explanation. I appreciate the time you had spent on this. Yes, Zotero is not writing the filenames as NFD, but that's how the filename is ending up after Dropbox syncs. When I disable Dropbox syncing, the filename stays as NFC. So, I have to take this up with Dropbox.

I am indeed on latest Dropbox (v45.4.92) and I am using APFS. I have other computers syncing, and I'll check what disabling does at a more opportune time. I should also test this on a computer still on HFS+ and/or on OS <=10.12.

bp_216 · April 5, 2018

I had the same issue and have found a workaround, which is to switch off "Automatically rename attachment files using parent metadata". Then just ensure to avoid special characters in PDF names.

enozkan · April 6, 2018

Well, despite the criticism the developers have been getting about the renaming feature, I find the auto-renaming feature quite sensible and useful.

But, yes, that would solve the issue.

enozkan · April 17, 2018

I finally got to disabling all the other computers actively syncing at multiple locations. When I had an item in Zotero, and tried to attach a local pdf, the file got lost to Zotero because the filename got a "(Unicode Encoding Conflict)" appended. So syncing other computers make no difference.

This issue has some weird but expected side effects. With Dropbox syncing on for this computer (regardless of other computers syncing or not), when I took one of these pdf files and fed it to Zotero directly (no parent item), the pdf naming/metadata retrieval server fails to identify the publication and cannot create a parent item. If I remove the offending character, metadata retrieval is successful. I guess the local file is immediately added to the storage folder on Dropbox before being sent to the metadata retrieval server, and before it makes it to the server, its name is changed, and fails to upload.

But, here is the good news. If I remove the offending character from the original local pdf, and let Zotero retrieve bibliographic data and create a parent item, all goes well. So, if I upload files with a name like a.pdf, everything goes smoothly. Renaming works without Dropbox messing up the filename. I can even name it back to a.pdf in Zotero and re-rename from parent metadata, no problem. But, if the local pdf with an é in its filename was added to Zotero, it immediately gets lost and cannot be recovered (due to Dropbox name change).

It appears that the issue arises at the moment of initial pdf being saved in the storage folder. The metadata retrieval server or the local rename from parent metadata don't cause any trouble. And when Zotero renames files by adding special characters, Dropbox has no problem with that, either.

This provides a workaround for me: When I want to add a pdf with a filename that contains a special characters, I can rename it to whatever.pdf and proceed.

Thanks.

enozkan · November 2, 2018

The issue seems to be resolved. I am not getting "(Unicode Encoding Conflict)" appended to filenames any more. Probably as a result of an update on Dropbox. Thanks to all that chimed in.