UTF Encoding Issue with Filenames
I just realized that some of the pdf's I have attached to items in my Zotero storage has been going "unfound". The reason appears to be that they have non-ASCII characters, usually from the author names with characters such as é or á. Zotero renames the files, but the names cause Dropbox, through which my storage (and only the storage, not the database) gets synced, to append "(Unicode Encoding Conflict)" to the filename before the extension ".pdf". Of course, Zotero can no longer find the file.
Just in case, I am attaching a debugging log where I attach a pdf and then it goes missing for Zotero: D188568239. However, I am certain the issue is due to the name change after Dropbox is unhappy about encoding. I am working on a Mac.
The question is then what encoding is Zotero using for file names with non-ASCII characters? They do not appear to be UTF-8? For example, I copied a á from one of these file names and checked the hexdump
%echo á > funnychar.txt
%hexdump -C funnychar.txt
00000000 61 cc 81 0a |a...|
00000004
It appears to be two characters, an "a" and a "´", when it appears as one. I am not sure where the problem is coming from: It's not the parent data for the item: thast one looks good. The pdfs indeed have these odd characters, and this may be happening because the new Zotero server extracts them as is, rather than use the metadata? I am not sure.
Overall, I'd love to know how to make sure non-ASCII characters especially in attachment file names are saved in UTF-8 or something normal. Thank you!
Just in case, I am attaching a debugging log where I attach a pdf and then it goes missing for Zotero: D188568239. However, I am certain the issue is due to the name change after Dropbox is unhappy about encoding. I am working on a Mac.
The question is then what encoding is Zotero using for file names with non-ASCII characters? They do not appear to be UTF-8? For example, I copied a á from one of these file names and checked the hexdump
%echo á > funnychar.txt
%hexdump -C funnychar.txt
00000000 61 cc 81 0a |a...|
00000004
It appears to be two characters, an "a" and a "´", when it appears as one. I am not sure where the problem is coming from: It's not the parent data for the item: thast one looks good. The pdfs indeed have these odd characters, and this may be happening because the new Zotero server extracts them as is, rather than use the metadata? I am not sure.
Overall, I'd love to know how to make sure non-ASCII characters especially in attachment file names are saved in UTF-8 or something normal. Thank you!
If the filename is valid when written to your OS X filesystem, it'd seem like there may be no zotero issue.
Ask dropbox ssuppor if they support syncing of these without changing the filename or use a sync method that will support it.
I'll try the Dropbox support way, but that may be a long shot.
If I actually use Rename File from Parent Metadata, it uses the non-compounded characters, since the sources I use, such as Pubmed, appear to encode those characters the regular way and Zotero imports the citation as is. Similarly, Crossref (which I believe is the resource for the new pdf naming server) also had these characters as simple UTF-8 characters, at least on their webpages. I may be completely wrong (and clearly way over my head) but it appears that the new auto-naming server might be converting the non-ASCII UTF-8 characters to their compounded versions? I must be wrong on that (happy to be corrected). Or maybe from the PDFs themselves? (that can't be right either)
I hope to not waste any developer hours on this, but I'd still suggest the pdf renaming scheme keeps the original UTF-8 characters as used in PubMed/Crossref/Google Scholar/metadata on publisher webpage (I checked them all!), if that is not being done so. It may help avoid complaints from Dropbox users, and also avoid the weird experience of having to hit Backspace twice to delete what is a single character (at least in a Mac terminal, Microsoft Word and possibly elsewhere).
If you want to gain access to all such files, search for filenames containing (Unicode Encoding Conflict), rinse and repeat as above. The issue seems to exist for only recently added files (since the file renaming server?), so there shouldn't be that many to deal with.
Zotero normalizes all item fields and all filenames to NFC (what you're calling the "simpler way"), which is what's used for filenames on Linux and Windows (or is at least preferred on the latter, though I'm not sure of the specifics there). So when it saves a file, it's NFC.
HFS+, the legacy Mac filesystem, used NFD (composed characters) — a perfectly valid choice 30 years ago, but not what the rest of the world ended up using for most things. So when Zotero (or anything else) saved an NFC filename to HFS+, it became NFD.
APFS, the new Mac filesystem in High Sierra, is normalization-preserving and normalization-insensitive by default, meaning that you can save and access files using either NFC or NFD and it'll find files as expected.
Finder, even with APFS in High Sierra, appears to normalize to NFD, such that if you paste an NFC character into a filename in Finder and press Return, the filename will be encoded using NFD. (This also means that you have to be careful when you want to test a filename. If you press Return, Cmd-C, and then press Esc, the filename will stay as is, but if you press Return, Cmd-C, and then either press Return or switch to another window, Finder will force the filename to NFD.)
ls
normalizes to NFC, so the best way to test a filename is by using Cmd-C in Finder (followed by Esc), which you can then paste intoecho é | hexdump
.What Dropbox is doing for you is a bit of a mystery to me. This is what they say: I'm not sure how that would come into play here, since Zotero is only saving files as NFC.
I'd be curious to know whether you're running the latest version of Dropbox (v45.4.92), if you're using HFS+ or APFS, and whether you have another computer syncing (and, if so, if disabling it helps). But basically Zotero is doing exactly what you would expect here, so as far as I know there's nothing for us to change.
I am indeed on latest Dropbox (v45.4.92) and I am using APFS. I have other computers syncing, and I'll check what disabling does at a more opportune time. I should also test this on a computer still on HFS+ and/or on OS <=10.12.
But, yes, that would solve the issue.
This issue has some weird but expected side effects. With Dropbox syncing on for this computer (regardless of other computers syncing or not), when I took one of these pdf files and fed it to Zotero directly (no parent item), the pdf naming/metadata retrieval server fails to identify the publication and cannot create a parent item. If I remove the offending character, metadata retrieval is successful. I guess the local file is immediately added to the storage folder on Dropbox before being sent to the metadata retrieval server, and before it makes it to the server, its name is changed, and fails to upload.
But, here is the good news. If I remove the offending character from the original local pdf, and let Zotero retrieve bibliographic data and create a parent item, all goes well. So, if I upload files with a name like a.pdf, everything goes smoothly. Renaming works without Dropbox messing up the filename. I can even name it back to a.pdf in Zotero and re-rename from parent metadata, no problem. But, if the local pdf with an é in its filename was added to Zotero, it immediately gets lost and cannot be recovered (due to Dropbox name change).
It appears that the issue arises at the moment of initial pdf being saved in the storage folder. The metadata retrieval server or the local rename from parent metadata don't cause any trouble. And when Zotero renames files by adding special characters, Dropbox has no problem with that, either.
This provides a workaround for me: When I want to add a pdf with a filename that contains a special characters, I can rename it to whatever.pdf and proceed.
Thanks.