Solution: existing files cannot be found even though they exist

I managed to solve a problem in a way that I didn't see anyone else come up with. To skip the details: cmd/ctrl-F in this page for "touch". That's where the actions starts.

This solution worked for 2 different causes:
- Screwed up filename encoding cause attachments with special characters in the file names not to be recognized (they look like they are properly named, even though they are not)
- File exists in the correct subdirectoy of `storage/` but just had the complete wrong name

The (apparent) problem is that zotero doesn't see attachments even though they are (apparently) present. When you try to open the attachment, you get this error:
"The attached file could not be found at the following path:
/path/to/an/actual.pdf
It may have been moved or deleted outside of Zotero, or, if the file was added on another computer, it may not yet have been synced to zotero.org."
When you go to `/path/to/an/`, you see the file `actual.pdf` sitting there plain as day.

The work around is to manually relink the file. But I had hundreds of files. (The problem itself is my fault due to not keeping a good handle on my storage.)

I don't use zotero sync, drop box, google drive etc. However I have recently moved from Mac to Linux and in this case the change of file system was likely the problem.

Abridged troubleshooting

Installed storage scanner plugin. It tags #broken_attachments. I made a collection with all of these that looked straight forward. I only include those with PDF attachments, I will deal with other stuff later but mostly they are PDFs.

Then I used export to find a way that included the full path to the attachment, it was xml formatted. Regex to extract the paths to the missing attachments. Sorted and eyeballed to see that nothing looked weird and removed a few.

So now I have a list of "missing" files. First by manual spot test, then by a simple script, I confirm: yes, they are all present. A few have different file names (my fault I'm sure) but mostly the names are just as Zotero is looking for. In the end both were solved the same way. :)

I noticed something about the unseen attachments: they all have special characters in the filenames. Accents, written in non-latin script, weird punctuation, emojis etc.

Eventually in the terminal I realized that all these attachments actually had some sort of encoding issue in the filename. I still can't say what it is exactly. I used a cli tool called `chardetect` to attempt to ascertain what was going on; it was inconsistent. I think basically the filenames are (were?) using "mac roman" whereas the database was looking for UTF-8.

As an example, "ä" and this "ä" are two different characters according to the computer. If you copy and paste them into a text editor, place the cursor to the right and hit delete, the first one turns into a regular "a" whereas the second one is deleted entirely. What I had was attachments named, eg, "Fileä.pdf", but zotero was looking for "Fileä.pdf". If these 2 filenames are copy/pasted into a terminal, there is different results, respectively:

"Filea<0308>.pdf"
"Fileä.pdf"

(All the above examples of course subject to change if the forum normalizes things, if your machine is set up differently etc.)

Without messing around in the terminal I have no idea how I would ever had figured this out.

In addition to `chardetect`, some other terminal tools I tried were `iconv`, `uconv`, `hexdump`, and `convmv`. This last one seemed like the perfect tool because it actually is purposed built for this situation: it converts filename character encoding. Unfortunately it made an error every time I tried to use it... Probably there has been some additional fkup in all the moving around filesystems from mac to linux. But it might work for someone else!! Try it.

This thread describes the underlying issue in detail by someone who knows what they are talking about: UTF Encoding Issue with Filenames. This is how I came to understand it had to do with moving from Mac to Linux.

Solution

First, I made a backup. And I made more backups as I iterated. And I even restored them a few times. So it was worth the time. And check the work at every step etc.

Using the list of broken files created above (from the xml file), make a seperate directory with symlinks to the affected subdirectories of `/storage`. Put these totally outside the zotero directory so as to only operate on the desire files.

Find/replace (on terminal `rename` or `fd`/`find` followed by `-exec`; also used Thunar bulk rename in the GUI for parts) to rename all the existing invisible PDFs, leaving them in their same folder. For example `X0X0X0X0/Filea<0308>.pdf --> "X0X0X0X0/Filea<0308>-original.pdf`

Use `touch` to create empty files where zotero is expecting to find the files (again with the list of expected files obtained from storage-scanner). For example; `touch X0X0X0X0/Fileä.pdf`

Now that there is a file in the right spot, even though it is an empty file, zotero can work with it. So once you have done this for the whole lot of them, go back in to Zotero. Verify it can see the new empty attachments.

Use Zotfile attachment to force rename the files to the same thing with no metadata. Like just "attachment.pdf". Not author, year etc. NOTE this is why I excluded items with multiple attachments as I don't know how this would go.

Now, instead of looking for `X0X0X0X0/Fileä.pdf`, zotero is looking for a file called X0X0X0X0/attachment.pdf`. Search in your storage folder for `attachment.pdf`. Double check they are all 0 kb in size in case something else slipped in, and delete them.

Use your rename utility to rename all the remaining pdfs `attachment.pdf`. Always keeping them in their own folder.

Then you can go back into zotero, check it worked, and change your renaming rules to how you like. Rename them as you please.




I appologize that this is both really long and possibly lacking in the details that may be required for some. I'm a terrible writer. Especially I don't know how to generalize the advice to be cross platform and broad audience. Judging by the various other posts I read while trying to solve this, and by my basic understanding, this is a problem that mono-platform, mono-device users are unlikely to have. There are lots of permutations.



  • However I have recently moved from Mac to Linux and in this case the change of file system was likely the problem.
    How did you copy the files between systems? This wouldn't happen via Zotero sync, and shouldn't happen when copying the data directory to macOS (since APFS is normalization-insensitive), but I could see it happening when copying the directory from macOS to Linux, perhaps depending on the Linux filesystem used.
    this is a problem that mono-platform, mono-device users are unlikely to have
    To be clear, Zotero itself handles this properly when it syncs — Zotero has always been designed to be completely cross-platform compatible. But if you manually copy the data directory between different filesystems such that the filenames are changed, and the target filesystem isn't normalization-insensitive, the filenames would no longer match Zotero's database and it wouldn't be able to find the files.

    We probably could've given you an easier fix that would've relinked these automatically, and we could potentially try to add code to auto-detect this, so I'd encourage others finding this thread to just ask us for help before trying any complicated process like the one described above.
  • No I didn't use sync. I moved manually. I have a giant collection of files that Zotero is part of. And I think there was some kind of problem with sync like for a minute a different account or database or something was used on my machine and it was impossible to get them unmingled.

    The real solution/prevention, that would have saved me all this headache if I had done it in the first place, isn't Zotero sync. It is investing in one of the fast cheap external SSDs to consolidate all the files in one place that can run speedy file operations. Rather than having my files spread between internal SSD, external HDDs over the lan.... that's where I got into trouble with inconsistencies, duplication, uncertainty.

    But since I didn't do that for a long time I had all kinds of issues with files being all over the place. Like presumably the reason I have a bunch of files with flatly the wrong name is due to mismatching the profile and data directory. What's done is done.

    I read a bunch of threads of people with this problem and the best solution I saw was manually rematching the files, assuming no underlying problem with sync/sharing that can be solved.

    I think the problem is with the file names on the filesystem, not with the database or the software. I wasn't able to reliably use any of the applicable cli tools. They all threw strange errors. I had a hard time constructing scripts because the files evade being called specifically. And the solution was to align the filenames with the expectations of the software. So it's hard to blame Zotero. It's just more Mac BS; don't know when I will ever be free of it.

    The basic idea is to create empty files which have the name zotero is looking for, rename them to be generic, then replace them with the real files once zotero knows about them. (Directly renaming was impossible due to the file names being corrupted or whatever.) The post is so long because I was trying to explain how to discern if this is even possibly a relevant track to go down. Which for a lot of people having the generic problem, if it is caused by drop box then it's better to resolve that of course. But if everything else is exhausted, then ultimately this might point to the source of a problem and/or the solution. I read a lot of threads, issues, blogs etc of people just being puzzled they their attachments are invisible.
  • edited April 1, 2024
    It's just more Mac BS
    No, again, if you're just talking about Unicode characters in filenames, it's modern macOS that handles this in a user-friendly way, by being insensitive to Unicode normalization. You had a problem because Linux filesystems (apparently) don't do that.

    Again, none of this was likely necessary. If you have a problem in Zotero, and you're going to post here anyway, I'd strongly encourage you to just report it and let us help you.
    I read a lot of threads, issues, blogs etc of people just being puzzled they their attachments are invisible.
    You're misunderstanding those threads. You posted about a very specific issue regarding Unicode normalization in filenames when manually copying files between filesystems — that's not a normal problem that people experience. Most people are just out of storage space and missed the warning on the computer where they added the file.
  • I have also run into poor unicode support in *other* applications when handling diacritics in PDF filenames, which then impacted on Zotero's correct handling. For example when *external* custom code searches for orphaned PDF attachments in a list of PDF filenames. In my case that problem was largely fixed/avoided by using Zotfile's renaming rule 'Remove special characters (diacritics) from filename' in its Advanced Settings. So now none of my (linked/single-folder-location) PDF filenames have diacritics.

    While it's obviously not up to Zotero to make up for all the shortcomings of other software, copying the data folder is recommended by Zotero as one means of transferring libraries between computers; which may have triggered @titusp 's issue. Copying will remain a good alternative option over syncing for some use cases (eg fast movement of large libraries, poor internet, exceeded online storage quota, linked attachments), so maybe a caveat/workaround is warranted ?
    https://www.zotero.org/support/kb/transferring_a_library
Sign In or Register to comment.