Find all PDF files that are not linked to any Zotero item

b0c5 · August 13, 2023

I am using ZotFile to link PDFs to my Zotero library. All PDFs are stored in a directory, with some renaming rules.

Sometimes some of these PDFs are left dangling, with no Zotero item linking to them. Is there a way to find all dangling PDFs in a directory? I would like to either delete them or move them to some other location.

tim820 · August 13, 2023

Some PDFs in Zotfile's Custom Location can indeed become 'orphaned' if one uses Zotero's Move Item to Bin on the item to which they belong. Zotero does not delete linked PDFs. So one should always use the delitem addon to actually delete such items, or individual linked PDFs.
https://github.com/redleafnew/delitemwithatt

The first step in cleaning up any orphaned PDFS that have inadvertently occurred is to find them. AFAIK there is no direct way to do that, as Zotero no longer has any record of them. So the first step is instead to find linked items that Zotero DOES know about. That is done by getting a list of all item attachments in the Zotero database that are NOT in local Zotero storage. You can use use the code below run under Tools\Developer\Run Javascript to get such a list:

var filepathnames = await Zotero.DB.columnQueryAsync('SELECT path AS filepathnames FROM itemAttachments WHERE path IS NOT NULL AND path NOT LIKE ? ORDER BY path','storage:%');
return filepathnames.join('\n');

If you have a Linked Attachment Base Directory set, each file name will be listed with the prefix 'attachments:' instead of its actual full path (that is how it is stored in the Zotero database). But if you replace that prefix by the Linked Attachment Base Directory location, you get the full path at which Zotero thinks the PDF is located.

If OTOH you don't have a Linked Attachment Base Directory set, the actual full path where Zotero thinks the linked file is located will be listed (that path will be the Zotfile Custom Location, ie where Zotfile moved the file to).

Note that the list may also contain 'dead' links - files that Zotero thinks are at the linked location, but are not actually there (eg because outside of Zotero you inadvertently moved or deleted them). We are not considering those here.

The hard part is now determining which files are in your linked folder, but NOT in the list you just got - the actual orphaned linked PDF files. You can do that with code (a batch file for example), that compares the list just created to all the files that are in the linked folder; and then returns files in that folder but NOT in the list. Or the code can just copy the listed files found in the folder to a new folder. Files that don't get copied are the orphans. You can then rename the new folder to be the new linked folder (and the old linked folder to some other name). Once you are happy that the new folder is 'working' in Zotero, you can delete the old linked folder.

danielvartan · February 4, 2024

For those versed in R, here's a function that allows you to identify unlinked files.

https://gist.github.com/danielvartan/924817b7e4b69212beb217f339c37a3f

```r
# library(checkmate)
# library(magrittr)
# library(purrr)
# library(readr)
# library(stringr)

# Export the Zotero library in a CSV file.
list_linked_files <- function(lib_file = file.choose(),
basename = TRUE) {
checkmate::assert_file_exists(lib_file, access = "r")
checkmate::assert_flag(basename)

out <-
lib_file |>
readr::read_csv(col_types = readr::cols(.default = "c")) |>
magrittr::extract2("File Attachments") |>
stringr::str_split("; ") |>
unlist() |>
stringr::str_squish() |>
purrr::discard(is.na)

if (isTRUE(basename)) {
basename(out)
} else {
out
}
}

find_orphan_files <- function(lib_file = file.choose(),
file_folder = "G:\\Meu Drive\\Zotero\\files") {
checkmate::assert_file_exists(lib_file, access = "r")
checkmate::assert_directory_exists(file_folder, access = "rw")

linked_files <- list_linked_files(lib_file, basename = TRUE)
real_files <- list.files(file_folder) |> basename()

real_files[!real_files %in% linked_files]
}
```

nicolas.lienart · October 7, 2024

How to Clean Orphaned Linked Attachments in Zotero (Linux, Windows, MacOS)

Based on @tim820’s solution, here is a simplified approach I use to manage and clean orphaned linked attachment files in Zotero. This is especially useful if you sync your attachments using a cloud service (like Mega) and want to avoid bloating your account with files that are no longer linked to Zotero items.

Steps:

Get a list of linked attachments:
Open Zotero's Javascript Console by going to Tools > Developer > Run Javascript.
Paste the following two lines of code and click Run:


var filepathnames = await Zotero.DB.columnQueryAsync('SELECT path AS filepathnames FROM itemAttachments WHERE path IS NOT NULL AND path NOT LIKE ? ORDER BY path','storage:%');
return filepathnames.join('\n');

A list of file paths will appear in the Return value pane on the right.
Select all of this content, then copy and paste it into a text editor.

Process the file paths:

If you have a Base Directory set (found in Edit > Preferences > Advanced > Files and Folders > Linked Attachment Base Directory), the file names will appear with the prefix attachments: instead of their full path.

Keep all lines that start with attachments: (these are the files Zotero knows about).

Replace "attachments:" with nothing (using find and replace) to remove the prefix, leaving just the relative file paths.
Save this list as a text file (e.g., list_of_file_paths.txt).

Clean up orphaned files:

Close Zotero and stop any cloud syncing services (like Mega) to avoid conflicts.
Make a backup of the Zotero base directory using your file explorer.
Delete the internal contents of the current base directory but keep the root directory.

Repopulate the base directory:

Use a bash terminal, available on MacOS, Linux or Windows 10 and above via WSL.
Run the following rsync command to move only the files listed in list_of_file_paths.txt created earlier from your backup to the now-empty Zotero base directory:


rsync -a --files-from="/path/to/my/list_of_file_paths.txt" "/path/to/my/backup_base_directory/" "/path/to/zotero_base_directory/"

Final checks:

Reopen Zotero and ensure the attachments are working.
Restart any cloud syncing services you paused.

tim820 · October 8, 2024

@nicolas.lienart did you encounter any issues with diacritics in filenames for rsync ? I struck that problem with the different coding approach that I used (DOS batch file). I eventually realized I had to tell it to use a different character set to its default set, in order for diacritics to be recognized. I notice some internet chatter that rsync may have some issues with handling diacritics correctly.

In order that I would not strike that problem again, I then turned ON Zotfile's option to remove diacritics from attachment file names. Which is one of several reasons why I think Zotero v7's new file renaming scheme should have the option to remove diacritics like Zotfile did.
https://forums.zotero.org/discussion/comment/469757

nicolas.lienart · October 8, 2024

@tim820, I didn't have any issue yet.
Files with accentuated characters and files with those characters in the name were successfully copied in my case (and many more): ê “ ° ’ ( % , « é _ à

Maybe it's because of the use of --files-from= option from rsync to provide the list of files to be copied.