Find all PDF files that are not linked to any Zotero item

I am using ZotFile to link PDFs to my Zotero library. All PDFs are stored in a directory, with some renaming rules.

Sometimes some of these PDFs are left dangling, with no Zotero item linking to them. Is there a way to find all dangling PDFs in a directory? I would like to either delete them or move them to some other location.
  • Some PDFs in Zotfile's Custom Location can indeed become 'orphaned' if one uses Zotero's Move Item to Bin on the item to which they belong. Zotero does not delete linked PDFs. So one should always use the delitem addon to actually delete such items, or individual linked PDFs.
    https://github.com/redleafnew/delitemwithatt

    The first step in cleaning up any orphaned PDFS that have inadvertently occurred is to find them. AFAIK there is no direct way to do that, as Zotero no longer has any record of them. So the first step is instead to find linked items that Zotero DOES know about. That is done by getting a list of all item attachments in the Zotero database that are NOT in local Zotero storage. You can use use the code below run under Tools\Developer\Run Javascript to get such a list:

    var filepathnames = await Zotero.DB.columnQueryAsync('SELECT path AS filepathnames FROM itemAttachments WHERE path IS NOT NULL AND path NOT LIKE ? ORDER BY path','storage:%');
    return filepathnames.join('\n');

    If you have a Linked Attachment Base Directory set, each file name will be listed with the prefix 'attachments:' instead of its actual full path (that is how it is stored in the Zotero database). But if you replace that prefix by the Linked Attachment Base Directory location, you get the full path at which Zotero thinks the PDF is located.

    If OTOH you don't have a Linked Attachment Base Directory set, the actual full path where Zotero thinks the linked file is located will be listed (that path will be the Zotfile Custom Location, ie where Zotfile moved the file to).

    Note that the list may also contain 'dead' links - files that Zotero thinks are at the linked location, but are not actually there (eg because outside of Zotero you inadvertently moved or deleted them). We are not considering those here.

    The hard part is now determining which files are in your linked folder, but NOT in the list you just got - the actual orphaned linked PDF files. You can do that with code (a batch file for example), that compares the list just created to all the files that are in the linked folder; and then returns files in that folder but NOT in the list. Or the code can just copy the listed files found in the folder to a new folder. Files that don't get copied are the orphans. You can then rename the new folder to be the new linked folder (and the old linked folder to some other name). Once you are happy that the new folder is 'working' in Zotero, you can delete the old linked folder.
  • For those versed in R, here's a function that allows you to identify unlinked files.

    https://gist.github.com/danielvartan/924817b7e4b69212beb217f339c37a3f

    ```r
    # library(checkmate)
    # library(magrittr)
    # library(purrr)
    # library(readr)
    # library(stringr)

    # Export the Zotero library in a CSV file.
    list_linked_files <- function(lib_file = file.choose(),
    basename = TRUE) {
    checkmate::assert_file_exists(lib_file, access = "r")
    checkmate::assert_flag(basename)

    out <-
    lib_file |>
    readr::read_csv(col_types = readr::cols(.default = "c")) |>
    magrittr::extract2("File Attachments") |>
    stringr::str_split("; ") |>
    unlist() |>
    stringr::str_squish() |>
    purrr::discard(is.na)

    if (isTRUE(basename)) {
    basename(out)
    } else {
    out
    }
    }

    find_orphan_files <- function(lib_file = file.choose(),
    file_folder = "G:\\Meu Drive\\Zotero\\files") {
    checkmate::assert_file_exists(lib_file, access = "r")
    checkmate::assert_directory_exists(file_folder, access = "rw")

    linked_files <- list_linked_files(lib_file, basename = TRUE)
    real_files <- list.files(file_folder) |> basename()

    real_files[!real_files %in% linked_files]
    }
    ```
Sign In or Register to comment.