Orphaned attachment - files

komrade · September 27, 2009

When I do a search for all PDFs in a storage folder, I get about 750, - but if I copy all those results into a new folder, there are 150+ duplicates (and some duplicates with different - previous names for the files).

I understand roughly how these dupes could have been made
http://forums.zotero.org/discussion/3562/orphaned-items/
though I didn't expect there to be that many... (there must be more than before...)

Is there a way to find and remove them from the storage folder? It's about 300MB bigger than it needs to be!

Thanks as always.

dstillman · September 27, 2009

That's an old thread that shouldn't apply anymore.

Are you sure the duplicates are actually orphaned? Have any of them been created recently? Have you copied items with file attachments into group libraries (which would duplicate the files)?

komrade · September 27, 2009

Yeah I'm pretty sure. My database is beautifully clean (has taken days!) - with mostly one PDF attachment per item (no dupes).
But I wanted to just back up all the PDFs from the storage folder, so did a search for *.pdf, then copied them all into another folder. Had to click "do not overwrite" for duplicates about 150 times!
There were also quite a few duplicates with the "old" filename - clearly duplicates of current, (beautifully auto renamed!), files.

Haven't had anything to do with Group libraries.

Almost all have been "created" recently - I exported these (in a few batches) via RDF from another FF profile into a nice clean new profile.
(had a few minor issues with the import/export - but that's in my other thread!)

Is there a way to identify the orphans?

Heckscher · November 3, 2011

I definitely have several hundred files in my storage folder that are duplicates of files in the zotero database but are not linked to zotero themselves. I have no idea how this happened -- perhaps with a Sugarsync operation gone wrong. But as komrade asks, is there any way way identify which files are orphans, and get rid of them? (I'm thinking of exporting the library and re-importing it, but I gather that may lose data.)

dstillman · November 3, 2011

Don't export. Unless you have some extreme need for disk space, just ignore it until there's a better way to find them. Someone will write a plugin/patch/tool for this eventually.

krueschan · November 3, 2011

@Dan: what do you think about this approach to solving this:
- use read API to get all itemKeys of the user's library
- from those generate a batch program/shell script that moves all the subfolders in the zotero storage folder that have a corresponding itemKey into a subfolder "not orphaned"
- everything that's left in the storage folder can be deemed orphaned and dealt with accordingly
- move everything from "not orphaned" back into place

This should only take minutes to code (I am offering) but I am not sure if this work because
- would this deal with storage from group libraries/collections? (i.e. does the API return items from group libraries when run with a user code?)
- are there possible settings users might have chosen where attachment storage is not organised into subfolders named according to the itemKey?

ajlyon · November 3, 2011

Do this in Python, preferably building off of the libzotero library exposed by Qnotero. It should give you group library access, and be much faster than using the server API. Since libzotero reads from the Sqlite database, it knows exactly what Zotero thinks it has-- this can also be used to find missing attachments.

Also, don't do this by moving the directories-- there are file system functions you can use to iterate through the directories without moving them.

mronkko · November 3, 2011

Here is a solution using perl. Comes with absolutely no warranty.


#!/usr/bin/perl

use DBI;
use File::Path;

# Update this to match your data directory
my $zoterostorage="/Users/mronkko/Documents/Research/Zotero";
my $zoterostoragefiles="$zoterostorage/storage";
my $dbh = DBI->connect("dbi:SQLite:dbname=$zoterostorage/zotero.sqlite","","");

# Query all PDF attachments

my $sth = $dbh->prepare('SELECT key FROM items' ) or die "Couldn't prepare statement: " . $dbh->errstr;


$sth->execute() or die "Couldn't execute statement: " . $sth->errstr;

print "Fetching all dirs as a array\n";  

opendir DIR,  $zoterostoragefiles or die "cannot open dir $zoterostoragefiles: $!";

my @files= readdir DIR;
closedir DIR;

while (@data = $sth->fetchrow_array()) {
	# Remove the item the file array
	print"Checking @data[0] \n";
	my $index = 0;
	$index++ until @data[$index] eq @data[0];
	splice(@files,  $index,1);
}

# Loop over the non-existing files
foreach (@files){
	print "Deleting orphaned directory $_ \n";
	# Uncomment the following line to actually delete things
	# rmtree(["$zoterostoragefiles/$_"]);
}

Heckscher · November 6, 2011

Wow! This is the advantage of open source. But I hope someone will explain to non-programmers (like me) how to use this.

mronkko · November 6, 2011

Hechsher: How you use it depends on your operating system.

You can download the script at

https://github.com/mronkko/ZoteroCleanOrphans/raw/master/ZoteroCleanOrphanedFiles.pl

Then you need to open that in a text editor and change the line number 7 to point to the path of your Zotero data directory.

On Mac, Linux and other Unixes you then run the following command in terminal


perl ~/Downloads/ZoteroCleanOrphanedFiles.pl

After you have tested that it works, you remove the hash mark (#) from the line 37

I do not know if Windows comes with per, or if you would need to install it yourself.

Also you should note that this script has gone through only a limited amount of testing, so you should take a backup of your zotero data directory before using the script. Please let me know if there are any issues with the script so that I know to fix them.

Heckscher · November 6, 2011

Thanks -- I do use Windows, so I need to figure out how to use perl (it does not appear to be on my computer). That will have to wait till I get a bit of a break -- unless someone is willing to walk a non-perling Windows user through using this script?

Also a query: the script appears to indicate that it is looking only for orphaned PDF files, though I can't find a command that actually does that. Is that the case? Does it then move the non-orphaned PDFs to a new directory, as krueschan had suggested? Or does it just produce a list of orphaned pdfs, then delete them in place (once you delete the line 37 hash mark)?

mronkko · November 6, 2011

The comment says that it is looking for PDF files, but in the actual code the type of the file does not matter. The code will delete the files that are orphaned and will leave the other files where they are.

Heckscher · November 9, 2011

Thanks so much, mronkko, that sounds great. Now I just need to get it to run. (I will be careful and back up first!)

naught101 · April 17, 2012

Hey mronkko, nice script. I appear to have a number of files that aren't in the database, but that are in directories that should still exist. This isn't zotero's fault - I've used zotfile to rename files, and then synched two computer's zotero storage directories using rsync. But I was wondering if there was an easy way to find these files, and not just directories? It's not a huge issue, but it would be nice to clean up my storage directories.

mronkko2 · April 17, 2012

<blockquote>But I was wondering if there was an easy way to find these files, and not just directories? </blockquote>
I have not studied the ZotFile code much myself, so do not know. It might be possible.

asokolov · February 8, 2015

For those who have Python here is a rewrite of the mronkko's script:

#!/usr/bin/python

import sqlite3
import os

# Update this to match your data directory
zoterostorage = "/home/......./zotero"

zoterostoragefiles = zoterostorage +  "/storage"
dbh = sqlite3.connect(zoterostorage + "/zotero.sqlite")

# Query all attachments
c = dbh.cursor()

c.execute("SELECT key FROM items")

# Fetching all dirs as a set  
files = set(os.listdir(zoterostoragefiles))

for key in c.fetchall():
    if key[0] in files:
        files.remove(key[0])

# Loop over the non-existing files
print("\n".join(files))

The script only prints the orphaned dirnames. To actually purge them run in the zotero/storage directory
python path-to-script/ZoteroCleanOrphanedFiles.py | xargs rm -r

nealeyoung · October 24, 2016

1) Doesn't handle unicode in file names very well.
2) Doesn't handle pdf files stored as links (attachments outside of zotero storage).

I adapted the above script to handle these two issues. FWIW, here is my script. You will have to modify it for your purposes.

#!/usr/bin/env python3
 
import sqlite3
import os
import sys
import unicodedata

# Update this appropriately (another one below)
db = u"~/Library/Application Support/Zotero/Profiles/ss0s3rzk.default/zotero/zotero.sqlite"

c = sqlite3.connect(os.path.expanduser(db)).cursor()

c.execute('select path from itemAttachments where mimetype = "application/pdf" and linkMode = 2')


def normalized(s):
    return unicodedata.normalize('NFKD', s)


def clean_sqlite(f):
    # return normalized(os.path.realpath(f.encode("latin-1").decode("utf-8")))
    return normalized(os.path.realpath(f.encode(sys.getfilesystemencoding())).decode("utf-8")))

def clean_path(f):
    return os.path.realpath(f)


attachments = set(clean_sqlite(key[0]) for key in c.fetchall())

#update this appropriately
files = [clean_path(os.path.join(root, f))
         for root, dirs, files in os.walk(os.path.expanduser(u"~/Zotero"))
         for f in files]

denorm = {normalized(f): f for f in files}

unattached_files = [denorm[n]
                    for n in set(denorm.keys()).difference(attachments)]

missing_attachments = attachments.difference(set(denorm.keys()))

args = sys.argv[1:]

assert len(args) < 2, "usage"

a = dict(
    attachments=attachments,
    files=files,
    unattached=unattached_files,
    missing=missing_attachments)

if not args:
    for k in sorted(a.keys()):
        print(len(a[k]), k)
else:
    for k, x in a.items():
        flags = ('-' + k[0], '-' + k, k)
        if args[0] in flags:
            print("\n".join(sorted(x)))
            break
    else:
        print("usage")

See here for some more discussion of the unicode issues:
http://stackoverflow.com/questions/17457427/curious-about-unicode-string-encoding-in-python-3

mammadkarbist · March 20, 2018

If you are using Zotfile and you are saving files in another folder, you can just search for pdf files in the storage directory using software like directory opus and delete all the pdf or manually link them to the Zotero.
I changed the above code a little bit to save the contents in a CSV file. I resolved the Unicode problem in the python code above for myself, which may work for you too.

#!/usr/bin/env python3

import sqlite3
import os
import sys
import unicodedata

# Update this appropriately (another one below)
db = "C:\\Users\\User\\Zotero\\zotero.sqlite"

c = sqlite3.connect(os.path.expanduser(db)).cursor()

c.execute('select path from itemAttachments where contentType = "application/pdf" and linkMode = 2')

i=0
def normalized(s):
return unicodedata.normalize('NFKD', s)

def clean_sqlite(f,i):
#print(i,f)
#return normalized(os.path.realpath(f.encode("latin-1").decode("utf-8")))
#return normalized(os.path.realpath(f.encode("latin-1").decode("ISO-8859-1")))
#return normalized(os.path.realpath(f.encode("latin-1").decode("cp1252")))
return normalized(os.path.realpath(f))

def clean_path(f):
return os.path.realpath(f)

attachments = set(clean_sqlite(key[0],i) for i, key in enumerate(c.fetchall()))
c.close()
# update this appropriately
files = [clean_path(os.path.join(root, f))
for root, dirs, files in os.walk(os.path.expanduser("Zotero"))
for f in files]

denorm = {normalized(f): f for f in files}

unattached_files = [denorm[n]
for n in set(denorm.keys()).difference(attachments)]

missing_attachments = attachments.difference(set(denorm.keys()))

import json
with open("missing.csv", 'w') as f:
json.dump(list(missing_attachments), f)

nealeyoung · January 17, 2019

I edited my script (two comments up) to change the line

return normalized(os.path.realpath(f.encode("latin-1").decode("utf-8")))

to

return normalized(os.path.realpath(f.encode(sys.getfilesystemencoding())).decode("utf-8")))