Orphaned attachment - files
When I do a search for all PDFs in a storage folder, I get about 750, - but if I copy all those results into a new folder, there are 150+ duplicates (and some duplicates with different - previous names for the files).
I understand roughly how these dupes could have been made
http://forums.zotero.org/discussion/3562/orphaned-items/
though I didn't expect there to be that many... (there must be more than before...)
Is there a way to find and remove them from the storage folder? It's about 300MB bigger than it needs to be!
Thanks as always.
I understand roughly how these dupes could have been made
http://forums.zotero.org/discussion/3562/orphaned-items/
though I didn't expect there to be that many... (there must be more than before...)
Is there a way to find and remove them from the storage folder? It's about 300MB bigger than it needs to be!
Thanks as always.
Are you sure the duplicates are actually orphaned? Have any of them been created recently? Have you copied items with file attachments into group libraries (which would duplicate the files)?
But I wanted to just back up all the PDFs from the storage folder, so did a search for *.pdf, then copied them all into another folder. Had to click "do not overwrite" for duplicates about 150 times!
There were also quite a few duplicates with the "old" filename - clearly duplicates of current, (beautifully auto renamed!), files.
Haven't had anything to do with Group libraries.
Almost all have been "created" recently - I exported these (in a few batches) via RDF from another FF profile into a nice clean new profile.
(had a few minor issues with the import/export - but that's in my other thread!)
Is there a way to identify the orphans?
- use read API to get all itemKeys of the user's library
- from those generate a batch program/shell script that moves all the subfolders in the zotero storage folder that have a corresponding itemKey into a subfolder "not orphaned"
- everything that's left in the storage folder can be deemed orphaned and dealt with accordingly
- move everything from "not orphaned" back into place
This should only take minutes to code (I am offering) but I am not sure if this work because
- would this deal with storage from group libraries/collections? (i.e. does the API return items from group libraries when run with a user code?)
- are there possible settings users might have chosen where attachment storage is not organised into subfolders named according to the itemKey?
Also, don't do this by moving the directories-- there are file system functions you can use to iterate through the directories without moving them.
#!/usr/bin/perl
use DBI;
use File::Path;
# Update this to match your data directory
my $zoterostorage="/Users/mronkko/Documents/Research/Zotero";
my $zoterostoragefiles="$zoterostorage/storage";
my $dbh = DBI->connect("dbi:SQLite:dbname=$zoterostorage/zotero.sqlite","","");
# Query all PDF attachments
my $sth = $dbh->prepare('SELECT key FROM items' ) or die "Couldn't prepare statement: " . $dbh->errstr;
$sth->execute() or die "Couldn't execute statement: " . $sth->errstr;
print "Fetching all dirs as a array\n";
opendir DIR, $zoterostoragefiles or die "cannot open dir $zoterostoragefiles: $!";
my @files= readdir DIR;
closedir DIR;
while (@data = $sth->fetchrow_array()) {
# Remove the item the file array
print"Checking @data[0] \n";
my $index = 0;
$index++ until @data[$index] eq @data[0];
splice(@files, $index,1);
}
# Loop over the non-existing files
foreach (@files){
print "Deleting orphaned directory $_ \n";
# Uncomment the following line to actually delete things
# rmtree(["$zoterostoragefiles/$_"]);
}
You can download the script at
https://github.com/mronkko/ZoteroCleanOrphans/raw/master/ZoteroCleanOrphanedFiles.pl
Then you need to open that in a text editor and change the line number 7 to point to the path of your Zotero data directory.
On Mac, Linux and other Unixes you then run the following command in terminal
perl ~/Downloads/ZoteroCleanOrphanedFiles.pl
After you have tested that it works, you remove the hash mark (#) from the line 37
I do not know if Windows comes with per, or if you would need to install it yourself.
Also you should note that this script has gone through only a limited amount of testing, so you should take a backup of your zotero data directory before using the script. Please let me know if there are any issues with the script so that I know to fix them.
Also a query: the script appears to indicate that it is looking only for orphaned PDF files, though I can't find a command that actually does that. Is that the case? Does it then move the non-orphaned PDFs to a new directory, as krueschan had suggested? Or does it just produce a list of orphaned pdfs, then delete them in place (once you delete the line 37 hash mark)?
I have not studied the ZotFile code much myself, so do not know. It might be possible.
#!/usr/bin/python
import sqlite3
import os
# Update this to match your data directory
zoterostorage = "/home/......./zotero"
zoterostoragefiles = zoterostorage + "/storage"
dbh = sqlite3.connect(zoterostorage + "/zotero.sqlite")
# Query all attachments
c = dbh.cursor()
c.execute("SELECT key FROM items")
# Fetching all dirs as a set
files = set(os.listdir(zoterostoragefiles))
for key in c.fetchall():
if key[0] in files:
files.remove(key[0])
# Loop over the non-existing files
print("\n".join(files))
The script only prints the orphaned dirnames. To actually purge them run in the zotero/storage directory
python path-to-script/ZoteroCleanOrphanedFiles.py | xargs rm -r
2) Doesn't handle pdf files stored as links (attachments outside of zotero storage).
I adapted the above script to handle these two issues. FWIW, here is my script. You will have to modify it for your purposes. See here for some more discussion of the unicode issues:
http://stackoverflow.com/questions/17457427/curious-about-unicode-string-encoding-in-python-3
I changed the above code a little bit to save the contents in a CSV file. I resolved the Unicode problem in the python code above for myself, which may work for you too.
#!/usr/bin/env python3
import sqlite3
import os
import sys
import unicodedata
# Update this appropriately (another one below)
db = "C:\\Users\\User\\Zotero\\zotero.sqlite"
c = sqlite3.connect(os.path.expanduser(db)).cursor()
c.execute('select path from itemAttachments where contentType = "application/pdf" and linkMode = 2')
i=0
def normalized(s):
return unicodedata.normalize('NFKD', s)
def clean_sqlite(f,i):
#print(i,f)
#return normalized(os.path.realpath(f.encode("latin-1").decode("utf-8")))
#return normalized(os.path.realpath(f.encode("latin-1").decode("ISO-8859-1")))
#return normalized(os.path.realpath(f.encode("latin-1").decode("cp1252")))
return normalized(os.path.realpath(f))
def clean_path(f):
return os.path.realpath(f)
attachments = set(clean_sqlite(key[0],i) for i, key in enumerate(c.fetchall()))
c.close()
# update this appropriately
files = [clean_path(os.path.join(root, f))
for root, dirs, files in os.walk(os.path.expanduser("Zotero"))
for f in files]
denorm = {normalized(f): f for f in files}
unattached_files = [denorm[n]
for n in set(denorm.keys()).difference(attachments)]
missing_attachments = attachments.difference(set(denorm.keys()))
import json
with open("missing.csv", 'w') as f:
json.dump(list(missing_attachments), f)