Finding duplicates using a python script

jwevandijk · November 7, 2007

Not the most elegant way but it works for me.
1) Export the zotero database (or the collection you want to test) to a MODS xml file.
2) Run the Python script "findDuplicateZoteroMods.py" (pasted below)
3) Manually remove the duplicate entries in Zotero it self

Please consider the script as a first attempt. As I am ignorant of the details of xml and DOM there might be a lot to approve upon. My reflist is several hundred long, for very large lists the DOM-approach might be too memory intensive.

It was tested with Python 2.5 on winxp
I run it from within the Python GUI IDLE

As indentation is strict in python it might that the copy paste in this message mutilates the file. Please mail me (6th comment in script) if you're interested in the original file.

(is there a more elegant method for this forum than pasting sources in a message?)

# Find duplicates in a MODS export of a Zotero database
# Version 6 Oct 2007
# Tested on Python 2.5
# Janwillem van Dijk
# Amersfoort, The Netherlands
# jwevandijk at xs4all dot nl

from xml.dom import minidom # building and parsing DOM
import hashlib # calculating a hash
from Tkinter import * # for a file-open dialog
from tkFileDialog import askopenfilename

fullcheck=False # set to False to check on authors+title only
# True checks the entire

def authors(reference): # all autors into a string
names=reference.getElementsByTagName("name")
npers=names.length
i=0
s=""
for name in names:
i+=1
parts=name.getElementsByTagName("namePart")
for part in parts:
try:
s+=part.lastChild.nodeValue
if part.getAttribute("type")=="family":
s+=", "
except: #name part empty
#print part.toxml()
pass
if i0:
t=titles[0]
s=titles[0].lastChild.nodeValue
return s

fname=""
#fname="My Library.xml" # uncomment for testing with default filename

# start a file-open dialog
if fname="":
root = Tk()
root.withdraw()
fname=askopenfilename(filetypes=[("MODS-files","*.xml"),("All files","*")])

reftree = minidom.parse(fname) # create the DOM-tree
treenodes = reftree.childNodes # make list of the nodes

# make list of the MODS entries in the references file
reflist=treenodes[0].getElementsByTagName("mods")

n=reflist.length
print "%d entries in database" % (n)

# make a hash list of the mods entries
print "Starting hash"
i=0
hashlist=[]
for nref in reflist:
if fullcheck:
xmlstr=nref.toxml() # hash does not allow for unicode strings
else:
xmlstr=authors(nref)+title(nref)
xmlstr=xmlstr.encode("utf-8")
xmlstr.lower() # some refs might be identical but the case
xmlhash=hashlib.sha256(xmlstr).hexdigest()
hashlist.append([xmlhash,i]) # save hash and index in database
i+=1

#find duplicate hash entries in the list
print "Starting scan"
hashlist.sort() # sort hash list
ndup=0
i=1
while i record %d equals %d" % (hashlist[i-1][1],hashlist[i][1])
r=reflist[hashlist[i-1][1]] # print info of 1st duplicate
print "%5d %s" % (hashlist[i-1][1],authors(r))
print " %s" % title(r)
r=reflist[hashlist[i][1]] # print info of 2nd duplicate
print "%5d %s" % (hashlist[i][1],authors(r))
print " %s" % title(r)
ndup+=1
i+=1

print "\n%d duplicates in %s" % (ndup,fname)
print "Scan ready"

jwevandijk · November 7, 2007

P.S. It seems that when you click "Edit" on the web-page you can copy paste the script with indentation OK.

jwevandijk · November 7, 2007

Sorry line 47 got mutilated, it should read
if fname=="":