find duplicates, merge databases

jwevandijk · November 13, 2007

As a workaround for the lack of features in Zotero to find duplicates and to merge databases I have written two Python scripts that, for my purposes, solve the problem. They are fore anyone freely available on ftp://ftp.ecn.nl/pub/nrg/im/ in janwillem/pyzotero on the condition that if you can improve on them you tell me. I have little experience in using Python and none at all in xml and DOM, so any suggestions will be a learning experience.

The scripts work on MODS/xml exports of zotero.sql and are based on parsing the DOM-trees of the xml files. This means that these work arounds only work for fields that are common to Zotero and MODS and e.g. not to "Extra".

The script findDuplicateZoteroMODS.py scans the exported MODS/xml database for possible identical records based on exact match of authors and title. The results are listed.

The script findNewZoteroMODS.py compares a MODS/xml database, preferably the largest one, MODS_0.xml say, with one ore more other MODS/xml databases, MODS_1.xml ... MODS_n.xml. It creates a MODS/xml database of entries that are in one of the other databases but not in the master MODS/xml base, say MODS_new.xml. This MODS_new.xml can be than be imported with Zotero into the zotero.sqlite that gave MODS_0.xml. This method makes that zotero.sqlite is only updated by Zotero itself.

The scripts are meant to be run from a Python IDE (e.g. IDLE or PythonWin that are included in in the Python 2.5 distro's) but can of course also be run from the command prompt. They both import defsForZotero.py.

An improvement could be to use for the duplicate finding a fuzzy search e.g. based on ngram similarity. For my modest problems this is, however, not worthwhile the effort.

A 4th script in the ftp dir adds in a MODS/xml export a reference code [Author year] like [van Dijk 2007] in the zotero.sqlite field "Call Number" and is added as an example of how the database can be manipulated without directly accessing the the sqlite base itself which all to easy is converted to zotero.sqlite.damaged.