Zotero RDF for data housekeeping

Using Zotero RDF as backup is strongly discouraged, eg.:

Documentation:
Warning — Import/Export: Zotero allows you to export your Zotero library as a Zotero RDF file. However, exporting and importing your library via RDF won't result in an exact copy of your library, and it isn't recommended as a backup strategy.
and frequently on the Forum:

adamsmith Jan 25th 2012:
Exporting and re-importing your library to clean it up is not recommended. There may be marginal data-loss, but more importantly it breaks item IDs, which will break connections to Word documents and will cause Havoc if/when you use the sync feature. I would strongly recommend you stop doing that.
I wonder if it is documented what exactly the "marginal data-loss" involves beyond item ID's? Is any of the metadata and/or notes&files in danger of silent modification?

A plain text dump of all data is extremely useful for many purposes, the most important (for me) being easy data housekeeping. Editing raw data in a text editor with Regular Expressions is much easier and more sophisticated than risking JavaScript API scripts (especially if one is not a JS person, like myself). As I can see from searching the Forum archives, the find&replace feature was asked for for years (even called "batch editing") and recent information is that it is planned for the next version (4.0?).

adamsmith Apr 29th 2012
we don't have search&replace yet, no - good chance it's going to be in the next Zotero version - it's one of the top two priorities. (It _can_ already be done via the javascript API if you really no what you're doing
On the other hand, while I understand the problem of breaking ID's, I do not see why any other data should be lost? Unlike the other export/import formats, where mapping between fields and allowed content obviously must cause data loss, Zotero RDF is Zotero's own format and should not leave anything out. Or am I wrong?

I would think that doing data housekeeping in this way should be possible provided that: (1) I make sure all my MSWord/OpenOffice documents have Zotero codes removed; (2) syncing is switched off; (3) the database is emptied and synced before importing.

Would that work and be safe?
  • One obvious thing you're losing is the date added, which I think is very useful information. I don't think there is anything else, but Dan would have to say for sure. (if he knows - one reason to put a disclaimer like this on non-supported/recommended practices is so that you can't complain if something doesn't work).

    Other issues to think about: At some size of library this may just crash: Zotero loads the entire RDF into memory before exporting it. A couple of thousand items works, but ten thousand might - depending on your computer - not.

    RDF import is very reliable, but it _does_ break occasionally (one wrong character, a file link that your OS doesn't like etc.), and because of the structure of XML, that's harder to deal with: You can't just split the file in chunks like with bibtex or RIS.

    And just so we're clear on this: none of your old Word documents created with Zotero would work anymore - that seems like a very steep price to pay.

    There may be other issues I'm not thinking of. I really wouldn't do it and if you do you're pretty much on your own.
  • Thanks again. This still does not sound that bad. "Date added" may be useful but certainly is not crucial to data integrity, so I may well sacrifice it. BTW: if it were exported, it would then be easy to move it to an unused field or create a note with it in the text file. But I am not much concerned with this.

    Memory problems with exporting seems to be an issue. Searching the Forum I see people had to face it with large libraries. I do not expect tens of thousands items in the particular Group project I am starting with Zotero but thinking ahead it may mean there will be no way to move data relatively losslessly from Zotero to other software (and while I like Zotero very much and wish it all the best, I would not like my data to be "arrested" within it). I had to move my bibliographic data from one program to another three times in my life and it was always a horrific experience, as all foreign formats did not include everything, truncated or otherwise changed data (for obvious reasons of different data models). So the only solution was always to use the native export format and write scripts to reformat the plain text file into something acceptable by the new program. And besides, it is always nice to have a human-readable version (the Report is fine, too, for that purpose).

    Splitting the RDF should not be that difficult as I see all items, memos and attachements are top level elements. The problem would be ascertaining that multiple links are all within the same chunk.

    OK. Thank you for indicating where the dangers are. I have been warned :-)
  • (to get the data for a huge library out of Zotero in ZoteroRDF you can just export in smaller batches (one collection at a time e.g.). That's not a big help if you want to do data clean-up, but for data portability it's perfectly sufficient).
  • This is quite enough! One can even make temporary collections just for the purpose, if needed. And the resulting files may be joined for cleaning up and then split again in the same places. Sounds nice! Thanks a lot!
  • You can't currently import into a group library, so you'd have to import the items back into your personal library and drag them to the group library from there.

    If this is a group library shared with other people, you really shouldn't do this—along with the other problems, you'll be requiring all other members of the group to resync all items. Handling of deleted items via sync is also somewhat non-optimal at the moment, so the group will permanently carry around a long delete history (for each time you do this) that could cause syncing problems.

    I assume you've seen this?
  • Thank you. Wonderful support and a nice place!

    Maybe I will never need such procedure (now I hope so) but working on Group Libraries with many people, as I expect, some data maintenance will surely be needed. And certainly not often, perhaps once in a few months or so. But I will keep in mind this is risky.

    Yes, I saw the JavaScript API find/replace script -- it may be helpful in simple situations of replacing text in a given field. What I would like, however, is more complex functionality with Regular Expressions. It is also safer to do in a text file, as one may proceed semi-manually, actually seeing the string to be replaced and deciding what to do.

    About the syncing with Group problems. Would this work: (1) move all items from Group to Personal Library into an empty Old Collection; (2) export to RDF; (3) import the processed RDF into an emty New Collection (to be sure); (4) create a New Zotero Group; (5) move all items from Personal to New Group Library; (6) delete the Old Group;(7) tell all members to join; (8) delete the Old Collection from Personal Library. Quite convoluted but should work? Say, twice a year? :-)
  • edited June 14, 2012
    You can use JS regular expressions in a batch script. By tweaking the code, you should be able to confirm the listing of affected entries before running an actual edit.

    As Dan says, it's not a good idea to do mass deletes and RDF imports in a library or group. Apart from the efficiency problems that Dan mentions, it will break all documents that your users have built using the content of the library before the change (as adamsmith points out). Keeping local RDF exports of the content as a supplementary backup might make sense, but a wholesale delete and replace of the running library content would be asking for trouble.
  • Thanks! I think I am now persuded not to do that and experiment with JS instead. Sigh... Maybe it will not be needed, after all, and I am just overcautious. I don't expect the Group Library to exceed 1000 items, so applying the changes manually may not be such a stress. Anyway, it is certainly good to know all the potential risks involved.
Sign In or Register to comment.