Implementing import from new simple "ZoteroXml" format - help!

HSteeb · August 25, 2010

I'm a software developer and would like to use zotero as a knowledge base for personal notes, bookmarks etc. I need to import some existing data. Since I found using the RDF import format too complex, and the discussions about zotero sometimes asked for additional ways to import data, I thought implementing an import from a simple straightforward "ZoteroXml" format would be nice, like

<zotero-xml>
<document>
<title>My document</title>
etc.
(roughly: main elements are itemType names, child elements are field names).

First I created a plugin (modification of the HelloWorldPlugin). Basically, it works for all item types except notes + attachments, but takes about 1 sec. per item - I guess there is something wrong with the concept (is Zotero.Items.add too high-level an entry point, too many transactions or GUI updates...?).

Second, I tried creating an import translator (type 1). Now I'm really stuck: the sandbox in which doImport() is called hides DOMParser(), so how can I get a DOM from the file I want to import?

As a side benefit, I wrote a small HTML+js page that accepts a version of the system.sql as text input and generates a HTML listing or "ZoteroXML" import file with all valid itemType/field combinations.

Any help? And I'll be glad to provide the existing sources to anybody interested - if one of the zotero experts likes to take over :-)

dstillman · August 25, 2010

Basically, it works for all item types except notes + attachments, but takes about 1 sec. per item - I guess there is something wrong with the concept (is Zotero.Items.add too high-level an entry point, too many transactions or GUI updates...?).

Did you try wrapping everything in Zotero.DB.beginTransaction() and Zotero.DB.commitTransaction()?

dstillman · August 25, 2010

Second, I tried creating an import translator (type 1). Now I'm really stuck: the sandbox in which doImport() is called hides DOMParser(), so how can I get a DOM from the file I want to import?

You can use E4X. Search existing translators for "new XML".

But if you already wrote a plugin you might as well use that if this is just for your personal use. I suspect using a single transaction will solve your performance problems.

dstillman · August 25, 2010

(And, just for future reference, zotero-dev is usually a better place for technical development questions like this.)

HSteeb · August 25, 2010

- with the single transaction, now the performance seems ok!
- I'll have a look at E4X
- I'll follow-up on zotero-dev (I followed this 5 step support recipe on zotero.org that ends with posting in the forum).
Many thanks!

HSteeb · September 17, 2010

Just for the records: the plugin now worked for me (imports of ca. 2500 and 100 items). In case anybody needs such an import, I made the plugin available at https://bitbucket.org/hsteeb/zoteroxmlimporter .

I also isolated the XML parsing so that creating a translator should be easy, but I don't need it (and didn't have a look at E4X, either).

bdarcus · September 17, 2010

Have you created a schema for the XML? If not, please do so. If you need help, let me know. Relax NG Compact preferred.

On the details, I suggest sticking more closely to the Zotero model; e.g.:

start = library | records
records = element records { record+ }
library = element library { metadata, records }
...

HSteeb · September 18, 2010

From system.sql, I generated the Relax NG Compact schema https://bitbucket.org/hsteeb/zoteroxmlimporter/src/tip/design/zotero-import.rnc looking like

zotero-import = element zotero-import {
( artwork
| audioRecording
| bill ...

For the items part (including notes, tags and links to webpages), this should be as close to the Zotero model as possible! To import collections of items from other sources, this was enough (and I have no idea about the model and API to import more). It would be fine to have a complete import as you suggest - anyone may pick up.

One restriction of the schema: it describes the leaf elements (e.g. title) as "text", not covering the inline HTML.

bdarcus · September 18, 2010

Nice job!

Dan might want to comment on whether it's a good idea to model the types as elements. An alternate approach would be something like:

artwork = element item {
   attribute type { "artwork" },
   ...
}

Note: am not advocating this ATM; it really depends how stable those types are likely to be long-term.

One restriction of the schema: it describes the leaf elements (e.g. title) as "text", not covering the inline HTML.

Did you do this because a) you weren't sure how, b) you didn't have time (yet), c) you worry it's premature, or d) other?

HSteeb · September 20, 2010

Thanks!

From my point of view, types as elements makes the model easier to use. The same holds for the item properties being child elements. I'm aware of the element vs attributes discussions. In this case (data import), the model should just be simple.

Types being stable: I don't think the import format should be constant even if types change. The import format should reflect the data model directly. The current format should actually have the constant version number "2.0" - I just left that out. I would not expect a need to import "Zotero XML 2.0" into Zotero 3.0. For long-term data storage, I would use RDF, since I assume this format will be maintained by the experts.

Since the import code just throws the XML elements into Zotero, the code can remain unchanged - just the docs must be adapted depending on what the current Zotero version accepts (the version numbers in install.rdf must be adapted, anyway).

Inline HTML not covered: a), b), d). I did not need it, I'd have to look up how to do it, and I have no idea what exactly Zotero accepts in which item properties - this is left for the experts.

As it is, the code works for me, and it should be useful to anybody. Maybe the Zotero team likes the idea and builds a maintained plug-in or translator for a "simple XML" within Zotero. (Continue on zotero-dev?)

bdarcus · September 20, 2010

That's fine. We're talking about the inline stuff over on the xbib-dev list, so we can add it once that's resolved, essentially copying-and-pasting the definition from the csl-data,rnc schema.

But yeah, it might be good to float on zotero-dev.