Hierarchical Item Relationships

Josh · February 5, 2007

I'd like to corral a few different threads of discussion here:

http://forums.zotero.org/discussion/385/
http://forums.zotero.org/discussion/78/
...and Bruce's useful prodding

Right now, Zotero has what's called a "flat" data model - there are a set number of item types, and each one has a fixed number of fields. This makes for easy coding, and is the model that Endnote and virtually every other research management tool use. As much as anything, this model grows out of the bibliographic citation; according to MLA or Chicago or whatever, there are a certain number of citation types, and if you're writing software that is ultimately about facilitating bibliography-creation, then the smart thing to do is to just implement those item types in a flat model.

The problem, though, comes when you want to do more interesting things with your data than simply create references; there are *relationships* between items that aren't expressed in a flat model. If you have several chapters from an edited volume in your library, for example, you'd want the data model to reflect the fact that, while each chapter is its own discrete object, they also share a parent (the book itself). That way, you wouldn't have to repeatedly enter the same information for the book each time you added a new chapter to your library; you'd just link each new "Edited Volume Chapter" item to its parent collection. By adding parent-child relationships, you move from a flat to a "hierarchical" data model, which is more elegant, cleaner, and just all around more appealing.

It also raises a few really big issues, which is why Zotero's still using a flat data model.

*Interface*

Right now, we're pretty happy with the iTunes-like Zotero interface; there's a library, and items in a list within it. That list can be sorted in a few different ways, but it's fundamentally a flat list. There is a hierarchical relationship between items and snapshots/links/notes, but even that's been problematic. (Should notes only be listed underneath the items to which they're attached? What about freestanding notes? What if you want to browse your notes - what's the intuitive interface for toggling between modes?)

A hierarchical data model raises huge interface questions which I haven't really seen solved anywhere; that may just mean that Zotero's going to be the app to do so, but it's going to be a long slog (we were lucky that Apple had spent huge amounts of time refining the iTunes interface to the point where we could just kludge many of its conventions), especially if we're committed to creating software that's intuitive and understandable not just for us power users, but for the broader population of researchers who are less willing to approach a more outside-the-box interface.

*Sharing*

Bound up with many of the conversations around data models has been a second, important-to-distinguish issue: user-created fields/types. Right now, the Zotero data model is immutably fixed; there are a set number of item types, each with set fields. We're still cleaning these up and adding new ones (the "Blog Post" entry for example, has a "Website Type" field which is about as useful as the human appendix, and which shares a vestigial explanation as the "Blog Post" item type grew out of the "Webpage" item type).

Flat data models lend themselves more easily to data exchange - if the ontology of types/fields is the same across all Zotero users, I can share my library with you seamlessly. If I wanted to customize things a bit, I could do what Endnote does and add in a few "Custom Fields" which are essentially containers for any extra information I might want to tack on at the end of an item. There's a reason why all the popular social software sites (del.icio.us, cite-u-like, flickr, youTube, etc.) operate with flat, immutable data models - to make different users' collections interoperable, you've got to have the same ontology.

If it's not clear, interoperability is extremely important to us; we're planning a server that'll allow users to share, exchange and collaborate on Zotero collections, and we're incredibly excited by the possibility of visualization, recommendation and other newfangled bells and whistles that will transform the way that we as scholars relate to both our research and our colleagues. However, this only works if the data is interoperable.

Now, I'm not trying to conflate user-defined fields with flat vs. hierarchical data models - they're two separate issues, and should be treated as such. However, if you want to enable users to tailor the software to their particular needs (which we do) because you think they know better than you what they'd need (which we do), a flat data model is *much* easier to deal with, because you're talking about a few extra, easily-excludable fields. If you start letting users create their own item types, interoperability becomes problematic, and if you enable the fitting together of existing and user-created item types together in parent-child hierarchies, the likelihood of normalization decreases exponentially. It's an awkward analogy, but you go from fitting square pegs of different lengths into square holes to dealing with pegs and holes of any size and shape imaginable.

*Where we're at*

When we started building Zotero, we had a few pragmatic goals: to build a tool that was more open and robust than Endnote and other bibliographic reference managers, to make its interface accessible to as many users as possible, and to build it with an eye toward enabling social-software-like functionality. When we ran into the question of what sort of data model to use, we made a tough decision - though it might not seem like it, we think *a lot* about the data model, and we spent quite a bit of time debating whether to go with a hierarchical model from the start.

Ultimately, we decided to start off with a flat, fixed model, and focus our energies on the infrastructure that would allow users to get data into the system, as well as mechanisms for getting data back out (either in citations or via utilities, about which I'll post more elsewhere). We knew from the start that this was a compromise, and one that we'd have to address later on; right now, our plans are to get version 1.0 out the door, then tackle this problem headlong.

There are two distinct pieces here: the hierarchical data model and user-defined types/fields. The current plan is to do hierarchical types first, with a defined ontology of types and fields. That's going to occupy a good amount of time, less for the implementation in the software itself (that'll be a pretty straightforward process) than for the interface design; figuring out exactly how to represent and work with hierarchical items isn't easy, and we've got a lot of work to do.

On the second piece (user-created types/fields), we want these as badly as everyone else, but the moment that Zotero opens up the data ontology to users we irrevocably lose the ability to cleanly aggregate and share item data. One of the basic rules of software development is that you can always add features but never take them away, so we're going to tread cautiously and carefully to make sure we enable users to customize their personal data without threatening the interoperability of Zotero collections.

With all of that said, this would be a good place to start a discussion on the strategy as I've laid it out, as well as the more focused questions raised elsewhere on the forums and dev list. I've also created a page on the Zotero Trac wiki (https://www.zotero.org/trac/wiki/HierarchicalOntology) on which to flesh out a new, hierarchical model - if we can come to a consensus of what this should look like, we'll be able to roll it into the post-1.0 Zotero much more quickly.

mkbergman · February 5, 2007

Hi Josh,

I'd like to humbly suggest that the project should distinguish between 1) creation of standard data models from 2) canonical data formats. For example, within microformats, each format (XOXO, hCard, rel-tag, etc.) has a process for the community to agree upon data labels to promote interoperability, but all formats follow the same XML canonical format.

For hierarchical relationships in Zotero the first question should thus be: what canonical form? OPML, XOXO or perhaps even an OWL variant, among others, could be candidates. I think it is always best to adopt the simplest canonical representation, because at later times converters can be written for translating to any exotic real-world variant. By this measure (and without further analysis or investigations), my first reaction would be to look strongly at OPML.

The second matter, standard data models, I think is more central to the core of your posting. Here, within the bibiographic research and citation community (actually, there may be many depending on discipline!), it seems to me establishing specific standards efforts per your wiki posting is the right way to go. Only domain experts can settle internally on the semantics necessary for their domain. But, it is also likely the case that each domain will also prefer its own semantics (and not nearly so constraining or "global" in intent as, for example, the existing microformats standards).

However, even in the case of different standard data models for different domains or communities, should such arise as they inevitably will, interoperability is not out of the question. Semantic mediation and resolving of semantic heterogeneities is an area of active research with many tools and promising approaches extant.

Please note that by acknowledging semantic heterogeneity I am NOT encouraging it, and where specific communities can avoid it via standards and consensus, that is an easier choice for you and the team as developers. Thus, your call for contributions and dialog on this question is right on. But what I am saying is that once you as developers place such useful tools into the marketplace, others will embrace them and take them in different directions. Semantic heterogeneities leading to challenges in interoperability are unavoidable, especially if you as developers have done a great job (as you have done!) to provide a powerful framework useful to many constituencies.

As the development stewards, then, focus on canonical formats. As domain or community stewards, focus on standard data models and semantic consistency.

Since I first encountered Zotero I saw it as having very broad applicability beyond its current audience. I hope that the question of canonical data formats and format conversion tools can be kept separate from the setting of standard data models (by community or domain).

I'd like to think more on this question and provide further input at a later time.

Thanks, Mike

Matthias · February 5, 2007

Josh, thanks for your detailed post, I fully agree with you that a hierarchical data model poses many interface & compatibility issues which are far from being trivial. Your post very much describes the reasoning why we haven't adopted a hierarchical model for refbase yet (although this was discussed with Bruce and others and had been planned since a long time, see e.g. http://sourceforge.net/forum/forum.php?thread_id=1135397&forum_id=218757).

While a hierarchical model has many advantages, a flat model has definitively some advantages that go beyond easy implementation & sharing. As an example, a flat model allows different users to use different variants for a particular container, e.g. while some of my colleages want (or need) to cite a non-english book or journal with its original (say, german) title, others always cite the title translated into English. Yet another (french) researcher may need to cite the very same resource with a french translation of the title (e.g. for a french grant application). Surely, these things can be all modelled with a hierarchical model but they require *a lot* of thought. In addition, each use case has to be accounted for in the interface. Contrastingly, a flat model lets users use custom container values for their own resources.

That said (and not thinking of any implementation details), I'd personally favour a solution that would let me choose on item-level whether I'd like to link a particular resource (like a book chapter) to its container item OR whether I'd like to enter custom container values for that particular resource. I.e., by default hierarchical relations would be encouraged by the software but I'd also be able to "unlink" individual items (of course, a link to a container item could still be stored by the software for unlinked items). With this in mind, we've thought about using a flat model along side to a hierarchical model for the refbase project. This would mean partially redundant data in the database but would allow us to use the best model for a given use case.

Just some thoughts... I very much appreciate this discussion. Matthias

Matthias · February 5, 2007

Regarding the interface question on how to implement a GUI that facilitates a hierarchical data model: Consider how a relational database such as FileMaker is doing it. You can easily add data fields from a parent record to a child record's layout (think Zotero item types). Then, when editing a parent's field *within* one of the child records, the parent record is updated accordingly (unless prohibited by a pref setting). Similarly, any editing of the parent record will affect all of its child records.

So, speaking of interface refinements to Zotero required for a hierarchical data model, I could imagine that most of the current GUI layout could stay exactly as it is now -- just include a means to link an item to its parent container (plus a means to create the container item if it doesn't exist yet), then distinguish the related container/parent fields with a clear visual clue within the item's type layout (maybe a different background color?).

A lock icon (or something like that) could determine on item level whether these related container/parent fields could be edited to change the parent item (and thus the parent fields of any other items within that container!). In addition, there could be a prominently placed checkbox (and notification message) that lets the user specify whether the editing of "container fields" will be applied globally or not. This could even solve the issues outlined in my previous post (flat & hierarchical data stored along side each other).

bdarcus · February 5, 2007

Josh,

A few comments first:

"As much as anything, this model grows out of the bibliographic citation; according to MLA or Chicago or whatever, there are a certain number of citation types, and if you're writing software that is ultimately about facilitating bibliography-creation, then the smart thing to do is to just implement those item types in a flat model."

I thank you for laying this out and starting the discussion, in part because it highlights some differences.

This probably stems from us handling different ends of the development puzzle (you more about GUI, me about formatting configuration), and from the fact that as a user, I have quite demanding needs (in fact, not unlike an historian!).

First, the style guides are not at all as focused on types as you assume. A whole lot of them are about classes of documents and relationships. They use types as illustrations.

Second, while it might be easier for you to code, flat models also actually don't match a whole lot of real world citation needs. It is only a convenience for programmers (and because relational databases and RDF didn't exist when most current bibliographic applications were designed). The thread about primary sources illustrates this really well, so I won't repeat those arguments.

Third, if you assume a flat list of types that can be extended, how do you handle a) data exchange, and b) formatting? In the latter case, do you require a style author to create templates for all 65 or so of the types that one can identify in Chicago?

That's been my focus. CSL is designed around a mix of a hierarchical and flat model. Part of the reason for this is that the styles become both more compact and easier to debug, as well more reliable. If there is no template for a given style, the formatting system is supposed to use one of the three generic fallbacks. These are determined based on the structural characteristics of the data.

"The problem, though, comes when you want to do more interesting things with your data than simply create references; there are *relationships* between items that aren't expressed in a flat model."

This is only one small part of the problem. It is about also the need to have both flexible and reliable data exchange and formatting.

"Now, I'm not trying to conflate user-defined fields with flat vs. hierarchical data models - they're two separate issues, and should be treated as such. However, if you want to enable users to tailor the software to their particular needs (which we do) because you think they know better than you what they'd need (which we do), a flat data model is *much* easier to deal with, because you're talking about a few extra, easily-excludable fields."

It might be easier for your programmers. But that's not to say that it's easier for formatting, or data exchange.

Here's the practical issue:

1. Zotero has its internal data model, and then another one it presents to its users
2. CSL has its own data model
3. the RDF model I designed -- and which you are partly using -- has still another data model, though designed in conjunction with CSL

It is critical to future interoperability of all these pieces for the three models to be in sympathy. A CSL formatter has to know what to do with a citation source. It has to know which template to use. Otherwise, the citations won't get formatted correctly. And to do that, the data models MUST be consistent.

Likewise with data exchange.

Now, I'd like to complicate this a little by pointing out (as Matthias does) that one need not necessarily chose flat vs. relational. There can be ways to have both. For example, maybe you use a little configuration language to setup a "type" for a GUI, but to keep a hierarchical structure for formatting and exchange. Just emphasizing this is no black-or-white choice.

bdarcus · February 5, 2007

Oh, also Josh, you do realize I've spent a lot of time on an RDF model for this; right?

http://purl.org/net/biblio

Also, it's not "easy" at all!

This is what I want us to agreee on: what should this RDF model look like (for typing that can be used in RDF, and also elsewhere, like Atom), and how do we map it to CSL (and the GUI)?

CloudofDust · February 5, 2007

First, thanks to Josh for initiating this discussion, and also for linking to that thread from last October, which helped answer some of my questions. This should be an interesting discussion.

One of the reasons that the inadequacies of biblio programs became my hobby horse is that before I entered academia, I worked a bit with MS Access and learned some of the rudiments of relational database design. At one time, I wanted to use Access to design a relational database that would handle my notes and citations - but that's another story. The point is that when I started shopping for biblio/notetaking software, I was shocked to find that it was unheard of for such programs to have an authors table, titles table, publishers table, etc. Instead, as you've all noted, the standard is one big table with lots of repetitive data and data entry.

These days, I do most of my academic writing with Nota Bene, which I like because of the sophistication of its word processor and the simplicity and user-friendliness of its text retrieval module, which is great for organizing research notes. And its biblio database module is adequate on its own terms, and very well integrated with the rest of the program. But I've also played around some with Library Master, which is the most user-customizable biblio program I've been able to find. Like everything else, it works on a flat, one big table model, but I found a way to make it create the kind of compound citations I needed in a way that approximated a relational database solution.

I'm pasting in below an old email I sent to the Nota Bene users list, and forwarded to the LM users list, describing this method. I think that the underlying concept (inserting record markers for "containers" within the "container" fields of other records) could be a fairly straightforward way to address many of my own complaints about flat biblio databases, and I suspect that one could create a user-friendly interface for it without too much trouble (along the lines that Matthias mentions above). It would also help stop the proliferation of record types that bdarcus points to: Instead of 100 different record types, we could have 10 document types, and 10 container types, and achieve the same (or better) result. Consider that the only difference between citations of a journal article and an anthology chapter is the "container." Have a single "titled short work" type for the actual items, and then a journal type and an anthology type as choices for the container. Lots of neat possibilities.

For the sake of making this thread more readable, I've replaced the full text of the email with the URL for the original:

http://www.hamline.edu/~wnk/notabene/2003/msg05806.html
(scroll down to the part labelled "Multi-level shortened citations")

Here's another URL for a more recent commentary of mine (more specifically directed towards improving Nota Bene, but relevant to the more general problem of achieving these complex citations without a completely relational database):

http://wnk.hamline.edu/mhonarc/notabene/msg00373.html

I hope you find these comments interesting.

raf · February 6, 2007

To be perfectly honest, I sometimes find it hard to understand many of the issues outlined in the above. Nevertheless, I think I understood the gist of CloudofDust's initial posts, before the discussion migrated to the feature request threads.

I am rather familiar with a few bibliographic databases and have come to use Zotero a lot off late. For the work I’m doing now, I have about 1400 entries of pre-1800 records.

I just wanted to say that the concerns of CloudofDust are shared by others as well – and if Zotero would manage to address those concerns it would make my life much easier as well.

Thanks to all the people who devote their time and energy to this,

r

paregorios · February 6, 2007

It's hard to describe how delighted I was to see this discussion rubric pop up the very day I subscribed to the forum. The issues of hierarchy and type are at the core of a bibliographic data quest that I hope Zotero can help us solve. Please forgive me if what follows seems idiosyncratic; rather, I hope it will serve as an interesting use case. Note: I've had to divide my comments into multiple posts, since there's evidently a character limit on comments that's irritatingly not advertized in advance.

Context: Collaborative Ancient Geography

I direct the Ancient World Mapping Center's Pleiades Project, which aims to create an online workspace for ancient geography. Our springboard is a legacy data collection assembled by the American Philological Association's Classical Atlas Project over the twelve-year period that culminated in the publication of the Barrington Atlas of the Greek and Roman World (R. Talbert, ed., Princeton, 2000). With startup funding from the National Endowment for the Humanities, our 2-person development team is crafting the supporting web application on top of the Plone content management system. Plone provides us with a customizable framework for the creation of structured content types, scriptable editorial workflow and an extensible user portal. Our goal is a compelling, intuitive environment in which community approaches will combine with academic-style editorial review, to enable anyone — from university professors to casual students of antiquity — to suggest updates to geographic names, descriptive essays, bibliographic references and geographic coordinates. Once vetted for accuracy and pertinence, these suggestions become a permanent, author-attributed part of future publications and data services.

Issue: Bibliography in Pleiades

The Ancient World Mapping Center has developed a bibliographic database that contains over 4,000 records deriving from the bibliographic citations that supported each mapped feature in the Barrington Atlas. These records have been augmented with data drawn from MARC records and added by hand, so that they now include alternate and translated titles (each coded for language), holdings information, standard numbers and more. This database must be migrated from its current, stand-alone configuration to operate seamlessly within the Pleiades web application as live content. Not only do we wish to harness the community's effort to capture new bibliographic information, we also require that each suggestion for an addition, deletion or change to a name, place or location be supported by either an explanatory essay or a bibliographic citation. Obviously, the accuracy of these citations, and the ease with which they can be created, are key concerns.

To be continued ...

paregorios · February 6, 2007

... continued from previous post

Plone Plugins for Bibliography

There is an established plugin for bibliographic content in Plone: CMFBibliographyAT. It was primarily designed to facilitate the construction and management of personal citation lists in Plone. Its data model derives from BibTeX. After long consideration, we have elected not to use it. There are a couple of key reasons:

No support for, or developer interest in, hierarchical relationships and associated authority control (e.g., information about a journal must be re-entered in the record for each article; we would prefer the appropriate journal be picked from a list in the GUI or auto-assigned to incoming data on the basis of an ISSN or other standard identifier)
CMFBibliographyAT developer concern that the existing code base will not scale to a dataset the size and complexity of ours

It looks as if we will need to develop our own Plone plugin for bibliography, just as we are doing for the geography and toponomy.

The Pleiades Bibliographic Data Model

The data model realized in the Mapping Center's bibliographic database is hierarchical, relational and idiosyncratic. In designing it (ca. 2000), we deliberately eschewed flat models (and those based directly on formats like USMARC) because we felt we needed the control of having a single record for each discrete work, whether it was an article, journal volume, journal, etc. Somehow, we missed the existence of the Functional Requirements for Bibliographic Records (FRBR) Final Report, published by the International Federation of Library Associations and Institutions in 1998. This report deliberately avoids recommending a data format. Rather, it lays out an entity-relationship model that aims to:

provide a clear, precisely stated, and commonly shared understanding of what it is that the bibliographic record aims to provide information about, and what it is that we expect the record to achieve in terms of answering user needs ... and recommend a basic level of functionality and basic data requirements for records created by national bibliographic agencies.

In comparing the FRBR model to our own, we find a high degree of commonality. In particular, FRBR's four main entities (works, expressions, manifestations and items) and their relationships with each other align almost perfectly with the semantic, hierarchical distinctions made in our model. Recently, we have come to the conclusion that we should adopt FRBR terminology for entities, attributes and relationships in our bibliographic plugin for Plone. This conclusion has been reinforced by the recent development of a Dublin Core application profile for scholarly works (i.e., e-prints) that takes as its starting point an application profile based on FRBR. See also: OCLC's FRBR Research Page and the FRBR Blog.

Zotero and Pleiades

Zotero is a killer ap. After just a few minutes of Zotero use, it was clear to me that the Pleiades bibliographic plugin for Plone would need to integrate tightly with it. Data flows from Zotero to Pleiades, and vice versa, will add value to our content creation process and to our users' individual research agendas. We are taking a similar approach with Google Earth, to which we can already push our geographic data. Soon, users will be able to recommend new coordinates for sites by moving or creating placemarks in Google Earth and pushing these back to Pleiades. On the formats front, we would also be happy to see something like Zotero's RDF schema (building, as it does, on so many other useful efforts) evolve to handle hierarchy and other types, so that we can serialize feeds of bibliographic data for third-party syndication, just as we are doing for geographic data by way of geoRSS and KML.

Questions for the Zotero Community

In light of all this, I'd like to ask a few questions that are hopefully relevant to the current thread:

For hierarchy, has there been any consideration of FRBR? Should there be?
I'd like to hear more about current thinking with respect to Josh's comment: we're planning a server that'll allow users to share, exchange and collaborate on Zotero collections
Within the library cataloging community, there are descriptive vocabularies for types of works in bibliographic records (e.g., for data elements 06 and 07 in USMARC). Shouldn't these, as existing standards already embodied in many of the resources that Zotero scrapes, all be represented (or crosswalked to something) in the list of Zotero item types (for example, our data model sees USMARC "serial" as a superclass for "series" and "journal", but I don't see corresponding items in the Zotero list)

bdarcus · February 6, 2007

"For hierarchy, has there been any consideration of FRBR? Should there be?"

I've spent a lot of time looking into and prototyping FRBR-related data stuff. It's a killer data model that is really sound. But you pay for that in complexity. My conclusion is that it's too complicated for these use cases.

That said, I helped a bit with the FRBR RDF work that Ian Davis (from Talis) and Rich Newman did, and more recently I've been referencing it from my RDF model. I blogged about this here:

http://netapps.muohio.edu/blogs/darcusb/darcusb/archives/2006/09/11/plugging-into-frbr-killing-marc

In my data model, BTW, I have journal as a subclass of periodical, which is a subclass of collection. MARC is good to learn from, but less-than-ideal otherwise. Library people think about bibliographic metadata differently, reflecting different priorities.

CloudofDust · February 6, 2007

A quick thought - it might help facilitate this conversation (and the inclusion in the discussion of non-techies, or even techies with different areas of expertise) to have a basic glossary of acronyms (CSL, RDF, etc.) and other key terms relevant to this issue. Maybe there is one already, and I just haven't seen it.

I've gotten a handle on at least some of this vocabulary by browsing some of the developer documentation, where many of these concepts are explained, but the documentation is written for developers, and so many of the explanations remain rather opaque to the rest of us.

So far, this has been a very interesting exchange of perspectives - let's keep it up.

bdarcus · February 6, 2007

Glossary:

XML (eXtensible Markup Language) - a way to tag plain-text with structured information

RDF (Resource Description Framework) - a standard (basically relational) data model for representing just about anything; builds on XML and the web (resources are named using URIs). Think of something like a relational database for the web. Well-suited to needs of Zotero.

CSL (Citation Style Language) - an XML language to describe citation styles; written by me, with some help from one of the Zotero guys, an used in Zotero. There's some interest in standardizing it and using it witin OpenDocument.

CloudofDust · February 6, 2007

The following question may fall into the "obvious" category for many or most of you, but perhaps answering it will help familiarize us non-techies with some of the technical difficulties we're facing, and maybe kick-start some brainstorming about alternatives.

Is an ordinary relational database a viable solution for biblio stuff? Why or why not?

I'm picturing related tables for "people," "titles," "journals," "publishers," "places," etc. A GUI would present a data entry form more or less like current biblio programs, but the data would go into the various tables instead of into one big one. There would be an autocomplete feature where appropriate, both to facilitate data entry and to prevent duplication of data (when adding additional works by an author already in the database, the name would pop up, ensuring that you don't create multiple records for the same author).

The record entry form would also contain tags to qualify how specific data would be used. For example, names from the "People" table could be tagged as author, editor, recipient, etc. These tags would then determine the appropriate formatting when it comes time to spit out citations. Similarly, titles could be tagged to designate their relation to the item being entered: item title, collection title, etc. We'd also need some way within the titles table to distinguish long work (i.e., underlined/italicized) titles from short work (quotation marks) titles from simple descriptions (with no special formatting at all) - or perhaps this could be handled by giving each of these title-types its own table.

Different forms could be created for different record types, labelling the fields so as to make the process intuitive for the user - but this is basically what record types do now. The user experience would be more or less the same as at present, but the underlying relational structure would allow us to do a lot of neat stuff with the data (such as using "collection" records within citations of "document" records, etc.)

I'm sure there are plenty of reasons why the above wouldn't work, but I don't know what they are, and would like to find out. I imagine the programming requirements are daunting, but it sounds like this will be true no matter what structure is ultimately chosen.

I also realize that bdarcus is suggesting some middle ground between flat and relational solutions - I'd love to hear more about these ideas, and why they are more appealing than what I've just outlined.

So fire away - I have no investment whatsoever in this scheme, nor am I advocating it as a surefire solution - think of it more as a starting point for a conversation that will, with any luck, go off in some unforseen direction.

dstillman · February 7, 2007

I mentioned this in one of the earlier threads, but to address CloudOfDust's question (and to clarify the discussion for others), the internal database structure used by Zotero is pretty much irrelevant to the issue at hand. How we store data internally is mostly an issue of disk space efficiency and performance. I'll just reproduce what I said on that other thread, as it's germane to this discussion:

There's nothing inherent in the current database design that prevents there from being dependencies/relations and automatically incorporating parent fields in dependent items without duplicating data. Normalizing publications out to a separate table might make a few things easier (and other things more convoluted), but it doesn't avoid any of the issues that make this difficult. You still need a way to select a parent/publication from within another item, still need to prompt the user for what to do on deletes, still need to determine whether the change to publication data in one item changes it in the parent, etc. And you need a mechanism for relating items in complex ways regardless of whether publication data exists independently.

The UI suggestions from Matthias above are a great start to answering these sorts of questions. More ideas along those lines would be most welcome.

Aside from UI issues, the other major requirement is a hierarchical map of reference types of the sort Bruce has been working on, which would allow not only for proper CSL fallback but also for certain internal logic within Zotero—for example, I would imagine that certain communication-based types might share the same logic for auto-generating a proper title field from the values of the creator types designated in the database as equivalent to "author" and "recipient" (whatever the type-specific names might be). A planning document that arises from this discussion would ideally include descriptions of such logic.

Bruce, does your RDF model account for actual fields or just types? (I don't see any mention of fields on the RDF schema page, so perhaps you haven't integrated them yet.) In particular, I'm wondering whether child reference types inherently contain all the fields of their parent types (though perhaps with different names).

(CloudOfDust: We already do exactly what you suggest for creators and creator types, since, among other things, they require many-to-many relationships with items. Separating out the other fields into separate tables wouldn't provide any benefit and would simply make the code much more convoluted, as it would require adding custom logic for every single field (and there are quite a lot of fields). One thing that would be appropriate, from a data normalization perspective, would be to add an itemDataValues table with all the unique values and simply reference them in itemData—it would make working with the data somewhat more tedious and possibly make certain operations slower, but the overall increase in storage efficiency (at least for large libraries) would probably make it worth it. But, again, a separate matter from the issues at hand.)

bdarcus · February 7, 2007

Dan:

"Bruce, does your RDF model account for actual fields or just types? (I don't see any mention of fields on the RDF schema page, so perhaps you haven't integrated them yet.)"

The SVN version does include all the properties, though I think we still need some inter-project discussion of whether we define these all in the same namespace (as I have) or whether we reuse existing vocabularies (like DC) for that.

"In particular, I'm wondering whether child reference types inherently contain all the fields of their parent types (though perhaps with different names)."

Not sure what you mean here. Can you rephrase?

In general, when we're talking about RDF and RDF Schema, we're talking about classes and properties. Properties can in turn define their domain, which is a class. In that case, it allows inferencers to come across a property (independent of any type statement) and say "ah ha, this property belongs to X class".

Here's an example from the sbo.n3 file in my project SVN:

sbo:title a owl:DatatypeProperty ;
rdfs:label "title"@en ;
rdfs:isDefinedBy sbo: ;
owl:equivalentProperty dc:title ;
rdfs:range xs:string .

bdarcus · February 7, 2007

Re: this question:

"I'm picturing related tables for "people," "titles," "journals," "publishers," "places," etc."

As Dan says, it's all really a question of trade-offs, but when I've worked on such models, I've tended to have tables for "agents" (includes publishers), "collections" (includes journals, archival collections, etc.), and "resources." I've always had "title" as a column for resources and collections, rather than a separate table (seems like overkill).

dstillman · February 7, 2007

Bruce — re: fields:

For example, if EMail is a subclass of PersonalCommunication, are all properties defined by PersonalCommunication valid for EMail as well, or is there some mechanism for specifying within EMail that some property defined by PersonalCommunication is not valid? I assume the former...

bdarcus · February 7, 2007

OK, Dan.

Validation in RDF is different than in XML. See:

http://danbri.org/words/2005/07/30/114

In short, RDF has this "open world assumption" which expects merging of data, and so to say something is "invalid" per se makes no sense.

But I'm pretty sure that subclasses inherit the characteristics of their parents, including restrictions about the occurrence of properties (if you define them of course). In other words, in answer to your question "yes, the former."

To me an ideal model to follow for what we need here is DOAP: it's good, clean RDF, and there's a nice RELAX NG schema that validates nice, clean, XML (that is RDF, but most people wouldn't know it).

CloudofDust · February 7, 2007

Off the top of my head, some brainstorming re: Dan's UI needs:

"You still need a way to select a parent/publication from within another item"

In the current model, to enter a new citation in my word processor I open my biblio database, find the desired record, and then insert the citation referencing that record. If no record exists for the material I want to cite, I can create a new record within that same process.

Couldn't the selection of a "parent" item within a "child" item work along the same lines? I'm creating a new record for the "child" item; a command opens up a window that lets me search the database for the desired "parent"; if necessary, I can then add a new "parent" record into the database, and use that instead of a preexisting one.

"still need to prompt the user for what to do on deletes"

A warning message - "are you sure you want to delete this record?" - followed by another - "are you REALLY REALLY sure about this?" - followed by another - "this deletion will be IRREVERSIBLE and could screw up a lot of stuff - please turn back now before it's too late!"

More seriously, I would think that something to the effect of "this record is used by other records to create citations. Deleting it will render those other records inoperable. Do you really want to do this?" would do the trick.

[Edit: you could also offer the option to delete all child records along with the parent record - accompanied by even louder warnings]

"still need to determine whether the change to publication data in one item changes it in the parent"

This may be the crux of the matter. In a relational or semirelational model, the data for the parent item within the child item IS the original record for the parent. Or to be more precise, the child item record does not actually contain parent-specific data, but merely a marker telling the database which record to grab the parent data from when making citations.

There are different options for handling this in the UI: you could have a subform, such as you might find in an Access or FileMaker database, where you actually edit the parent record from within the form displaying the child record. Or you could have the parent-specific data appear in the child record in read-only fashion, with a command available to open the parent record for editing if desired. In either case, changes to the parent data (including the publication data) would be universal, because all manifestations of the parent data would refer back to the same original record.

Dan and bdarcus's comments about multiple tables are well taken. At the moment, I wish I had a separate "Place" table, so that I could correct "Cambridge, Mass." to "Cambridge, MA" universally, but I can appreciate that that small convenience might not be worth the extra labor required to make it happen (after all, how often do standard formats for place names change?)

One more question. Dan writes: "I would imagine that certain communication-based types might share the same logic for auto-generating a proper title field from the values of the creator types designated in the database as equivalent to "author" and "recipient" (whatever the type-specific names might be)."

Could you please explain what you mean by "auto-generating a proper title field" from the values for "author" and "recipient"? I'm not sure I see a need for a title field for communication-based types at all. The document is identified not by a title, but by the designation of author and recipient, and in some cases by a description: "memorandum," etc.

Or is your point that various reference types use a pattern like "[author] to [recipient]," and therefore this pattern could be built into a higher-level umbrella type, which would greatly expedite the creation of additional lower-level reference types that use that pattern?

dstillman · February 7, 2007

CloudOfDust — a couple quick things:

Re: "Cambridge, MA": Just to be clear, my point was that having a separate "Place" table in the database is wholly unrelated to correcting "Cambridge, Mass." to "Cambridge, MA" universally. That can be done at the moment in one line of SQL. It could be done for items with specific item types, for items created in the last two weeks, for items from a particular publisher...in other words, for items matching any arbitrary set of criteria. It's just a question of writing the SQL. But the current database design is quite capable of all of that. It just requires implementing the batch editing and search/replace features to let you do it in the interface. How it's stored is more or less beside the point.

Re: "auto-generating a proper title field": I meant the latter. By title field for the communication-based types, I was merely referring to what is displayed, say, in the items list.

CloudofDust · February 7, 2007

I see: so whether "Cambridge, Mass." or any other bit of data to be manipulated exists in a single record in a related table, or in lots of different records in an unrelated table, a script can be written that makes a universal change possible, and a UI feature can be designed that makes this function intuitive for the user. And more generally, in a flat database with lots of repetitive data, scripts can be created to duplicate many of the effects possible in a relational database, and to render them intuitive for the user.

Having said all that, it seems to me that, whether you put "parent" and "child" records in separate tables or one big table, the issues surrounding edits, deletion, etc. could be more easily addressed by inserting references to parent records within child records, rather than duplicating parent-specific data in each new child record (along the lines I described above, and in one of the emails I linked to above). But I'm not a developer, so that's easy for me to say.

dstillman · February 7, 2007

Yes, your first paragraph is correct.

For the second, you don't need separate tables to have parent/child relationships, and having separate tables doesn't make the edit/delete issues any easier, since the difficulties are in the UI. The parent data wouldn't be stored with the child—most simply, you'd just have a parent column in the items table, and on retrieval, the data layer would grab the data for both items.

The other, more fundamental, issue is that, in our case, parents and children aren't necessarily different types of entities as far as an editing interface is concerned. You want to create a record for an edited book the same way you want to create a chapter—making them fundamentally different entities would just add unnecessary complexity to the code.

In a relational database, data doesn't need to be in separate tables for you to relate it.

bdarcus · February 7, 2007

Dan wrote: "In a relational database, data doesn't need to be in separate tables for you to relate it."

I'm not really a RDBMS expert (not even close), but while this might be true in the sense that you can do parent-child relationships within a table (indeed, that's how I'd prefer to model chapter-book relations), there is a general principle that one should avoid data duplication.

dstillman · February 7, 2007

Yes, but again, data duplication is still a completely separate issue from what we're (theoretically) discussing. As I said above, a separate table containing all unique values, linked to itemIDs in the itemData table, would remove all redundant data. It'd save a few bytes per associated item for longer field values (the difference between the integer and the string). It'd save no space whatsoever (or take up more space) for fields that were already integers or short strings. But you certainly don't need separate tables for each type of field or item type. And even for those rows in itemData linking the unique values to the individual items, you certainly don't need to duplicate them in the child items, as seems to be an assumption.

It would be beneficial to get back to the issues of UI design and data mapping.

CloudofDust · February 7, 2007

Agreed. In the interests of doing so, I'll repost the comments I made earlier about UI design, which have nothing to do with relational tables, etc.:

"You still need a way to select a parent/publication from within another item"

In the current model, to enter a new citation in my word processor I open my biblio database, find the desired record, and then insert the citation referencing that record. If no record exists for the material I want to cite, I can create a new record within that same process.

Couldn't the selection of a "parent" item within a "child" item work along the same lines? I'm creating a new record for the "child" item; a command opens up a window that lets me search the database for the desired "parent"; if necessary, I can then add a new "parent" record into the database, and use that instead of a preexisting one.

"still need to prompt the user for what to do on deletes"

A warning message - "are you sure you want to delete this record?" - followed by another - "are you REALLY REALLY sure about this?" - followed by another - "this deletion will be IRREVERSIBLE and could screw up a lot of stuff - please turn back now before it's too late!"

More seriously, I would think that something to the effect of "this record is used by other records to create citations. Deleting it will render those other records inoperable. Do you really want to do this?" would do the trick.

[Edit: you could also offer the option to delete all child records along with the parent record - accompanied by even louder warnings]

"still need to determine whether the change to publication data in one item changes it in the parent"

This may be the crux of the matter. In a relational or semirelational model [for "semirelational," please substitute whatever non-relational way of relating data you prefer], the data for the parent item within the child item IS the original record for the parent. Or to be more precise, the child item record does not actually contain parent-specific data, but merely a marker telling the database which record to grab the parent data from when making citations.

There are different options for handling this in the UI: you could have a subform, such as you might find in an Access or FileMaker database, where you actually edit the parent record from within the form displaying the child record. Or you could have the parent-specific data appear in the child record in read-only fashion, with a command available to open the parent record for editing if desired. In either case, changes to the parent data (including the publication data) would be universal, because all manifestations of the parent data would refer back to the same original record.

As a follow up, I'm wondering whether any of the above conflicts with Dan's statement that:

"The other, more fundamental, issue is that, in our case, parents and children aren't necessarily different types of entities as far as an editing interface is concerned. You want to create a record for an edited book the same way you want to create a chapter—making them fundamentally different entities would just add unnecessary complexity to the code."

This sounds more or less like what I described above - or at least what I thought I was describing. Or am I missing something?

Edit: to use Dan's example, creating the record for a chapter is basically the same process as creating a record for the book as a whole, except that creating the chapter record requires either creating a new (related) record for the book, or else selecting a pre-existing "book" record. There is a difference, certainly, but in my eyes it's not one that would complicate the UI very much - the UI for the chapter entry could look more or less like it does now, except that there would be an option either to insert data for a pre-existing book record or to enter a new book record from scratch.

bdarcus · February 7, 2007

OK, my two cents on Dan's UI questions (probably duplicating some of what CloudofDust says):

"You still need a way to select a parent/publication from within another item"

Auto-complete?

I'm imagining the more complicated case of entering book chapters. Let's say I start not with having an already-entered edited book, but just decide to enter the chapters.

So I choose a "chapter" type and get the fields. I get to the "book title" field and get auto-complete options. Since I've not yet entered the book, I won't see what I'm looking for. So I then enter the additional fields.

Alternately, perhaps the first field is the ISBN, and the data can then be auto-populated by pinging a server (I suppose a title field could do that too).

"still need to prompt the user for what to do on deletes"

I'd say deleting a child should never delete a parent. But deletion of a parent should bring a serious warning message.

"still need to determine whether the change to publication data in one item changes it in the parent"

Maybe if a user selects already-entered data (an existing parent record) the UI for those fields somehow indicates that, and one needs to explicitly choose to edit it (maybe by switching to the parent entry)?

CloudofDust · February 7, 2007

bdarcus writes: "So I choose a "chapter" type and get the fields. I get to the "book title" field and get auto-complete options. Since I've not yet entered the book, I won't see what I'm looking for. So I then enter the additional fields."

So by entering the additional fields in the "chapter" entry type, the user effectively creates a new record (or entry, or whatever you want to call it) for the book? So that the book in future can also be cited on its own, rather than as the "container" of a chapter? And so that when entering additional chapters, the book will appear with all the other books in the autocomplete list?

I think we're on the same page here, but I'm just double-checking for clarity.

erazlogo · February 7, 2007

Just to brainstorm on various UI issues, to follow from what bdarcus and CloudofDust are proposing--in theory would it be possible to set up the following interface? (this doesn't address hierarchy models, only UI).

(the following is relevant to this issue: "You still need a way to select a parent/publication from within another item")

Let's say I'm entering a new item--a letter published in a book:

Theodor W. Adorno to Walter Benjamin, 7 March 1938, in The Complete Correspondence, 1928-1940, ed. Henri Lonitz (Cambridge, Mass.: Harvard University Press, 1999), 240-242.

I will start by entering the fields available in the current "Letter" Zotero item type: Author, Recipient, Date of Letter (fields necessary in other cases would include type of letter, place sent, and place received).

Then instead of a repository field for archive I would get a line where i could choose possible parent item type (let's call it a "container" type): Repository (as in archive), Book, Journal, Magazine, Newspaper, etc. The container type field would be a pulldown field like the menu currently reserved for author type. I choose "Book".

Then I would start typing a title, which would autocomplete from the book title field. If the book already exists, typing "The Complete Correspondence, 192..." would bring up the full title. If the book doesn't exist, I enter the entire title.

I then hit the return key. The interface expands to reveal all the fields for the "Book" item type. If the book already exists in the database, Zotero shows data from existing fields. (If two identical titles exist, I would get a prompt to choose one) If not, I go ahead and enter the new data. (A new item then should be created in Zotero just for that book.)

When I come back to this item later, all I see in "Info" is the letter fields plus the title for the book in the "container" line. Clicking on the title reveals all the fields for the book, which I can then edit.

This would be great interface for entering letters, interviews, and artworks, because all of these have few item-specific fields (i.e. type for letter, medium for interview, medium and size for artwork) but all of these can be found in archives or published or reproduced in books, journals, magazines. This method could also be applied to book sections & articles in periodicals.

This does not resolve the question of what I'd do if I want to also record archival information for this same letter--do I enter it as a separate item or (better yet) is there a second or third "container" option available within the same record (just as I can now add more authors by clicking the "plus" on the right of the author's name)?

This interface also calls for a separate "Archival Collection" item type, which I think Zotero needs anyway because researchers may need to record information about a collection or archive in advance to actually doing there and entering specific sources.

On these issues:

"still need to prompt the user for what to do on deletes" and "still need to determine whether the change to publication data in one item changes it in the parent"

I think users should be able to add and edit parent items directly from the child record, with some warning ("editing this field will alter the container item and all linked items")--it seems easier than opening a separate window. They also should be warned on deleting the item that has "child" items in the same way they are now prompted about deleting notes and files--just as currently when we choose not to delete a note we just end up with a standalone note, we'd end up with a letter/interview without a "container" item linked to it.

By the way, sometimes letters do have titles--newspapers and magazines often title their letters to the editor, and business memos often include a "subject" which would be useful to enter into a "letter title" field.

Josh · February 8, 2007

Over the past day and a half, Sean and I hammered out a first proposal for a new data model that might address most of the concerns expressed here. I've gone ahead and updated the Trac wiki page, so take a look. It bears a strong similarity to Bruce's current model, but with some important (to us) distinctions (more on that in a sec)

The Model

Rather than starting with abstract categories, we began with the current universe of Zotero item types and worked backwards. Basically, there are a few top-level item types (Document, Communication, Event, Agent), as well as several ancillary item types; at no point would we envision a user either creating a top-level or ancillary item type from scratch, nor would they show up in the middle "List" pane. Instead, users would create items from the "Item Type" list, each of which inherits properties of various kinds from parent elements (including the top-level and ancillary types).

A few important things to note:

We've created two catchall types for citable things, Document and Communication...the more we hammered at this, the more it seemed that they were fundamentally different animals: a Document is essentially a broadcast of some form, whether via electromagnetic waves or paper, and thus only has an author; a Communication has an intended audience (whether the interviewer or recipient) and thus has both a Sender and Reciever (nod to Shannon and Weaver here).

Every citable thing has two complex fields: Location and Reproduced in. These do the following:
- Location: A thing's Location is a place where you can find an instantiation of it. This can be be a call number in a specific library, a URL for a webpage, or collection/series/box in an archive. An item can have multiple locations, as copies can be found in multiple places.
- Reproduced in enables the drawing of a particular kind of parent-child relationship to be drawn by users. It's intended to be used in cases where a given item is reproduced within another; i.e. a letter reprinted in a book, or an image embedded in a blog post

Our goal was less philosophical purity than pragmatic functionalism, so we hardcoded a few things into the model (using the ancillary item types) rather than making sure that everything fit into a ur-type. For example, there's no collection item type (unlike Bruce's schema)...in the case of each complex relationship (i.e. journal-issue-article), we decided what the core item was, and then figured that things like book series and journal could just be other item types, rather than collections. Our metaphor was of things and then things that are pieces of things, rather than collections and things in them (hopefully that statement wasn't uselessly vague). While we know that this isn't a solution in a broad ontology sense, it would massively improve Zotero's architecture without precluding a more sophisticated (and perfect) system down the road.

Interface stuff

In terms of how this will be implemented in a UI, we're thinking along very similar lines to Elena. For example, imagine that I want to add a journal article: when I click the green new item button, moving over the "Article" option beings up a submenu of types: Newspaper, Magazine, or Journal. Choosing one creates a new item, with the standard Document fields (title, author(s), language, location(s), reproduced in) as well as two special fields, "publication" and "issue". These might be dropdowns, they might be auto-complete fields; regardless, "Issue" is grayed out initially, nudging the user to choose a publication. If the publication already exists within Zotero, its metadata populates the item pane; if not, then empty fields come up for its type (pre-filled with the current item type) and ISSN (if applicable). Once there's a publication, the "Issue" field is editable, and the same process applies.

For the "Location" information, it seems that the most familiar interface is the one we use in contact management programs; when I go into my Address Book, I have the option of adding multiple addresses (work, home, etc.) for a given contact; the "Locations" for an item should work the same way. One thing worth mentioning is that by separating out "Location" as a separate kind of data associated with an item, we open the door to custom location types; while we're wary of opening up user-customized item types for the time being, we're much less concerned about custom locations (which are less vital to interoperability, and which are more obviously esoteric to users and contexts), and so this opens a path by which we could comfortably enable custom locations without threatening the normalization of the data we cate most about (the properties of an item itself).

As for the "Reproduced in" button, in UI it'll be equivalent to the "Add" button under the "Related" tab, except that a given item can only be reproduced in one item (if you've got multiple reproductions of it, then you've got multiple items, and should have a separate record for each reproduction, much the same way as we treat editions of books).

Okay, there's lots more to talk about, but this should at least offer the gist of our current thoughts - have at it, and feel free to add to the schema on the Trac wiki page if there are things we've missed...

bdarcus · February 8, 2007

Ugh, I just spent 30 minutes on a reply, but it got eaten! I don't have the energy to repeat it all, but will boil it down to:

1. we need to decide the policies for when something becomes a formalized "type."

I suggest we focus on a smart core with some hierarchy, and leave room for customization with a "genre" or "type" column/property. Some of your types are too specific in my view.

Maybe another wiki page for this?

Also, it's important to me for the RDF that we have a hierarchy of types (per my blog post).

2. I think ditching the collection and using the "Ancillary item types" is a bad idea. Why are you doing this again?