Not signed in (Sign In)
 

Quick Links

Vanilla 1.1.5a is a product of Lussumo. More Information: Documentation, Community Support.

    • CommentAuthorJosh
    • CommentTimeFeb 5th 2007
     
    I'd like to corral a few different threads of discussion here:

    http://forums.zotero.org/discussion/385/
    http://forums.zotero.org/discussion/78/
    ...and Bruce's useful prodding

    Right now, Zotero has what's called a "flat" data model - there are a set number of item types, and each one has a fixed number of fields. This makes for easy coding, and is the model that Endnote and virtually every other research management tool use. As much as anything, this model grows out of the bibliographic citation; according to MLA or Chicago or whatever, there are a certain number of citation types, and if you're writing software that is ultimately about facilitating bibliography-creation, then the smart thing to do is to just implement those item types in a flat model.

    The problem, though, comes when you want to do more interesting things with your data than simply create references; there are *relationships* between items that aren't expressed in a flat model. If you have several chapters from an edited volume in your library, for example, you'd want the data model to reflect the fact that, while each chapter is its own discrete object, they also share a parent (the book itself). That way, you wouldn't have to repeatedly enter the same information for the book each time you added a new chapter to your library; you'd just link each new "Edited Volume Chapter" item to its parent collection. By adding parent-child relationships, you move from a flat to a "hierarchical" data model, which is more elegant, cleaner, and just all around more appealing.

    It also raises a few really big issues, which is why Zotero's still using a flat data model.

    *Interface*

    Right now, we're pretty happy with the iTunes-like Zotero interface; there's a library, and items in a list within it. That list can be sorted in a few different ways, but it's fundamentally a flat list. There is a hierarchical relationship between items and snapshots/links/notes, but even that's been problematic. (Should notes only be listed underneath the items to which they're attached? What about freestanding notes? What if you want to browse your notes - what's the intuitive interface for toggling between modes?)

    A hierarchical data model raises huge interface questions which I haven't really seen solved anywhere; that may just mean that Zotero's going to be the app to do so, but it's going to be a long slog (we were lucky that Apple had spent huge amounts of time refining the iTunes interface to the point where we could just kludge many of its conventions), especially if we're committed to creating software that's intuitive and understandable not just for us power users, but for the broader population of researchers who are less willing to approach a more outside-the-box interface.

    *Sharing*

    Bound up with many of the conversations around data models has been a second, important-to-distinguish issue: user-created fields/types. Right now, the Zotero data model is immutably fixed; there are a set number of item types, each with set fields. We're still cleaning these up and adding new ones (the "Blog Post" entry for example, has a "Website Type" field which is about as useful as the human appendix, and which shares a vestigial explanation as the "Blog Post" item type grew out of the "Webpage" item type).

    Flat data models lend themselves more easily to data exchange - if the ontology of types/fields is the same across all Zotero users, I can share my library with you seamlessly. If I wanted to customize things a bit, I could do what Endnote does and add in a few "Custom Fields" which are essentially containers for any extra information I might want to tack on at the end of an item. There's a reason why all the popular social software sites (del.icio.us, cite-u-like, flickr, youTube, etc.) operate with flat, immutable data models - to make different users' collections interoperable, you've got to have the same ontology.

    If it's not clear, interoperability is extremely important to us; we're planning a server that'll allow users to share, exchange and collaborate on Zotero collections, and we're incredibly excited by the possibility of visualization, recommendation and other newfangled bells and whistles that will transform the way that we as scholars relate to both our research and our colleagues. However, this only works if the data is interoperable.

    Now, I'm not trying to conflate user-defined fields with flat vs. hierarchical data models - they're two separate issues, and should be treated as such. However, if you want to enable users to tailor the software to their particular needs (which we do) because you think they know better than you what they'd need (which we do), a flat data model is *much* easier to deal with, because you're talking about a few extra, easily-excludable fields. If you start letting users create their own item types, interoperability becomes problematic, and if you enable the fitting together of existing and user-created item types together in parent-child hierarchies, the likelihood of normalization decreases exponentially. It's an awkward analogy, but you go from fitting square pegs of different lengths into square holes to dealing with pegs and holes of any size and shape imaginable.

    *Where we're at*

    When we started building Zotero, we had a few pragmatic goals: to build a tool that was more open and robust than Endnote and other bibliographic reference managers, to make its interface accessible to as many users as possible, and to build it with an eye toward enabling social-software-like functionality. When we ran into the question of what sort of data model to use, we made a tough decision - though it might not seem like it, we think *a lot* about the data model, and we spent quite a bit of time debating whether to go with a hierarchical model from the start.

    Ultimately, we decided to start off with a flat, fixed model, and focus our energies on the infrastructure that would allow users to get data into the system, as well as mechanisms for getting data back out (either in citations or via utilities, about which I'll post more elsewhere). We knew from the start that this was a compromise, and one that we'd have to address later on; right now, our plans are to get version 1.0 out the door, then tackle this problem headlong.

    There are two distinct pieces here: the hierarchical data model and user-defined types/fields. The current plan is to do hierarchical types first, with a defined ontology of types and fields. That's going to occupy a good amount of time, less for the implementation in the software itself (that'll be a pretty straightforward process) than for the interface design; figuring out exactly how to represent and work with hierarchical items isn't easy, and we've got a lot of work to do.

    On the second piece (user-created types/fields), we want these as badly as everyone else, but the moment that Zotero opens up the data ontology to users we irrevocably lose the ability to cleanly aggregate and share item data. One of the basic rules of software development is that you can always add features but never take them away, so we're going to tread cautiously and carefully to make sure we enable users to customize their personal data without threatening the interoperability of Zotero collections.

    With all of that said, this would be a good place to start a discussion on the strategy as I've laid it out, as well as the more focused questions raised elsewhere on the forums and dev list. I've also created a page on the Zotero Trac wiki (https://www.zotero.org/trac/wiki/HierarchicalOntology) on which to flesh out a new, hierarchical model - if we can come to a consensus of what this should look like, we'll be able to roll it into the post-1.0 Zotero much more quickly.
    • CommentAuthormkbergman
    • CommentTimeFeb 5th 2007
     
    Hi Josh,

    I'd like to humbly suggest that the project should distinguish between 1) creation of standard data models from 2) canonical data formats. For example, within microformats, each format (XOXO, hCard, rel-tag, etc.) has a process for the community to agree upon data labels to promote interoperability, but all formats follow the same XML canonical format.

    For hierarchical relationships in Zotero the first question should thus be: what canonical form? OPML, XOXO or perhaps even an OWL variant, among others, could be candidates. I think it is always best to adopt the simplest canonical representation, because at later times converters can be written for translating to any exotic real-world variant. By this measure (and without further analysis or investigations), my first reaction would be to look strongly at OPML.

    The second matter, standard data models, I think is more central to the core of your posting. Here, within the bibiographic research and citation community (actually, there may be many depending on discipline!), it seems to me establishing specific standards efforts per your wiki posting is the right way to go. Only domain experts can settle internally on the semantics necessary for their domain. But, it is also likely the case that each domain will also prefer its own semantics (and not nearly so constraining or "global" in intent as, for example, the existing microformats standards).

    However, even in the case of different standard data models for different domains or communities, should such arise as they inevitably will, interoperability is not out of the question. Semantic mediation and resolving of semantic heterogeneities is an area of active research with many tools and promising approaches extant.

    Please note that by acknowledging semantic heterogeneity I am NOT encouraging it, and where specific communities can avoid it via standards and consensus, that is an easier choice for you and the team as developers. Thus, your call for contributions and dialog on this question is right on. But what I am saying is that once you as developers place such useful tools into the marketplace, others will embrace them and take them in different directions. Semantic heterogeneities leading to challenges in interoperability are unavoidable, especially if you as developers have done a great job (as you have done!) to provide a powerful framework useful to many constituencies.

    As the development stewards, then, focus on canonical formats. As domain or community stewards, focus on standard data models and semantic consistency.

    Since I first encountered Zotero I saw it as having very broad applicability beyond its current audience. I hope that the question of canonical data formats and format conversion tools can be kept separate from the setting of standard data models (by community or domain).

    I'd like to think more on this question and provide further input at a later time.

    Thanks, Mike
    • CommentAuthorMatthias
    • CommentTimeFeb 5th 2007
     
    Josh, thanks for your detailed post, I fully agree with you that a hierarchical data model poses many interface & compatibility issues which are far from being trivial. Your post very much describes the reasoning why we haven't adopted a hierarchical model for refbase yet (although this was discussed with Bruce and others and had been planned since a long time, see e.g. http://sourceforge.net/forum/forum.php?thread_id=1135397&forum_id=218757).

    While a hierarchical model has many advantages, a flat model has definitively some advantages that go beyond easy implementation & sharing. As an example, a flat model allows different users to use different variants for a particular container, e.g. while some of my colleages want (or need) to cite a non-english book or journal with its original (say, german) title, others always cite the title translated into English. Yet another (french) researcher may need to cite the very same resource with a french translation of the title (e.g. for a french grant application). Surely, these things can be all modelled with a hierarchical model but they require *a lot* of thought. In addition, each use case has to be accounted for in the interface. Contrastingly, a flat model lets users use custom container values for their own resources.

    That said (and not thinking of any implementation details), I'd personally favour a solution that would let me choose on item-level whether I'd like to link a particular resource (like a book chapter) to its container item OR whether I'd like to enter custom container values for that particular resource. I.e., by default hierarchical relations would be encouraged by the software but I'd also be able to "unlink" individual items (of course, a link to a container item could still be stored by the software for unlinked items). With this in mind, we've thought about using a flat model along side to a hierarchical model for the refbase project. This would mean partially redundant data in the database but would allow us to use the best model for a given use case.

    Just some thoughts... I very much appreciate this discussion. Matthias
    • CommentAuthorMatthias
    • CommentTimeFeb 5th 2007
     
    Regarding the interface question on how to implement a GUI that facilitates a hierarchical data model: Consider how a relational database such as FileMaker is doing it. You can easily add data fields from a parent record to a child record's layout (think Zotero item types). Then, when editing a parent's field *within* one of the child records, the parent record is updated accordingly (unless prohibited by a pref setting). Similarly, any editing of the parent record will affect all of its child records.

    So, speaking of interface refinements to Zotero required for a hierarchical data model, I could imagine that most of the current GUI layout could stay exactly as it is now -- just include a means to link an item to its parent container (plus a means to create the container item if it doesn't exist yet), then distinguish the related container/parent fields with a clear visual clue within the item's type layout (maybe a different background color?).

    A lock icon (or something like that) could determine on item level whether these related container/parent fields could be edited to change the parent item (and thus the parent fields of any other items within that container!). In addition, there could be a prominently placed checkbox (and notification message) that lets the user specify whether the editing of "container fields" will be applied globally or not. This could even solve the issues outlined in my previous post (flat & hierarchical data stored along side each other).
    • CommentAuthorbdarcus
    • CommentTimeFeb 5th 2007
     
    Josh,

    A few comments first:

    "As much as anything, this model grows out of the bibliographic citation; according to MLA or Chicago or whatever, there are a certain number of citation types, and if you're writing software that is ultimately about facilitating bibliography-creation, then the smart thing to do is to just implement those item types in a flat model."

    I thank you for laying this out and starting the discussion, in part because it highlights some differences.

    This probably stems from us handling different ends of the development puzzle (you more about GUI, me about formatting configuration), and from the fact that as a user, I have quite demanding needs (in fact, not unlike an historian!).

    First, the style guides are not at all as focused on types as you assume. A whole lot of them are about classes of documents and relationships. They use types as illustrations.

    Second, while it might be easier for you to code, flat models also actually don't match a whole lot of real world citation needs. It is only a convenience for programmers (and because relational databases and RDF didn't exist when most current bibliographic applications were designed). The thread about primary sources illustrates this really well, so I won't repeat those arguments.

    Third, if you assume a flat list of types that can be extended, how do you handle a) data exchange, and b) formatting? In the latter case, do you require a style author to create templates for all 65 or so of the types that one can identify in Chicago?

    That's been my focus. CSL is designed around a mix of a hierarchical and flat model. Part of the reason for this is that the styles become both more compact and easier to debug, as well more reliable. If there is no template for a given style, the formatting system is supposed to use one of the three generic fallbacks. These are determined based on the structural characteristics of the data.

    "The problem, though, comes when you want to do more interesting things with your data than simply create references; there are *relationships* between items that aren't expressed in a flat model."

    This is only one small part of the problem. It is about also the need to have both flexible and reliable data exchange and formatting.

    "Now, I'm not trying to conflate user-defined fields with flat vs. hierarchical data models - they're two separate issues, and should be treated as such. However, if you want to enable users to tailor the software to their particular needs (which we do) because you think they know better than you what they'd need (which we do), a flat data model is *much* easier to deal with, because you're talking about a few extra, easily-excludable fields."

    It might be easier for your programmers. But that's not to say that it's easier for formatting, or data exchange.

    Here's the practical issue:

    1. Zotero has its internal data model, and then another one it presents to its users
    2. CSL has its own data model
    3. the RDF model I designed -- and which you are partly using -- has still another data model, though designed in conjunction with CSL

    It is critical to future interoperability of all these pieces for the three models to be in sympathy. A CSL formatter has to know what to do with a citation source. It has to know which template to use. Otherwise, the citations won't get formatted correctly. And to do that, the data models MUST be consistent.

    Likewise with data exchange.

    Now, I'd like to complicate this a little by pointing out (as Matthias does) that one need not necessarily chose flat vs. relational. There can be ways to have both. For example, maybe you use a little configuration language to setup a "type" for a GUI, but to keep a hierarchical structure for formatting and exchange. Just emphasizing this is no black-or-white choice.
    • CommentAuthorbdarcus
    • CommentTimeFeb 5th 2007 edited
     
    Oh, also Josh, you do realize I've spent a lot of time on an RDF model for this; right?

    http://purl.org/net/biblio

    Also, it's not "easy" at all!

    This is what I want us to agreee on: what should this RDF model look like (for typing that can be used in RDF, and also elsewhere, like Atom), and how do we map it to CSL (and the GUI)?
    • CommentAuthorCloudofDust
    • CommentTimeFeb 5th 2007 edited
     
    First, thanks to Josh for initiating this discussion, and also for linking to that thread from last October, which helped answer some of my questions. This should be an interesting discussion.

    One of the reasons that the inadequacies of biblio programs became my hobby horse is that before I entered academia, I worked a bit with MS Access and learned some of the rudiments of relational database design. At one time, I wanted to use Access to design a relational database that would handle my notes and citations - but that's another story. The point is that when I started shopping for biblio/notetaking software, I was shocked to find that it was unheard of for such programs to have an authors table, titles table, publishers table, etc. Instead, as you've all noted, the standard is one big table with lots of repetitive data and data entry.

    These days, I do most of my academic writing with Nota Bene, which I like because of the sophistication of its word processor and the simplicity and user-friendliness of its text retrieval module, which is great for organizing research notes. And its biblio database module is adequate on its own terms, and very well integrated with the rest of the program. But I've also played around some with Library Master, which is the most user-customizable biblio program I've been able to find. Like everything else, it works on a flat, one big table model, but I found a way to make it create the kind of compound citations I needed in a way that approximated a relational database solution.

    I'm pasting in below an old email I sent to the Nota Bene users list, and forwarded to the LM users list, describing this method. I think that the underlying concept (inserting record markers for "containers" within the "container" fields of other records) could be a fairly straightforward way to address many of my own complaints about flat biblio databases, and I suspect that one could create a user-friendly interface for it without too much trouble (along the lines that Matthias mentions above). It would also help stop the proliferation of record types that bdarcus points to: Instead of 100 different record types, we could have 10 document types, and 10 container types, and achieve the same (or better) result. Consider that the only difference between citations of a journal article and an anthology chapter is the "container." Have a single "titled short work" type for the actual items, and then a journal type and an anthology type as choices for the container. Lots of neat possibilities.

    For the sake of making this thread more readable, I've replaced the full text of the email with the URL for the original:

    http://www.hamline.edu/~wnk/notabene/2003/msg05806.html
    (scroll down to the part labelled "Multi-level shortened citations")

    Here's another URL for a more recent commentary of mine (more specifically directed towards improving Nota Bene, but relevant to the more general problem of achieving these complex citations without a completely relational database):

    http://wnk.hamline.edu/mhonarc/notabene/msg00373.html

    I hope you find these comments interesting.
    • CommentAuthorraf
    • CommentTimeFeb 6th 2007
     
    To be perfectly honest, I sometimes find it hard to understand many of the issues outlined in the above. Nevertheless, I think I understood the gist of CloudofDust's initial posts, before the discussion migrated to the feature request threads.

    I am rather familiar with a few bibliographic databases and have come to use Zotero a lot off late. For the work I’m doing now, I have about 1400 entries of pre-1800 records.

    I just wanted to say that the concerns of CloudofDust are shared by others as well – and if Zotero would manage to address those concerns it would make my life much easier as well.

    Thanks to all the people who devote their time and energy to this,

    r
  1.  

    It's hard to describe how delighted I was to see this discussion rubric pop up the very day I subscribed to the forum. The issues of hierarchy and type are at the core of a bibliographic data quest that I hope Zotero can help us solve. Please forgive me if what follows seems idiosyncratic; rather, I hope it will serve as an interesting use case. Note: I've had to divide my comments into multiple posts, since there's evidently a character limit on comments that's irritatingly not advertized in advance.


    Context: Collaborative Ancient Geography


    I direct the Ancient World Mapping Center's Pleiades Project, which aims to create an online workspace for ancient geography. Our springboard is a legacy data collection assembled by the American Philological Association's Classical Atlas Project over the twelve-year period that culminated in the publication of the Barrington Atlas of the Greek and Roman World (R. Talbert, ed., Princeton, 2000). With startup funding from the National Endowment for the Humanities, our 2-person development team is crafting the supporting web application on top of the Plone content management system. Plone provides us with a customizable framework for the creation of structured content types, scriptable editorial workflow and an extensible user portal. Our goal is a compelling, intuitive environment in which community approaches will combine with academic-style editorial review, to enable anyone — from university professors to casual students of antiquity — to suggest updates to geographic names, descriptive essays, bibliographic references and geographic coordinates. Once vetted for accuracy and pertinence, these suggestions become a permanent, author-attributed part of future publications and data services.


    Issue: Bibliography in Pleiades


    The Ancient World Mapping Center has developed a bibliographic database that contains over 4,000 records deriving from the bibliographic citations that supported each mapped feature in the Barrington Atlas. These records have been augmented with data drawn from MARC records and added by hand, so that they now include alternate and translated titles (each coded for language), holdings information, standard numbers and more. This database must be migrated from its current, stand-alone configuration to operate seamlessly within the Pleiades web application as live content. Not only do we wish to harness the community's effort to capture new bibliographic information, we also require that each suggestion for an addition, deletion or change to a name, place or location be supported by either an explanatory essay or a bibliographic citation. Obviously, the accuracy of these citations, and the ease with which they can be created, are key concerns.


    To be continued ...

  2.  

    ... continued from previous post


    Plone Plugins for Bibliography


    There is an established plugin for bibliographic content in Plone: CMFBibliographyAT. It was primarily designed to facilitate the construction and management of personal citation lists in Plone. Its data model derives from BibTeX. After long consideration, we have elected not to use it. There are a couple of key reasons:


    • No support for, or developer interest in, hierarchical relationships and associated authority control (e.g., information about a journal must be re-entered in the record for each article; we would prefer the appropriate journal be picked from a list in the GUI or auto-assigned to incoming data on the basis of an ISSN or other standard identifier)
    • CMFBibliographyAT developer concern that the existing code base will not scale to a dataset the size and complexity of ours

    It looks as if we will need to develop our own Plone plugin for bibliography, just as we are doing for the geography and toponomy.


    The Pleiades Bibliographic Data Model


    The data model realized in the Mapping Center's bibliographic database is hierarchical, relational and idiosyncratic. In designing it (ca. 2000), we deliberately eschewed flat models (and those based directly on formats like USMARC) because we felt we needed the control of having a single record for each discrete work, whether it was an article, journal volume, journal, etc. Somehow, we missed the existence of the Functional Requirements for Bibliographic Records (FRBR) Final Report, published by the International Federation of Library Associations and Institutions in 1998. This report deliberately avoids recommending a data format. Rather, it lays out an entity-relationship model that aims to:


    provide a clear, precisely stated, and commonly shared understanding of what it is that the bibliographic record aims to provide information about, and what it is that we expect the record to achieve in terms of answering user needs ... and recommend a basic level of functionality and basic data requirements for records created by national bibliographic agencies.


    In comparing the FRBR model to our own, we find a high degree of commonality. In particular, FRBR's four main entities (works, expressions, manifestations and items) and their relationships with each other align almost perfectly with the semantic, hierarchical distinctions made in our model. Recently, we have come to the conclusion that we should adopt FRBR terminology for entities, attributes and relationships in our bibliographic plugin for Plone. This conclusion has been reinforced by the recent development of a Dublin Core application profile for scholarly works (i.e., e-prints) that takes as its starting point an application profile based on FRBR. See also: OCLC's FRBR Research Page and the FRBR Blog.


    Zotero and Pleiades


    Zotero is a killer ap. After just a few minutes of Zotero use, it was clear to me that the Pleiades bibliographic plugin for Plone would need to integrate tightly with it. Data flows from Zotero to Pleiades, and vice versa, will add value to our content creation process and to our users' individual research agendas. We are taking a similar approach with Google Earth, to which we can already push our geographic data. Soon, users will be able to recommend new coordinates for sites by moving or creating placemarks in Google Earth and pushing these back to Pleiades. On the formats front, we would also be happy to see something like Zotero's RDF schema (building, as it does, on so many other useful efforts) evolve to handle hierarchy and other types, so that we can serialize feeds of bibliographic data for third-party syndication, just as we are doing for geographic data by way of geoRSS and KML.


    Questions for the Zotero Community


    In light of all this, I'd like to ask a few questions that are hopefully relevant to the current thread:


    • For hierarchy, has there been any consideration of FRBR? Should there be?
    • I'd like to hear more about current thinking with respect to Josh's comment: we're planning a server that'll allow users to share, exchange and collaborate on Zotero collections
    • Within the library cataloging community, there are descriptive vocabularies for types of works in bibliographic records (e.g., for data elements 06 and 07 in USMARC). Shouldn't these, as existing standards already embodied in many of the resources that Zotero scrapes, all be represented (or crosswalked to something) in the list of Zotero item types (for example, our data model sees USMARC "serial" as a superclass for "series" and "journal", but I don't see corresponding items in the Zotero list)
    • CommentAuthorbdarcus
    • CommentTimeFeb 6th 2007
     
    "For hierarchy, has there been any consideration of FRBR? Should there be?"

    I've spent a lot of time looking into and prototyping FRBR-related data stuff. It's a killer data model that is really sound. But you pay for that in complexity. My conclusion is that it's too complicated for these use cases.

    That said, I helped a bit with the FRBR RDF work that Ian Davis (from Talis) and Rich Newman did, and more recently I've been referencing it from my RDF model. I blogged about this here:

    http://netapps.muohio.edu/blogs/darcusb/darcusb/archives/2006/09/11/plugging-into-frbr-killing-marc

    In my data model, BTW, I have journal as a subclass of periodical, which is a subclass of collection. MARC is good to learn from, but less-than-ideal otherwise. Library people think about bibliographic metadata differently, reflecting different priorities.
  3.  
    A quick thought - it might help facilitate this conversation (and the inclusion in the discussion of non-techies, or even techies with different areas of expertise) to have a basic glossary of acronyms (CSL, RDF, etc.) and other key terms relevant to this issue. Maybe there is one already, and I just haven't seen it.

    I've gotten a handle on at least some of this vocabulary by browsing some of the developer documentation, where many of these concepts are explained, but the documentation is written for developers, and so many of the explanations remain rather opaque to the rest of us.

    So far, this has been a very interesting exchange of perspectives - let's keep it up.
    • CommentAuthorbdarcus
    • CommentTimeFeb 6th 2007
     
    Glossary:

    XML (eXtensible Markup Language) - a way to tag plain-text with structured information

    RDF (Resource Description Framework) - a standard (basically relational) data model for representing just about anything; builds on XML and the web (resources are named using URIs). Think of something like a relational database for the web. Well-suited to needs of Zotero.

    CSL (Citation Style Language) - an XML language to describe citation styles; written by me, with some help from one of the Zotero guys, an used in Zotero. There's some interest in standardizing it and using it witin OpenDocument.
  4.  
    The following question may fall into the "obvious" category for many or most of you, but perhaps answering it will help familiarize us non-techies with some of the technical difficulties we're facing, and maybe kick-start some brainstorming about alternatives.

    Is an ordinary relational database a viable solution for biblio stuff? Why or why not?

    I'm picturing related tables for "people," "titles," "journals," "publishers," "places," etc. A GUI would present a data entry form more or less like current biblio programs, but the data would go into the various tables instead of into one big one. There would be an autocomplete feature where appropriate, both to facilitate data entry and to prevent duplication of data (when adding additional works by an author already in the database, the name would pop up, ensuring that you don't create multiple records for the same author).

    The record entry form would also contain tags to qualify how specific data would be used. For example, names from the "People" table could be tagged as author, editor, recipient, etc. These tags would then determine the appropriate formatting when it comes time to spit out citations. Similarly, titles could be tagged to designate their relation to the item being entered: item title, collection title, etc. We'd also need some way within the titles table to distinguish long work (i.e., underlined/italicized) titles from short work (quotation marks) titles from simple descriptions (with no special formatting at all) - or perhaps this could be handled by giving each of these title-types its own table.

    Different forms could be created for different record types, labelling the fields so as to make the process intuitive for the user - but this is basically what record types do now. The user experience would be more or less the same as at present, but the underlying relational structure would allow us to do a lot of neat stuff with the data (such as using "collection" records within citations of "document" records, etc.)

    I'm sure there are plenty of reasons why the above wouldn't work, but I don't know what they are, and would like to find out. I imagine the programming requirements are daunting, but it sounds like this will be true no matter what structure is ultimately chosen.

    I also realize that bdarcus is suggesting some middle ground between flat and relational solutions - I'd love to hear more about these ideas, and why they are more appealing than what I've just outlined.

    So fire away - I have no investment whatsoever in this scheme, nor am I advocating it as a surefire solution - think of it more as a starting point for a conversation that will, with any luck, go off in some unforseen direction.
  5.  
    I mentioned this in one of the earlier threads, but to address CloudOfDust's question (and to clarify the discussion for others), the internal database structure used by Zotero is pretty much irrelevant to the issue at hand. How we store data internally is mostly an issue of disk space efficiency and performance. I'll just reproduce what I said on that other thread, as it's germane to this discussion:

    There's nothing inherent in the current database design that prevents there from being dependencies/relations and automatically incorporating parent fields in dependent items without duplicating data. Normalizing publications out to a separate table might make a few things easier (and other things more convoluted), but it doesn't avoid any of the issues that make this difficult. You still need a way to select a parent/publication from within another item, still need to prompt the user for what to do on deletes, still need to determine whether the change to publication data in one item changes it in the parent, etc. And you need a mechanism for relating items in complex ways regardless of whether publication data exists independently.


    The UI suggestions from Matthias above are a great start to answering these sorts of questions. More ideas along those lines would be most welcome.

    Aside from UI issues, the other major requirement is a hierarchical map of reference types of the sort Bruce has been working on, which would allow not only for proper CSL fallback but also for certain internal logic within Zotero—for example, I would imagine that certain communication-based types might share the same logic for auto-generating a proper title field from the values of the creator types designated in the database as equivalent to "author" and "recipient" (whatever the type-specific names might be). A planning document that arises from this discussion would ideally include descriptions of such logic.

    Bruce, does your RDF model account for actual fields or just types? (I don't see any mention of fields on the RDF schema page, so perhaps you haven't integrated them yet.) In particular, I'm wondering whether child reference types inherently contain all the fields of their parent types (though perhaps with different names).

    (CloudOfDust: We already do exactly what you suggest for creators and creator types, since, among other things, they require many-to-many relationships with items. Separating out the other fields into separate tables wouldn't provide any benefit and would simply make the code much more convoluted, as it would require adding custom logic for every single field (and there are quite a lot of fields). One thing that would be appropriate, from a data normalization perspective, would be to add an itemDataValues table with all the unique values and simply reference them in itemData—it would make working with the data somewhat more tedious and possibly make certain operations slower, but the overall increase in storage efficiency (at least for large libraries) would probably make it worth it. But, again, a separate matter from the issues at hand.)
    • CommentAuthorbdarcus
    • CommentTimeFeb 7th 2007
     
    Dan:

    "Bruce, does your RDF model account for actual fields or just types? (I don't see any mention of fields on the RDF schema page, so perhaps you haven't integrated them yet.)"

    The SVN version does include all the properties, though I think we still need some inter-project discussion of whether we define these all in the same namespace (as I have) or whether we reuse existing vocabularies (like DC) for that.

    "In particular, I'm wondering whether child reference types inherently contain all the fields of their parent types (though perhaps with different names)."

    Not sure what you mean here. Can you rephrase?

    In general, when we're talking about RDF and RDF Schema, we're talking about classes and properties. Properties can in turn define their domain, which is a class. In that case, it allows inferencers to come across a property (independent of any type statement) and say "ah ha, this property belongs to X class".

    Here's an example from the sbo.n3 file in my project SVN:

    sbo:title a owl:DatatypeProperty ;
    rdfs:label "title"@en ;
    rdfs:isDefinedBy sbo: ;
    owl:equivalentProperty dc:title ;
    rdfs:range xs:string .
    • CommentAuthorbdarcus
    • CommentTimeFeb 7th 2007
     
    Re: this question:

    "I'm picturing related tables for "people," "titles," "journals," "publishers," "places," etc."

    As Dan says, it's all really a question of trade-offs, but when I've worked on such models, I've tended to have tables for "agents" (includes publishers), "collections" (includes journals, archival collections, etc.), and "resources." I've always had "title" as a column for resources and collections, rather than a separate table (seems like overkill).
  6.  
    Bruce — re: fields:

    For example, if EMail is a subclass of PersonalCommunication, are all properties defined by PersonalCommunication valid for EMail as well, or is there some mechanism for specifying within EMail that some property defined by PersonalCommunication is not valid? I assume the former...
    • CommentAuthorbdarcus
    • CommentTimeFeb 7th 2007
     
    OK, Dan.

    Validation in RDF is different than in XML. See:

    http://danbri.org/words/2005/07/30/114

    In short, RDF has this "open world assumption" which expects merging of data, and so to say something is "invalid" per se makes no sense.

    But I'm pretty sure that subclasses inherit the characteristics of their parents, including restrictions about the occurrence of properties (if you define them of course). In other words, in answer to your question "yes, the former."

    To me an ideal model to follow for what we need here is DOAP: it's good, clean RDF, and there's a nice RELAX NG schema that validates nice, clean, XML (that is RDF, but most people wouldn't know it).
    • CommentAuthorCloudofDust
    • CommentTimeFeb 7th 2007 edited
     
    Off the top of my head, some brainstorming re: Dan's UI needs:

    "You still need a way to select a parent/publication from within another item"

    In the current model, to enter a new citation in my word processor I open my biblio database, find the desired record, and then insert the citation referencing that record. If no record exists for the material I want to cite, I can create a new record within that same process.

    Couldn't the selection of a "parent" item within a "child" item work along the same lines? I'm creating a new record for the "child" item; a command opens up a window that lets me search the database for the desired "parent"; if necessary, I can then add a new "parent" record into the database, and use that instead of a preexisting one.

    "still need to prompt the user for what to do on deletes"

    A warning message - "are you sure you want to delete this record?" - followed by another - "are you REALLY REALLY sure about this?" - followed by another - "this deletion will be IRREVERSIBLE and could screw up a lot of stuff - please turn back now before it's too late!"

    More seriously, I would think that something to the effect of "this record is used by other records to create citations. Deleting it will render those other records inoperable. Do you really want to do this?" would do the trick.

    [Edit: you could also offer the option to delete all child records along with the parent record - accompanied by even louder warnings]

    "still need to determine whether the change to publication data in one item changes it in the parent"

    This may be the crux of the matter. In a relational or semirelational model, the data for the parent item within the child item IS the original record for the parent. Or to be more precise, the child item record does not actually contain parent-specific data, but merely a marker telling the database which record to grab the parent data from when making citations.

    There are different options for handling this in the UI: you could have a subform, such as you might find in an Access or FileMaker database, where you actually edit the parent record from within the form displaying the child record. Or you could have the parent-specific data appear in the child record in read-only fashion, with a command available to open the parent record for editing if desired. In either case, changes to the parent data (including the publication data) would be universal, because all manifestations of the parent data would refer back to the same original record.

    Dan and bdarcus's comments about multiple tables are well taken. At the moment, I wish I had a separate "Place" table, so that I could correct "Cambridge, Mass." to "Cambridge, MA" universally, but I can appreciate that that small convenience might not be worth the extra labor required to make it happen (after all, how often do standard formats for place names change?)

    One more question. Dan writes: "I would imagine that certain communication-based types might share the same logic for auto-generating a proper title field from the values of the creator types designated in the database as equivalent to "author" and "recipient" (whatever the type-specific names might be)."

    Could you please explain what you mean by "auto-generating a proper title field" from the values for "author" and "recipient"? I'm not sure I see a need for a title field for communication-based types at all. The document is identified not by a title, but by the designation of author and recipient, and in some cases by a description: "memorandum," etc.

    Or is your point that various reference types use a pattern like "[author] to [recipient]," and therefore this pattern could be built into a higher-level umbrella type, which would greatly expedite the creation of additional lower-level reference types that use that pattern?
  7.  
    CloudOfDust — a couple quick things:

    Re: "Cambridge, MA": Just to be clear, my point was that having a separate "Place" table in the database is wholly unrelated to correcting "Cambridge, Mass." to "Cambridge, MA" universally. That can be done at the moment in one line of SQL. It could be done for items with specific item types, for items created in the last two weeks, for items from a particular publisher...in other words, for items matching any arbitrary set of criteria. It's just a question of writing the SQL. But the current database design is quite capable of all of that. It just requires implementing the batch editing and search/replace features to let you do it in the interface. How it's stored is more or less beside the point.

    Re: "auto-generating a proper title field": I meant the latter. By title field for the communication-based types, I was merely referring to what is displayed, say, in the items list.
  8.  
    I see: so whether "Cambridge, Mass." or any other bit of data to be manipulated exists in a single record in a related table, or in lots of different records in an unrelated table, a script can be written that makes a universal change possible, and a UI feature can be designed that makes this function intuitive for the user. And more generally, in a flat database with lots of repetitive data, scripts can be created to duplicate many of the effects possible in a relational database, and to render them intuitive for the user.

    Having said all that, it seems to me that, whether you put "parent" and "child" records in separate tables or one big table, the issues surrounding edits, deletion, etc. could be more easily addressed by inserting references to parent records within child records, rather than duplicating parent-specific data in each new child record (along the lines I described above, and in one of the emails I linked to above). But I'm not a developer, so that's easy for me to say.
  9.  
    Yes, your first paragraph is correct.

    For the second, you don't need separate tables to have parent/child relationships, and having separate tables doesn't make the edit/delete issues any easier, since the difficulties are in the UI. The parent data wouldn't be stored with the child—most simply, you'd just have a parent column in the items table, and on retrieval, the data layer would grab the data for both items.

    The other, more fundamental, issue is that, in our case, parents and children aren't necessarily different types of entities as far as an editing interface is concerned. You want to create a record for an edited book the same way you want to create a chapter—making them fundamentally different entities would just add unnecessary complexity to the code.

    In a relational database, data doesn't need to be in separate tables for you to relate it.
    • CommentAuthorbdarcus
    • CommentTimeFeb 7th 2007
     
    Dan wrote: "In a relational database, data doesn't need to be in separate tables for you to relate it."

    I'm not really a RDBMS expert (not even close), but while this might be true in the sense that you can do parent-child relationships within a table (indeed, that's how I'd prefer to model chapter-book relations), there is a general principle that one should avoid data duplication.
  10.  
    Yes, but again, data duplication is still a completely separate issue from what we're (theoretically) discussing. As I said above, a separate table containing all unique values, linked to itemIDs in the itemData table, would remove all redundant data. It'd save a few bytes per associated item for longer field values (the difference between the integer and the string). It'd save no space whatsoever (or take up more space) for fields that were already integers or short strings. But you certainly don't need separate tables for each type of field or item type. And even for those rows in itemData linking the unique values to the individual items, you certainly don't need to duplicate them in the child items, as seems to be an assumption.

    It would be beneficial to get back to the issues of UI design and data mapping.
    • CommentAuthorCloudofDust
    • CommentTimeFeb 7th 2007 edited
     
    Agreed. In the interests of doing so, I'll repost the comments I made earlier about UI design, which have nothing to do with relational tables, etc.:

    "You still need a way to select a parent/publication from within another item"

    In the current model, to enter a new citation in my word processor I open my biblio database, find the desired record, and then insert the citation referencing that record. If no record exists for the material I want to cite, I can create a new record within that same process.

    Couldn't the selection of a "parent" item within a "child" item work along the same lines? I'm creating a new record for the "child" item; a command opens up a window that lets me search the database for the desired "parent"; if necessary, I can then add a new "parent" record into the database, and use that instead of a preexisting one.

    "still need to prompt the user for what to do on deletes"

    A warning message - "are you sure you want to delete this record?" - followed by another - "are you REALLY REALLY sure about this?" - followed by another - "this deletion will be IRREVERSIBLE and could screw up a lot of stuff - please turn back now before it's too late!"

    More seriously, I would think that something to the effect of "this record is used by other records to create citations. Deleting it will render those other records inoperable. Do you really want to do this?" would do the trick.

    [Edit: you could also offer the option to delete all child records along with the parent record - accompanied by even louder warnings]

    "still need to determine whether the change to publication data in one item changes it in the parent"

    This may be the crux of the matter. In a relational or semirelational model [for "semirelational," please substitute whatever non-relational way of relating data you prefer], the data for the parent item within the child item IS the original record for the parent. Or to be more precise, the child item record does not actually contain parent-specific data, but merely a marker telling the database which record to grab the parent data from when making citations.

    There are different options for handling this in the UI: you could have a subform, such as you might find in an Access or FileMaker database, where you actually edit the parent record from within the form displaying the child record. Or you could have the parent-specific data appear in the child record in read-only fashion, with a command available to open the parent record for editing if desired. In either case, changes to the parent data (including the publication data) would be universal, because all manifestations of the parent data would refer back to the same original record.

    As a follow up, I'm wondering whether any of the above conflicts with Dan's statement that:

    "The other, more fundamental, issue is that, in our case, parents and children aren't necessarily different types of entities as far as an editing interface is concerned. You want to create a record for an edited book the same way you want to create a chapter—making them fundamentally different entities would just add unnecessary complexity to the code."

    This sounds more or less like what I described above - or at least what I thought I was describing. Or am I missing something?

    Edit: to use Dan's example, creating the record for a chapter is basically the same process as creating a record for the book as a whole, except that creating the chapter record requires either creating a new (related) record for the book, or else selecting a pre-existing "book" record. There is a difference, certainly, but in my eyes it's not one that would complicate the UI very much - the UI for the chapter entry could look more or less like it does now, except that there would be an option either to insert data for a pre-existing book record or to enter a new book record from scratch.
    • CommentAuthorbdarcus
    • CommentTimeFeb 7th 2007
     
    OK, my two cents on Dan's UI questions (probably duplicating some of what CloudofDust says):

    "You still need a way to select a parent/publication from within another item"

    Auto-complete?

    I'm imagining the more complicated case of entering book chapters. Let's say I start not with having an already-entered edited book, but just decide to enter the chapters.

    So I choose a "chapter" type and get the fields. I get to the "book title" field and get auto-complete options. Since I've not yet entered the book, I won't see what I'm looking for. So I then enter the additional fields.

    Alternately, perhaps the first field is the ISBN, and the data can then be auto-populated by pinging a server (I suppose a title field could do that too).

    "still need to prompt the user for what to do on deletes"

    I'd say deleting a child should never delete a parent. But deletion of a parent should bring a serious warning message.

    "still need to determine whether the change to publication data in one item changes it in the parent"

    Maybe if a user selects already-entered data (an existing parent record) the UI for those fields somehow indicates that, and one needs to explicitly choose to edit it (maybe by switching to the parent entry)?
  11.  
    bdarcus writes: "So I choose a "chapter" type and get the fields. I get to the "book title" field and get auto-complete options. Since I've not yet entered the book, I won't see what I'm looking for. So I then enter the additional fields."

    So by entering the additional fields in the "chapter" entry type, the user effectively creates a new record (or entry, or whatever you want to call it) for the book? So that the book in future can also be cited on its own, rather than as the "container" of a chapter? And so that when entering additional chapters, the book will appear with all the other books in the autocomplete list?

    I think we're on the same page here, but I'm just double-checking for clarity.
    • CommentAuthorerazlogo
    • CommentTimeFeb 7th 2007 edited
     
    Just to brainstorm on various UI issues, to follow from what bdarcus and CloudofDust are proposing--in theory would it be possible to set up the following interface? (this doesn't address hierarchy models, only UI).

    (the following is relevant to this issue: "You still need a way to select a parent/publication from within another item")

    Let's say I'm entering a new item--a letter published in a book:

    Theodor W. Adorno to Walter Benjamin, 7 March 1938, in The Complete Correspondence, 1928-1940, ed. Henri Lonitz (Cambridge, Mass.: Harvard University Press, 1999), 240-242.

    I will start by entering the fields available in the current "Letter" Zotero item type: Author, Recipient, Date of Letter (fields necessary in other cases would include type of letter, place sent, and place received).

    Then instead of a repository field for archive I would get a line where i could choose possible parent item type (let's call it a "container" type): Repository (as in archive), Book, Journal, Magazine, Newspaper, etc. The container type field would be a pulldown field like the menu currently reserved for author type. I choose "Book".

    Then I would start typing a title, which would autocomplete from the book title field. If the book already exists, typing "The Complete Correspondence, 192..." would bring up the full title. If the book doesn't exist, I enter the entire title.

    I then hit the return key. The interface expands to reveal all the fields for the "Book" item type. If the book already exists in the database, Zotero shows data from existing fields. (If two identical titles exist, I would get a prompt to choose one) If not, I go ahead and enter the new data. (A new item then should be created in Zotero just for that book.)

    When I come back to this item later, all I see in "Info" is the letter fields plus the title for the book in the "container" line. Clicking on the title reveals all the fields for the book, which I can then edit.

    This would be great interface for entering letters, interviews, and artworks, because all of these have few item-specific fields (i.e. type for letter, medium for interview, medium and size for artwork) but all of these can be found in archives or published or reproduced in books, journals, magazines. This method could also be applied to book sections & articles in periodicals.

    This does not resolve the question of what I'd do if I want to also record archival information for this same letter--do I enter it as a separate item or (better yet) is there a second or third "container" option available within the same record (just as I can now add more authors by clicking the "plus" on the right of the author's name)?

    This interface also calls for a separate "Archival Collection" item type, which I think Zotero needs anyway because researchers may need to record information about a collection or archive in advance to actually doing there and entering specific sources.

    On these issues:

    "still need to prompt the user for what to do on deletes" and "still need to determine whether the change to publication data in one item changes it in the parent"

    I think users should be able to add and edit parent items directly from the child record, with some warning ("editing this field will alter the container item and all linked items")--it seems easier than opening a separate window. They also should be warned on deleting the item that has "child" items in the same way they are now prompted about deleting notes and files--just as currently when we choose not to delete a note we just end up with a standalone note, we'd end up with a letter/interview without a "container" item linked to it.

    By the way, sometimes letters do have titles--newspapers and magazines often title their letters to the editor, and business memos often include a "subject" which would be useful to enter into a "letter title" field.
    • CommentAuthorJosh
    • CommentTimeFeb 8th 2007
     
    Over the past day and a half, Sean and I hammered out a first proposal for a new data model that might address most of the concerns expressed here. I've gone ahead and updated the Trac wiki page, so take a look. It bears a strong similarity to Bruce's current model, but with some important (to us) distinctions (more on that in a sec)

    The Model



    Rather than starting with abstract categories, we began with the current universe of Zotero item types and worked backwards. Basically, there are a few top-level item types (Document, Communication, Event, Agent), as well as several ancillary item types; at no point would we envision a user either creating a top-level or ancillary item type from scratch, nor would they show up in the middle "List" pane. Instead, users would create items from the "Item Type" list, each of which inherits properties of various kinds from parent elements (including the top-level and ancillary types).

    A few important things to note:
    • We've created two catchall types for citable things, Document and Communication...the more we hammered at this, the more it seemed that they were fundamentally different animals: a Document is essentially a broadcast of some form, whether via electromagnetic waves or paper, and thus only has an author; a Communication has an intended audience (whether the interviewer or recipient) and thus has both a Sender and Reciever (nod to Shannon and Weaver here).

    • Every citable thing has two complex fields: Location and Reproduced in. These do the following:

      • Location: A thing's Location is a place where you can find an instantiation of it. This can be be a call number in a specific library, a URL for a webpage, or collection/series/box in an archive. An item can have multiple locations, as copies can be found in multiple places.

      • Reproduced in enables the drawing of a particular kind of parent-child relationship to be drawn by users. It's intended to be used in cases where a given item is reproduced within another; i.e. a letter reprinted in a book, or an image embedded in a blog post
      • . Among other things, this makes the generation of citations for such items a straightforward recursive exercise.

    • Our goal was less philosophical purity than pragmatic functionalism, so we hardcoded a few things into the model (using the ancillary item types) rather than making sure that everything fit into a ur-type. For example, there's no collection item type (unlike Bruce's schema)...in the case of each complex relationship (i.e. journal-issue-article), we decided what the core item was, and then figured that things like book series and journal could just be other item types, rather than collections. Our metaphor was of things and then things that are pieces of things, rather than collections and things in them (hopefully that statement wasn't uselessly vague). While we know that this isn't a solution in a broad ontology sense, it would massively improve Zotero's architecture without precluding a more sophisticated (and perfect) system down the road.


    Interface stuff



    In terms of how this will be implemented in a UI, we're thinking along very similar lines to Elena. For example, imagine that I want to add a journal article: when I click the green new item button, moving over the "Article" option beings up a submenu of types: Newspaper, Magazine, or Journal. Choosing one creates a new item, with the standard Document fields (title, author(s), language, location(s), reproduced in) as well as two special fields, "publication" and "issue". These might be dropdowns, they might be auto-complete fields; regardless, "Issue" is grayed out initially, nudging the user to choose a publication. If the publication already exists within Zotero, its metadata populates the item pane; if not, then empty fields come up for its type (pre-filled with the current item type) and ISSN (if applicable). Once there's a publication, the "Issue" field is editable, and the same process applies.

    For the "Location" information, it seems that the most familiar interface is the one we use in contact management programs; when I go into my Address Book, I have the option of adding multiple addresses (work, home, etc.) for a given contact; the "Locations" for an item should work the same way. One thing worth mentioning is that by separating out "Location" as a separate kind of data associated with an item, we open the door to custom location types; while we're wary of opening up user-customized item types for the time being, we're much less concerned about custom locations (which are less vital to interoperability, and which are more obviously esoteric to users and contexts), and so this opens a path by which we could comfortably enable custom locations without threatening the normalization of the data we cate most about (the properties of an item itself).

    As for the "Reproduced in" button, in UI it'll be equivalent to the "Add" button under the "Related" tab, except that a given item can only be reproduced in one item (if you've got multiple reproductions of it, then you've got multiple items, and should have a separate record for each reproduction, much the same way as we treat editions of books).

    Okay, there's lots more to talk about, but this should at least offer the gist of our current thoughts - have at it, and feel free to add to the schema on the Trac wiki page if there are things we've missed...
    • CommentAuthorbdarcus
    • CommentTimeFeb 8th 2007
     
    Ugh, I just spent 30 minutes on a reply, but it got eaten! I don't have the energy to repeat it all, but will boil it down to:

    1. we need to decide the policies for when something becomes a formalized "type."

    I suggest we focus on a smart core with some hierarchy, and leave room for customization with a "genre" or "type" column/property. Some of your types are too specific in my view.

    Maybe another wiki page for this?

    Also, it's important to me for the RDF that we have a hierarchy of types (per my blog post).

    2. I think ditching the collection and using the "Ancillary item types" is a bad idea. Why are you doing this again?
  12.  
    A couple of questions re: Josh & Sean's schema. First, how does this model address the fact that in many cases a "reproduced in"-type item (such as a collection of documents) is also a specific document in its own right (such as a book, or journal article, or blog, etc.)? Will the "reproduced in" designation simply point to the original document record for the book, article, etc.? And does the distinction between regular and "ancillary" types complicate this at all (since some "reproduced in" types are ancillary, and some are not)?
    • CommentAuthorMatthias
    • CommentTimeFeb 8th 2007 edited
     
    Thanks Josh and Sean for working on this! Like Bruce I have problems with the distinction between ancillary item types and actual item types. There are use cases where it would be handy to enter the ancillary item types as actual items (i.e. stand-alone database entries). So I would simply take away this distinction and treat all item types as equal, but then allow to relate these item types freely with each other.

    Also, I must admit that, as a third-party developer, I'd like to agree upon a hierarchical model that would not only work for Zotero but also for other bibliographic applications.

    That said, and not thinking of any Zotero implementation details, please allow me to go back to a more conceptual model for brainstorming purposes. Personally, it helps me to think of relationships like this:

    First of all, all items are on the same level and can be *freely related* with each other (this is very important if the model wants to address all different kind of needs).

    Speaking of relationships, I think of "classes", "subclasses", "items" and "item properties" (more about properties below).

    In case of "classes", I'm thinking of the main basic elements that occur in every bibliographic citation/reference. "Subclasses" are contained within these basic classes and would usually work as fallback elements when generating citations. "Items" would be contained within subclasses and would (in case of "resources" & "events") represent the actual database entries (think Zotero item types).

    So, following this markup scheme:

    class:
    - subclass: item1, item2, item3, ...

    I think of these classes, subclasses and items:

    Agents:
    - Person: author, editor, translator, inventor, contributor, recipient, ...
    - Organization: publisher, authority, ...
    - ...

    Ressources:
    - Collection: periodical, series, archive, internet site, proceeding, ...
    - Document: article, book, section, chapter, thesis, statute, map, image, blog, ...
    - Communication: letter, email, instant message, interview, ...
    - Broadcast: radio, television, podcast, ...
    - ...

    Events:
    - Conference
    - Legal Case: brief, decision
    - Hearing
    - Expedition
    - ...

    Places:
    - Country
    - City
    - ...

    Dates:
    - Year
    - Month
    - Day
    - ...

    Note that there is no dedicated "Collections" class, since collections are IMHO simply a resource that can contain other resources, so basically collections are resources as well. Therefore, I'd consider them as a subclass of the "Resources" class. Take a book as an example - would this be a collection or a (document) resource? The answer is, of course: it depends. A book can be considered as a stand-alone document as well as a container of book chapters -- or, as a container of book sections which in turn are containers of book chapters. I.e., depending on the situation, a book can be considered as a collection or as a document, but it's in any case a resource.

    Speaking of "items", all items could have multiple "properties", such as "title", "name", "language", "locator", "identifier" or "descriptor". A qualifier (i.e. the elements after the colon below) could specify in more detail the nature of that property, e.g.:

    - name: given, family, suffix, display, sort, ...
    - title: long, short, abbreviated, translated, alternate, descriptive, ...
    - language: fulltext, summary, ...
    - locator: volume, issue, pages, edition, code, patent number, ...
    - identifier: url, doi, issn, isbn, pmid, lccn, archive id, local call number, ...
    - descriptor: keyword/tag, category/group, ...
    - ...

    In addition, a universal "type" property could always specify in more detail the actual nature of the item. For example, a periodical could have one of these type properties:

    - journal
    - court reporter
    - magazine
    - newspaper
    - ...

    Or a thesis item could have one of following type properties:

    - dissertation
    - master thesis
    - bachelor thesis
    - ...

    And all place and date items could have a type property such as:

    - published
    - presented
    - sent
    - received
    - ...

    In theory, all items from all classes could be related freely with each other to form a citation that suits the user's needs.

    However more practically, the software could provide "relationship templates" for typical use cases (say, a letter within an archive, or a book within a series, etc). This would be the equivalent to Zotero's current item types. The user would select one of these templates from a dropdown menu. Zotero would prefill the edit mask with all necessary fields but would visually distinguish between item-specific and container-specific fields. As discussed by others ealier in this thread, the contents of container-specific fields would make a separate database record.

    Note that it should be also possible to nest a container within a higher-level container. This would be required to properly cite e.g. a book chapter within a book within a book series.

    I hope that my thoughts make sense to you.
    • CommentAuthorMatthias
    • CommentTimeFeb 8th 2007
     
    Bruce said:
    "we need to decide the policies for when something becomes a formalized 'type.'"

    As per my previous post, I'd prefer if item types would merely be a pre-made set of relationships. I.e. Zotero would offer the most used citation cases as named sets of relationships (as Zotero does now via its item types). Choosing one of these sets from a dropdown menu, would fill the edit mask accordingly.

    Ideally, these relationships (and the corresponding GUI layouts) would be established on the basis of some type definition files (I guess RDF?) defining all the relationships. While Zotero wouldn't offer a GUI interface to change its own pre-made type sets (or to make new ones), power users could go ahead and edit the underlying RDF files to modify existing type sets or to create new ones. Then, offer a repository where these type definition files could be shared within the community. These would be a very flexible and powerful setup that would be easily adjustable in the future.

    "I suggest we focus on a smart core with some hierarchy, and leave room for customization with a 'genre' or 'type' column/property."

    How would the data model that I outlined above, fit your imagined model? (I've tried to include much of your biblio scheme and posts)
    • CommentAuthorsean
    • CommentTimeFeb 8th 2007 edited
     
    CoD- "Reproduced in" is not an item type. It's just a UI element pointing at a back-end association between one item and an another. Functionally, this means that before entering a letter, you'll first want to enter the book where the letter is reproduced. Then letter item details, and click "Reproduced in" to point back to the parent item.

    "Reproduced in" should never point to an ancillary type, since a journal article is not likely to be "reproduced in" an issue. That's its original form.
    • CommentAuthorsean
    • CommentTimeFeb 8th 2007 edited
     
    Matthias- Josh and I agree completely on re: typing items to reduce the total number of item types. As you note, periodicals could be collapsed, as could be dictionary/encyclopedia, etc.

    As for "ancillary" items, we think it's important not to allow the creation of a "periodical" or a "periodical issue" because doing so will severely complicate the UI. At the end of the day, we need to provide an interface that meets the needs of the vast majority of users. These people don't cite a journal or an issue. They cite articles in the issue of the periodical. Same with book series, etc. By separating these items out, it will be very easy to add and cite these kinds of sources, since we can auto-populate or provide some kind of selection UI for existing journals, issues, series.

    I understand the attraction of a perfect model, but it's also important to keep usability front and center. How important is the use-case for citing a single archive, and how much development time are you willing to invest to make it happen? What kind of usability trade-offs are you willing to make?
    • CommentAuthorJosh
    • CommentTimeFeb 8th 2007
     
    On the question of "ancillary" vs. other item types, I made a mistake in not being clear - this is a UI distinction, rather than an ontological one in the data model. From the data perspective, there's no difference between a journal issue and an article; the difference is that in the Zotero interface, you'd never be able to create a journal issue on its own (we'd be hacking in a bunch of UI decisions like this in order to make the software more accessible to people who don't think about this stuff as much as we all do).

    On the "collection" type, Matthias hit it on the head when he said above: "collections are IMHO simply a resource that can contain other resources, so basically collections are resources as well." That's how we see it, and the "collection" type seems an unnecessary elaboration.
    • CommentAuthorJosh
    • CommentTimeFeb 8th 2007
     
    I'll also add one more point - I see the evolving taxonomy on the Trac page as Zotero-specific; as I see it, the primary agenda is to come up with something that we can implement in Zotero as soon as a few weeks from now, and to do so in a way that doesn't preclude compatibility with more universal standards (hence the "ancillary item types" thing)...

    So, my personal concerns are naturally more specific and seemingly-limited than one would need in a universal ontology; I'd like to have my cake and eat it too, but given constraints it seems that the best we might be able to do is have our cake and keep the option of eating it open for the future. The great thing is that Zotero can spit out an RDF in whatever more generalized format we want later on, while still maintaining a more tailored and domain-specific model inside its black box.

    Regardless, this is clearly a useful discussion to have (both in the concrete and more abstract forms), so let's continue.

    One question I'd put out: any ideas on what to do with what we called a "Communication"? It seemed that this was fundamentally a different thing than a "Document", in a sort of broadcast vs. interpersonal communication way; any thoughts on this?
    • CommentAuthorbdarcus
    • CommentTimeFeb 8th 2007
     
    "On the "collection" type, Matthias hit it on the head when he said above: "collections are IMHO simply a resource that can contain other resources, so basically collections are resources as well." That's how we see it, and the "collection" type seems an unnecessary elaboration."

    But the notion of collection is significant both WRT to the GUI (as you both mention), the citation formatting, and data exchange. If you were to change "ancillary types" to collection and be more precise about it, would you really be deviating from your intentions?

    FWIW, I created the main collection class becuase that notion is commonly used in bilbiographic data, and because it groups together a number of similar structures: series, archival collections, tv and radio shows, web sites, etc.
    • CommentAuthorbdarcus
    • CommentTimeFeb 8th 2007
     
    "One question I'd put out: any ideas on what to do with what we called a "Communication"? It seemed that this was fundamentally a different thing than a "Document", in a sort of broadcast vs. interpersonal communication way; any thoughts on this?"

    Part of my long-post-that-got-eaten did address this. I had thought about this earlier too, but it's a little tricky. For example, how would you deal with an interview published in a book, or broadcast on the internet?
    • CommentAuthorbdarcus
    • CommentTimeFeb 8th 2007
     
    Matthias --

    "As per my previous post, I'd prefer if item types would merely be a pre-made set of relationships."

    Yes. Well, look at how I do it in CSL. I have a convention that says you concatenate the primary level type with its container type. So you can do "article-periodical" or "article-magazine" and so forth. I could imagine the same for these assembled GUI types.
    • CommentAuthorsean
    • CommentTimeFeb 8th 2007
     
    Bruce --

    "Reproduced in" a book or a web page is what you would do with your hypothetical interview.

    That said, in our model a "Communication" is really just a "Document" with the addition of a specified recipient or recipients. So maybe we should make it one.
    • CommentAuthorMatthias
    • CommentTimeFeb 8th 2007
     
    Please note that I didn't ditch the "collection" hierarchy entirely (I consider it useful), I was just emphasizing that a "collection" can also be a valid resource (similar to an article, letter, etc). Therefore, all of its members should be treated equal compared to members of the "document" tree (i.e. all these elements being "items" in my view).

    Josh, I understand (and like) the idea of a communication being different from a document. Also, I think that a broadcoast (one to many) is different from a communication (one to one).

    Bruce said:
    "For example, how would you deal with an interview published in a book, or broadcast on the internet?"

    This is a valid point and it highlights my point of a book being a "collection" OR a "document" depending on the situation. With this in mind, I'd argue that the relationships between items and its higher-level categories (subclasses) should not be dogmatically fixed. Why shouldn't it be possible to relate an interview to a book in one instance and to relate it to an internet broadcast in another instance?

    How about if we'd merely regard subclasses as some means to provide meaning between items and classes, i.e. for example, a book could belong to subclass "collection" OR subclass "document" depending on the situation? (just thinking out loud here) Would this be feasable in an RDF model?
  13.  
    Matthias: "I'd argue that the relationships between items and its higher-level categories (subclasses) should not be dogmatically fixed. Why shouldn't it be possible to relate an interview to a book in one instance and to relate it to an internet broadcast in another instance?"

    It should be possible, just as it should be possible to relate a letter to a book, or a journal, or an article in a journal, or a reel of microfilm, or a website, or an article in a journal on a website, or just a box in an archive. Clearly, one point of a hierarchical model is to make possible all kinds of relationships, including the ones that we don't currently anticipate needing.

    That said, it sounds to me like the model posted on the wiki could handle this pretty well. The key is that everything is a type, and can be related in various ways to other types. The most complicated documents I have to cite are a bunch of letters appended to an annotated diary published as a titled article in a journal. So I create a new entry for a letter, note that it's "reproduced in" the article, which is found in the journal, etc. Works for me.

    I do have another question about the model, however:

    Why the distinction between "reproduced in"-type parents and original parents? What is the functional difference between the relationship between a letter and a book and the relationship between a letter and an archive? Or an article's relationship to the journal that first published it vs. a book in which it was reprinted? I don't see what difference this makes either for the data structure or the UI (especially for the UI - the average user entering a letter/book in a database only cares about recording the information for that particular incarnation of the letter, and not about the fact that the original letter exists somewhere else). In each case, we're still just talking about types relating to types, right?
    • CommentAuthorJosh
    • CommentTimeFeb 8th 2007 edited
     
    On the "reproduced in" idea, it seems to me that the relationship between a journal article and its parent "Journal" item is different from that of a journal article and a book in which it's reproduced; this might just come down to a Benjamin-like notion of the "aura" of originality, but in every case we could come up with, there is an original and authentic item (of course, we're thinking as historians, so that might also play a role here). It seems useful to avoid conflating an authentic original with its reproductions - we cite them differently, and treat them differently in the context of research.

    As for "collections", if a book can be either a collection or a document depending solely on perspective, then it fundamentally can't be described in an objective way (which is crucial to data portability and interoperability). Since one could *always* make a gestalt shift between collection/document, I don't see the usefulness of maintaining them as somehow ontologically distinct. Why not just level the distinction between them, and simply talk about items that can also be pieces of other items (and in turn be composed of smaller items as well)?
    • CommentAuthorbdarcus
    • CommentTimeFeb 8th 2007
     
    It might make sense to disentable version relations from part-container relations, Josh. We also deal with republished versions of books a little differently. Going back to FRBR, BTW, it provides a useful way to think of these distinctions (works vs. expressions vs. manifestations vs. items).

    On collections, I don't really agree with Matthias that a book can be either a document or a collection. OK, yes, I do see what he's saying. But I am seeing this as a distinction between a standalone item, and a set of them. One never cites the latter. So an edited book would be a subclass of a document, and its series would be a collection. Likewise, a multi-volume book would also be a collection I guess.

    The reason why not to "level the distinction" in the exchange and formatting system (CSL) is that I *think* it's important. In CSL, for example, I have a relation attribute for things like titles. It helps there to be able to say use a container title here, and a collection title there.

    In any case, I'm not really religious about my position; it's worth discussing more. I've just always had a separate collection class/table/etc.

    To Matthias question on RDF subclassing, yes, a class can subclass multiple parent classes.
  14.  
    I've just had a long post erased by a connection blip. A pox on fuzzy wireless connections.

    I had two points, I think: first, re: Josh's comment: "It seems useful to avoid conflating an authentic original with its reproductions - we cite them differently, and treat them differently in the context of research."

    We cite them differently? The citation gives the information for the item and tells the reader where to find it. I don't see how it makes any difference to the bibliographic software whether the "container" is the original or a reproduction.

    Second, at some points in this discussion we've misunderstood each other because distinctions between the data structure, the GUI, and hierarchies of entry types aren't clear. It seems to me that these are three quite different (though obviously interrelated) things. My dearly departed post spelled out how I saw the differences among them, but I won't try to recreate that. The key thing is that there seems to be widespread support for the idea of a hierarchy-free data structure - records are records, and books can be parents, children, or stand-alone items, depending on how they're related to other records in a citation. The framework for organizing citation styles and entry types is more rigidly hierarchical - the category "communication" contains sub-categories "letter," "email message," etc. These hierarchies of types will help expedite the creation of new citation styles and entry types, and eventually make it possible to open these up to user customization without devolving into utter chaos. The GUI will also be hierarchical, but in a different sense - it will display the hierarchical relationships between items and their parents (and potentially grandparents).

    I hope that's clear - a more carefully composed version is gone forever.
    • CommentAuthorerazlogo
    • CommentTimeFeb 8th 2007 edited
     
    In response to this point from Sean way above:

    "As for "ancillary" items, we think it's important not to allow the creation of a "periodical" or a "periodical issue" because doing so will severely complicate the UI. At the end of the day, we need to provide an interface that meets the needs of the vast majority of users. These people don't cite a journal or an issue. They cite articles in the issue of the periodical."

    It is not clear to me that "periodical" or "archive" are not useful as item types. Yes, people don't cite periodicals or archives in footnotes, but then Zotero is a research tool rather than just a bibliographic citation tool. I can imagine someone doing research on, say, Dwight MacDonald, and amassing a list of archival collections where his letters or relevant documents are located. Or, one might work on a history of audio engineering from the seventeenth century to the present and compile a list of relevant periodicals to be perused. Eventually, one may need to export this list into a bibliography--many dissertation and book bibliographies include sections for "Archives" and "Periodicals". Why not make it easy to import periodicals from online library catalogs, or online archival finding aids (such as this one), into Zotero?
    • CommentAuthorJosh
    • CommentTimeFeb 9th 2007 edited
     
    CoD: "We cite them differently? The citation gives the information for the item and tells the reader where to find it. I don't see how it makes any difference to the bibliographic software whether the "container" is the original or a reproduction."

    When I'm citing an artwork, I do need to indicate in the citation whether I'm working off of the original or a reproduction; this is part of how I lay out an evidentiary chain for my argument. The question of reproduction *does* matter, because if I'm not working off of the original, errors or noise might have been introduced in the process of reproduction, and a later reader needs to be able to track down the particular reproduction on which I based my claims (hence the need for a particular kind of "Reproduced in" relationship)

    CoD, on your second point, I think I agree: in principle, the more flexible the model the better, but the implementation of these concepts in a particular piece of software tailored to a particular set of practices (i.e. Zotero) will hard-code much more rigid and explicit hierarchical relationships into both Zotero's UI and internal data model (neither of which preclude exporting said data in the more abstract and generic RDF form as needed). Because my first concern is Zotero (with the broader utility of Zotero data in other contexts a close second), I tend to slip back into the latter two modes, rather than instinctively staying at the more generalized level.
    • CommentAuthorMatthias
    • CommentTimeFeb 9th 2007 edited
     
    I agree wholeheartedly with Sean & Josh that one has to draw a line somewhere between an ideal model and the reality of implementation, and I'll face the same trouble when trying to implement such a hierarchical model in my own bibliographic application. However, my point is that it doesn't make sense to adopt a new model (which takes *a lot* of time to implement!) if it's again fairly limiting. Sure, it's always a tradeoff, but I really think that a hierarchical model should:

    - allow me to freely relate any items with any other items, or, with Josh's words: "items that can also be pieces of other items (and in turn be composed of smaller items as well)"

    - allow me to cite any kind of container by its own. Sure it's a less common case but there are many cases (even in hard sciences) where you need to cite an entire book as well as some chapters from the same book within the same work. And I imagine that some people (such as an editor in a preface) will definitively need to cite a book series by its own. So I agree with erazlogo here that from a user point of view it should be possible to cite a container by its own.

    Bruce said:
    "On collections, I don't really agree with Matthias that a book can be either a document or a collection. OK, yes, I do see what he's saying. But I am seeing this as a distinction between a standalone item, and a set of them. One never cites the latter. So an edited book would be a subclass of a document, and its series would be a collection. Likewise, a multi-volume book would also be a collection I guess."

    Maybe my confusion is that I've always viewed a "container" as being a synonym for a "collection", but in case of books, it's more like that a book can be a container for something while not being a collection in the sense that it's a set of multiple items that are usually physically distinct from each other.

    If we regard collections as a set of multiple things (so a standalone book being a document), how about if the characteristic of "being a container" would simply be a property that could be assigned to a relationship.

    The same logic could be applied to the distinction of "original" vs "reproduction": I understand the usefulness of this distinction, but it's only one of many useful relationships. So couldn't this be just another property of the relationship between two items?

    In other words: there are many other useful relationships besides "reproduced in", e.g. "presented at" comes to mind. Btw, w.r.t. "reproduced in", I would also favour a more general relationship such as "contained within". Even better, wouldn't it be better if we'd simply could establish *any* kind of relationship between two items? Zotero could offer a dropdown menu with a "relationship qualifier" such as "contained within", "reproduced in", "presented at", etc? Would this be possible? Also, this would make it rather easy to expand such a system in the future to add a new type of relationship.

Zotero Forums are powered by Vanilla 1.1.5a