Specifications for Zotero's data fields

scot · September 10, 2007

User-level documentation for the finer points of Zotero's data fields would be handy. I assume that some of the following are CSL matters, but functionally, they are Zotero matters.

For example:

[1] Should page numbers be:
6-8
or
6 - 8 (with spaces)

[2] Should I type:
Becker, Scot C.
and: Harrisburgh, Pa.

or: Becker, Scot C
and: Harrisburg, Pa (w/o the period)

[2b] Do I need to insure consistency in my imported data in both of these things or might the citation style be clever enough to create consistency in its output?

[3] What in fact should go in the 'Extra' field? (some translators used to stuff leftover metadata in it), but I suspect it has another aim.

[4] Is there currently a place for storing the bibliographic data for the original edition of a work I'm citing (for the time when the chronology of the academic discussion is important), or are we waiting for more clever nesting or association of items?

[5] Can I assume that Zotero's two ways of entering names (two fields, surname first) and one field, first name first, will both parse the same in my output?

These issues might make a good Wiki page, perhaps?

bdarcus · September 10, 2007

[4] Is there currently a place for storing the bibliographic data for the original edition of a work I'm citing (for the time when the chronology of the academic discussion is important) ...?

No; this is the kind of thing that a new—more relational—data model would be able to handle with ease.

[5] Can I assume that Zotero's two ways of entering names (two fields, surname first) and one field, first name first, will both parse the same in my output?

AFAIK, the single-field option will not parse the name (indeed, should not).

The problem with the first/last model is it's not international-friendly. For example, in "Mao Zedong" his first name is in fact a family name, and the second a given. So if you sort or display using Western rules, the results are wrong.

So the single name field is a bit of a hack that says "treat this like a dumb string" (e.g. do not parse).

dstillman · September 10, 2007

I don't know how our CSL processor handles most of these things (though it seems easy enough to check), but in general I think we try to follow the usual mantra of being liberal in what we accept and conservative in what we emit. This is especially important for Zotero, since our data input often comes from third-party sites (with their own idiosyncratic ways of encoding data) rather than individual users (who might be better able to follow guidelines that we posted). A good example of this is the Date field, which will accept any freeform string and do its best to parse out a standard date.

Now, dates are probably among the easiest values to parse, and there may be fields where we require a particular format, but in general, I would say that if there are logical ways of entering data that produce unexpected style output in Zotero, let us know, and we'll try to improve the parsing algorithms.

scot · September 12, 2007

Regarding punctuation in fields, I did a few tests, and it seems that at the moment, it's nearly what-you-see-is-what-you-get. "6 - 8" and "6-8" are each handled verbatim. Same with trailing periods on titles, spaces before the semicolon dividing title and subtitle, and missing/present periods on name initials or place abbrev's.

So (based only on the 1000 records I have collected from various sources) here are a few possibilities for improving parsing:

(1) Page (and other locator) numbers: a little tricky, since you have to deal with things like: 12-15, 31, and 123-9. They make nice candidates for parsing though, since you then can regularize spacing and punctuation, as well as the “54-9” thing. A lot depends on whether CSL parses the whole thing out or not, but I guess that it does.
(2) Trailing punctuation generally. I find this everywhere: titles with trailing periods, places with colons, publishers with commas. This might be an easy one, and, come to think of it, probably belongs at the translator level, since then you have clean data to begin with and don’t have to guess whether your citation style is clever enough to axe it.
(3) Initials in names. A little tricky even in European languages, since a few people have middle initials which don’t stand for anything. Still I must have 30-40% of the things I import come in without a period on the middle or first initial. This is probably a matter for global search and replace rather than parsing.
(4) Spaces before semicolons in titles. This should be someone else’s call. At the moment we don’t parse title from subtitle (though MODS does, for example). It works pretty well as-is for me, but I can imagine cases where you would like to have them parsed. German style guides in the humanities tend to use a period to separate the two. (And the examples and guides I've seen keep the title-separating punctuation consistent, no matter what language the original was written in. That is German bibliographies cite English works using the period to set off the subtitle, English works cite German works using the semicolon).

And that 'Extra' field? Is it a catch-all, or is it for "unparsed things tacked on to the end of an item in a bibliography."