Can't cite web site?

wdmartin · November 10, 2006

I've been exploring Zotero for a couple of days now, and I don't seem to be able to create bibliographic entries from web sites. Here's what I'm doing:

1) Find a web site. Say this one: http://www.slate.com/id/2152830/
2) Open Zotero and click "Create New Item from Current Page."
3) Fill out any missing information (e.g., correct the author's name, add a publication name).
4) Right-click the entry in the center pane and select "Create Bibliography from Selected Item."

After doing all this, the created bibliography is empty. I've tried exporting to HTML, to RTF, and to the clipboard with all three citation styles (MLA, APA, Chicago). In every case, it doesn't actually give me a bibliographic entry. In the case of exporting HTML or RTF, the resulting file contains the appropriate formatting commands for the bibliography, but doesn't actually have any data. What gives? Am I doing something wrong, or is this a bug in Zotero?

Incidentally, when I export the item to an RDF file, all the data appears to be present. I can import the RDF file and have all the data appear (except that the RDF apparently records both the title of the article and the title of the publication as dc:title, and then discards the title of the publication on import).

Suggestions?

EDIT: It'd be nice to have some kind of user guide for the forum. Like, does HTML work? Which tags? Is it a BBCode setup? The Lussomo web site has developer docs but no user guide.

bdarcus · November 11, 2006

This seems to be a bug in Zotero's formatting code. All records should get formatted more-or-less correctly regardless of whether there's any particular type template in the CSL file.

It also exposes the need for a more clear type model in Zotero. What you're citing in your example above is not a "site," but rather a page within a site. Moreover, the fact that it's on the web is insignificant. It's an "Article" (or if you like a "Document") that happens to have an online location.

dstillman · November 11, 2006

Well, it's not a bug, per se—we just haven't mapped many of the item types to CSL fallbacks, so they don't show up in bibliographic export (as noted on the Known Issues page). But yes, this will be helped by the addition of generic types and item type templating, both of which are on our list.

Also, while a "web page" is indeed an article and should fall back to that for CSL purposes, that doesn't necessarily obviate the need for or utility of a "Web Page" item type, since, as you've noted elsewhere, item types have various purposes. I'd argue that one is to provide users with clear and relatively descriptive designations for their sources (which can also be used for, say, searching) without having to consider the implications for citation styling. A "Web Page" item type might seem quaint 5 or 10 years from now, but I think its omission now would just be unsettling for many users.

It's a fair discussion, though, and it comes more into focus with types like "Forum Post" and "Blog Post." I personally think the latter, especially, is problematic, since I think we've already reached the point where the word "blog" has ceased to have much meaning, but some people felt strongly that it represented a unique format that has become integral to online research.

Ultimately, the biggest help will likely come from custom item type functionality, which will let users create item types for organizational purposes that fall back to standard types for export. That has its own problems, like what happens to custom fields, but that's a whole other discussion...

wdmartin · November 11, 2006

Fair enough, it's a page, not a site.

I spent some time wading through the CSL schema, to see if I could fix this myself. I must confess - I didn't really understand it. I'm familiar with DTDs, but I've never worked with other schema languages before. Is this an XML Schema per W3C rec, or a RELAX NG schema, or something else? I need to go read up on whatever schema language you're using.

It'd be nice to have that "Key Files" page in the wiki fleshed out. I think I've figured out where you keep most of the code, but it's rather a lot to wrap your head around all at once. Also, storing the translator JS inside the SQLite db is weird. What are the advantages of doing it that way? There've got to be some, or you'd never have done it, 'cause it makes it harder to edit the code.

dstillman · November 12, 2006

You can read up more on CSL here: http://xbiblio.sourceforge.net/csl/

Bruce can comment more on it if necessary, as it's his creation—we (or, more accurately, Simon, another dev on our team) implemented an engine and helped refine the styles a bit. In this case, though, it's our code that needs to be adjusted, not the CSL. Basically, in CSL there are book, article, and chapter base types, and item types use those—this is a case of some item types not being mapped to those three types.

I'll try to flesh out Key Files later today. As for storing the translator and CSL code within the DB, you're right that it's a bit weird and unwieldy. They could be stored as individual files, but we decided to put them in the DB mainly because the metadata had to go in there anyway, individual files would need to be transferred from the XPI to the data directory on install/update, it all syncs with a central repository, we didn't really care to deal with making sure the DB (which gives us transactions) and the filesystem (which doesn't) were in sync, etc. Also, the idea (and Bruce's goal) is that eventually they'll be created using a GUI tool, so they won't have to be edited much manually anyhow. In the meantime, I believe Simon edited them just by keeping a text editor open and running sqlite3 zotero.sqlite < scrapers.sql after making changes.

bdarcus · November 12, 2006

@wdmartin -- the schema is the RELAX NG compact syntax. But the problem isn't in the CSL as Dan notes; they haven't implemented the fallback behavior in their code yet.

@Dan -- my argument for why "web page" is "quaint" or just wrong has little to do with CSL actually. Well, it does, but in a somewhat different way. I just think it's a poor model for how we actually deal with documents.

What *is* a web page, anyway? It's nothing more than a document on the web (to me, having the Document fallback is more critical even than Article).

The confusion comes up in particular when you're dealing with exactly these sorts of items: articles published on Slate, the New York Times, etc. web sites. And these are the single most commonly cited documents on the web. To focus on these as "web pages" is almost like saying the book I have in my hand is fundamentally a "PrintedText" rather than a Book.

In other words, it's my contention that having a "web site" type is more confusing to users -- today -- than not.

But in any case, it highlights the need for a clear logic and policy for this. In my RDF model, for example, I'd be willing to include an InternetDocument subclass of Document *if* someone can provide a clear explanation of how we distinguish them. Is a PDF that I put on the web a Document, or an InternetDocument? If the latter, what makes it so?

This issue is so important that it really deserves a page on the dev wiki, and perhaps a discussion on the mailing list (and/or maybe IRC). Ideally, in fact, Zotero and CSL share the same type model.

Finally, yes, my goal is a robust infrastructure of CSL files, seamlessly added and updated as users need. CSL files ought to be then stored in one or more online repositories, and then cached locally as needed.

jankoc · November 13, 2006

This is a feature I have been using a lot of time searching for as well.
Hope it will be added soon!

wdmartin · November 13, 2006

@bdarcus - Interesting. I'm not sure I follow you, though. Currently, bibliographic categories (book, article, etc) are organized mostly around the medium of the document. If I've understood correctly, it sounds like you want to replace that with a system organized around ... well, not content exactly. Perhaps more around form. So an article that appears in the NY Times and also on the NY Times web site is still an article regardless, and should be classified thus.

I'm not sure that eliminates the need for a category called "web page," though. I can cite something like a post in a forum.[1] It hasn't got a print equivalent; it's doesn't fall easily into an existing genre, not being an article or what have you. But it is definitely part of a web page, so if I have a "web page" category, I can use that. If there's no web page category, then I have to sit there and figure out what other category it might fit into, when I'd really rather get on with life.

Basically, I'm arguing that the data we need in order to retrieve a document is inherently connected to the medium in which the document exists. "Web page" is a medium that basically everyone can recognize, even if perhaps they don't necessarily agree on the details of what constitutes a web page. Why get rid of a useful category like that?

[1] http://earthsongsaga.com/forum/viewtopic.php?p=129372#129372 <-- this one, for example.

bdarcus · November 13, 2006

@wdmartin -- yes, an article is an article; makes no difference where or how it's published, or what form it's in (print, PDF, HTML).

In terms of the upshot of the argument, fundamentally, I'm saying there ought to be robust, generic, fallback types: Document, Image, Communication, etc. One then adds further types as refinements of those superclasses/types. So Document --> Article --> JournalArticle and so forth.

I'm not (necessarily) saying get rid of web page, but that it MUST be part of a comprehensive model. Otherwise, not only will it be confusing and limiting to users from a UI perspective (really, there's a lot of online stuff I cite that I cannot fit in Zotero's type scheme; your Slate article, press releases, legal briefs, etc., and adding "web page" does me no good), but it will also cause problems for the formatting system (not just the code, but also the style files).

erazlogo · November 14, 2006

Well, there are web pages that are just web pages - any personal home page for example. These, it seems to me, should be cited as web pages/web sites.

According to the Chicago Manual of Style, an online article is cited as a regular article with a url instead of page numbers, which seems to suggest that a Slate article should be entered into Zotero as a magazine article.

I agree that "Document" seems to be a great fallback type, but Communication and Image seem to be confusing. Is a letter found in an archive a "document" (I think most archival researchers think of it that way) or a "communication" simply because it was sent from one person to another? Is a digital scan of a press release a "document" (which would make sense for citation purposes) or an "image" simply because its medium is a digital image file? When you search for "images" in spotlight on a mac you find all file types usually classified as an image and I think most people are used to thinking of "image" in terms of file type rather than item/reference type.

bdarcus · November 14, 2006

Good questions erazlogo. You hit on exactly the tricky areas.

Having done some archival work, I'd probably be happy with categorizing a letter as a document and leave it at that.

Keep in mind, though, that RDF (as well as object-oriented languages) allow for hierarchical typing. So generic classes like image and document and communication are in part to serve as fallbacks, and also in part to group related types. For image, then, it can include paintings or photographs or diagrams or graphs. Communicaton is the more typical "personal communication" and typically include emails, phone conversations, and memos.

Also, I'd want to separate out the content of the material (that it's primarily visual/graphical rather than textual) from its medium (paper, vs. digitial image, etc.). And as it's often been modeled (say in FOAF), an image is a subclass of a document.

But in any case, yes, this stuff is hard!

Here, BTW, is an example of a document I was just reading that I really need to be able to capture well, but cannot yet in Zotero:

http://jurist.law.pitt.edu/pdf/al-marrimotiontodismissforlackofjurisdiction.pdf

homunq · May 12, 2009

This has gotten a little off into philosophical discussions of what is a web page. I don't care. I want to be able to take a snapshot of a webpage, right click on that and say "create entry", choose "blog post", and have a bibliography with reasonable guesses for title (web page title), blog title (title of web page at blog root), URL (duh), creation date (scraped from article text), and access date (duh) filled in for me, with the snapshot as a sub-item. I can deal with filling in and correcting this data manually, but tools exist to take on this simple grunt-work.

This is a clear feature which addresses the original need as far as I can see is possible, and has nothing to do with most of the discussion above.

homunq · May 12, 2009

Oh, BTW, one reason this is so annoying currently, is that fields that can't be edited can't be copied, so I can't use copy-paste within zotero to get the url and access date. I can't even set access date to "today". Just those features would get 50% of this feature done for about 10% of the work.

noksagt · May 12, 2009

@homunq:
Why is "Create New Item from Current Page" inadequate? This fills in title, url, and access/added date & will create a snapshot for you.

From what I can tell, it only misses:

blog title (title of web page at blog root)...creation date (scraped from article text)

Neither of these seems trivial to scrape. How do you identify the blog root? It is neither always one directory up nor is it always the root of the web url. And there are often lots of things on a webpage that look like dates & not all blogs actually have dates posted.

noksagt · May 12, 2009

fields that can't be edited can't be copied, so I can't use copy-paste within zotero to get the url and access date.

This seems unrelated to the topic of the current thread.

Both "accessed" and "URL" are editable (at least in v1.5). You can copy/paste anything from/into the url field. The "Accessed" field is a datetime field, so it requires that you input data in a certain way. Others have already requested better handling of date input.

homunq · May 12, 2009

You're absolutely right. I can give no good account for having missed that function. I thought it was too obvious an oversight to be missing, and I was right.

Please ignore me, except for the "blog root" and "date" ideas. I agree, these are not trivial to code; however, you could cover probably 80% of the blogs out there with a handful of recipes. Still, I consider my basic question answered, and I thank you abashedly.

Dean88 · August 18, 2012

Accessed Date is not included when I Insert Bibliography. I am using Firefox and Open Office. I have searched for a solution for over a week so I definitely need an answer.

adamsmith · August 18, 2012

Dean - start a new thread, please, this is unrelated. Specify which citation style you're using - many don't require access dates anymore (for good reasons).

Also, even if you were born in 1988, I'd strongly suggest not using 88 in your username (too late to change on Zotero, so this for future reference):
http://en.wikipedia.org/wiki/88_%28number%29#As_a_Neo-Nazi_symbol

Dean88 · August 19, 2012

Yeah funny. Mate I can speak a heap more Mandarin than German so 88 to me means Good Fortune.