URIs (again)

bdarcus · May 14, 2009

So I'm still finding the URI schema for zotero.org/2.0 strange, after having complained about it months ago, and having Dan assure my it was just temporary.

Why am I still seeing trailing ids after the natural language slugs? E.g.:

http://www.zotero.org/bdarcus/34

Or:

https://www.zotero.org/groups/critical_geopolitics/23

You already preview the URI in group creation, so you can at the same ensure the group names are unique (which is important anyway so that people don't create groups with large but not conscious overlap).

I would say if you want to include disambiguation ids somewhere in the URI, they might be GUIDs, and they should certainly go before the slug.

Also, this is an aesthetic thing, but can you change your slugifying code to use dashes instead of underscores to replace spaces?

noksagt · May 14, 2009

No comment on what the best avenue forward is, but...

Why am I still seeing trailing ids after the natural language slugs?

The trailing ids are what is actually used & they seem to make for permalinks. You are allowed to change your username and group name, and it seems reassuring to me that:

http://www.zotero.org/bdarcus/34
http://www.zotero.org/bruce/34
http://www.zotero.org/darcus/34

will all point to the same thing. Note that the shorter http://www.zotero.org/bdarcus also points there. But if you change your username, it will become a dead link.

I would say if you want to include disambiguation ids somewhere in the URI, they might be GUIDs, and they should certainly go before the slug.

Why? Putting human-readable ids makes the URI look "friendlier" on a first parse to me & it is probably more search-engine friendly to put content at the start of a URI.

Also, this is an aesthetic thing, but can you change your slugifying code to use dashes instead of underscores to replace spaces?

Having created groups that have hyphenated words, I don't know if I agree with your aesthetic. I wouldn't have a strong objection to it, but I likehttps://www.zotero.org/groups/atom-probe_tomographyWikipedia uses the same URI scheme, so maybe we all just like what we know ;-).

bdarcus · May 14, 2009

On the URIs; I don't care about being able to change my user name. I care about having dead stable URIs that can be used to link data together. This kind of thing is not encouraging when you start to scale it out.

On the slugifying, it certainly is aesthetic, and so in that sense there probably is no objectively correct position. FWIW, I take my cue here from frameworks like Django, blog engines like WordPress, Drupal, etc., all of which use dashes (Django actually has a built-in slugify field type which does it all automatically).

noksagt · May 14, 2009

I care about having dead stable URI that can be used to link data together.

But the current URIs are stable, right? They just might not be "pretty." If we wanted to drop the serial number, then we'd have to disallow name changes to make stable URIs, no?

dstillman · May 14, 2009

On the URIs; I don't care about being able to change my user name. I care about having dead stable URIs that can be used to link data together.

The id is optional, and, as noksagt says, allows for a permalink despite name changes. That http://www.zotero.org/bdarcus/items/10049 is currently a 404 is a bug. Deeper URLs should redirect the same way that http://zotero.org/bdarcus redirects.

dstillman · May 14, 2009

Actually, http://www.zotero.org/anything/34 is supposed to redirect to http://www.zotero.org/bdarcus/34, but that doesn't seem to be working at the moment either.

Also, as far as the API is concerned, the item URI is just http://zotero.org/users/34/items/10049, and that will redirect to http://www.zotero.org/bdarcus/34/items/10049

This may not be the ideal scheme, but it does have its advantages and avoids certain problems with other approaches.

bdarcus · May 14, 2009

note: posted before I read Dan's note; but still applies ;-)

OK, there are two separate issues here. There's the somewhat anal aesthetics of pretty URIs. This is admittedly not that important (though obviously something I care about).

There's also the really important issue about URI stability. Can we please have one identifier for each user/group/item in zotero, rather than many? This isn't just about serving web pages to users; it's also about being able to access structured data (e.g. RDF).

As an example, what possible real advantage is there to allowing users to change their user names that outweighs the problems on the data end? If I use that URI above for an item to cite something, does it end up in a 404 when I want to grab the data after later changing my user name? Or are you going to put in redirects, and added RDF triples, just to account for those changes?

bdarcus · May 14, 2009

Also, another issue we've periodically chatted about forever: have you worked out how you're going to distinguish URIs for user items (which I submit are just glorified bookmarks) and the sources they are referencing?

noksagt · May 14, 2009

If underscores are used, it'd be nice if the spaced version also worked. The (url-encoded) addresses:

http://www.zotero.org/groups/atom-probe%20tomography
https://www.zotero.org/groups/atom-probe%20tomography/11

should point to the right page.

noksagt · May 14, 2009

Actually, http://www.zotero.org/anything/34 is supposed to redirect to http://www.zotero.org/bdarcus/34, but that doesn't seem to be working at the moment either.

Works-for-me assuming that "anything" doesn't have a space in it (haven't tried other escaped characters).

Can we please have one identifier for each user/group/item in zotero, rather than many?

The unique permanent identifier is the serial number!

what possible real advantage is there to allowing users to change their user names that outweighs the problems on the data end?

What problems are on the data end in URIs that have serial numbers?

If I use that URI above for an item to cite something, does it end up in a 404 when I want to grab the data after later changing my user name?

Not if you use the serial number too.

noksagt · May 14, 2009

When searching groups, it'd be nice if a space was used as a keyword separator, rather than part of a substring. That is, these all work:

https://www.zotero.org/search#group/atom
https://www.zotero.org/search#group/probe
https://www.zotero.org/search#group/probe%20tomograph

But it is probably not user friendly that this also works:https://www.zotero.org/search#group/e%20twhile this does not:http://www.zotero.org/search#group/atom%20probe

bdarcus · May 14, 2009

The unique permanent identifier is the serial number!

Come on Rick; this is a hack. You have a URI which is effectively opaque, and a bunch of black magic redirects.

noksagt · May 14, 2009

Perhaps; but the hack is both somewhat common (a similar scheme is used in these Vanilla forums to list thread IDs). I think the way WordPress handles it is more opaque; how is anyone to know that:

http://community.muohio.edu/blogs/darcusb/?p=585
http://community.muohio.edu/blogs/darcusb/archives/2009/05/10/html-5-microdata-use-cases

are the same thing?

It seems the choice is "pretty/opaque urls that are surprising on name changes" vs "ugly/magic urls that don't surprise."

bdarcus · May 15, 2009

Well, first, I don't consider WordPress a model; I was just referring to their slugifying.

As for the choice, this presumes there's a compelling need for users to change their usernames. My position is there is not.

The issue here is that Zotero.org can become a big, open, linked database of scholarly data. These data can be linked to Library of Congress data, and/or this in-development periodical data, etc., etc.

If they follow the principles of linked data, this is easy to do. But a fundamental prerequisite is a sane and stable URI scheme. It seems doubtful to me one exists ATM, which has me worried.

PS - What I meant by opaque is essentially that it's nowhere visible in the interface.

fbennett · May 15, 2009

I don't really see a technical problem here. If a user changes their name, and a call comes in on the old URI, the response will be something like a 301 or a 307 not a 404. If the calling application can't handle the redirect, it's fragile and needs fixing anyway.

As I understand it, while URIs are mutable by design, it is URNs that are meant to be immutable. That seems to be why the former can be written off the cuff and provide a redirection mechanism, while the latter are issued by a standards authority and do not.

(This is unrelated to aesthetics and readability, of course.)

bdarcus · May 15, 2009

I don't really see a technical problem here. If a user changes their name, and a call comes in on the old URI, the response will be something like a 301 or a 307 not a 404. If the calling application can't handle the redirect, it's fragile and needs fixing anyway.

Sure, you can do that; but why? I'm just asking for a plan, and a strategy, here. Right now, the URI infrastructure isn't working as I'd expect. Consider something I've been hoping for:

$ curl -I -H "Accept: application/rdf+xml" http://www.zotero.org/bdarcus/34/items/10051
HTTP/1.1 200 OK
Date: Sat, 16 May 2009 00:55:57 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.2.4
Set-Cookie: zotero_www_session=deleted; expires=Fri, 16-May-2008 00:55:56 GMT; path=/; domain=www.zotero.org
Set-Cookie: zotero_www_session=deleted; expires=Fri, 16-May-2008 00:55:56 GMT; path=/; domain=.zotero.org
Set-Cookie: lussumocookieone=deleted; expires=Fri, 16-May-2008 00:55:56 GMT; path=/; domain=.forums.zotero.org
Set-Cookie: zotero_www_session_v2=ppu887h5s6plr76ja3eu5sb2n2; path=/; domain=.zotero.org
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8

If I actually request that RDF, I get back HTML.

As I understand it, while URIs are mutable by design, it is URNs that are meant to be immutable.

Just because you can change URIs doesn't mean you should. You certainly shouldn't without a good reason. See http://www.w3.org/TR/cooluris/

noksagt · May 15, 2009

Right now, the URI infrastructure isn't working as I'd expect.

Would you expect any app to emit RDF based on a header request, alone, on the first beta rollout? Tough crowd ;-)

fbennett · May 16, 2009

The response to the Accept header is a completely separate thing, unrelated to the form of the URI.

I read the document, which proved to be very interesting. The reservations it expresses about changing URIs are based on the assumption that changing the URI leaves a dangling link. See also:

http://www.w3.org/Provider/Style/URI

No dangling links here, though, so long as the server continues to perform redirects according to the same scheme. If it were to stop doing so (let's say in connection with the abrupt introduction of a policy that users should not be able to change their usernames ...), that would be a clear breach of the guidelines.

If anything, the Tim Berners-Lee document linked immediately above would favor using the ID number alone, with the username dropped altogether (viz: "What to leave out? Everything!"). It's a question of whether we trust the current redirection scheme to remain in place. If we do, then it boils down to readability and aesthetics (which is the sort of coolness that TBL argues against).

dstillman · May 16, 2009

Yeah, what Rick and Frank said. There's no reason we can't do content negotiation eventually, but it's unrelated, and we don't have RDF data on the server. Once we have a Zotero-to-BIBO mapping, we can offer RDF via content negotiation and the API (separately and/or with RDFa).

As for the username changing policy, since that dictates the current URI scheme, I don't have a strong opinion (at least once the forums are changed to display full names rather than usernames). Username changes are probably rare enough that it could either not be allowed or be allowed via special request, in which case the user IDs could be removed from the URLs (or, more accurately, they would be a redirected alternative to the username, since not everything that generates URIs will have the usernames). I suspect we'd get a handful of change requests a year, and we could easily keep redirects for those to keep old links working.

For user items, I don't see a better option than the item id. (Yes, in an ideal world there wouldn't be item editing either, but there is.) We can, however, append a more user-friendly slug to the end and ignore it, as this forum does with discussion titles. (Or, better, we could redirect if a request didn't match the current slug.)

(URLs generated in the client and embedded in documents will use the item's secondary key, which is a random 8-character string, instead of the id, which is now server-generated and no longer synced to clients, but, e.g., http://zotero.org/bdarcus/items/2H8FWNL3 will redirect to http://zotero.org/bdarcus/items/12345. We use ids for items on the site because they're less unwieldy.)

The current plan is to loosely associate user items with "abstract" items, which would aim to be the canonical representations of user items. I say "loosely" because this sort of aggregation will be inexact and subject to user data changes or algorithm improvements. Abstract items will probably be available at http://zotero.org/items/12345, with the same sort of optional, ignored, human-friendly suffix as discussed above. There'll be a link to the abstract item URI available on the user item web page and in API responses.

Finally, eventually we'll also get around to switching to zotero.org instead of www for site URLs. Generated URIs already all use zotero.org.

bdarcus · May 16, 2009

OK, Dan; but what about my point that what you call a user item is actually two separate things: the metadata about the source, and the user's referencing of it (notes, tags, etc., etc.)? This is more of a modeling point, but it does have implications for URIs too.

My point on content-negotiation wasn't clear, but I was just meaning to say the server was returning incorrect information (I believe; could be wrong).

On the periodicals dataset, I believe there's a bug with the PHP framework they're using (which I've reported).

bdarcus · May 16, 2009

Would you expect any app to emit RDF based on a header request, alone, on the first beta rollout? Tough crowd ;-)

Well, it's not like this is the first time I've raised this issue. And as I said in the previous post, I wasn't so much expecting the RDF at this point (something I didn't make clear).

noksagt · May 16, 2009

My point on content-negotiation wasn't clear, but I was just meaning to say the server was returning incorrect information (I believe; could be wrong).

I believe you are wrong. HTTP-1.0 doesn't specify a behavior (and that is the protocol that the Z server states it uses) & even HTTP-1.1 states:

If an Accept header field is present, and if the server cannot send a response which is acceptable according to the combined Accept field value, then the server SHOULD send a 406 (not acceptable) response.

(note "SHOULD" is different than "MUST" in RFCs: in neither HTTP-1.0 or 1.1 can a client expect to only receive a media type from the list of types they claim to be able to read).curl -I -H "Accept: application/rdf+xml" http://www.google.comLeads to similar behavior.

dstillman · May 16, 2009

OK, Dan; but what about my point that what you call a user item is actually two separate things: the metadata about the source, and the user's referencing of it (notes, tags, etc., etc.)? This is more of a modeling point, but it does have implications for URIs too.

What sort of implications would you say it has? As long as users can edit their own items, user items may contain different data from the associated abstract items and need to provide that data. Is there a difference other than that some user items will link to canonical abstract items and others won't?

bdarcus · May 16, 2009

To me, a zotero item is two separate things, with two separate URIs. The item per se is effectively a bookmark, containing who bookmarked it (me), when they did it, the tags they assigned it, and the notes and attachments they associate with it.

The source proper is then linked/associated from that bookmark.

The tricky/awkward bit is modeling/identifying that linked resource. Maybe you have two separate relation properties: one to what you call the "canonical" version, and another to the user's version (which I would hope would generally be the same).

To jot it down in RDF, something like (warning: strawman):

<http://zotero.org/jdoe/items/1> a a:Bookmark ;
     dct:creator <http://zotero.org/jdoe>
     bm:recalls <http://www.nytimes.com/2008/06/25/business/25exurbs.html> ;
     z:udata <http://zotero.org/jdoe/items/1/udata> .

# here we represent the "canonical" data (would be much more verbose normally)
# note: could also use a zotero uri, and add owl:sameAs link to the nytimes URI
<http://www.nytimes.com/2008/06/25/business/25exurbs.html> a bibo:Article ;
    dct:title "Fuel Prices Shift Math for Life in Far Suburbs"@en .

# here we represent the user data, maybe only if it differs from the above
# note: still need a way to merge and disambiguate these data
<http://zotero.org/jdoe/items/1/udata> a bibo:Article ;
    dct:title "Fuel Prices Shift Math for Life in Far Suburbs"@en .

dstillman · May 22, 2009

OK, we're going to fix the forum name display, lock usernames, and remove the user ids from URIs. The API will also return URIs with usernames rather than user ids.

For group names, we're considering removing group ids for public groups and allowing a fixed number of group name changes within a given period (say, twice in six months), with automatic, permanent redirects. It seems there might be legitimate reasons to change a group name, and having a human-readable slug in a URL is valuable, and, if we're going to support post-name-change redirects from /groups/group_name anyway—which we should—there's no need to use the group id. Private groups will continue to use just the group id.

Bruce, I don't really see the need for separate entities (bookmark and user data) for user items. Having a single user item with owl:sameAs pointing to an abstract item (if one exists yet, which it may not until some asynchronous processing has occurred) seems perfectly sufficient.

bdarcus · May 22, 2009

Dan, on your last point, am a little unclear: are you suggesting collapsing all three of those entities I outline in the example RDF, or just the last two? The latter seems less problematic than the former.

dstillman · May 22, 2009

I'm suggesting that the first and last parts—the user-specific parts—would be combined and would point to a global abstract item if one existed.

And yes, the abstract items I'm referring to would be abstract Zotero items, with Zotero URIs such as http://zotero.org/items/13245. (These URIs would have associated web pages with statistics and links to users.) Abstract items would in turn point to external resources.

As you note, a user item might only include the original data if different from the abstract item's data.

bdarcus · May 22, 2009

I'm suggesting that the first and last parts—the user-specific parts—would be combined and would point to a global abstract item if one existed.

So you're telling me that you want to mix what is exclusively user data (date added/updated and tags being the most obvious) and what is not (the data about the source)?

What about notes and attachments?

dstillman · May 22, 2009

OK, that's a fair point, but what's the solution for something like RDFa, assuming the udata doesn't have its own web page separate from the "bookmark" (which we want to avoid). I ask not having looked at RDFa in depth.

The same problem exists in the API, which we currently use internally but haven't yet made public. In the API, we put user data (user, date added, date modified) into the containing Atom entry fields, and the item data into the <content>. We currently just use custom XML in <content> but will likely switch to RDFa once the BIBO mapping is complete. But other than just using the Atom fields for the user data, how, then, do you model the two types of data in the RDFa response without requiring separate API requests for each item returned by the original request?

bdarcus · May 22, 2009

The key is just to focus on the URIs as identifiers, and so to consider the entities in the model. Dealing with the syntax issues isn't hard; the rdfa "about" attribute allows you to specify a subject URI apart from the document proper in the same way that you can do the same in the RDF as serialized into XML or turtle.

Might it be worth jotting down the ideas on the trac wiki (sean put a page up for the bibo mapping; could be there, or linked from there)?

bdarcus · May 26, 2009

OK, I created this page, and will post a note on the BIBO list:

https://www.zotero.org/trac/wiki/URIScheme

Dan, please adjust/comment as you see fit.