scrape copyright using extant tool

HLHJ · July 27, 2014

It would be very useful to have Zotero scrape an item's license and put it in a copyright field (CC-BY? CC-ND-NC-BY? All rights reserved?).

Case:
This would save time for librarians, repository managers, textbook writers, and anyone wanting to find usable excerpts from items in a Zotero database. It would also allow automated checking that items published open-access in a closed-access journal are correctly labelled (publishers make mistakes, and some universities apparently check manually). It would help anyone posting a Zotero-derived bibliography online and wanting to indicate which items are available to all readers. It would also help the Wikipedia Signalling-OAness project.

Code:
The scraper already exists. It's called Open Access Gauge (http://oag.cottagelabs.com/). It's licensed under a Modified BSD License, which is GPL-compatible. The host says the project has run out of funding. Cameron Neylon, who works for PLOS, the copyright holder, said they'd be happy to see it incorperated into Zotero, so relicensing is probably possible if needed.

The code is available on Github:
https://github.com/CottageLabs/OpenArticleGauge/

The code is in Python and BASH scripts, with a C dependency, not in Zotero's Javascript. The key item, though, is a database of contexts in which to find the license statement, given the URL; this seems as if it would be readily portable.

Does anyone with more knowledge of the Zotero codebase have any comments?

aurimas · July 27, 2014

In general we try to pick up copyright information when scraping metadata from page. It's not in any standardized format, which could be an issue, and not a lot of pages provide proper information in that regard. Improving scraping of copyright information would be the first thing to do (do you have some examples where we could be doing a better job of picking up this information?)

Zotero is currently unable to update/supplement metadata from other providers after it is imported into the database. This feature is pretty high on the to-do list though and once it's implemented, we could probably take advantage of Open Article Gauge tool. This won't happen very soon unfortunately.

In the mean time, what information exactly are you expecting from the copyright field? Currently we typically store something like "©1999 <copyright holder>", but as I said, it's not very consistent and AFAIK not used anywhere.

HLHJ · July 27, 2014

I'd love to be able to update metadata, but I'm not suggesting it here. I'm suggesting lifting the scraping algorithms from OAG and use them to improve Zotero's scraping.

For examples of copyright data, in my own collection I have:

copyright = {Approved for public release; distribution is unlimited.}

This is a work of a U.S. civil servant, so I think it's public domain, but that phrase is from the security classification (not Zotero's fault)!

Then there are open access papers:

copyright = {©2013. The Authors., This is an open access article under the terms of the Creative Commons Attribution-{NonCommercial-NoDerivs} License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.}

The website says just that (with a para break instead of the first comma).

I'm suggesting seperate fields for the copyright holder and the rights which are reserved. So one field could say "2013 CC-NC-ND", "Public Domain" or "2012 All rights reserved" and one could say "The Authors", "American Geophysical Union", "Nature Publishing Group", or "The government of the United States".

Better yet, a seperate field for the copyright year would make it easier to identify things that were in the public domain because the copyright had lapsed in a given country. Example:

copyright = {Copyright © 1929 Swedish Society for Anthropology and Geography}

Becomes:
Rights = All rights reserved
Copyright_year = 1929
Copyright_holder = Swedish Society for Anthropology and Geography

This data would be easier to extract were it not for the hideous mess of copyright holder names. Many of these pairs are downloaded from the same site (years changed), but show no consistency:
copyright = {© 2010 Nature Publishing Group}
copyright = {© 2011 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.}
copyright = {©2008. American Geophysical Union. All Rights Reserved.}
copyright = {Copyright 2009 by the American Geophysical Union.}
copyright = {©1999 Springer-Verlag Berlin Heidelberg}
copyright = {©2000 Springer {Science+Business} Media {B.V.}}

This is totally the publisher's responsibility, not Zotero's, but splitting the other data off from this mess at the time of scraping would make it easier to use.

There are also many publishers whose copyright info Zotero cannot currently scrape. I don't know if breaking it up would help, or how heavily the OAG algorithms would overlap with those Zotero already has. It seems like it's worth a look, though.

fbennett · July 27, 2014

Probably the most important copyright terms to capture accurately are the Creative Commons licenses, since the cost of reusing CC content is very low, once the license is known. Aren't CC licenses designed to be machine-readable (for that reason)?

HLHJ · July 27, 2014

Do publishers generally use CC licenses in a readily machine-readable form? Is it always easy to scrape a CC license? If so, it definitely deserves its own field, and a standard format, uncluttered with non-machine-readable stuff.

aurimas · July 27, 2014

I don't think we'll get multiple fields for this in Zotero (at least not in the database or not any time soon).

What we can do is store this information in a single field in some well-defined parsable format. E.g. ©YEAR HOLDER (LICENSE). (a Zotero add-on could even expand this into multiple display fields)

Once we agree on a good format, the next step would be to try and improve the scraping to fix the messy data from the publisher. For that, links to where all of your sample data came from would be very useful.

(I'll take a closer look at OAG. I'm not sure I understand how it works)

HLHJ · July 27, 2014

A well-defined format would be pretty much just as good. I can certainly give you the URLs for any or all of the examples I gave.

©YEAR HOLDER (LICENSE) looks good, although I'm not sure if the © is needed or implied. Would it still be automatically-parsable if some fields were missing? Or would you add "Unknown copyright holder" etc.?

I have no first-hand knowledge of how OAG works, I'm afraid; I've only glanced at it and talked to someone who worked on it.

aurimas · July 27, 2014

I think the field would remain parsable if any of the metadata was missing. Though maybe I would use braces instead of parentheses for license in case that the copyright holder's name contains parentheses.

Feel free to paste those links here. I know we can improve Nature translator, since they provide license info.

HLHJ · July 27, 2014

The public-domain one I downloaded manually from a website which I might be able to find it again, but I don't have the URL. Here are the rest:

copyright = {Copyright © 1929 Swedish Society for Anthropology and Geography},
url = {http://www.jstor.org/stable/519394}

copyright = {©2013. American Geophysical Union. All Rights Reserved.},
url = {http://onlinelibrary.wiley.com/doi/10.1002/2013JC008797/abstract}

copyright = {Copyright 2000 by the American Geophysical Union.},
url = {http://onlinelibrary.wiley.com/doi/10.1029/2000GL011771/abstract}

copyright = {© 1999 Nature Publishing Group},
url = {http://www.nature.com/nature/journal/v399/n6735/abs/399429a0.html},

copyright = {© 2014 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.},
url = {http://www.nature.com/nature/journal/v506/n7489/full/nature12991.html}

copyright = {©2012 Springer {Science+Business} Media {B.V.}},
url = {http://link.springer.com/chapter/10.1007/978-94-007-2027-5_3}

copyright = {©1999 Springer-Verlag Berlin Heidelberg},
url = {http://link.springer.com/chapter/10.1007/978-3-642-60134-7_13}

copyright = {©2013. The Authors., This is an open access article under the terms of the Creative Commons Attribution-{NonCommercial-NoDerivs} License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.},
url = {http://onlinelibrary.wiley.com/doi/10.1002/2013GL058479/abstract}

Journal of Raman Spectroscopy annoyingly puts a copyright notice in the abstract text. http://onlinelibrary.wiley.com/doi/10.1002/jrs.4464/abstract