scrape copyright using extant tool
It would be very useful to have Zotero scrape an item's license and put it in a copyright field (CC-BY? CC-ND-NC-BY? All rights reserved?).
Case:
This would save time for librarians, repository managers, textbook writers, and anyone wanting to find usable excerpts from items in a Zotero database. It would also allow automated checking that items published open-access in a closed-access journal are correctly labelled (publishers make mistakes, and some universities apparently check manually). It would help anyone posting a Zotero-derived bibliography online and wanting to indicate which items are available to all readers. It would also help the Wikipedia Signalling-OAness project.
Code:
The scraper already exists. It's called Open Access Gauge (http://oag.cottagelabs.com/). It's licensed under a Modified BSD License, which is GPL-compatible. The host says the project has run out of funding. Cameron Neylon, who works for PLOS, the copyright holder, said they'd be happy to see it incorperated into Zotero, so relicensing is probably possible if needed.
The code is available on Github:
https://github.com/CottageLabs/OpenArticleGauge/
The code is in Python and BASH scripts, with a C dependency, not in Zotero's Javascript. The key item, though, is a database of contexts in which to find the license statement, given the URL; this seems as if it would be readily portable.
Does anyone with more knowledge of the Zotero codebase have any comments?
Case:
This would save time for librarians, repository managers, textbook writers, and anyone wanting to find usable excerpts from items in a Zotero database. It would also allow automated checking that items published open-access in a closed-access journal are correctly labelled (publishers make mistakes, and some universities apparently check manually). It would help anyone posting a Zotero-derived bibliography online and wanting to indicate which items are available to all readers. It would also help the Wikipedia Signalling-OAness project.
Code:
The scraper already exists. It's called Open Access Gauge (http://oag.cottagelabs.com/). It's licensed under a Modified BSD License, which is GPL-compatible. The host says the project has run out of funding. Cameron Neylon, who works for PLOS, the copyright holder, said they'd be happy to see it incorperated into Zotero, so relicensing is probably possible if needed.
The code is available on Github:
https://github.com/CottageLabs/OpenArticleGauge/
The code is in Python and BASH scripts, with a C dependency, not in Zotero's Javascript. The key item, though, is a database of contexts in which to find the license statement, given the URL; this seems as if it would be readily portable.
Does anyone with more knowledge of the Zotero codebase have any comments?
Zotero is currently unable to update/supplement metadata from other providers after it is imported into the database. This feature is pretty high on the to-do list though and once it's implemented, we could probably take advantage of Open Article Gauge tool. This won't happen very soon unfortunately.
In the mean time, what information exactly are you expecting from the copyright field? Currently we typically store something like "©1999 <copyright holder>", but as I said, it's not very consistent and AFAIK not used anywhere.
For examples of copyright data, in my own collection I have:
copyright = {Approved for public release; distribution is unlimited.}
This is a work of a U.S. civil servant, so I think it's public domain, but that phrase is from the security classification (not Zotero's fault)!
Then there are open access papers:
copyright = {©2013. The Authors., This is an open access article under the terms of the Creative Commons Attribution-{NonCommercial-NoDerivs} License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.}
The website says just that (with a para break instead of the first comma).
I'm suggesting seperate fields for the copyright holder and the rights which are reserved. So one field could say "2013 CC-NC-ND", "Public Domain" or "2012 All rights reserved" and one could say "The Authors", "American Geophysical Union", "Nature Publishing Group", or "The government of the United States".
Better yet, a seperate field for the copyright year would make it easier to identify things that were in the public domain because the copyright had lapsed in a given country. Example:
copyright = {Copyright © 1929 Swedish Society for Anthropology and Geography}
Becomes:
Rights = All rights reserved
Copyright_year = 1929
Copyright_holder = Swedish Society for Anthropology and Geography
This data would be easier to extract were it not for the hideous mess of copyright holder names. Many of these pairs are downloaded from the same site (years changed), but show no consistency:
copyright = {© 2010 Nature Publishing Group}
copyright = {© 2011 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.}
copyright = {©2008. American Geophysical Union. All Rights Reserved.}
copyright = {Copyright 2009 by the American Geophysical Union.}
copyright = {©1999 Springer-Verlag Berlin Heidelberg}
copyright = {©2000 Springer {Science+Business} Media {B.V.}}
This is totally the publisher's responsibility, not Zotero's, but splitting the other data off from this mess at the time of scraping would make it easier to use.
There are also many publishers whose copyright info Zotero cannot currently scrape. I don't know if breaking it up would help, or how heavily the OAG algorithms would overlap with those Zotero already has. It seems like it's worth a look, though.
What we can do is store this information in a single field in some well-defined parsable format. E.g. ©YEAR HOLDER (LICENSE). (a Zotero add-on could even expand this into multiple display fields)
Once we agree on a good format, the next step would be to try and improve the scraping to fix the messy data from the publisher. For that, links to where all of your sample data came from would be very useful.
(I'll take a closer look at OAG. I'm not sure I understand how it works)
©YEAR HOLDER (LICENSE) looks good, although I'm not sure if the © is needed or implied. Would it still be automatically-parsable if some fields were missing? Or would you add "Unknown copyright holder" etc.?
I have no first-hand knowledge of how OAG works, I'm afraid; I've only glanced at it and talked to someone who worked on it.
Feel free to paste those links here. I know we can improve Nature translator, since they provide license info.
copyright = {Copyright © 1929 Swedish Society for Anthropology and Geography},
url = {http://www.jstor.org/stable/519394}
copyright = {©2013. American Geophysical Union. All Rights Reserved.},
url = {http://onlinelibrary.wiley.com/doi/10.1002/2013JC008797/abstract}
copyright = {Copyright 2000 by the American Geophysical Union.},
url = {http://onlinelibrary.wiley.com/doi/10.1029/2000GL011771/abstract}
copyright = {© 1999 Nature Publishing Group},
url = {http://www.nature.com/nature/journal/v399/n6735/abs/399429a0.html},
copyright = {© 2014 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.},
url = {http://www.nature.com/nature/journal/v506/n7489/full/nature12991.html}
copyright = {©2012 Springer {Science+Business} Media {B.V.}},
url = {http://link.springer.com/chapter/10.1007/978-94-007-2027-5_3}
copyright = {©1999 Springer-Verlag Berlin Heidelberg},
url = {http://link.springer.com/chapter/10.1007/978-3-642-60134-7_13}
copyright = {©2013. The Authors., This is an open access article under the terms of the Creative Commons Attribution-{NonCommercial-NoDerivs} License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.},
url = {http://onlinelibrary.wiley.com/doi/10.1002/2013GL058479/abstract}
Journal of Raman Spectroscopy annoyingly puts a copyright notice in the abstract text. http://onlinelibrary.wiley.com/doi/10.1002/jrs.4464/abstract