Wikified copyleft bibliographic database

HLHJ · April 20, 2014

When I automatically generate an entry form a PDF, the request goes off to Google, and frequently it comes back "No OCR text". Worse, it sometimes comes back with an INCORRECT citation. The most common error is only listing the first one or two authors when there are in fact more. So, while I've been saved a lot of work, I still have to proofread my bibliography manually, and correct my database manually. I hate manually correcting bibliographies :).

Solution: a Zoterowiki citation database. I click to upload a complete citation (including the URL from which I got the information and/or the fulltext). I merge it with any duplicates, automatically suggested as in the current Zotero. I check it against the fulltext and mark it as proofread by a human (authors and publishers can stamp it with a higer grade of verification).

The neat bit: if I download some fulltext or a partial citation, I can choose to run it past Google, the copyleft database, or (default) both. Even if the OCR is stumped, I still get back a (proofread!) citation, indexed through the download location and filename (and maybe a file hash?). If that fails, perhaps because some publisher set all their PDFs to download with the same name from a dynamic location, I can manually search the copyleft database.

The database could also help find fulltexts, since it lists multiple URLs where they can be found. This would help users locate the gap between the paywall and the findwall. Often the easy-to-find, DOI-linked copy is charged for, even if the author is legally offering the paper for free download on their personal website or an institutional repository. It should also be possible to add a DOI/URL for downloading a preprint and supplementary material.

The database makes life cheaper for the Zotero servers. By default, the citations users upload are just stored as database indexes; if 200 users each have a copy of the abstract of a popular paper, you are storing 200 copies of the DOI and one copy of the abstract, rather than 200 copies of both.

If I am stressed by the lack of control (someone else could correct errors in my citations!), I can lock my citation to the database index and date, so I always use the stable archived version. Finally, I can always set it to ignore the database and use the current facilities.

I realise there are already copyleft citation databases, like CiteSeerX. I'd suggest working with them, and building Zotero tools to interact with them, rather than starting from scratch.

------------
Bonus:
There are plenty of authors named "J. Smith". The database could contain authors, indexing their publications and their personal websites.
------------

I'm willing to put in some time on this, say an hour a week. I can program, write basic HTML and CSS, and have some experience with Postgres (but not SQLite) and regular expressions. I have no familiarity with Mozilla technologies or Git (as listed at https://www.zotero.org/support/dev/start). I have only a user's familiarity with Zotero. Who's interested?

aurimas · April 20, 2014

The most common error is only listing the first one or two authors when there are in fact more.

This should be fairly trivial to fix by re-retrieving better metadata from the publisher's website and/or CrossRef.

Some slightly related (and very long) discussion about improving the accuracy of metadata imports can be found here: https://forums.zotero.org/discussion/23748/automating-massimport-from-pdfs/ (Note that the issue with Google Scholar lockouts, which was the biggest driver for moving to alternative sources of metadata, has mostly been resolved. Though we still recognize that GS metadata is poor)

For reference, there has been some discussion on curated metadata repositories previously: https://forums.zotero.org/discussion/26501/crowd-sourcing-bibliographic-errors/ I feel like curating such a huge repository of metadata is a HUGE amount of work. I think CrossRef is in the best position to make this happen.

There are plenty of authors named "J. Smith". The database could contain authors, indexing their publications and their personal websites.

This would probably be delegated to a third party (ORCID is in the best position for this IMO). Some recent discussions about this include https://forums.zotero.org/discussion/36088/orcid/ and https://forums.zotero.org/discussion/3913/single-author-with-two-different-name-spellings-short-vs-longhand

HLHJ · April 20, 2014

Curating the metadata would be a huge job. I don't suggest Zotero does it. It would be a really useful job, though, and Zotero would make a great tool for anyone doing it, even without integration. Let's see if we can figure out who might do it, and how to encourage them.

For crowdsourcing assemblies of data, the champion has to be Wikipedia. This project wouldn't be too different from things like their taxonomic database.

For the actual database, CrossRef would be ideal. And it seems they do indeed offer such services, but only to authorized partners, and they charge for it (http://www.crossref.org/cms/index.html). I understand from one of your links (automating-massimport-from-pdfs) that they make an exception for Zotero; when you look something up by DOI they give you the results. I would be happy to improve their database, if they would be willing to release all the citations I proofread under a good copyleft or into the public domain. I already e-mail individual publishers when the metadata on their pages is wrong; this would automate the task. Do you think there's a chance?

CiteSeerX is already openly sharing its data; is it used in the Zotero automatic metadata retrieval?

I would really like to be able to see, and choose, what metadata sources are used.

Wrong data is fairly trivial to fix, usually, if there is a DOI. Just tedious, as related here: https://forums.zotero.org/discussion/23748/automating-massimport-from-pdfs/

I do take DWL-SDCA's comments, to the effect that I should be reading all the fulltexts, to heart. Note that in my original comment I did not object to proofreading my own bibliographies -- but I want to reduce the number of errors I find and fix, so I'd be happy to have the crowd copyedit for me first.

I would also note that I have papers in Zotero which I will probably never read throughly or cite; I downloaded them, skimmed them, decided they were of no immediate use, and just didn't delete them. I might want them someday, and then I will suffer the tedium of proofreading the data. Zotero can also be used as a sort of bookmarking tool.

A copyleft metadata database would be useful for other things, too, especially if more publishers get on board with ORCIDs and bibliographies that are as freely reproducible as abstracts.
https://forums.zotero.org/discussion/22523/zotero-and-citation-analysis/

adamsmith · April 20, 2014

Doesn't Mendeley already do this to some degree (and release the data under CC-BY?). I understand their data quality is still pretty mediocre.
As for CrossRef - their DOI lookup API is open and unrestricted, anyone can use it, it's just for more advanced function and bulk data (as well as the ability to deposit) that you have to pay.

No, Zotero doesn't use CiteSeer for retrieve metadata - there isn't a compelling reason to: it's pretty incomplete and the data isn't very good.

HLHJ · April 20, 2014

It would be nice to be able to choose the sources of your metadata, including the Mendeley database. Could OAI-PMH be used?

Mendeley’s database is freely accessible under a Creative Commons license, but is it forkable? Can someone mirror it, like Genbank?

adamsmith · April 20, 2014

For Mendeley - technically, CC-BY gives you the right to fork. In practice, I believe the only way to access it is via API which throttles requests after a while, so you'd be at it for, uhm, a while. I'm not 100% sure about this, though.

I'm not denying that having a open database would be nice, but absent a major grant or the like, I don't see it happen, quite honestly.

dstillman · April 20, 2014

Yeah, I think providing collaboratively edited authoritative metadata is beyond our scope at the current time, though it may become increasingly unavoidable for us.

In the short term, though, to address the original problem described by HLHJ, I think the most realistic thing would be to make metadata requests to the Zotero API before Google Scholar, use Zotero's synced full-text content to find identifiers, and then use CrossRef and the like to get authoritative data from those. This doesn't provide all of the benefits of a collaboratively edited database, but it would allow for a wider range of materials to be recognized and also decrease our reliance on Google Scholar (which is less of a big deal since the rate-limit improvements but would still be nice in general). And by just extracting identifiers, it would — in my view, at least — avoid privacy concerns with exposing potentially private metadata. (When this was suggested previously I think the idea was that it would just use the metadata directly.)

The biggest question, I'd say, is whether there are enough documents with identifiers and authoritative metadata available elsewhere that aren't detected by our current retrieval process to make this worthwhile.

HLHJ · April 22, 2014

Dan Stillman's idea sounds like a good one to me. It has the advantage that the papers used by Zotero users are more likely to resemble those used by other Zotero users than those used by users of Google Scholar. It also gives an incentive to get other people in one's field using Zotero :). If it were clearly indicated what sources you were using, with some configurability, it might also incentivize people to improve the sources

I would not be concerned about privacy issues arising from Zotero using the URLs and file hashes of my synced data to index identifiers. Since Zotero is not in the business of profiling me, it might even be considered a privacy improvement.

Mendeley is based in London; if it's useful, I volunteer to talk to them and physically go and get their database, so that bandwidth cost is not a problem. Just the metadata on even hundreds of millions of items shouldn't be too big to sneakernet.

Data from Zotero synced, Crossref, Mendeley, and CiteSeerX would at least decrease the number of documents for which I have to enter all the data from scratch. Though most of those lack modern identifiers and would have to be identified by URL or file hash.

CiteSeerX has some funding; perhaps they'd be interested in a collaboration? Shall I e-mail and ask?

Wikipedia has a strong interest in a good index of source metadata. Does anyone mind if I moot the idea there? They certainly have the collaborative editing infrastructure that Zotero currently lacks.

HLHJ · July 27, 2014

This is happening! We are currently making a database of the metadata of everything with a DOI in WikiData (the database sidekick of Wikipedia). It is CC-0.

I think this addresses most of the problems discussed here. Zotero does not have to host or maintain the database, but Zotero users are welcome to use and contribute to it. The database has a few ORCIDs, and as these become common it will gain more.

We built a CSL (*edit) importer last weekend. This is definitely still in beta; the data schema, authorship, and version control are not yet sorted out! But a prototype Zotero importer (a button in Zotero to upload part or all of a collection to the bot, which would then insert it into Wikidata) would already be very useful. A Zotero downloader (where Zotero can get proofread metadata from Wikidata) is the next step after that.

Comments from those familiar with the Zotero codebase are very welcome.

aurimas · July 27, 2014

Do you have a link for more details? What data exactly is being collected? Is there an API?

HLHJ · July 27, 2014

Example scholarly article metadata (work in progress): https://www.wikidata.org/wiki/Q15625490

We're collecting any metadata on anything with a DOI which seems likely to be useful to Wikipedia or be important to scholarly activity; https://www.wikidata.org/wiki/Wikidata:Notability. We're starting with the standard bibliographic information on academic periodicals and books, since people cite those a lot. We are including publications not yet mentioned in Wikipedia.

Once the schema and imports work well, the lowest-hanging fruit will likely go first (figuring out what existing datasets we can import). We have already been using files exported manually from Zotero for sandbox work; hopefully, in the future there will be a way for users to contribute edits through Zotero, and we'll do something intelligent to make sure that they can get credit for their edits and that repeated uploads of incorrect metadata (say because the publisher or Google Scholar or some repository made a mistake) get repeatedly ignored. If Zotero is developing the ability to update/supplement metadata from other providers after it is imported into the Zotero database, that might involve similar work; maybe some of it could be shared.

Apart from the standard bibliography fields, we can add others (is this article a review? what Wikiquote pages include quotes from this item?). Some of these fields will be pretty useless to Zotero, but any data useful to Zotero will probably be included, because Zotero gets used a lot by Wikipedia editors.

What fields to include is currently being discussed, and help from the Zotero community would be very useful.

https://www.wikidata.or/wiki/Wikidata:Periodicals_task_force

https://www.wikidata.org/wiki/Wikidata:Books_task_force

There is an API, with documentation and user support:

https://www.wikidata.org/wiki/Wikidata:API

Downloading data is fairly unrestricted, but the Wikidata devs ask that no-one automatically upload anything (except into an allocated sandbox) without some testing and review. Getting testing space is fairly simple (register a Wikimedia account, then follow the steps at https://www.wikidata.org/wiki/Wikidata:Bots). The importer mentioned in the post above is at https://github.com/mitar/csl2wikidata.

BTW, I did e-mail CiteSeerX, but got no response.

HLHJ · July 27, 2014

Did you know about this use of Zotero?

https://www.mediawiki.org/wiki/Citoid

The person mostly working on Citoid (Mvolz) is also working on bibliographic data in Wikidata (and is friendly and helpful).

HLHJ · August 4, 2014

Discussion of Zotero in relation to this on Github:

https://github.com/mitar/csl2wikidata/issues/1

I notice there appears to be no metadata on the specific reviews accessible through the DOI lookup (testing using the Cochrane reviews cited at https://en.wikipedia.org/wiki/Malaria). If you paste a DOI into the finder, it gives you generic metadata about Cochrane Reviews in general. I've e-mailed Cochrane about this.

Maybe related to https://forums.zotero.org/discussion/22784/cochrane-library/ ?

aurimas · August 5, 2014

It's not related to the other thread and it doesn't seem to be an issue with Cochrane's submitted data. The problem is on our end and I'm working on a solution. Will let you know when it's done.