Export to Schema.org RDFa and/or Microdata

westurner · April 12, 2014

How would I go about adding HTML + RDFa [1] and/or HTML + Microdata [2] export templates with Schema.org classes and properties to Zotero?

References

[1] http://www.w3.org/TR/xhtml-rdfa-primer/
[2] http://www.w3.org/TR/microdata/
[3] https://en.wikipedia.org/wiki/Schema.org
[4] http://schema.org/docs/full.html
[5] http://lists.w3.org/Archives/Public/public-vocabs/2014Apr/0104.html

adamsmith · April 12, 2014

Do you want Zotero to be able to import these (and if so from where specifically) or for Zotero to be able to generate them?

There's an open ticket for import: https://github.com/zotero/translators/issues/366
but we don't have much in terms of a use case at this time (i.e. we're not seeing this on many sites).

In terms of "how" - the starting point should be the "Embedded Metadata" Translator. It may just be easiest to add RDFa and/or microdata support in there.

westurner · April 12, 2014

Once I have stored citations in Zotero, I would like to generate (export) the citations to an HTML page with Schema.org RDFa.

* Select a few citations
* Right-click: "Export selected items"
* [Request]: Select 'HTML (Schema.org RDFa)' format
* Click 'Ok'

(This may be difficult, as there would be no citation style selected.)

And/or

* Select a few citations
* Right-click: "Create Bibliography from Selected Items"
* Select a citation style
* [Request]: Select Save as 'HTML + [Schema.org] RDFa'

In terms of source files, from github search, I see:

* https://github.com/zotero/zotero/blob/master/chrome/content/zotero/bibliography.js
* https://github.com/zotero/zotero/blob/master/chrome/content/zotero/bibliography.xul

But I'm not sure where mappings between:

* Zotero types <--> http://schema.org/CreativeWork subclasses
* CSL fields <--> Schema.org properties (see [5])

would need to be.

westurner · April 12, 2014

There's an example of an http://schema.org/Article (base type of BlogPosting, NewsArticle, ScholarlyArticle, TechArticle) represented in RDFa below where it says 'Examples' (in the RDFa tab).

adamsmith · April 12, 2014

Frankly, that'd be a nightmare to implement since citations are generated using citeproc.js which is a project run separately from Zotero. Adding different microdata fields to citation components would entail mucking with the citeproc code which I don't think anyone is excited about. Also, not every component of a citation is on a different html element, so I wouldn't even know where to put the rdfa. HTML bibliographies already include COinS, which seems great for the purpose of exposing citation information.
Moreover, I'd question whether embedding microdata with citations actually make sense. The point as I understand it is to add a universal metadata structure to webpages - if you add microdata to citations _on_ a webpage, if anything, that would seem to be misleading for search engines, no?

edit: so, more generally speaking, I'm puzzled what Zotero has to do with this (except that it should eventually be able to read it). This would seem to be something to integrate into WordPress and similar systems that you use to generate actual webpages, not into a reference manager.

westurner · April 12, 2014

> Frankly, that'd be a nightmare to implement since citations are generated using citeproc.js which is a project run separately from Zotero.

https://bitbucket.org/fbennett/citeproc-js/wiki/Home

http://citationstylist.org/docs/citeproc-js-csl.html

> Adding different microdata fields to citation components would entail mucking with the citeproc code which I don't think anyone is excited about.

Thank you for the feedback.

> Also, not every component of a citation is on a different html element, so I wouldn't even know where to put the rdfa.

In RDFa [1], there would be divs with spans, metas, and <a>s.

> HTML bibliographies already include COinS, which seems great for the purpose of exposing citation information.

I suppose RDFa could be added to the 'See Also' section of https://en.wikipedia.org/wiki/COinS

> Moreover, I'd question whether embedding microdata with citations actually make sense. The point as I understand it is to add a universal metadata structure to webpages - if you add microdata to citations _on_ a webpage, if anything, that would seem to be misleading for search engines, no?

AFAIU, RDFa and microdata metadata are distinct from the HTML page in which they're located. For example, a directory service may host a page which includes information about an Organization with one or more LocalBusinesses. [4]

There is also a Thing > CreativeWork > WebPage type.

As a data model for graphs of resources with URIs and URLs, there are lots of practical uses for RDF ( http://www.w3.org/wiki/ConverterToRdf )

RDF[a] supports links. http://catalogablog.blogspot.com/2010/02/rdf-coins-and-microformats.html

https://www.zotero.org/support/dev/exposing_metadata#using_an_open_standard_for_exposing_metadata

adamsmith · April 12, 2014

In RDFa [1], there would be divs with spans, metas, and <a>s.

Exactly - and they don't exist in current citations, so you'd have to code all of this into citeproc.

As a data model for graphs of resources with URIs and URLs, there are lots of practical uses for RDF ( http://www.w3.org/wiki/ConverterToRdf )

right, but RDF and RDFa aren't the same thing. Zotero already supports (at least) two RDF export formats.

But let me step back a bit. What's your larger vision here? I don't understand where you're trying to go with this. Zotero principally generates bibliographies/citations. Is your idea to generate a bibliography with each item containing RDFa? I'd like to see any documentation that suggest such a use of RDFa - the links you posted all suggest using RDFa/Microdata to add structured data to a given page.

westurner · April 12, 2014

> But let me step back a bit. What's your larger vision here? I don't understand where you're trying to go with this. Zotero principally generates bibliographies/citations. Is your idea to generate a bibliography with each item containing RDFa?

Objective: Produce an HTML page with bibliographic citation metadata that can be parsed and extracted back into RDF.

Yes.

Personally, I like Sphinx (reStructuredText) and bibtex.

PDFs print well, but 'most' of the time, they don't contain enough information to generate their own bibliographic citation (necessitating journal HTML parsers, which Zotero does so well).

> Exactly - and they don't exist in current citations, so you'd have to code all of this into citeproc.

https://bitbucket.org/fbennett/citeproc-js/src/tip/src/

> I'd like to see any documentation that suggest such a use of RDFa - the links you posted all suggest using RDFa/Microdata to add structured data to a given page.

https://en.wikipedia.org/wiki/RDFa :

> RDFa (or Resource Description Framework in Attributes[1]) is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The RDF data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

There is a lot of support for COinS. Is there anything that can be done with COinS that cannot be done with RDFa?

adamsmith · April 12, 2014

Objective: Produce an HTML page with bibliographic citation metadata that can be parsed and extracted back into RDF.

sorry, I still don't understand. The "bibliographic citation metadata" - is that for the content of that page - i.e. one item per page - or for every item cited on the page? For the former, I'd argue Zotero is simply the wrong tool. Zotero doesn't generate HTML pages, why should it generate the metadata describing them? Why not do that in WordPress or whatever else you like to use to generate HTML? So I guess I'm asking - why should this be something Zotero does?
Zotero also doesn't generate other metadata formats it can use to read info from a page like google highwire or DC metatags.

If it's the latter, I still don't see how that's even supposed to look.

What would be helpful if you could provide a very specific, entirely non-abstract use case. I feel like we're talking past each other, so feel free to talk to me like I'm stupid, I won't take offense.

There is a lot of support for COinS. Is there anything that can be done with COinS that cannot be done with RDFa?

content-wise I don't know, probably not. Structurally, COinS has the major advantage of being contained in a single span tag with no displayed text, which makes it very easy to generate/implement: Just put all the info into span in the right format. It's trivial for Zotero to generate this along with or entirely separate from citations. For RDFa et al., content and metadata are mixed. Which makes a lot of sense structurally, but I don't really see how Zotero would usefully generate that, since Zotero isn't used to generate content, just citations.

westurner · April 12, 2014

Problem: Wastefully un-structured HTML bibliographies

* Enter/collect structured data citations into Zotero [in: structured data]
* Generate bibliography with Zotero [out: unstructured textual data**]

** Bibliographies: RTF/HTML/TXT
** Exports: various RDF and non-RDF formats

Solution: Generate HTML + RDFa bibliography with Zotero (with whichever CSL style)

Scope: Zotero generates bibliographies in a number of output formats, with a number of citation styles

Value:

* Structured data
* Make 'round-trip' feasible (Citations -> Zotero -> Bibliography as RDFa -> Citations)

...

A COinS parser that outputs RDF triples would also be great.

westurner · April 12, 2014

There's an example of Schema.org RDFa for an MLA-style citation in [5]

[5] http://lists.w3.org/Archives/Public/public-vocabs/2014Apr/0104.html

westurner · April 12, 2014

I suppose the title of this request should be "Generate Schema.org HTML + RDFa bibliographies";

though an additional RDF export format with http://schema.org classes and properties could also be helpful.

adamsmith · April 12, 2014

ok that thread is helpful. So yeah - the way to go would be to code that in to citeproc directly, following the general template of how html bibs are generated in citeproc (and working with CSL variables taken from CSL-JSON). That's going to be a _huge_ undertaking, so I'd wait until we have more of an emerging standard - the thread from this week seems to suggest that it's not at all clear how this should actually look - and I doubt Frank from the citeproc side - or anyone at Zotero - is going to touch this any time in the next couple of years. It's possible this may be easier to do with one of the other citeproc (e.g. the -hs version integrated with Pandoc) but I don't know, you'd have to ask them.
Obviously, feel free to take a stab at this yourself, but the citeproc-js code is massive.

westurner · April 12, 2014

Thanks again for your help!

1. Map from CSL Types and attributes to Schema.org classes and properties

* https://en.wikipedia.org/wiki/CiteProc
* http://citationstyles.org/downloads/specification.html
* https://en.wikipedia.org/wiki/Separation_of_presentation_and_content
* https://github.com/citation-style-language/styles/blob/master/bibtex.csl
* https://github.com/brechtm/citeproc-py/blob/master/citeproc/source/bibtex/bibtex.py
* https://github.com/brechtm/citeproc-py/blob/master/citeproc/source/json.py

2. Output RDFa:

* https://bitbucket.org/fbennett/citeproc-js/src/tip/src/formats.js (html, text, rtf)
It appears that the output formatters are not schema-aware.

* https://github.com/citation-style-editor/csl-editor/wiki/User-guide-for-the-CSL-Editor
One could generate Schema.org HTML + RDFa copies of requisite CSL styles with a really gnarly XSL workalike.

* http://gsl-nagoya-u.net/http/pub/citeproc-doc.html#generating-bibliographies

(Seems like a lot of work to punctuate triples out of nested JSON form.)

It would be relatively easy to create a JSON-LD context [6] for CSL JSON, but that wouldn't satisfy the output requirements of [CSL Style X] as HTML+RDFa structured data readable by Zotero.

[6] http://www.w3.org/TR/json-ld/#the-context

aurimas · April 12, 2014

Make 'round-trip' feasible (Citations -> Zotero -> Bibliography as RDFa -> Citations)

If this is the problem you are trying to solve, then I think RDFa is the wrong approach. Formatted bibliographies often contain less metadata than necessary to generate citations in other formats (e.g. initials instead of full names, "et al" instead of the full list of authors, abbreviated journal titles, no DOI etc.), which would make re-importing the metadata next to useless.

IMO, the only way to accomplish this is to expand a standard like COinS (one that would embed complete metadata) to support a richer set of metadata.

fbennett · April 13, 2014

There has been a lot of discussion about integrating RDFa in citation and bibliography output over the past few years. Apart from the need for parsers (now presumably cleared), the sticking point is the problem noted by aurimas: the elements displayed in citations are incomplete, so supplementary information would need to be embedded in the document in any case. It's not clear that providing RDFa wrapping on the bits that do show through would be worth the candle.

That said, embedding metadata for items cited in a bibliography or in footnotes could be useful in some contexts. As one example, a few of us built a tool last year for U.S. legal texts that implements a similar concept, but based on reverse-parsing plain text citations (possible because U.S. legal citation conventions are more or less consistent, necessary because U.S. legal publishers do not expose structured metadata). The code is used in a plugin for use with Multilingual Zotero, and features in a node.js package for server-side applications.

On the output side, there is a hook in the citeproc-js processor (@bibliography/entry) that can be used to wrap a bibliography entry in arbitrary markup. That doesn't give you element-level granularity for linking, but it could be used (for example) to add a reveal of underlying metadata in an HTML page.

The first step would be to work out a sample page or PDF document that works as you would like. There are some potentially conflicting desiderata -- links to embedded metadata, cross-linking of citations and bibliography entries, external links to full-text source via DOIs or URLs, ORCID links -- and a sample document would force consideration of the design tradeoffs, before looking into how citeproc-js or another CSL processor could be adapted to help make it happen.

If such pages became common (i.e. documents containing both self-referencing and cited metadata), Zotero would need UI for handling extraction and filing of both classes of cite details. There wouldn't be resistance to that, I think, but I doubt it will happen before demand is stimulated by a volume of document data to feed on.

(Edit: A further hurdle to clear would be the mapping of the JSON input to the citeproc-js processor into schema.org [or whatever] structures. The CSL input format serves as an intermediate layer between well-defined formats designed for data exchange between machines, and printed formats designed for human consumption. The CSL input format itself is not designed with data exchange in mind, and you would need to do some work [probably a significant amount of work] on mapping conventions to get things working correctly.)

westurner · April 13, 2014


To step back a bit, there are multiple reasons for including
a bibliography of structured citations:

1. To give credit where credit is due
2. To allow for the verification of logically inductive premises
(to support scientific reproducibility)

The wider objective here, is to share bibliographies of structured citations
as structured data.

Reproducible Science (logically inductive argument verification):

* discover a graph of supporting premises (resources)
    * (Zotero helps discover the metadata for one or more resources)
* for each resource
    * retrieve peer review comment threads
    * retrieve meta-analysis metadata in re: validity and reproducibility
    * retrieve supporting data [7]
        * validate stated transformations
    * validate logical conclusions
    * retrieve relevant annotations
* generalize to red/green per resource


The citation lookup overhead seems wastefully inefficient.
How much time is spent, in academia, manually parsing
and disambiguating citations and the resources which they describe?

URIs and URLs are the solution.

The irony here, in respect to citation graph discoverability
 and the 6,992 citation styles, is that despite the intricate punctuational
 variation from journal to journal, none of the textual citation styles
 support looking up the the supporting premises of the supporting premises;
 without lots of complex text parsing.

RDF (and RDFa) presents a solution to this;
 in regards to the wider problem of verifying cited resources as premises.

* A resource is a Thing.
    * For which there can be multiple representations (each with MIME type)
        * HTML
        * LaTeX
        * PDF
        * RDF
            * RDF/XML
            * TriX
            * Turtle/N3
            * RDFa (HTML + RDF)
* URIs are designed to uniquely identify resources.
    * DOI URNs are URIs.
        * Most citations do have have a DOI.
    * URLs are URIs.
* URLs are designed to be dereferenecable [8]
    * Graphs of URLs form the 'Giant Global Graph'
* RDF is designed to describe resource graphs of URIs and URLs
  with infinite fidelity
 and well-defined parsing semantics.
    * Of what use is a citation style without a field for
      a URI (e.g. a DOI URN) and/or a URL?


In the western world, we tend to record names as first, middle, and last.

* Bibliographic name granularity (name as FML) is preserved with high fidelity
    * With Zotero RDF
    * With COinS HTML
        * We must parse for URIs and URLs
    * With CSL JSON
* Bibliographic name granularity is not preserved
  (must parse name fields -> FML)
    * With Schema.org RDF (name)
    * With DCTERMs RDF (name)
    * With almost every CSL (structured data -> text)
        * We must parse for URIs and URLs


So, one could run a COinS parser and a (Zotero) RDFa parser
 on every resource in a graph of supporting premises.

To promote efficiency:

* A recommendation like
  "complete bibliographic data SHOULD be included in a resource"
* Identify loss of fidelity
    * Unstructured data -> Structured Data (Zotero RDF) -> CSL
    * RDF, RDFa -> RDF (Zotero RDF) -> RDFa
    * RDF (Zotero RDF) -> COinS HTML
* Work with COins to produce an RDF schema
* Work with Schema.org (major search engines)
    * Understand that western FML name patterns are one way to express names
         * ttps://en.wikipedia.org/wiki/Surname
         * ttps://en.wikipedia.org/wiki/Unicode_collation_algorithm

westurner · April 13, 2014

[7] http://www.plosone.org/static/policies#sharing
[8] https://en.wikipedia.org/wiki/Dereferenceable_Uniform_Resource_Identifier

westurner · April 13, 2014

the sticking point is the problem noted by aurimas: the elements displayed in citations are incomplete, so supplementary information would need to be embedded in the document in any case. It's not clear that providing RDFa wrapping on the bits that do show through would be worth the candle.

(Edit: A further hurdle to clear would be the mapping of the JSON input to the citeproc-js processor into schema.org [or whatever] structures. The CSL input format serves as an intermediate layer between well-defined formats designed for data exchange between machines, and printed formats designed for human consumption. The CSL input format itself is not designed with data exchange in mind, and you would need to do some work [probably a significant amount of work] on mapping conventions to get things working correctly.)

There do see to be architectural limitations to how CSL JSON (at least in current citeproc-js) is formatted.

Expression of (more complete) bibliographic as structured RDFa data which 'validates' as a particular CSL style could be accomplished through the use of @content. http://www.w3.org/TR/rdfa-core/#object-resolution.

westurner · April 13, 2014

...

* https://en.wikipedia.org/wiki/Linked_data
* http://www.w3.org/TR/ld-glossary/
* http://5stardata.info/

fbennett · April 13, 2014

@westurner,

You're preaching to the converted. People involved in Zotero development see the benefits of linked data, and of passing structured metadata from document to citing document to newly authored document.

It's only a question of implementation, and that's something for document designers to come up with, in the first instance. I think it's fair to say that the Zotero crew are just (reasonably) waiting to see what will emerge at that end.

If you have a concrete example of a published document or sample to show, I'm sure people will be happy to take a look and comment on the possibilities.

aurimas · April 13, 2014

I'm not sure that above is helping very much. Most of us here are fairly well aware of the benefits of embedding structured metadata, RDF(a), linking it to other resources, etc. No one is questioning the need (or at least usefulness) of embedding metadata into references. The question is how to best achieve this. After reading all of the above, I'm not really sure any more what you are proposing. I would suggest that we consider a concrete example (post it as a public gist) and see what we can come up with using various formats. Otherwise, I think this discussion is not going to go anywhere (at least not at a considerable pace).

Edit: Frank, lol...

aurimas · April 13, 2014

I also think that embedding actual metadata into citations is a thing of the past. Looking forward, I think it makes the most sense to simply link a reference to its metadata, which would be hosted in a central location (and could be easily updated).

In a sense we already have this with DOIs. Unfortunately, the DOI RAs do not always provide metadata, the metadata provided is not always complete, and/or is not presented in a consistent format. Additionally, while in an ideal world any one resource should be described by a single set of metadata (I'm thinking a central curated database. Maybe crowd-sourced?), it seems to me that there will always be a need for customizability.

To this end, we have zotero.org which could (and already does) also serve as a central repository of metadata. The nice part about this is that zotero.org can serve metadata in a number of formats.

So what I imagine is that one could simply add

<link rel="meta" type="application/rdf+xml" href="https://api.zotero.org/groups/183462/items/3DXJRRCD?format=rdf_zotero"/>

to a span encompassing a particular reference and be done with this. This probably isn't that useful to search engines, but, in terms of metadata, I think this is as good as it can get.

dunning · April 21, 2014

This is all very interesting. The core issue here is simply that we're currently going to the trouble of maintaining nice, structured information for our citations, but it simply ends up in lightly formatted text, because there really aren't any good writing tools that will export citations using COinS and such where appropriate as well as formats suitable for print. In this regard, I don't think that the state of the art has really progressed since this 2007 Zotero blog post.

I wonder whether pandoc's version of citeproc could be extended to automatically spit out COinS or the meta links suggested by aurimas when exporting to HTML; I see that someone suggested this a few years back (where, incidentally, it was also thought that RDFa would in theory be the best mechanism).

The proposal by aurimas for links does seem more sensible in many ways, if it could be implemented; this is a bit like a suggestion by Martin Fenner.

fbennett · April 21, 2014

Just in case it was missed upthread:

On the output side, there is a hook in the citeproc-js processor (@bibliography/entry) that can be used to wrap a bibliography entry in arbitrary markup.

So the infrastructure is in place, as far as the Zotero CSL processor goes. It just needs for data to speak to it.

dylan_k · March 4, 2015

I had an interest in doing something similar so I put together a simple example. Using JSON that I exported from Zotero, I converted that into YAML and then used Jekyll to generate the HTML.

I've shared what I built, along with some documentation here: https://github.com/dylan-k/biblio