Conversion of < jats:sub > to plain text not good

The JATS/HTML tag < sub > appears in abstracts provided by Crossref. Two examples are:

1) 10.1371/journal.pgen.1009241
2) 10.1101/2024.09.24.614506

In these cases it's used for a well known math concept/name from population genetics.

Currently the Zotero codebase is probably doing something like replacing certain JATS tags with newlines. In the case of these abstracts it seems to make the text more difficult to understand, especially the 2nd case.

Is this intended behavior? I suspect that replacing the < sub > (and < sup >) tags with
spaces is a more graceful degradation of rendering (there is no ideal solution).

I was going to include the actually underlying JATS data from Crossref and bioRxiv but this forum software seems to dumb-delete anything that looks like an HTML tag. Can I post this kind of stuff in an editor that is code friendly? ... like say GitHub?

Zotero has "F", "ST", and following English text all on separate lines with confusing indentation.

In the 2nd case the plain text rendering is much worse and confusing. The abstract appears to be getting butchered at multiple stages. For some reason the Crossref JSON has lost some important spacing compared to the original JATS XML from bioRviv.

The plain text rendering in Zotero has many lines of very confusing butchered math.

If it's helpful, I can copy some of the Zotero javascript and play around with some tweaks to see whether space-replacement looks less terrible than newline replacement. I imagine that some tags make sense to be replaced with newlines, others with space, and others just removed.

I maybe can help document/specific an expected conversion for general use of how mildly formatted JATS ideally is converted to plain text. I'd also like to establish what JATS is not expected to be handled well in plain text at all ... e,g, images, tables, MathML, ... others probably.
  • edited 10 days ago
    You can post code you want to link to to github and link here, yes (you can also put short snippets into HTML <code> tags, but please don't use that for more than ~25 lines).

    I think you're misunderstanding some things, both about Zotero and about how publishing metadata work.

    Most importantly:
    1. Zotero never interacts with JATS. It could be made to, read some metadata from JATS, like from any other XML format, but for the publishers & sites that do make JATS directly available (PLOS, Pubmed), there's virtually always a more lightweight option with better or equally good citation metadata available.

    2. CrossRef doesn't use JATS. CrossRef has its own deposit format (UNIXREF Deposit is how I think they refer to it). It may have some overlap with JATS, but it's very much not the same -- see the linked page. CrossRef then also does some internal processing on that which determine how it's served via their APIs, which in different configurations can output two different XML formats as well as JSON.

    3. Zotero actually uses two different CrossRef APIs. In most cases (e.g. when using add by identifier) it gets Unixref XML via content negotiation, the equivalent of
    curl -LH "Accept: application/vnd.crossref.unixref+xml" https://doi.org/10.1371/journal.pgen.1009241
    In some other cases it gets JSON via the REST API (you can find both options in the Zotero translators)

    Specifically, the bad spacing and newlines come from the XML as deposited with CrossRef. Here's what the API returns:

    https://s3.amazonaws.com/zotero.org/images/forums/u2433/q6rae634tcd20bvt3jnq.png

    You'll note that it's not Zotero that's adding the newlines around the HTML tag. This is just bad XML deposited by PLOS. I'm afraid you'll find a fair amount of issues like this as you work with journal metadata.
  • Ah, now I see where the newlines are coming from. I had only seen the Crossref JSON data and not the UNIXREF. Thx for the curl line.

    The href attribute got stripped out of the anchor tag around your text "CrossRef doesn't use JATS". Can you add the raw URL?

    So sounds like Zotero never replaces HTML-like tags with newlines. Am I correct that Zotero just replaces these HTML-like tags with empty strings? Or are some tags replaced with space and some tag replaced with empty string?

    Apart from making a point about entire JATS files, I'm not on the same page as you on "CrossRef doesn't use JATS" and "it's very much not the same".

    I see your point about Zotero not reading entire JATS files. Perhaps we can distinguish whether we mean entire JATS files or entire JATS < article > elements vs JATS tags. < italic > and < abstract > are not from the set of tags found in HTML. They are in the set of tags found in JATS. I'm fine if you want to say "HTML-like" for tags like < italic >. I find it a bit odd to not call them JATS because that is the "Tag Suite"/"Tag Set" from which they come. They are definitely not "HTML tags".

    The documented metadata XML schema of Crossref uses the JATS schema for the abstracts. See the abstract element documentation in:

    https://data.crossref.org/reports/help/schema_doc/5.4.0/index.html

    In addition the XSL transform file that Crossref suggests publishers use (at least at some point in time), just copies whatever JATS elements are in the abstract over to Crossref's metadata deposit XML:

    https://gitlab.com/crossref/schema/-/blob/master/JATS/JATS_to_Crossref.v2.xsl

    I suspect the "internal processing" that Crossref does is pretty much just copying entire JATS elements for the abstract and not really doing any processing.
  • Link fixed. Crossref recommends against using the xsl you link to. There's going to be some overlap, but the crucial part is that CrossRef never sees the JATS file and you can't just assume how UNIXREF is going to look based on JATS bc you don't know how journals generate it
  • I wonder if part of the problem here is that Crossref has engineered their UNIXREF XML to treat all whitespace around UNIXREF XML as meaningless, but then publishers are stuffing entire abstracts with < p > elements and inline HTML-like elements where whitespace around HTML-like inline elements has HTML-like meaning but then Zotero is just treating all that whitespace like preformated plaintext of an HTML < pre > tag (or JATS < preformat > tag).

    I think I'm going to hop over to the Crossref forum and ask some questions. I suspect that UNIXREF XML as currently engineering might be fundamentally not designed for HTML-like XML with a mix of block and inline elements (like < p > and < sub > and < italic >).
  • I've got at least two questions for Crossref. First one is https://community.crossref.org/t/a-way-per-doi-to-get-original-metadata-deposit-xml/14529

    Regardless of what Crossref and the publishers are doing incorrectly, Zotero is also incorrectly preserving newlines between inline HTML/JATS elements inside the paragraph < p> elements. In HTML this is inline (phrasing) content which flows to the right without hard line breaks. Removing tags and preserving newlines is an incorrect rendering to plain text for inline/phrasing content.

    I raise these issues primarily because I'm trying to figure out what XML in abstracts will actually get processed properly if submitted to Crossref. I'm implementing code soon to generate Crossref deposit XML from Baseprint XML.

    As best I can tell, literally nobody is actually using these tags in abstracts to display richly formatted HTML for human reading. It mostly seems like the motivation for publishers in sending abstracts is to enable text-mining and search. And the real reason the tags are included is because that's the easier path to take: just copy whatever is in their JATS files and not worry about what tags do or do not get handled properly downstream.

    Cycling back to the initial issue of this discussion, what are your thoughts on handling HTML-like tags in abstracts? Is this out-of-scope? In other words, is the abstract box in Zotero really only intended to show XML with whitespace/newlines preserved but with all XML tags stripped out and replaced with empty strings?

    This does imply that sending HTML-like XML into Zotero may look unexpectedly bad given the way HTML is supposed to flow. But it's not clear anybody needs to or should be sending HTML-like tags in abstract to Zotero.
  • Cycling back to the initial issue of this discussion, what are your thoughts on handling HTML-like tags in abstracts?
    There's actually two questions:
    1. How should Zotero handle whitespace in CrossRef abstracts?
    CrossRef didn't use to have very many abstracts, so this probably didn't get much attention for a while, but looking at this now, a lot of them are looking quite bad. Zotero should handle this better (though this may not be trivial given that the type of data that journals put into CrossRef abstracts is also likely bad and, worse, highly inconsistent).

    2. Should Zotero try to preserve markup in abstracts?

    This is less clear. Zotero titles unambiguously take HTML mark-up. It would obviously be possible to convert CrossRef markup into HTML for abstracts.
    For abstracts, however, the Zotero interface doesn't render HTML markup, so they will look quite messy with HTML tags. On the other hand, citation styles that do include abstracts will render HTML tags in them, so the right behavior just isn't clear here. If Zotero devs have an opinion on this (and, e.g., plans to render HTML in abstract fields) that'd be helpful to know @dstillman


    The Unixref import code already has some code that cleans up tags in titles, so it likely wouldn't be hard to extend that to abstracts:
    https://github.com/zotero/translators/blob/master/Crossref Unixref XML.js#L437
  • One quick point on the possibility of Zotero rendering marking into HTML for abstract: I'm still waiting to confirm a few questions from Crossref, but it seems likely that Crossref doesn't really offer markup in UNIXREF ... as in HTML/XML with mixed content ... as in XML where there is a mix of child elements and text and whitespace that matters to proper rending. So as things currently stand, I think it might be impossible for CrossRef to render HTML in abstracts via Crossref UNIXREF because their UNIXREF API loses information in the HTML (namely orig whitespace in mixed content).
  • Looking at the Crossref Unixref XML.js code, it looks like y'all largely avoid this issue because the translator replaces newlines with empty string and very few articles are showing up with newlines in their title.

    As a side note, replacing newline with an empty string is a bug in that the HTML "My hot\ndog!" should render as three words "My hot dog!", not two "My hotdog!"

    Also HTML/XML/SGML numerical entities are not getting handled (e.g. &#8364; and &#x20AC; for the Euro sign, etc...).
  • On 2nd thought, the HTML/XML/SGML numerical entities are probably a moot point. The XML data has already been parsed into an XML tree. I'm guessing the reason some of the code is converting the named entities is for the rare cases where publishers have "over-double-escaped" say an ampersand and so you're double decoding them on the assumption that no title intentionally is going to want to show named XML entities.
  • I'm guessing the reason some of the code is converting the named entities is for the rare cases where publishers have "over-double-escaped" say an ampersand and so you're double decoding them on the assumption that no title intentionally is going to want to show named XML entities.
    yup, this exactly -- not an assumption, unfortunately, but rather bitter experience.
  • Not to nerd-snipe (xkcd.com/356) ourselves here but the double decoding didn't seem to do the trick for 10.1542/peds.2023-062391. I followed up with Crossref on this and it does appear to be a case of the publisher over-espcaping. I don't see my recently updated Zotero install undoing the publisher's mistake. I think it's fine though. The publisher clearly screwed up. It's an issue between the publisher and the author. Besides, what's a nerd to do if they actually WANT an XML named entity to appear in the title! :-)

    https://community.crossref.org/t/jats-elements-in-abstract-is-this-the-expected-behaviour/6069/6
  • As best I can tell both the Crossref UNIXREF XML API and the REST JSON API are incorrectly modifying the XML from publishers and introducing semantically incorrect changes to the text.

    For the UNIXREF XML API problem I have posted a fun beer ale yeast repro case:

    https://community.crossref.org/t/unixref-xml-api-inserting-incorrect-whitespace-between-html-like-elements/14539

    and for tracking the REST JSON API problem I have posted this topic:

    https://community.crossref.org/t/meaningful-whitespace-removed-from-json-rest-api/14533

    Crossref seems to have really covered their bases by messing with the XML in both ways possible, in the XML API they inject invalid whitespace and in the JSON API they invalidly remove whitespace! :-)
Sign In or Register to comment.