Import of journal articles with lot of html formatting tags in abstract

Lately I've found that some(major and minor) journal publishers are recently including html tags in the metadata they provide. As is the case with the example below, this occurs with the embedded metadata translator but I've found a few Springer/Nature journals that also sometimes have these formatting tags.

See: DOI 10.1016/j.jsams.2020.05.008

I find that this is undesirable but others may disagree.

Is there something simple that could be done with translators to strip the tags? I admit that this is more an annoyance to me and that others may actually like them. Thus, this may be more a paper-cut than a real trouble. I get around this by going to the article webpage and copying/pasting the abstract. Thanks

  • It's definitely undesirable, but this is not trivial on a technical level because the html tags are just escaped html entities in the metadata, i.e. the above starts like this:
    content="<h2>Abstract</h2><h3>Objective</h3><p>

    So we'd have to import it and then use some sort of heuristic to see if there are things that look like html tags and if so remove them. That's not impossible, but it's also not simple without risking to remove other things.

    If @dstillman has thoughts on this on a conceptual or technical level that'd be helpful.
  • Thanks for the quick reply Yeah, I was thinking that the escaped html entities would be a problem.

    Another issue is several highwire-translator journals are providing "graphical astracts" before text and that also leads to a similar trouble:

    10.1016/j.jaccas.2019.11.070
  • I've added a ticket to track this, but no promises that anything will happen any time soon https://github.com/zotero/translators/issues/2250
  • Brill, another publisher with escaped html entities in header metadata:

    10.1163/15685306-12341596
Sign In or Register to comment.