Import of journal articles with lot of html formatting tags in abstract
Lately I've found that some(major and minor) journal publishers are recently including html tags in the metadata they provide. As is the case with the example below, this occurs with the embedded metadata translator but I've found a few Springer/Nature journals that also sometimes have these formatting tags.
See: DOI 10.1016/j.jsams.2020.05.008
I find that this is undesirable but others may disagree.
Is there something simple that could be done with translators to strip the tags? I admit that this is more an annoyance to me and that others may actually like them. Thus, this may be more a paper-cut than a real trouble. I get around this by going to the article webpage and copying/pasting the abstract. Thanks
See: DOI 10.1016/j.jsams.2020.05.008
I find that this is undesirable but others may disagree.
Is there something simple that could be done with translators to strip the tags? I admit that this is more an annoyance to me and that others may actually like them. Thus, this may be more a paper-cut than a real trouble. I get around this by going to the article webpage and copying/pasting the abstract. Thanks
content="<h2>Abstract</h2><h3>Objective</h3><p>
So we'd have to import it and then use some sort of heuristic to see if there are things that look like html tags and if so remove them. That's not impossible, but it's also not simple without risking to remove other things.
If @dstillman has thoughts on this on a conceptual or technical level that'd be helpful.
Another issue is several highwire-translator journals are providing "graphical astracts" before text and that also leads to a similar trouble:
10.1016/j.jaccas.2019.11.070
10.1163/15685306-12341596