Normalizing HTML markup in web translators

Rintze · May 22, 2012

I have several papers from SpringerLink that have titles with a lot of HTML markup, in various forms, e.g.:

C& lt;sub& gt;1& lt;/sub& gt; compounds ... (without spaces)

and

... <span style="font-variant:small-caps"><small>l</small> ...

It would be nice if titles (and short titles) were cleaned up to use the CSL markup tags ( http://citationstyles.org/downloads/upgrade-notes.html#rich-text-markup-within-fields ).

Also, the SpringerLink translator incorrectly extracts a short title because it detects a colon after "font-variant".

DWL-SDCA · May 22, 2012

Springer is not the only publisher with this problem.

There are many journal records on PubMed that are also affected by misplaced html tags. IN addition to Springer journals, there are offenders that are worse -- particularly BMJ Group and LWW. Usually, the article titles survive intact but the abstracts are frequently truncated and the author affiliation can end up being one word (usually "From"). These html tags really shouldn't be included in the metadata the publishers send to bibliographic databases. The tags are in the files that they send to me. I have had email exchanges with folks at each of these publishers and they are aware of the problem. My contacts say that the metadata is being taken from the data on their fileserver -- the same data that drives the text on their websites. Apparantly, they are doing very little formatting using CSS but format with inline tags instead. Each of the publishers I've contacted say that they will eventually fix this. NO hint at how long we will need to wait.