Additional spaces in long item titles when importing citations

daniel_esser · March 21, 2013

Recently, zotero started adding additional spaces into the item's title when saving the citations. On Web of Knowledge it adds 2 spaces to a space about every 60, 120, 180 ... characters (not counting spaces). Example:
http://apps.webofknowledge.com/full_record.do?product=UA&search_mode=GeneralSearch&qid=38&SID=T1Cl9La9JiBiDO1kkFL&page=2&doc=15
On other sites a similar error may occur but with different ammounts of spaces and at different positions. These sites are very rare though Example:
http://han.sub.uni-goettingen.de/han/62323_1/onlinelibrary.wiley.com/doi/10.1111/1574-6941.12040/abstract
(Frerichs et al. (2013), FEMS Microbiology Ecology 84-1, pp. 60-74)

noksagt · March 21, 2013

In the case of Web of Knowledge, there are extra spaces in their export file. It might be the case we can work around their bad export; I don't know. I don't have access to the second example, but you could check the export files for the same kind of problem.

daniel_esser · March 25, 2013

Thanks noksagt for the quick answer. I checked several WoK export files and the additional spaces could all be found there, too. Guess I will avoid WoK for this purpose and get the citations from the journal pages directly. For the second example, the export file of FEMS Microbiology Ecology is OK but the 2 in 'CO2' is put between 15 spaces on either side when saving it to zotero in the usual way. Funny enough, the WoK export file for this paper is - despite its length - alright.

adamsmith · March 25, 2013

for the wiley article (FEMS Microbiol) - when you hover your mouse over the URL bar icon, what does it say? I can't replicate the problem there.
I'll take a look at WoK - we should be able to clean that up on import, thanks for letting us know.

daniel_esser · March 28, 2013

It says "In Zotero speichern (DOI)" ("Save to Zotero (DOI"). If I save it it shows the entry correctly in the library list but if I hover the mouse over the entry or if you try to edit the title (and only then) I see the additional sapces.

adamsmith · March 28, 2013

yeah - the problem is the Goettingen proxy there - because of the way the Goettingen UB inserts itself in the URL, Zotero doesn't recognize the Wiley site as such and so you're getting the (incomplete - e.g. w/o an abstract or full-text) data from CrossRef via DOI.
If you can access the resource unproxied (e.g. with a VPN connection or from campus) it'd work better.

That said, we should be able to fix the DOI import as well, even preserving the subscript 2.

aurimas · March 31, 2013

Technical note:

That said, we should be able to fix the DOI import as well, even preserving the subscript 2.

I was looking into this a bit and I'm not sure if we will be able to do this correctly. Particularly, I'm not sure how to properly remove the spaces presented in the XML output from CrossRef.

I.e. http://www.crossref.org/openurl/?pid=zter:zter321&url_ver=Z39.88-2004&&rft_id=info:doi/10.1111/1574-6941.12040&noredirect=true&format=unixref is the data we get from CrossRef for that Wiley item. If you look at the title, it looks like:


            <title>
              Microbial community changes at a terrestrial volcanic CO
              <sub>2</sub>
              vent induced by soil acidification and anaerobic microhabitats within the soil column
            </title>

We can get the inner XML from the title, but if we simply remove all the redundant spaces, we end up with "volcanic CO ₂ vent". Note the space between CO and 2. I'm not sure if CrossRef has some sort of rules about handling this, but I don't see any obvious way. I have a feeling that this is simply the way the metadata was deposited with CrossRef.

adamsmith · April 1, 2013

I see - that makes sense. I think volcanic CO ₂ vent
seems pretty decent though - certainly better than the status quo with multiple spaces and w/o the subscript and we should go ahead and implement that.

adamsmith · April 18, 2013

double spaces in titles and abstracts are now removed for ISI WoK
Your version of Zotero will automatically update within 24hs, or you can update manually using the "Update Now" button in the "General" tab of the Zotero preferences.

Any further problems let us know & thanks for reporting

aurimas · April 27, 2013

Rich markup is now retained in Titles retrieved from CrossRef.

Your version of Zotero will automatically update within 24hs, or you can update manually using the "Update Now" button in the "General" tab of the Zotero preferences.

Rintze · April 30, 2013

@aurimas, any chance I could interest you in improving Zotero's support for processing rich markup? Some things on my wishlist:

- being able to apply markup with shortcuts (see https://forums.zotero.org/discussion/3875/rich-text-in-titles/?Focus=110229#Comment_110229 )
- have automatic normalization of markup that comes in via translators. E.g. some of my items come in as "...", and it would be nice to consistently have lowercase ("..."). We moved from using <sc/> to for smallcaps, so the latter is the preferred markup. And some of my metadata comes in with escaped tags, like "& lt;i& gt;Candida& lt;/i& gt;" (without the spaces).
- have Zotero parse rich text markup when displaying titles in the center column, and when displaying titles and the abstract in the info panel (unless the field is active for editing).
- (have the option to) have Zotero ignore rich text markup when searching my library

aurimas · April 30, 2013

1, 3, and 4 I've thought about quite a bit and I have done ideas, so I'll give it a shot maybe before 4.1

2, what translator are these coming from? I'll have to think about how to generalize that, but an example would be helpful.

Rintze · April 30, 2013

I get "Characterization of the sol Operon in Butanol-Hyperproducing Clostridium saccharoperbutylacetonicum Strain N1-4 and Its Degeneration Mechanism" when saving from https://www.jstage.jst.go.jp/article/bbb/71/1/71_60370/_article (J-Stage translator)

adamsmith · April 30, 2013

@aurimas - I'll leave this for you, but I had a quick look and the uppercase html tags are in the bibtex (yes, really - html tags in bibtex).

aurimas · April 30, 2013

For that one, that's the way the title is presented in the BibTeX. We can probably normalize this on import for all translators. Though I'm wondering if there is any advantage to this other than aesthetics.

Do you have a link for this?

And some of my metadata comes in with escaped tags, like "& lt;i& gt;Candida& lt;/i& gt;" (without the spaces).

Sounds like that should be fixed at the translator level.

Rintze · April 30, 2013

I have 19 papers in my library (of 320 items) that have unescaped tags. All came from "SpringerLink"/"Springer Link (old)", and I can't reproduce it from the new Springer website. It would be nice if this could be retroactively be cleaned up in people's libraries, though.