Diacritic removal leads to wrong results for languages like Swedish

Hi,
I noticed the update about diacritic removal in the note search and wanted to point out that diacritics are not always changing the sound of a base letter but are used to denote wholly different letters in some languages, like Swedish å, ä, and ö.
https://en.wikipedia.org/wiki/Swedish_alphabet

The problem with this is that a search for any of for example kår/kär/kar would match all the others, even if they are different words (corps, in love, tub).

Solving this generally is of course a hard problem where the diacritic removal would have to be based on knowledge of what language a text is written in.

This is perhaps a minor problem for notes in Zotero, a bigger problem for search in general in Zotero.

I noticed here that for searching notes (on a build from master today):
Searching kår, kär, kar matches any of kår, kär, and kar.

While when searching for documents with the zotero search bar:
kär only matches kär, kår only kår, but kar matches kår, kär, and kar.

Implying diacritics are stripped from items (titles in this case) but not from the search string.


This isn’t something that has been a big problem for me in Zotero. Ten years ago I highlighted the same problem for the full-text indexer recoll, where many false results were more problematic for me:
https://www.freelists.org/post/recoll-user/Problems-with-character-substitution

The easy solution would be to implement user-configurable exceptions to diacritics removal. This is what was added to recoll after that discussion, and what I also have configured for the similar machinery in Emacs. If I then have much text in Swedish, I could set Å,Ä,Ö,å,ä,ö to never have diacritics removed. If I also have text in german where doing ä→a makes sense, I have a problem. But the literature in my field is mostly in English, so in Zotero I’ve actually never thought about this until now.
Sign In or Register to comment.