Ignoring diacritics (accents): local and web searches behave differently

Yves Marcoux · April 10, 2019

I am new to this forum, but I did search for this particular issue and did not find anything. If (as I hope) this is a long settled issue, please point me to relevant threads.

I have seen numerous threads about search taking into account accents and other diacritics: this seems to be acknowledged as undesirable, with no ETA for the fix.

What I just found out is that, when done through the web, searches DO behave properly (that is, they ignore diacritics). So, local searches and web searches (on the same fields with same search terms) seem to return different results.

1- Is something wrong in my local config, or is it really the case that web searches ignore diacritics and local searches do not?

2- If the problem has been solved for web searches, is it so difficult to port the solution to the standalone program doing local searches?

3- Is there any hope of getting soon at least a ETA for a fix for local searches?

Thank you very much.

dstillman · April 10, 2019

Nothing wrong on your end, but also not a simple fix — the web library uses a more sophisticated database that can do accent-insensitive searches automatically.

No ETA here, but this should become more possible after we move Zotero to a new architecture, which we're hoping to do this year or early next year.

djross3 · April 11, 2019

Sorry if this is a bit out of place, but while the topic of diacritics has been brought up, I've noticed confusing behavior for unicode characters versus ascii characters plus unicode modifier diacritics. That is, you can form é either as a special character, or as e+´=é. The former is probably 'best', but the latter very often gets into my library from importing metadata (especially from Worldcat), and this means that authors have visually identical names but aren't actually (always?) treated as the same, at least not when searching. (I'm not sure whether they're treated as equivalent otherwise, e.g., in sorting a bibliography.) I've tried to manually correct most of this in my own library, but I just thought I'd mention it in the discussion here.

(Interestingly, I notice that now in my Firefox browser, which I believe is also the software backbone of Zotero, searching for either of the characters in the post above identifies only the former [single character] and not the latter [character+modifier] regardless of which one I copy and paste into the 'search' box. Maybe that's the source of this inconsistency.)

adamsmith · April 11, 2019

@djross3 Zotero should normalize precomposed & decomposed unicode on import (and export) since version 4.0.25.1, January 2015. Are you sure this is happening with pairs of items imported since then?

If it is, that's probably better in a new thread.

djross3 · April 11, 2019

Ah, thanks. That may be the case (and some of my entries are older than that, mostly manually fixed now). Since it's hard to find these, and I haven't noticed them often (but do import names with diacritics often), you may be right. But I will keep an eye out for it and then report as a bug in a new thread if I find that it is still happening.

xxtraloud · August 22, 2019

@dstillman Is it not possible to use a lookup table to ignore diacritics during searches? For example a lookup table that substitutes any instance of é with e, just during the search, and create a temporary string without any accented characters? Are you saying that the current architecture cannot do any string manipulation of this kind? I am hard pressed to find a programming language that cannot do this, but I don't know what's Zotero's architecture.

Olib170 · January 29, 2022

This seems to be a problem which has been there for a long time and has been reporter over and over again.

starting in 2010 with :
https://forums.zotero.org/discussion/11498

https://forums.zotero.org/discussion/34898
https://forums.zotero.org/discussion/40530
https://forums.zotero.org/discussion/74866
https://forums.zotero.org/discussion/69992
https://forums.zotero.org/discussion/67934
https://forums.zotero.org/discussion/comment/276774

It's probably not a problem for people in countries that do not use many accented characters, but for example here in France it's a real problem. Especially for author names, as some journals will have them with or without the accented characters. And also it is difficult to write on some keyboards the accented characters from other countries. Try to type in these authors names in Zotero : Kardošová , Klápště , Truţǎ , Marcińska , or some subtle ones, such as Martínez . Typing these in just as Kardosova, Klapste, Truta, Marcinska and Martinez would be so much easier.

xxtraloud · January 29, 2022

This problem bothers me very regularly. Even english speakers have this problem because of exactly what you described.

emilianoeheyns · January 30, 2022

@xxtraloud Zotero search translates the search query to a SQL query and executes that, and SQLite (the database engine that Zotero uses does not have something like SQL Server's accent insensitive collations.

It can be technically made to work by either saving a transliterated version of any field that can be searched, but that would complicate Zotero, or by doing the translateration manually by constructing the queries using replace

SELECT YOUR_COLUMN FROM YOUR_TABLE WHERE replace(replace(replace(replace(replace(replace(replace(replace(
replace(replace(replace( lower(YOUR_COLUMN), 'á','a'), 'ã','a'), 'â','a'), 'é','e'), 'ê','e'), 'í','i'), 'ó','o') ,'õ','o') ,'ô','o'),'ú','u'), 'ç','c') LIKE '%SEARCHTERM%'

but that's going to make search massively slower, and might run into query-length limitations when the full range of transliteration options is added (romajization tables are massive)

xxtraloud · January 30, 2022

@emilianoeheyns Thanks for the suggestion.

Wouldn't another option simply have an additional column with the author names without problematic characters when adding an article to the database?

Or even better, simply having the choice to completely ignore accents when saving an article from the web? I personally wouldn't care of not having accents at all. That way the problem is solved at its source.

emilianoeheyns · January 30, 2022

Wouldn't another option simply have an additional column with the author names without problematic characters when adding an article to the database?

That's effectively the first option I mentioned

Or even better, simply having the choice to completely ignore accents when saving an article from the web? I personally wouldn't care of not having accents at all. That way the problem is solved at its source.

That would be the worst possible option from my POV - it creates an unfixable problem at the source. That would end up in the bibliography, and you or me don't get to choose how the author you are referring to gets named.

Edit: but if the latter option is what you want, even though I would strongly recommend against doing so, as it amounts to purposeful data loss that will be (nigh on) impossible to undo, that's easy to automate with a plugin. I can create that for you, but the data loss that this plugin would cause -- by design -- is on you.

emilianoeheyns · January 30, 2022

Huh - I could implement the first option as a plugin, I think, by adding pseudo-fields. These would still need to be saved to a separate database (or startup would become massively slower as the database grows), which has some risk for race conditions, and I would have to add a pseudo-field for every field that would be searchable this way. It'd be a bit of work though.

DWL-SDCA · January 30, 2022

It is the weekend so I can't easily ask my web developers for details on how they accomplished it. It only required a 1.5 hour charge. But an author name search of the MySQL+Sphinx database (with well over 3 million unique authors attached to almost 700,000 publication records) using either a name with properly "accented" characters or characters without accents will return exactly the same list. An author query Martínez or Martinez will return the same inclusive result. Same with ß vs ss. Likewise ö=oe, ü=ue, etc. Searches require less than 2 seconds to return results. The actual search script execution time is fractions of a millisecond. Is this a difference between the capabilities of MySQL and SQLite? Maybe the difference is the search is on a dedicated server? All of the server and web technical "stuff" far exceeded my own knowledge, skills, and abilities 15 years ago.

We've just about completed the handling of names with particles -- optionally a search for a last name with or without the particle will return the same listing. This is (I think) being done with an author equivalency table. We did this to allow searchers living in different places using different knowledge and keyboards to nonetheless perform a successful query.

(When the record is downloaded to Zotero the metadata contains the accented-characters version of the name unless it is quite clear that the author prefers the undecorated characters. We export names with particles properly placed in the given name field.)

djross3 · January 31, 2022

...when the full range of transliteration options is added (romajization tables are massive)

1. Not implementing this because it would be difficult to implement in the most extreme cases doesn't make sense as an argument to me. If Japanese (and some other languages) can't be supported, then that doesn't mean that all other languages must also wait, including those that would be much simpler to implement.

2. From my perspective (as a linguist) there is a fundamental difference between searching "a" and "á" as the same character and treating Japanese glyphs as variants of romaji. While the latter could also be helpful for searching, it's less transparent and expected. A similar comparison can be made for, say, German umlauts: we should be able to search for "ü" by typing just "u", even though a standard alternative for writing "ü" on a keyboard without umlauts is "ue". If you want to implement both, that's fine. But similarly for "ß" and "ss", that's not so obviously the "same character", as in the case of "ü/u" or "á/a", etc. Letters with obvious diacritics should be searched as if the base form, while anything beyond that would be a bonus but, as you said, much more complicated to implement.

There is also a secondary issue, which is that if you include variant spellings, then it should be possible to look them up either way, which makes it even more complicated (and won't work with the proposed method of a secondary search table with diacritics removed). In that case, "ß" should search for "ss" and "ss" should search for "ß", and for that matter "ü" should search for "u", etc., but that's again going beyond the basic functionality corresponding to the simple "á" and "a" case. I'm not even sure that more complex functionality would be transparent to everyone searching, although in principle it could be useful if, say, you wanted to search for the same author's name whether written in Japanese or transliterated in English/Romaji. A secondary problem, of course, is that there are many different ways of transcribing some languages (this is especially the case with Russian, for example), so it would be too complex to implement fully.

In short, ignoring (during search) the most common diacritics from basic Latin letters would be helpful.

(I will also add that I strongly agree with you that Zotero should keep the original form of the name in metadata. This should only apply to search.)

dstillman · January 31, 2022

We're obviously not going to throw away diacritics on save.

I look into this with whatever current build of SQLite we have from Firefox every few years, and the last time I did it still didn't seem to be possible, but I'll look again after Zotero 6 is out. Storing a second normalized copy of all strings is an option, but a bad one.

emilianoeheyns · January 31, 2022

It is the weekend so I can't easily ask my web developers for details on how they accomplished it. It only required a 1.5 hour charge.

They changed the collation, and that 1.5 hour is probably mostly spend finding the configuration file and reading the docs. The actual technical change will have been one or two lines of text in a config file, and a restart. SQLite has the plumbing to do new collations, but doesn't have any baked in in the version that's in firefox, and we can't simply re-bake the sqlite in firefox.

dstillman · January 31, 2022

Technical details, for the curious: https://github.com/zotero/zotero/issues/1300#issuecomment-1025609657 and the follow-up.

DWL-SDCA · January 31, 2022

Off topic anecdote that doesn't really apply to Zotero citation but only to searching:

When searchers don't know how to "properly" enter a name (or even that there is a "proper" way) they query using what they think is the name.

I'm especially sensitive to the name particle issue (searchers being able to find names with particles with or without entering the particle) because of a SNAFU that happened in my home town when I was about 10 years old. I grew up in south Louisiana "Cajun country". The telephone company switched from operator-connected "number please" calls to dial. Everyone received a new 7-digit number My old home phone number was 21 and my best friend's number was 121. The name problem was with the new telephone book's alphabetization. Almost every family with a name that even a hint of a particle had it dropped. Leblanc (the most common name in town) became listed under B instead of L. Debellvue, another common family name was listed under B. The large Mexican population with Spanish surnames similarly suffered. That would have caused serious confusion under normal circumstances but when everyone was receiving a new phone number it became a disaster. edit: The type was also converted--all upper case to mixed case the name was listed as leBlanc and under B instead of the old Leblanc under "L". People went crazy (with good reason).

Looking at the server logs listing author-name queries it is clear that, depending upon where the user resides, there is a pattern of particle-related name searching. Searchers use or don't use a dropping or non-dropping particle. We received complaints from users who used a form of the name that didn't comply with cataloging rules or various style guides. In fact, for many names more people queried using the "wrong" name-form than the standard one. Thus, our efforts to allow multiple entry formats.