Ignoring diacritics (accents): local and web searches behave differently
I am new to this forum, but I did search for this particular issue and did not find anything. If (as I hope) this is a long settled issue, please point me to relevant threads.
I have seen numerous threads about search taking into account accents and other diacritics: this seems to be acknowledged as undesirable, with no ETA for the fix.
What I just found out is that, when done through the web, searches DO behave properly (that is, they ignore diacritics). So, local searches and web searches (on the same fields with same search terms) seem to return different results.
1- Is something wrong in my local config, or is it really the case that web searches ignore diacritics and local searches do not?
2- If the problem has been solved for web searches, is it so difficult to port the solution to the standalone program doing local searches?
3- Is there any hope of getting soon at least a ETA for a fix for local searches?
Thank you very much.
I have seen numerous threads about search taking into account accents and other diacritics: this seems to be acknowledged as undesirable, with no ETA for the fix.
What I just found out is that, when done through the web, searches DO behave properly (that is, they ignore diacritics). So, local searches and web searches (on the same fields with same search terms) seem to return different results.
1- Is something wrong in my local config, or is it really the case that web searches ignore diacritics and local searches do not?
2- If the problem has been solved for web searches, is it so difficult to port the solution to the standalone program doing local searches?
3- Is there any hope of getting soon at least a ETA for a fix for local searches?
Thank you very much.
No ETA here, but this should become more possible after we move Zotero to a new architecture, which we're hoping to do this year or early next year.
(Interestingly, I notice that now in my Firefox browser, which I believe is also the software backbone of Zotero, searching for either of the characters in the post above identifies only the former [single character] and not the latter [character+modifier] regardless of which one I copy and paste into the 'search' box. Maybe that's the source of this inconsistency.)
If it is, that's probably better in a new thread.
starting in 2010 with :
https://forums.zotero.org/discussion/11498
https://forums.zotero.org/discussion/34898
https://forums.zotero.org/discussion/40530
https://forums.zotero.org/discussion/74866
https://forums.zotero.org/discussion/69992
https://forums.zotero.org/discussion/67934
https://forums.zotero.org/discussion/comment/276774
It's probably not a problem for people in countries that do not use many accented characters, but for example here in France it's a real problem. Especially for author names, as some journals will have them with or without the accented characters. And also it is difficult to write on some keyboards the accented characters from other countries. Try to type in these authors names in Zotero : Kardošová , Klápště , Truţǎ , Marcińska , or some subtle ones, such as Martínez . Typing these in just as Kardosova, Klapste, Truta, Marcinska and Martinez would be so much easier.
It can be technically made to work by either saving a transliterated version of any field that can be searched, but that would complicate Zotero, or by doing the translateration manually by constructing the queries using replace
SELECT YOUR_COLUMN FROM YOUR_TABLE WHERE replace(replace(replace(replace(replace(replace(replace(replace(
replace(replace(replace( lower(YOUR_COLUMN), 'á','a'), 'ã','a'), 'â','a'), 'é','e'), 'ê','e'), 'í','i'), 'ó','o') ,'õ','o') ,'ô','o'),'ú','u'), 'ç','c') LIKE '%SEARCHTERM%'
but that's going to make search massively slower, and might run into query-length limitations when the full range of transliteration options is added (romajization tables are massive)
Wouldn't another option simply have an additional column with the author names without problematic characters when adding an article to the database?
Or even better, simply having the choice to completely ignore accents when saving an article from the web? I personally wouldn't care of not having accents at all. That way the problem is solved at its source.
That's effectively the first option I mentioned
That would be the worst possible option from my POV - it creates an unfixable problem at the source. That would end up in the bibliography, and you or me don't get to choose how the author you are referring to gets named.
Edit: but if the latter option is what you want, even though I would strongly recommend against doing so, as it amounts to purposeful data loss that will be (nigh on) impossible to undo, that's easy to automate with a plugin. I can create that for you, but the data loss that this plugin would cause -- by design -- is on you.We've just about completed the handling of names with particles -- optionally a search for a last name with or without the particle will return the same listing. This is (I think) being done with an author equivalency table. We did this to allow searchers living in different places using different knowledge and keyboards to nonetheless perform a successful query.
(When the record is downloaded to Zotero the metadata contains the accented-characters version of the name unless it is quite clear that the author prefers the undecorated characters. We export names with particles properly placed in the given name field.)
2. From my perspective (as a linguist) there is a fundamental difference between searching "a" and "á" as the same character and treating Japanese glyphs as variants of romaji. While the latter could also be helpful for searching, it's less transparent and expected. A similar comparison can be made for, say, German umlauts: we should be able to search for "ü" by typing just "u", even though a standard alternative for writing "ü" on a keyboard without umlauts is "ue". If you want to implement both, that's fine. But similarly for "ß" and "ss", that's not so obviously the "same character", as in the case of "ü/u" or "á/a", etc. Letters with obvious diacritics should be searched as if the base form, while anything beyond that would be a bonus but, as you said, much more complicated to implement.
There is also a secondary issue, which is that if you include variant spellings, then it should be possible to look them up either way, which makes it even more complicated (and won't work with the proposed method of a secondary search table with diacritics removed). In that case, "ß" should search for "ss" and "ss" should search for "ß", and for that matter "ü" should search for "u", etc., but that's again going beyond the basic functionality corresponding to the simple "á" and "a" case. I'm not even sure that more complex functionality would be transparent to everyone searching, although in principle it could be useful if, say, you wanted to search for the same author's name whether written in Japanese or transliterated in English/Romaji. A secondary problem, of course, is that there are many different ways of transcribing some languages (this is especially the case with Russian, for example), so it would be too complex to implement fully.
In short, ignoring (during search) the most common diacritics from basic Latin letters would be helpful.
(I will also add that I strongly agree with you that Zotero should keep the original form of the name in metadata. This should only apply to search.)
I look into this with whatever current build of SQLite we have from Firefox every few years, and the last time I did it still didn't seem to be possible, but I'll look again after Zotero 6 is out. Storing a second normalized copy of all strings is an option, but a bad one.
They changed the collation, and that 1.5 hour is probably mostly spend finding the configuration file and reading the docs. The actual technical change will have been one or two lines of text in a config file, and a restart. SQLite has the plumbing to do new collations, but doesn't have any baked in in the version that's in firefox, and we can't simply re-bake the sqlite in firefox.
When searchers don't know how to "properly" enter a name (or even that there is a "proper" way) they query using what they think is the name.
I'm especially sensitive to the name particle issue (searchers being able to find names with particles with or without entering the particle) because of a SNAFU that happened in my home town when I was about 10 years old. I grew up in south Louisiana "Cajun country". The telephone company switched from operator-connected "number please" calls to dial. Everyone received a new 7-digit number My old home phone number was 21 and my best friend's number was 121. The name problem was with the new telephone book's alphabetization. Almost every family with a name that even a hint of a particle had it dropped. Leblanc (the most common name in town) became listed under B instead of L. Debellvue, another common family name was listed under B. The large Mexican population with Spanish surnames similarly suffered. That would have caused serious confusion under normal circumstances but when everyone was receiving a new phone number it became a disaster. edit: The type was also converted--all upper case to mixed case the name was listed as leBlanc and under B instead of the old Leblanc under "L". People went crazy (with good reason).
Looking at the server logs listing author-name queries it is clear that, depending upon where the user resides, there is a pattern of particle-related name searching. Searchers use or don't use a dropping or non-dropping particle. We received complaints from users who used a form of the name that didn't comply with cataloging rules or various style guides. In fact, for many names more people queried using the "wrong" name-form than the standard one. Thus, our efforts to allow multiple entry formats.
There doesn't seem to have been any movement on this, as it's still a problem on my local search. Are there any updates?
Diacritic-insensitive searching is essential for those of us with libraries that deal with non-English-langauge material, which is often handled differently by publishers.
Along these lines, perhaps a fuzzy search for your library would be a nice adjacent feature?