Ignoring diacritics (accents): local and web searches behave differently

I am new to this forum, but I did search for this particular issue and did not find anything. If (as I hope) this is a long settled issue, please point me to relevant threads.

I have seen numerous threads about search taking into account accents and other diacritics: this seems to be acknowledged as undesirable, with no ETA for the fix.

What I just found out is that, when done through the web, searches DO behave properly (that is, they ignore diacritics). So, local searches and web searches (on the same fields with same search terms) seem to return different results.

1- Is something wrong in my local config, or is it really the case that web searches ignore diacritics and local searches do not?

2- If the problem has been solved for web searches, is it so difficult to port the solution to the standalone program doing local searches?

3- Is there any hope of getting soon at least a ETA for a fix for local searches?

Thank you very much.
  • edited April 10, 2019
    Nothing wrong on your end, but also not a simple fix — the web library uses a more sophisticated database that can do accent-insensitive searches automatically.

    No ETA here, but this should become more possible after we move Zotero to a new architecture, which we're hoping to do this year or early next year.
  • edited April 11, 2019
    Sorry if this is a bit out of place, but while the topic of diacritics has been brought up, I've noticed confusing behavior for unicode characters versus ascii characters plus unicode modifier diacritics. That is, you can form é either as a special character, or as e+´=é. The former is probably 'best', but the latter very often gets into my library from importing metadata (especially from Worldcat), and this means that authors have visually identical names but aren't actually (always?) treated as the same, at least not when searching. (I'm not sure whether they're treated as equivalent otherwise, e.g., in sorting a bibliography.) I've tried to manually correct most of this in my own library, but I just thought I'd mention it in the discussion here.

    (Interestingly, I notice that now in my Firefox browser, which I believe is also the software backbone of Zotero, searching for either of the characters in the post above identifies only the former [single character] and not the latter [character+modifier] regardless of which one I copy and paste into the 'search' box. Maybe that's the source of this inconsistency.)
  • @djross3 Zotero should normalize precomposed & decomposed unicode on import (and export) since version 4.0.25.1, January 2015. Are you sure this is happening with pairs of items imported since then?

    If it is, that's probably better in a new thread.
  • edited April 11, 2019
    Ah, thanks. That may be the case (and some of my entries are older than that, mostly manually fixed now). Since it's hard to find these, and I haven't noticed them often (but do import names with diacritics often), you may be right. But I will keep an eye out for it and then report as a bug in a new thread if I find that it is still happening.
  • @dstillman Is it not possible to use a lookup table to ignore diacritics during searches? For example a lookup table that substitutes any instance of é with e, just during the search, and create a temporary string without any accented characters? Are you saying that the current architecture cannot do any string manipulation of this kind? I am hard pressed to find a programming language that cannot do this, but I don't know what's Zotero's architecture.
Sign In or Register to comment.