special character search

realtime99 · May 16, 2010

Often, names are entered on the web or in metadata with special characters converted to similar English characters, so Zotero sometimes considers two papers by one author to be by different authors.

I often can remember an author's name but not the diacritics of the non-English characters. Currently, if I want to find all papers by one author, my options are to either convert all names to English characters by hand (which gives citations with an incorrect spelling of the name) or to convert all names to the original language characters by hand (which prevents me from successfully searching).

I would like the ability to have search consider a English character as equivalent to similar special characters to make search easier. e.g., I would like a search for "Noel Carroll" to find articles written by "Noël Carroll", and a search for "Laszlo Halasz" to find articles by "László Halász".

This would help me immensely. Can it be implemented?

alexuw · May 16, 2010

This would be a nice feature. Very common problem when using non-English and translated sources, particularly with certain languages.

Note that differing diacriticals or spellings are not necessarily "incorrect" - different publications may have different versions of the author's name. I suppose how to correctly cite in this situation will depend on the citation style. This is a case where zotero has not entirely decoupled data from style.

ajlyon · May 16, 2010

This is a question of collations in Sqlite, and it has been discussed. My understanding is that the newest versions of Sqlite allow user-defined collation functions, so Zotero could, at its option, ignore combining diacritical marks.

I run into this frequently with Library of Congress Cyrillic transliterations, which use many diacritical marks that are nigh impossible to enter manually; most people use a diacritical-free simplification. Such simplifications match against the diacritical-adorned versions in bibliographic databases, and it would be great for them to match against adorned entries in Zotero.

My test case:
Adorned: Kazanskiĭ retro-leksikon: pervyĭ opyt rodoslovno-biograficheskoĭ i istoriko-kraevedcheskoĭ Ėnt︠s︡iklopedii
Unadorned: Kazanskii retro-leksikon: pervyi opyt rodoslovno-biograficheskoi i istoriko-kraevedcheskoi Entsiklopedii
I would like for a search for Entsiklopedii to match either form; right now it won't, since the characters in the adorned case are E*nt-s-iklopedii; where the * and - represent combining diacriticals that are not available on most (any?) keyboards.

When it becomes possible to have multiple versions of a single field (see http://forums.zotero.org/discussion/1798), as Frank has discussed and as the CSL 1.0 processor supports, then it should become possible to address alexuw's concern, so long as each version of a name or title can be described by a specific transliteration scheme, and that scheme is issued a subtag in the IANA Language Subtag registry (as Frank and I have already done).

ajlyon · October 23, 2010

I ran into similar issue with Unicode equivalence that is making author autocompletion rather inconvenient -- the names "Dündar" and "Dündar" are treated as distinct with regards to completion. The difference is, as with the combining ligatures above, that the "ü" in the first example is precomposed, and second example is not. The two should be treated as equivalent for search and sorting.

Appropriate resources for handling this might be found in the ICU toolkit: http://userguide.icu-project.org/collation/icu-string-search-service, available for Sqlite (http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/icu/README.txt).

ajlyon · October 23, 2010

The use of non-use of precomposed characters has ramifications for disambiguation and sorting CSL as well. Frank generally has a test for everything under the sun, but maybe precomposed characters have slipped by.

fbennett · October 23, 2010

Well, that's ugly. These certainly have slipped by. The localeCompare() string method used for sorting does not see them as equivalents, so sorting will be messed up. The JavaScript sloppy compare operator == (as opposed to strict compare with ===) also sees the two characters as not equal. I don't see any way to cope with this case in JavaScript, apart from normalization of the strings on input.

There must be a table of equivalents in the system somewhere, since the two display with the same glyph in the browser. Not sure how you get at that facility, though, or whether it can be used for reverse mapping to a canonical character code. Looks like a tough one.

fbennett · October 23, 2010

Here's a W3C doc on Unicode normalization: http://www.w3.org/TR/charmod-norm/#sec-ChoiceNFC

fbennett · October 23, 2010

Here's a tutorial that might be relevant. Can we access java.text.Normalizer in Firefox JS I wonder?

http://download.oracle.com/javase/tutorial/i18n/text/normalizerapi.html

dstillman · October 23, 2010

We can't rely on something from Java in any case.

fbennett · October 23, 2010

Here is a comprehensive test suite, from which data for a normalizing function could be extracted:

http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt

Looks like there are about 18,000 of these things. Per-character SQL lookups are a little frightening performance-wise ... this is outside my skill envelope. Would certainly be useful to have access to an NFD -> NCD unicode conversion facility in JS though, if nothing exists yet and someone can be tempted to work on it. Summer of Code candidate?

dstillman · October 23, 2010

As discovered by Simon:

http://www.oxymoronical.com/experiments/apidocs/interface/nsIUnicodeNormalizer

We haven't tested it yet, but we might be able to normalize strings when saving to the database.

fbennett · October 23, 2010

Aha! From Richard Ishida:

http://rishida.net/blog/?p=222

Looks like this may do the trick. (Edit: One more solution into the mix, anyway!)

ajlyon · October 24, 2010

Thanks for looking into this. Normalization, however, only solves the problems of precomposed and non-precomposed letters.

Matching of adorned text (/Entsiklopedii/ matches against "Ėnt︠s︡iklopedii") and matching by base glyph (/Dundar/ matches against "Dundär"), as presented above, is also important. The precomposed problem might still be more urgent, since it is likely to cause hard-to-diagnose problems in sorting and disambiguation.

ajlyon · November 1, 2010

For reference, this has a ticket in Trac: https://www.zotero.org/trac/ticket/865

ajlyon · February 21, 2011

This was just raised again: http://forums.zotero.org/discussion/16534/

This is an issue that will hopefully only get worse-- the multilingual branch is all about encouraging/allowing people to be faithful to original data, so I anticipate seeing more confusion wrought by the two Unicode wrinkles noted in this thread. Maybe we can aim for Zotero 2.2 with this?

ajlyon · February 25, 2011

Mentioned again at http://forums.zotero.org/discussion/16617/

dstillman · December 17, 2011

Another request re: composition: http://forums.zotero.org/discussion/20990/problem-with-umlauts/

On the trunk I've added a function to normalize Unicode strings using Mozilla's built-in NFC function.

Now we just need to actually call it somewhere. As noted on the ticket, I think it makes the most sense to normalize all incoming data in the data layer. Not sure how best to handle existing data, though.

(Also, this doesn't address the original issue on this thread, but we have ways to do that now too.)

realtime99 · May 28, 2012

(Also, this doesn't address the original issue on this thread, but we have ways to do that now too.)

Does this mean that there are ways to do the type of search I described in the first post? If so, is this being worked on? (Not a complaint, just curious). The last update I see related to the issue is 13 months ago at:

https://www.zotero.org/trac/ticket/865

but I'm not sure that will address the search issue.

Thanks for any updates!

takowl · June 10, 2013

+1 on the original point - I too have author citations with accented characters, e.g. Gómez, which I can't easily search for from the Libreoffice plugin, as my keyboard doesn't have an ó key. It would be great if it could recognise what I'm after when I search for 'Gomez'.

jessriedel · July 12, 2014

I would also greatly appreciate this feature.

jirving · December 13, 2014

+1
This is still critical! Ideally, if there was a way we could just convert other characters to the native keyboard equivalent, that would be amazing. I'm having trouble because on some journals, an authors name contains no accents, but other journals have them, so the entries are by two different authors and searching for the citations is problematic. I'm writing my dissertation now and this would be a godsend.

takowl · December 13, 2014

There are packages for Python and Perl called Unidecode, which are basically just big data tables mapping letter characters to a reasonable ascii approximation. As far as I know, there is no official standard covering these (unlike unicode normalisation), and the mapping has just been built by whoever needed it first. Presumably, a JS port of this could be used to build search indices with simplified versions of names.

Here's the Python package, which is just under 200KB compressed:
https://pypi.python.org/pypi/Unidecode

If that's too complex, there is a trick you can do to remove accents from characters: normalise to the decomposed form, and then strip out all non-letter characters as determined by unicode character categories.

kaktux · June 25, 2015

+1 from me - I have some authors that are stored with "different" names as different journals use special characters in the name - while others don't
Also - as i don't have those characters on my keyboard i only search for the name with a-z letters.

In my case it is for example an ř in the name - while i search for the name with the "r" only - which doesn't include the spelling with the mentioned ř.

As a non-programmer: maybe an to easy idea:
give zotero know to use r instead of ř
or like on my keyboard - ö = o , ä, a etc. - so you basically break down special characters to those included in a-z - so you get all contents when searching.

rameloot · July 8, 2016

+1 from me too

Is there an easy fix for this?
It is annoying to miss papers in your own database because of this.

Thanks!

Dani_Bodor · August 23, 2016

+1 for me too.
even when I know the correct spelling, I often don't know how to type it quickly (and need to try out a bunch of alt codes first) or start typing the middle of someone's name (which is what I do now, but inconvenient).

Any updates on this issue?

DWL-SDCA · August 23, 2016

Just a comment on how this isn't a simple one-to-one substitution. For example, "ü" -- especially with author names -- can be represented at least 3 ways: ü, u, or eu. The character(s) used with the name of the same person often differs across time and publisher. There are other more complex character-representation examples. Thus, even if a 1-to-1 substitution were implemented it would be difficult to assure every item would be found. To continue the example, should a search using words or names containing letter pair "eu" also search fot matches.containing the character ü?

This is further clouded by the multitude of languages of Zotero users and the base language settings of their computer and perhaps their keyboard. There are languages wherein a decorated basic character and a nondecorated character both exist and both types can appear in the same word.

Know that database designers and name authority committees are well aware of this issue and that there is not yet concensus on the best ways to address the problem.

adamsmith · August 23, 2016

The perfect solution is, indeed, not easy, but a solution that strips diacrits for search purposes (which would go a long way here) is definitely feasible. Not sure where Zotero's plans are with search more generally speaking -- whether they're planning on moving to other search technology would affect the degree to which it makes sense to tweak details (albeit important ones) like this.

DWL-SDCA · August 23, 2016

Know that I was not suggesting that a system to strip diacritical marks or added glyphs would be a bad thing to do.

There are a number of scripts and utilities that do exactly that. I use one of these on my site. A brief warning is justified so that everyone understands that when using that in processing a query that precision and recall will be affected in unknown ways.

Dani_Bodor · August 24, 2016

I wouldn't want to strip/edit the names in the library, but I would like the search function to be a bit more inclusive. This means that when typing something like Muller, it would also find Müller (and perhaps even Mueller, but this one will be tougher, I understand).
Even if this leads to more 'false positives,' it would be preferable to not finding the people you are looking for.

adamsmith · August 24, 2016

right, stripping them refers to the index only, obviously you want Dr. Müller in the bibliography and your database with the proper umlaut.

DWL-SDCA · May 23, 2017

Here is something else to consider. I've had, over the years, complaints concerning alphabetical placement of not only author names (that have accented characters) but also titles when the initial word is an article (or the like). Our work in this issue may be a bit extreme for Zotero because in my case I'm working with an online database where 2/3 of users do not visit from English language browsers. We are still experimenting with this but where implemented we have had a positive response.

With names, English speakers can be happy with an alphabetical order that simply drops the accents or decorations to the character. However, especially with the first character of a name, some languages do not place these words in alphabetical position as though the accented character had no accent. We are working at a system where a name with decorated characters will have multiple synonymous entries in the author table -- each entry tied to an entry in the language table. We use a localization system and detect the users' language thus we can adjust the alphabetical order of lists based upon the conventions of the user's language. A, Å, À, Ã, Â, etc. will appear appropriately placed.

With titles of items, articles, books, journals we have now at least 2 title fields -- the actual title and a second field for the title without the non-sorting word or abbreviation. We are testing this with localization so that if a searcher arrives with (for example) a German-language browser, a title beginning with "Der" will be placed in order based upon the first character of the second word. The innovation is that users with browsers of other languages will receive a browsable list with the item listed _both_ under the first word (article) _and_ under the second word of the title. Records are included in their original language and in English.

SafetyLit is designed to give non-experts access to scholarly material from many professional disciplines and nations. For example, a local school board employee who wants information about bicycle safety or bullying. These users will perform a search and then browse the results for items that seem to fit their needs. They are typically quite literate in their own language and have competent English skills. These folks do not seek the help of a librarian and have rudimentary information-seeking skills. I see my job as helping these novices to have access to the most complete literature listing, their naive search confidence notwithstanding.