special character search
Often, names are entered on the web or in metadata with special characters converted to similar English characters, so Zotero sometimes considers two papers by one author to be by different authors.
I often can remember an author's name but not the diacritics of the non-English characters. Currently, if I want to find all papers by one author, my options are to either convert all names to English characters by hand (which gives citations with an incorrect spelling of the name) or to convert all names to the original language characters by hand (which prevents me from successfully searching).
I would like the ability to have search consider a English character as equivalent to similar special characters to make search easier. e.g., I would like a search for "Noel Carroll" to find articles written by "Noël Carroll", and a search for "Laszlo Halasz" to find articles by "László Halász".
This would help me immensely. Can it be implemented?
I often can remember an author's name but not the diacritics of the non-English characters. Currently, if I want to find all papers by one author, my options are to either convert all names to English characters by hand (which gives citations with an incorrect spelling of the name) or to convert all names to the original language characters by hand (which prevents me from successfully searching).
I would like the ability to have search consider a English character as equivalent to similar special characters to make search easier. e.g., I would like a search for "Noel Carroll" to find articles written by "Noël Carroll", and a search for "Laszlo Halasz" to find articles by "László Halász".
This would help me immensely. Can it be implemented?
Note that differing diacriticals or spellings are not necessarily "incorrect" - different publications may have different versions of the author's name. I suppose how to correctly cite in this situation will depend on the citation style. This is a case where zotero has not entirely decoupled data from style.
I run into this frequently with Library of Congress Cyrillic transliterations, which use many diacritical marks that are nigh impossible to enter manually; most people use a diacritical-free simplification. Such simplifications match against the diacritical-adorned versions in bibliographic databases, and it would be great for them to match against adorned entries in Zotero.
My test case:
Adorned: Kazanskiĭ retro-leksikon: pervyĭ opyt rodoslovno-biograficheskoĭ i istoriko-kraevedcheskoĭ Ėnt︠s︡iklopedii
Unadorned: Kazanskii retro-leksikon: pervyi opyt rodoslovno-biograficheskoi i istoriko-kraevedcheskoi Entsiklopedii
I would like for a search for Entsiklopedii to match either form; right now it won't, since the characters in the adorned case are E*nt-s-iklopedii; where the * and - represent combining diacriticals that are not available on most (any?) keyboards.
When it becomes possible to have multiple versions of a single field (see http://forums.zotero.org/discussion/1798), as Frank has discussed and as the CSL 1.0 processor supports, then it should become possible to address alexuw's concern, so long as each version of a name or title can be described by a specific transliteration scheme, and that scheme is issued a subtag in the IANA Language Subtag registry (as Frank and I have already done).
Appropriate resources for handling this might be found in the ICU toolkit: http://userguide.icu-project.org/collation/icu-string-search-service, available for Sqlite (http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/icu/README.txt).
There must be a table of equivalents in the system somewhere, since the two display with the same glyph in the browser. Not sure how you get at that facility, though, or whether it can be used for reverse mapping to a canonical character code. Looks like a tough one.
http://download.oracle.com/javase/tutorial/i18n/text/normalizerapi.html
http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
Looks like there are about 18,000 of these things. Per-character SQL lookups are a little frightening performance-wise ... this is outside my skill envelope. Would certainly be useful to have access to an NFD -> NCD unicode conversion facility in JS though, if nothing exists yet and someone can be tempted to work on it. Summer of Code candidate?
http://www.oxymoronical.com/experiments/apidocs/interface/nsIUnicodeNormalizer
We haven't tested it yet, but we might be able to normalize strings when saving to the database.
http://rishida.net/blog/?p=222
Looks like this may do the trick. (Edit: One more solution into the mix, anyway!)
Matching of adorned text (/Entsiklopedii/ matches against "Ėnt︠s︡iklopedii") and matching by base glyph (/Dundar/ matches against "Dundär"), as presented above, is also important. The precomposed problem might still be more urgent, since it is likely to cause hard-to-diagnose problems in sorting and disambiguation.
This is an issue that will hopefully only get worse-- the multilingual branch is all about encouraging/allowing people to be faithful to original data, so I anticipate seeing more confusion wrought by the two Unicode wrinkles noted in this thread. Maybe we can aim for Zotero 2.2 with this?
On the trunk I've added a function to normalize Unicode strings using Mozilla's built-in NFC function.
Now we just need to actually call it somewhere. As noted on the ticket, I think it makes the most sense to normalize all incoming data in the data layer. Not sure how best to handle existing data, though.
(Also, this doesn't address the original issue on this thread, but we have ways to do that now too.)
https://www.zotero.org/trac/ticket/865
but I'm not sure that will address the search issue.
Thanks for any updates!
This is still critical! Ideally, if there was a way we could just convert other characters to the native keyboard equivalent, that would be amazing. I'm having trouble because on some journals, an authors name contains no accents, but other journals have them, so the entries are by two different authors and searching for the citations is problematic. I'm writing my dissertation now and this would be a godsend.
Here's the Python package, which is just under 200KB compressed:
https://pypi.python.org/pypi/Unidecode
If that's too complex, there is a trick you can do to remove accents from characters: normalise to the decomposed form, and then strip out all non-letter characters as determined by unicode character categories.
Also - as i don't have those characters on my keyboard i only search for the name with a-z letters.
In my case it is for example an ř in the name - while i search for the name with the "r" only - which doesn't include the spelling with the mentioned ř.
As a non-programmer: maybe an to easy idea:
give zotero know to use r instead of ř
or like on my keyboard - ö = o , ä, a etc. - so you basically break down special characters to those included in a-z - so you get all contents when searching.
Is there an easy fix for this?
It is annoying to miss papers in your own database because of this.
Thanks!
even when I know the correct spelling, I often don't know how to type it quickly (and need to try out a bunch of alt codes first) or start typing the middle of someone's name (which is what I do now, but inconvenient).
Any updates on this issue?
This is further clouded by the multitude of languages of Zotero users and the base language settings of their computer and perhaps their keyboard. There are languages wherein a decorated basic character and a nondecorated character both exist and both types can appear in the same word.
Know that database designers and name authority committees are well aware of this issue and that there is not yet concensus on the best ways to address the problem.
There are a number of scripts and utilities that do exactly that. I use one of these on my site. A brief warning is justified so that everyone understands that when using that in processing a query that precision and recall will be affected in unknown ways.
Even if this leads to more 'false positives,' it would be preferable to not finding the people you are looking for.
With names, English speakers can be happy with an alphabetical order that simply drops the accents or decorations to the character. However, especially with the first character of a name, some languages do not place these words in alphabetical position as though the accented character had no accent. We are working at a system where a name with decorated characters will have multiple synonymous entries in the author table -- each entry tied to an entry in the language table. We use a localization system and detect the users' language thus we can adjust the alphabetical order of lists based upon the conventions of the user's language. A, Å, À, Ã, Â, etc. will appear appropriately placed.
With titles of items, articles, books, journals we have now at least 2 title fields -- the actual title and a second field for the title without the non-sorting word or abbreviation. We are testing this with localization so that if a searcher arrives with (for example) a German-language browser, a title beginning with "Der" will be placed in order based upon the first character of the second word. The innovation is that users with browsers of other languages will receive a browsable list with the item listed _both_ under the first word (article) _and_ under the second word of the title. Records are included in their original language and in English.
SafetyLit is designed to give non-experts access to scholarly material from many professional disciplines and nations. For example, a local school board employee who wants information about bicycle safety or bullying. These users will perform a search and then browse the results for items that seem to fit their needs. They are typically quite literate in their own language and have competent English skills. These folks do not seek the help of a librarian and have rudimentary information-seeking skills. I see my job as helping these novices to have access to the most complete literature listing, their naive search confidence notwithstanding.