I'm currently trying to make Solr index a lot of library data. This library data for example contains authors spelled differently, and with local letters (such as ä, ü, ø, ö, etc). I'd like it to be possible for my users to search for Østersøen and get results such as Österssöen, Østersøen.
My question is, how do I achieve this with Solr. It seems to me that mappings won't cut it, since I'd like one character to be able to count as several others.
u -> u, ü, ù, ú
å -> å, aa
ø -> ø, ö, o
but also the other way around (with some of them), so that
aa -> å
Is this possible, and if so how?
Look at the tips here. Basically there are two things to do:
proper stemming/filters depending on the language
ASCIIFoldingFilterFactory/ICUFoldingFilterFactory
You want Unicode folding (ICUFoldingFilterFactory), which does all the normalization.
You also want to read through the whole 12 post series on the using Solr in the library with CJK issues as a focus. It will probably answer the questions you don't even have yet. Finally, if you haven't looked at Project Blacklight yet, it is worth doing so. It's a community of people using Solr specifically for libraries and they may have common problems and solutions.
Related
I have a list of product names written in mixture of English letters and numbers and Chinese characters stored in my database.
There is a table called products with the fields name_en, name_zh amongst others.
E.g.
AB 10"机翼
Peter Norvig has a fantastic algorithm for spell check but it only works for English.
I was wondering if there's a way to do something similar for a a narrow list of terms containing Chinese characters?
E.g. of mispelling such as
A10机翼
AB 10鸡翼
AB 10鸡一
AB 10木几翼
all will prompt AB 10"机翼 as the correct spelling
How do I do this?
You have a much more complex problem than Norvig's:
Chinese Input-method
The mis-spellings in your case (at least in your example) is mostly caused by the pinyin input method. One same typing of "jiyi" (English: airplane wings) could lead to different Chinese phrases:
机翼
鸡翼
鸡一
几翼
Chinese Segmentation
Also in Chinese to break up a long sentence into small tokens with semantic meaning, you would need to do segmentation. For example:
飞机模型零件 -> Before segmentation
飞机-模型-零件 After segmentation you got three phrases separated by '-'.
Work on the token-level
You probably can experiment starting from a list of mis-spellings. I guess you can collect a bunch of them from your user logs. Take out one misspelling at a time, using your example:
AB 10鸡翼
First break it into tokens:
A-B-10-鸡翼
(here you probably need a Chinese segmentation algorithm to realize that 鸡翼 should be treated together).
Then you should try to find its nearest neighbor in your product db using the edit distance idea. Note that:
you do not remove/edit/replace one character at a time, but remove/edit/replace one token at a time.
when edit/replace, we should limit our candidates to be those near neighbors of the original token. For example, 鸡翼 -> 机翼,几翼,机一
Build Lucene index
You can also try to tackle the problem in a different way, starting from your correct product names. Treat each product name as a document and pre-build lucene index from that. Then for each user query, the query-matching problem is converted to a search problem in which we issue a query to the search-engine for find the best-matching documents in our db. In this case, I believe Lucene would probably takes care of the segmentation (if not, you would need to extend its functionality to suit your own needs) and tokenization for you.
We are working on expanding an implementation of Solr to international markets. The mapping-ISOLatin1Accent.txt only supports one mapping per accented characters. For example: ä => a. However we want to map ä => a and ae. Is there a way to map 1 accented character to multiple non-accented representations in the existing ISO mapping or do we need a custom mapper?
Thanks
This feels a little bit like reinventing a bicycle. If you are planning to deal with international markets, switch up from using simple mapping to using ICU Unicode Mapping.
Solr has a full support for Unicode normalization, decomposition, and mapping components (search for those starting from ICU). It takes a little bit of exploring, but is well worth it.
I'm implementing an alphabetic search based on telephone keypad, like Phone keypad1
When user types , say 2, I get {A, B, C} in the combination. When user types 23, I get {AD, AE, AF, BD, BE, BF, CD, CE, CF} in the combinations, and so on. If I keep typing and make combinations I get thousands of combinations which make search process quite slow. So now I want to implement an algorithm which delete illogical combinations like CF BD CD, I mean logically no one's name starts with these combinations, perhaps two consonants without vowel. So this way I want to narrow down my search. Anyone knowing about such state machine, implemented in C?
You could build a trie of valid prefixes based on the dataset you're searching. Matching partial inputs against that should be pretty easy.
Keep in mind that when it comes to linguistic data "illogical" is not a good proxy for "unlikely." This is particularly true when it comes to names. As an example, according to a standard definition of "consonant" in English, my last name starts with four consonants. If it were to be written after a German fashion, it would start with five. When thinking about such issues it is useful to keep in mind that:
Sounds are not letters, and letters are not sounds: in most
orthographic systems, the mapping of letters to sounds is not 1:1
Many languages have unexpected syllabic nuclei: Tamazight Berber, for instance, allows syllables where the sound m plays the role of the syllabic nucleous, like the vowel generally do in English. So a Berber name can look like CCmC (where C stands for consonants) and be perfect in that language. It is not unlikely that a person of Berber origin would then use the similar orthography in English, which a naive system would rule out as "illogical"
Finally, many systems for writing foreign names and words in English use di-graphs or tri-graphs (two letter and three letter combinations) for representing the sounds of the foreign language in English: this can create what looks like illicit consontant clusters. We know that English does that (sh represents one sound, see point 1), but this is particularly true when transcribing foreign words.
So unless you know very well the orthographic rules for the names you are expecting, you are likely to rule out legitimate names using a naive system.
I'm wondering is there an algorithm or a library which helps me identify the components in an English which has no meaning? e.g., very serious grammar error? If so, could you explain how it works, because I would really like to implement that or use that for my own projects.
Here's a random example:
In the sentence: "I closed so etc page hello the door."
As a human, we can quickly identify that [so etc page hello] does not make any sense. Is it possible for a machine to point out that the string does not make any sense and also contains grammar errors?
If there's such a solution, how precise can that be? Is it possible, for example, given a clip of an English sentence, the algorithm returns a measure, indicating how meaningful, or correct that clip is? Thank you very much!
PS: I've looked at CMU's link grammar as well as the NLTK library. But still I'm not sure how to use for example link grammar parser to do what I would like to do as the if the parser doesn't accept the sentence, I don't know how to tweak it to tell me which part it is not right.. and I'm not sure whether NLTK supported that.
Another thought I had towards solving the problem is to look at the frequencies of the word combination. Since I'm currently interested in correcting very serious errors only. If I define the "serious error" to be the cases where words in a clip of a sentence are rarely used together, i.e., the frequency of the combo should be much lower than those of the other combos in the sentence.
For instance, in the above example: [so etc page hello] these four words really seldom occur together. One intuition of my idea comes from when I type such combo in Google, no related results jump out. So is there any library that provides me such frequency information like Google does? Such frequencies may give a good hint on the correctness of the word combo.
I think that what you are looking for is a language model. A language model assigns a probability to each sentence of k words appearing in your language. The simplest kind of language models are n-grams models: given the first i words of your sentence, the probability of observing the i+1th word only depends on the n-1 previous words.
For example, for a bigram model (n=2), the probability of the sentence w1 w2 ... wk is equal to
P(w1 ... wk) = P(w1) P(w2 | w1) ... P(wk | w(k-1)).
To compute the probabilities P(wi | w(i-1)), you just have to count the number of occurrence of the bigram w(i-1) wi and of the word w(i-1) on a large corpus.
Here is a good tutorial paper on the subject: A Bit of Progress in Language Modeling, by Joshua Goodman.
Yes, such things exist.
You can read about it on Wikipedia.
You can also read about some of the precision issues here.
As far as determining which part is not right after determining the sentence has a grammar issue, that is largely impossible without knowing the author's intended meaning. Take, for example, "Over their, dead bodies" and "Over there dead bodies". Both are incorrect, and could be fixed either by adding/removing the comma or swapping their/there. However, these result in very different meanings (yes, the second one would not be a complete sentence, but it would be acceptable/understandable in context).
Spell checking works because there are a limited number of words against which you can check a word to determine if it is valid (spelled correctly). However, there are infinite sentences that can be constructed, with infinite meanings, so there is no way to correct a poorly written sentence without knowing what the meaning behind it is.
I think what you are looking for is a well-established library that can process natural language and extract the meanings.
Unfortunately, there's no such library. Natural language processing, as you probably can imagine, is not an easy task. It is still a very active research field. There are many algorithms and methods in understanding natural language, but to my knowledge, most of them only work well for specific applications or words of specific types.
And those libraries, such as the CMU one, seems to be still quite rudimental. It can't do what you want to do (like identifying errors in English sentence). You have to develop algorithm to do that using the tools that they provide (such as sentence parser).
If you want to learn about it check out ai-class.com. They have some sections that talks about processing language and words.
Let's say I've got a database full of music artists. Consider the following artists:
The Beatles -
"The" is officially part of the name, but we don't want to sort it with the "T"s if we are alphabetizing. We can't easily store it as "Beatles, The" because then we can't search for it properly.
Beyoncé -
We need to allow the user to be able to search for "Beyonce" (without the diacritic mark)and get the proper results back. No user is going to know how or take the time to type the special diacritcal character on the last "e" when searching, yet we obviously want to display it correctly when we need to output it.
What is the best way around these problems? It seems wasteful to keep an "official name", a "search name", and a "sort name" in the database since a very large majority of entries will all be exactly the same, but I can't think of any other options.
The library science folks have a standard answer for this. The ALA Filing Rules cover all of these cases in a perfectly standard way.
You're talking about the grammatical sort order. This is a debatable topic. Some folks would take issue with your position.
Generally, you transform the title to a normalized form: "Beatles, The". Generally, you leave it that way. Then sort.
You can read about cataloging rules here: http://en.wikipedia.org/wiki/Library_catalog#Cataloging_rules
For "extended" characters, you have several choices. For some folks, é is a first-class letter and the diacritical is part of it. They aren't confused. For other folks, all of the diacritical characters map onto unadorned characters. This mapping is a feature of some Unicode processing tools.
You can read about Unicode diacritical stripping here: http://lexsrv3.nlm.nih.gov/SPECIALIST/Projects/lvg/current/docs/designDoc/UDF/unicode/NormOperations/stripDiacritics.html
http://www.siao2.com/2005/02/19/376617.aspx