Forgive me for the lack of official phrasing; this is a problem given orally in class, as opposed to being written in a problem set. Using the English alphabet with no spaces, commas, periods, etc (and thus only working with twenty-six letters possible), how many possible orderings are there of a string of fifty characters that contain the combination "Johndoe" at some location in the set?
This is more suited for something like math or stats stackexchange. Having said, that, there are 26^(50-7)*(50-7) combinations. To see why, ask yourself: how many 50 letter permutations of the 26 letters exist? Now, we will reduce this set by adding the restriction that a 7-letter contiguous word must exist within any candidate permutation. This has the effect of "fixing" 7 letters and making them unable to vary. However, we can place this 7 letter string anywhere, and there are 43 positions to place it ("johndoe" at position 0, "johndoe" at position 1, all the way to position 43, since "johndoe" will not fit at position 44).
I have a list of product names written in mixture of English letters and numbers and Chinese characters stored in my database.
There is a table called products with the fields name_en, name_zh amongst others.
AB 10"机翼
Peter Norvig has a fantastic algorithm for spell check but it only works for English.
I was wondering if there's a way to do something similar for a a narrow list of terms containing Chinese characters?
E.g. of mispelling such as
AB 10鸡翼
AB 10鸡一
AB 10木几翼
all will prompt AB 10"机翼 as the correct spelling
How do I do this?
You have a much more complex problem than Norvig's:
Chinese Input-method
The mis-spellings in your case (at least in your example) is mostly caused by the pinyin input method. One same typing of "jiyi" (English: airplane wings) could lead to different Chinese phrases:
Chinese Segmentation
Also in Chinese to break up a long sentence into small tokens with semantic meaning, you would need to do segmentation. For example:
飞机模型零件 -> Before segmentation
飞机-模型-零件 After segmentation you got three phrases separated by '-'.
Work on the token-level
You probably can experiment starting from a list of mis-spellings. I guess you can collect a bunch of them from your user logs. Take out one misspelling at a time, using your example:
AB 10鸡翼
First break it into tokens:
(here you probably need a Chinese segmentation algorithm to realize that 鸡翼 should be treated together).
Then you should try to find its nearest neighbor in your product db using the edit distance idea. Note that:
you do not remove/edit/replace one character at a time, but remove/edit/replace one token at a time.
when edit/replace, we should limit our candidates to be those near neighbors of the original token. For example, 鸡翼 -> 机翼,几翼,机一
Build Lucene index
You can also try to tackle the problem in a different way, starting from your correct product names. Treat each product name as a document and pre-build lucene index from that. Then for each user query, the query-matching problem is converted to a search problem in which we issue a query to the search-engine for find the best-matching documents in our db. In this case, I believe Lucene would probably takes care of the segmentation (if not, you would need to extend its functionality to suit your own needs) and tokenization for you.
I'm looking for a hint towards a solution of the problem:
Suppose there's an array with some numbers in ascending order and some in descending, for example [1,2,5,9,6,3,2,4,7,8] has sequences asc [1,2,5,9], desc [(9),6,3,2], asc [(2),4,7,8].
Now this isn't a problem, I could simply loop through an array and add them to some data structure, and when the direction changes - I store this structure somwhere and start filling next one.
What I've found tricky is if I want to have threshold of some sort. For example: [0,50,100,99,98,97,105,160]
So the sequence in descending order [(100), 99, 98, 97] could be neglected, because overall change is -3, whereas the sequence was increasing much more dramatically (+100) and as a result, the algorithm identifies only one sequence in ascending order.
I have tried the same method as above, simply adding all sequences in a data structure and then comparing the change in values of two consequtive items: (100 vs -3 means -3 can be neglected). But then the problem is if I have say this situation:
(example only in change of values from start to end of sequense)
[+100, -3, +1, -50]
in this situation I cannot neglect descending movement, because the numbers start to descend, then slightly ascend and again go down pretty significantly.
and it gets really confusing with stuff like that:
[+100, -3, +3, -3, +3, -50]
this is quick sketch of representation of what I am trying to achieve:
black lines represent initial data in an array, red thin lines are desired resulting output
Could somebody point me out in right direction? How would I approach this situation? Compare multiple sequences at a time slowly combining sequences together? Maybe I would need to go through sequences multiple times?
I'm not sure If I've come across problem like that and don't know working algorithm. This is a problem I've faced myself trying to analyse some data.
If I understand correctly, you expect your curve to be a succession of alternatively increasing and decreasing sequences, with a bit of added noise.
The usual way to get rid of noise is to filter data. There are millions of ways to do that, most of them requiring frequency analysis, but in your case you could probably get good enough results with something simple.
The main point is that the relevant variable is not the values in the array, but their variations.
Given N values, consider the array of N-1 elements holding the differences between two consecutive values.
[0,50,100,99,98,97,105,160] -> 50,100,-1,-1,-1,6,45
Now eliminate all values whose absolute value is below a given threshold (say 10 for instance)
-> 50,100,0,0,0,0,45
you can then detect a rising sequence by looking at streaks of all positive or null values (and the same for decreasing sequences, considering zero or negative values).
As for all filtering processes, you will have to find a sweet spot for your threshold. Too low and it will fail to eliminate insignificant variations, too high and it will wipe out significant slope inversions.
I don't know if I understand your problem correctly, but I had to do this kind of dimensionality reduction many times before, so I wrote a small javascript library to do so. It uses the Perceptually Important Points algorithm.
In the algorithm you can define a custom metric of the distance between three consecutive points (to measure how much a single point adds in entropy).
Here is a demonstration (in JS). It works kind like a heap, where you remove points that do not contribute so much to the overall entropy:
for(var i=0; i<data.length; i++)
while(heap.minValue() < threshold)
And here is the library.
I want a quick and dirty way of determining what language the user is writing in. I know that there is a Google API which will detect the difference between French and Spanish (even though they both use mostly the same alphabet), but I don't want the latency. Essentially, I know that the Latin alphabet has a lot of confusion as to what language it is using. Other alphabets, however, don't. For example, if there is a character using hiragana (part of the Japanese writing system) there is no confusion as to the language. Therefore, I don't need to ask Google.
Therefore, I would like to be able to do something simple like say that שלום uses the Hebrew alphabet and こんにちは uses Japanese characters. How do I get that alphabet string?
"Bonjour", "Hello", etc. should return "Latin" or "English" (Then I'll ask Google for the real language). "こんにちは" should return "Hiragana" or "Japanese". "שלום" should return "Hebrew".
I'd suggest looking at the Unicode "Script" property. The latest database can be found here.
For a quick and dirty implementation, I'd try scanning all of the characters in the target text and looking up the script name for each one. Pick whichever script has the most characters.
Use an N-gram model and then give a sufficiently large set of training data. A full example describing this technique is to be found at this page, among others:
Although the article assumes you are implementing in PHP and by "language" you mean something like English, Italian, etc... the description may be implemented in C if you require this, and instead of using "language" as in English, etc. for the training, just use your notion of "alphabet" for the training. For example, look at all of your "Latin alphabet" strings together and consider their n-grams for n=2:
Bonjour: "Bo", "on", "nj", "jo", "ou", "ur"
Hello: "He", "el", "ll", "lo"
With enough training data, you will discover dominant combinations that are likely for all Latin text, for example, perhaps "Bo" and "el" are quite probable for text written in the "Latin alphabet". Likewise, these combinations are probably quite rare in text that is written in the "Hiragana alphabet". Similar discoveries will be made with any other alphabet classification for which you can provide sufficient training data.
This technique is also known as a Hidden Markov model or a Markov chain; searching for these keywords will give more ideas for implementation. For "quick and dirty" I would use n=2 and gather just enough training data such that the least common letter from each alphabet is encountered at least once... e.g. at least one 'z' and at least one 'ぅ' *little hiragana u.
For a simpler solution than N-Grams, use only basic statistical tests -- min, max and average -- to compare your Input (a string given by the user) with an Alphabet (a string of all characters in one of the alphabets you are interested).
Step 1. Place all the numerical values of the Alphabet (e.g. utf8 codes) in an array. For example, if the Alphabet to be tested against is "Basic Latin", make an array DEF := {32, 33, 34, ..., 122}.
Step 2. Place all the numerical values of the Input into an array, for example, make an array INP := {73, 102, 32, ...}.
Step 3. Calculate a score for the input based on INP and DEF. If INP really comes from the same alphabet as DEF, then I would expect the following statements to be true:
min(INP) >= min(DEF)
max(INP) <= max(DEF)
avg(INP) - avg(DEF) < EPS, where EPS is a suitable constant
If all statements are true, the score should be close to 1.0. If all are false, the score should close to 0.0. After this "Score" routine is defined, all that's left is to repeat it on each alphabet you are interested in and choose the one whiich gives the highest score for a given Input.
I'm wondering is there an algorithm or a library which helps me identify the components in an English which has no meaning? e.g., very serious grammar error? If so, could you explain how it works, because I would really like to implement that or use that for my own projects.
Here's a random example:
In the sentence: "I closed so etc page hello the door."
As a human, we can quickly identify that [so etc page hello] does not make any sense. Is it possible for a machine to point out that the string does not make any sense and also contains grammar errors?
If there's such a solution, how precise can that be? Is it possible, for example, given a clip of an English sentence, the algorithm returns a measure, indicating how meaningful, or correct that clip is? Thank you very much!
PS: I've looked at CMU's link grammar as well as the NLTK library. But still I'm not sure how to use for example link grammar parser to do what I would like to do as the if the parser doesn't accept the sentence, I don't know how to tweak it to tell me which part it is not right.. and I'm not sure whether NLTK supported that.
Another thought I had towards solving the problem is to look at the frequencies of the word combination. Since I'm currently interested in correcting very serious errors only. If I define the "serious error" to be the cases where words in a clip of a sentence are rarely used together, i.e., the frequency of the combo should be much lower than those of the other combos in the sentence.
For instance, in the above example: [so etc page hello] these four words really seldom occur together. One intuition of my idea comes from when I type such combo in Google, no related results jump out. So is there any library that provides me such frequency information like Google does? Such frequencies may give a good hint on the correctness of the word combo.
I think that what you are looking for is a language model. A language model assigns a probability to each sentence of k words appearing in your language. The simplest kind of language models are n-grams models: given the first i words of your sentence, the probability of observing the i+1th word only depends on the n-1 previous words.
For example, for a bigram model (n=2), the probability of the sentence w1 w2 ... wk is equal to
P(w1 ... wk) = P(w1) P(w2 | w1) ... P(wk | w(k-1)).
To compute the probabilities P(wi | w(i-1)), you just have to count the number of occurrence of the bigram w(i-1) wi and of the word w(i-1) on a large corpus.
Here is a good tutorial paper on the subject: A Bit of Progress in Language Modeling, by Joshua Goodman.
Yes, such things exist.
You can read about it on Wikipedia.
You can also read about some of the precision issues here.
As far as determining which part is not right after determining the sentence has a grammar issue, that is largely impossible without knowing the author's intended meaning. Take, for example, "Over their, dead bodies" and "Over there dead bodies". Both are incorrect, and could be fixed either by adding/removing the comma or swapping their/there. However, these result in very different meanings (yes, the second one would not be a complete sentence, but it would be acceptable/understandable in context).
Spell checking works because there are a limited number of words against which you can check a word to determine if it is valid (spelled correctly). However, there are infinite sentences that can be constructed, with infinite meanings, so there is no way to correct a poorly written sentence without knowing what the meaning behind it is.
I think what you are looking for is a well-established library that can process natural language and extract the meanings.
Unfortunately, there's no such library. Natural language processing, as you probably can imagine, is not an easy task. It is still a very active research field. There are many algorithms and methods in understanding natural language, but to my knowledge, most of them only work well for specific applications or words of specific types.
And those libraries, such as the CMU one, seems to be still quite rudimental. It can't do what you want to do (like identifying errors in English sentence). You have to develop algorithm to do that using the tools that they provide (such as sentence parser).
If you want to learn about it check out They have some sections that talks about processing language and words.
Write a function that given a string of digits and a target value, prints where to put +'s and *'s between the digits so they combine exactly to the target value. Note there may be more than one answer, it doesn't matter which one you print.
"1231231234",11353 -> "12*3+1+23*123*4"
"3456237490",1185 -> "3*4*56+2+3*7+490"
"3456237490",9191 -> "no solution"
If you have an N digit value, there are N-1 possible slots for the + or * operators. So brute force, there are 3^(N-1) possibilities. Testing all of these are inefficient.
Your examples are all 10 digits. 3^9 = 19683, so brute force is FINE! No need to get any fancier.
So all you need to do is iterate through all 19683 cases, each time building a string for that case, and evaluating the expression. Evaluating the expression is a straightforward task. Iterating is straightforward (just use an incrementing value, you can read the state of the first slot by (i%3), which gives you "no operator" "+" or "*", the state of the second slot is (i/3)%3, the state of the third slot is (i/9)%3 and so on.)
Even with crude parsing code, CPUs are fast.
The brute force option starts becoming ugly after about 20 digits, and you'd have to switch to be more clever.
If this is for the gaming programmer position, do not use the brute force approach. I did that but failed this a couple of years ago. Later heard from someone inside that dynamic programming approach is the one that gets the job.
This can be solved either by backtracking or by dynamic programming.
The "cleverer" approach (using dynamic programming) is basically this:
For each substring of the original string, figure out all possible values it can create. (e.g. in your first example "12" can become either 1+2=3 or 1*2=2)
There may be a lot of different combinations, but many of them will be duplicates. (Also, you should ignore all combinations that are greater than the target).
Thus, when you add a "+" or a "*", you can envision it as combining two substrings of the string. (and since you have the possible values for each substring, you can see if such a combination is possible)
These values can be generated similarly: try splitting the substring in all possible ways, and combine the different values in each half of the substring.
The total number of "states", then, is something like |S|^2 * target - for your example case, it's worse than the brute-force method. But if you had a string of length 1000 and a target of say 5000, then the problem would be solvable with dynamic programming.
Google Code Jam had an extended version of this problem last year (in Round 1C), called Ugly Numbers. You can visit that link and click "Contest Analysis" for some approaches to that problem, when extended to large numbers of digits.