I'm wondering is there an algorithm or a library which helps me identify the components in an English which has no meaning? e.g., very serious grammar error? If so, could you explain how it works, because I would really like to implement that or use that for my own projects.
Here's a random example:
In the sentence: "I closed so etc page hello the door."
As a human, we can quickly identify that [so etc page hello] does not make any sense. Is it possible for a machine to point out that the string does not make any sense and also contains grammar errors?
If there's such a solution, how precise can that be? Is it possible, for example, given a clip of an English sentence, the algorithm returns a measure, indicating how meaningful, or correct that clip is? Thank you very much!
PS: I've looked at CMU's link grammar as well as the NLTK library. But still I'm not sure how to use for example link grammar parser to do what I would like to do as the if the parser doesn't accept the sentence, I don't know how to tweak it to tell me which part it is not right.. and I'm not sure whether NLTK supported that.
Another thought I had towards solving the problem is to look at the frequencies of the word combination. Since I'm currently interested in correcting very serious errors only. If I define the "serious error" to be the cases where words in a clip of a sentence are rarely used together, i.e., the frequency of the combo should be much lower than those of the other combos in the sentence.
For instance, in the above example: [so etc page hello] these four words really seldom occur together. One intuition of my idea comes from when I type such combo in Google, no related results jump out. So is there any library that provides me such frequency information like Google does? Such frequencies may give a good hint on the correctness of the word combo.
I think that what you are looking for is a language model. A language model assigns a probability to each sentence of k words appearing in your language. The simplest kind of language models are n-grams models: given the first i words of your sentence, the probability of observing the i+1th word only depends on the n-1 previous words.
For example, for a bigram model (n=2), the probability of the sentence w1 w2 ... wk is equal to
P(w1 ... wk) = P(w1) P(w2 | w1) ... P(wk | w(k-1)).
To compute the probabilities P(wi | w(i-1)), you just have to count the number of occurrence of the bigram w(i-1) wi and of the word w(i-1) on a large corpus.
Here is a good tutorial paper on the subject: A Bit of Progress in Language Modeling, by Joshua Goodman.
Yes, such things exist.
You can read about it on Wikipedia.
You can also read about some of the precision issues here.
As far as determining which part is not right after determining the sentence has a grammar issue, that is largely impossible without knowing the author's intended meaning. Take, for example, "Over their, dead bodies" and "Over there dead bodies". Both are incorrect, and could be fixed either by adding/removing the comma or swapping their/there. However, these result in very different meanings (yes, the second one would not be a complete sentence, but it would be acceptable/understandable in context).
Spell checking works because there are a limited number of words against which you can check a word to determine if it is valid (spelled correctly). However, there are infinite sentences that can be constructed, with infinite meanings, so there is no way to correct a poorly written sentence without knowing what the meaning behind it is.
I think what you are looking for is a well-established library that can process natural language and extract the meanings.
Unfortunately, there's no such library. Natural language processing, as you probably can imagine, is not an easy task. It is still a very active research field. There are many algorithms and methods in understanding natural language, but to my knowledge, most of them only work well for specific applications or words of specific types.
And those libraries, such as the CMU one, seems to be still quite rudimental. It can't do what you want to do (like identifying errors in English sentence). You have to develop algorithm to do that using the tools that they provide (such as sentence parser).
If you want to learn about it check out ai-class.com. They have some sections that talks about processing language and words.
Related
I am automating a process which asks questions (via SMS but shouldn't matter) to real people. The questions have yes/no answers, but the person might respond in a number of ways such as: sure, not at this time, yeah, never or in any other way that they might. I would like to attempt to parse this text and determine if it was a yes or no answer (of course it might not always be right).
I figured the ideas and concepts to do this might already exist as it seems like a common task for an AI, but don't know what it might be called so I can't find information on how I might implement it. So my questions is, have algorithms been developed to do this kind of parsing and if so where can I find more information on how to implement them?
This can be viewed as a binary (yes or no) classification task. You could write a rule-based model to classify or a statistics-based model.
A rule-based model would be like if answer in ["never", "not at this time", "nope"] then answer is "no". When spam filters first came out they contained a lot of rules like these.
A statistics-based model would probably be more suitable here, as writing your own rules gets tiresome and does not handle new cases as well.
For this you need to label a training dataset. After a little preprocessing (like lowercasing all the words, removing punctuation and maybe even a little stemming) you could get a dataset like
0 | never in a million years
0 | never
1 | yes sir
1 | yep
1 | yes yes yeah
0 | no way
Now you can run classification algorithms like Naive Bayes or Logistic Regression over this set (after you vectorize the words in either binary, which means is the word present or not, word count, which means the term frequency, or a tfidf float, which prevent bias to longer answers and common words) and learn which words more often belong to which class.
In the above example yes would be strongly correlated to a positive answer (1) and never would be strongly related to a negative answer (0). You could work with n-grams so a not no would be treated as a single token in favor of the positive class. This is called the bag-of-words approach.
To combat spelling errors you can add a spellchecker like Aspell to the pre-processing step. You could use a charvectorizer too, so a word like nno would be interpreted as nn and no and you catch errors like hellyes and you could trust your users to repeat spelling errors. If 5 users make the spelling error neve for the word never then the token neve will automatically start to count for the negative class (if labeled as such).
You could write these algorithms yourself (Naive Bayes is doable, Paul Graham has wrote a few accessible essays on how to classify spam with Bayes Theorem and nearly every ML library has a tutorial on how to do this) or make use of libraries or programs like Scikit-Learn (MultinomialNB, SGDclassifier, LinearSVC etc.) or Vowpal Wabbit (logistic regression, quantile loss etc.).
Im thinking on top of my head, if you get a response which you dont know if its yes / no, you can keep the answers in a DB like unknown_answers and 2 more tables as affirmative_answers / negative_answers, then in a little backend system, everytime you get a new unknown_answer you qualify them as yes or no, and there the system "learns" about it and with time, you will have a very big and good database of affirmative / negative answers.
I'm implementing an alphabetic search based on telephone keypad, like Phone keypad1
When user types , say 2, I get {A, B, C} in the combination. When user types 23, I get {AD, AE, AF, BD, BE, BF, CD, CE, CF} in the combinations, and so on. If I keep typing and make combinations I get thousands of combinations which make search process quite slow. So now I want to implement an algorithm which delete illogical combinations like CF BD CD, I mean logically no one's name starts with these combinations, perhaps two consonants without vowel. So this way I want to narrow down my search. Anyone knowing about such state machine, implemented in C?
You could build a trie of valid prefixes based on the dataset you're searching. Matching partial inputs against that should be pretty easy.
Keep in mind that when it comes to linguistic data "illogical" is not a good proxy for "unlikely." This is particularly true when it comes to names. As an example, according to a standard definition of "consonant" in English, my last name starts with four consonants. If it were to be written after a German fashion, it would start with five. When thinking about such issues it is useful to keep in mind that:
Sounds are not letters, and letters are not sounds: in most
orthographic systems, the mapping of letters to sounds is not 1:1
Many languages have unexpected syllabic nuclei: Tamazight Berber, for instance, allows syllables where the sound m plays the role of the syllabic nucleous, like the vowel generally do in English. So a Berber name can look like CCmC (where C stands for consonants) and be perfect in that language. It is not unlikely that a person of Berber origin would then use the similar orthography in English, which a naive system would rule out as "illogical"
Finally, many systems for writing foreign names and words in English use di-graphs or tri-graphs (two letter and three letter combinations) for representing the sounds of the foreign language in English: this can create what looks like illicit consontant clusters. We know that English does that (sh represents one sound, see point 1), but this is particularly true when transcribing foreign words.
So unless you know very well the orthographic rules for the names you are expecting, you are likely to rule out legitimate names using a naive system.
I'm trying to build a collection English words that are difficult to pronounce.
I was wondering if there is an algorithm of some kind or a theory, that can be used to show how difficult a word is to pronounce.
Does this appear to you as something that can be computed?
As this seems to be a very subjective thing, let me make it more objective, let's say hardest words to pronounce by text to speech technologies.
One approach would be to build a list with two versions of each word. One the correct spelling, and the other being the word spelled using the simplest of phonetic spelling. Apply a distance function on the two words (like Levenshtein distance http://en.wikipedia.org/wiki/Levenshtein_distance). The greater the distance between the two words, the harder the word would be to pronounce.
Great problem! Off the top of my head you could create a system which contains all the letters from the phonetic alphabet and with connected weights betweens every combination based on difficulty (highly specific so may need multiple people testing and take averages etc) then have a list of all words from the English dictionary stored on disk and call a script which cycles through each entry and performs web scraping on wikipedia for the phonetic spelling and ranks their difficulty. This could take into consideration the length of the word as well as the difficulty between joining phonetics then order the list based on the difficulty.
Thats what I would try and do :P
To a certain extent...
Speech programs for example use a system of phonetics to try and pronounce words.
For example, "grasp" would be split into:
Gr-A-Sp
However, for foreign words (or words that don't follow this pattern), exception lists have to be kept e.g. Yacht
Suggestion
Fortunately Pronunciation as a process is dependent on a two factors these include
the phones making up the words and the location of vowels and semi vowels i.e
/a/,/ae/,/e/,/i/,/o/,/u/,/w/,/j/...
length of the word.
the first relates to the mechanics of phone sound production as the velum, cheeks tongue have to be altered to produce various sounds related to individual phones i.e nasal etc. this makes some words more difficult to pronounce as the movement required may be a lot. Refer to books about phonetics to find positions of pronouncing each phone.
Algorithm
a weighted spanning tree with weight being the difficulty of pronouncing two consecutive phones i.e l and r or /sh/ and /s/
good luck.
I'm trying to implement application that can determine meaning of sentence, by dividing it to smaller pieces. So I need to know what words are subject, object etc. so that my program can know how to handle this sentence.
This is an open research problem. You can get an overview on Wikipedia, http://en.wikipedia.org/wiki/Natural_language_processing. Consider phrases like "Time flies like an arrow, fruit flies like a banana" - unambiguously classifying words is not easy.
You should look at the Natural Language Toolkit, which is for exactly this sort of thing.
See this section of the manual: Categorizing and Tagging Words - here's an extract:
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
"Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective."
I guess there is not "simple" way to do this. You have to build a linguistic analyzer (which is quite possible), however, a language as a lot of exceptional cases. And that is what makes implementing a linguistic analyzer that hard.
The specific problem you mention, the identification of the subject and objects of a clause, is accomplished by syntactic parsing. You can get a good idea of how parsing works by using this demo of parsing software developed by Stanford University.
However, syntactic parsing does not determine the meanining of a sentence, only its structure. Determining meaning (semantics) is a very hard problem in general and there is no technology that can really 'understand' a sentence in the same way that a human would. Although there is no general solution, you may be able to do something in a very restricted subject domain. For example, is the data you want to analyse about a narrow topic with a limited set of 'things' that people talk about?
StompChicken has given the right answer to this question, but I'd like to add that the concepts of subject and object are known as grammatical relations, and that Briscoe and Carroll's RASP is a parser that can go the extra step of deducing a list of relations from the parse.
Here's some example output from their demo page. It's an extract from the output for a sentence that begins "We describe a robust accurate domain-independent approach...":
(|ncsubj| |describe:2_VV0| |We:1_PPIS2| _)
(|dobj| |describe:2_VV0| |approach:7_NN1|)
Let's say I've got a database full of music artists. Consider the following artists:
The Beatles -
"The" is officially part of the name, but we don't want to sort it with the "T"s if we are alphabetizing. We can't easily store it as "Beatles, The" because then we can't search for it properly.
Beyoncé -
We need to allow the user to be able to search for "Beyonce" (without the diacritic mark)and get the proper results back. No user is going to know how or take the time to type the special diacritcal character on the last "e" when searching, yet we obviously want to display it correctly when we need to output it.
What is the best way around these problems? It seems wasteful to keep an "official name", a "search name", and a "sort name" in the database since a very large majority of entries will all be exactly the same, but I can't think of any other options.
The library science folks have a standard answer for this. The ALA Filing Rules cover all of these cases in a perfectly standard way.
You're talking about the grammatical sort order. This is a debatable topic. Some folks would take issue with your position.
Generally, you transform the title to a normalized form: "Beatles, The". Generally, you leave it that way. Then sort.
You can read about cataloging rules here: http://en.wikipedia.org/wiki/Library_catalog#Cataloging_rules
For "extended" characters, you have several choices. For some folks, é is a first-class letter and the diacritical is part of it. They aren't confused. For other folks, all of the diacritical characters map onto unadorned characters. This mapping is a feature of some Unicode processing tools.
You can read about Unicode diacritical stripping here: http://lexsrv3.nlm.nih.gov/SPECIALIST/Projects/lvg/current/docs/designDoc/UDF/unicode/NormOperations/stripDiacritics.html
http://www.siao2.com/2005/02/19/376617.aspx