Good pedagogical ressources on suffix arrays - suffix-array

I simply cannot find any good pedagogical ressource explaining suffix arrays. Even the "bible" doesn't cover it.
Where can I find a clear and thorough explanation of suffix arrays and their uses? (A video course would be ideal, because I'm lazy.)

Prof Dan Gusfield gave a lecture on this topic : http://www.cs.ucdavis.edu/~gusfield/cs222f07/lineartimesuffixarray.wmv . You might find it useful

Many a thing you can do with a suffix array has in the past been described on the basis of the suffix tree. A great text book covering that is the Algorithms book by Dan Gusfield.
A great resource when it comes to suffix array search, representation and compression is the survey paper by Navarro and Mäkinen DOI 10.1145/1216370.1216372.

Related

Can a language have a multiple solution in dfa diagram?

What i mean is that can there be multiple different forms of diagram of the same language? Can it be drawn with multiple solutions? Or each language has only one solution in DFA? I attended a pop quiz today. Drew a solution and tried multiple strings. Each of those were accepted but i didn't get any points for it. Didn't get any feedback from my TA as why it was considered wrong.
The question was. Let L = {w | w contains an odd number of 0s or at least two 1s}.
This is what i did (sorry had to use ms paint).
If you notice a bit more carefully then 0101 is a string in your language but it is not accepted by your automata. Also to answer your other question, yes, there can be multiple DFAs which accept the same language. A trivial example would be the language 0* (Think about it if you are still interested, haha!).
P.S. - Just noticed a comment which pointed out the counter-example but I still went ahead. Sorry!

LDA: Why sampling for inference of a new document?

Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler:
When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions). However as topics are kept fixed when inferring a new document, i don't see why this should be relevant.
An issue with sampling is the probabilistic nature - sometimes documents topic assignments inferred, greatly vary on repeated invocations. Therefore i would like to understand the theoretical and practical value of the sampling vs. just using a deterministic method.
Thanks Ben
Just using term topic counts of the last Gibbs sample is not a good idea. Such an approach doesn't take into account the topic structure: if a document has many words from one topic, it's likely to have even more words from that topic [1].
For example, say two words have equal probabilities in two topics. The topic assignment of the first word in a given document affects the topic probability of the other word: the other word is more likely to be in the same topic as the first one. The relation works the other way also. The complexity of this situation is why we use methods like Gibbs sampling to estimate values for this sort of problem.
As for your comment on topic assignments varying, that can't be helped, and could be taken as a good thing: if a words topic assignment varies, you can't rely on it. What you're seeing is that the posterior distribution over topics for that word has no clear winner, so you should take a particular assignment with a grain of salt :)
[1] assuming beta, the prior on document-topic distributions, encourages sparsity, as is usually chosen for topic models.
The real issue is computational complexity. If each of N tokens in a document can have K possible topics, there are K to the N possible configurations of topics. With two topics and a document the size of this answer, you have more possibilities than the number of atoms in the universe.
Sampling from this search space is, however, quite efficient, and usually gives consistent results if you average over three to five consecutive Gibbs sweeps. You get to do something computationally impossible, and what it costs you is some uncertainty.
As was noted, you can get a "deterministic" result by setting a fixed random seed, but that doesn't actually solve anything.

Identifying the components in a English sentence that do not make sense

I'm wondering is there an algorithm or a library which helps me identify the components in an English which has no meaning? e.g., very serious grammar error? If so, could you explain how it works, because I would really like to implement that or use that for my own projects.
Here's a random example:
In the sentence: "I closed so etc page hello the door."
As a human, we can quickly identify that [so etc page hello] does not make any sense. Is it possible for a machine to point out that the string does not make any sense and also contains grammar errors?
If there's such a solution, how precise can that be? Is it possible, for example, given a clip of an English sentence, the algorithm returns a measure, indicating how meaningful, or correct that clip is? Thank you very much!
PS: I've looked at CMU's link grammar as well as the NLTK library. But still I'm not sure how to use for example link grammar parser to do what I would like to do as the if the parser doesn't accept the sentence, I don't know how to tweak it to tell me which part it is not right.. and I'm not sure whether NLTK supported that.
Another thought I had towards solving the problem is to look at the frequencies of the word combination. Since I'm currently interested in correcting very serious errors only. If I define the "serious error" to be the cases where words in a clip of a sentence are rarely used together, i.e., the frequency of the combo should be much lower than those of the other combos in the sentence.
For instance, in the above example: [so etc page hello] these four words really seldom occur together. One intuition of my idea comes from when I type such combo in Google, no related results jump out. So is there any library that provides me such frequency information like Google does? Such frequencies may give a good hint on the correctness of the word combo.
I think that what you are looking for is a language model. A language model assigns a probability to each sentence of k words appearing in your language. The simplest kind of language models are n-grams models: given the first i words of your sentence, the probability of observing the i+1th word only depends on the n-1 previous words.
For example, for a bigram model (n=2), the probability of the sentence w1 w2 ... wk is equal to
P(w1 ... wk) = P(w1) P(w2 | w1) ... P(wk | w(k-1)).
To compute the probabilities P(wi | w(i-1)), you just have to count the number of occurrence of the bigram w(i-1) wi and of the word w(i-1) on a large corpus.
Here is a good tutorial paper on the subject: A Bit of Progress in Language Modeling, by Joshua Goodman.
Yes, such things exist.
You can read about it on Wikipedia.
You can also read about some of the precision issues here.
As far as determining which part is not right after determining the sentence has a grammar issue, that is largely impossible without knowing the author's intended meaning. Take, for example, "Over their, dead bodies" and "Over there dead bodies". Both are incorrect, and could be fixed either by adding/removing the comma or swapping their/there. However, these result in very different meanings (yes, the second one would not be a complete sentence, but it would be acceptable/understandable in context).
Spell checking works because there are a limited number of words against which you can check a word to determine if it is valid (spelled correctly). However, there are infinite sentences that can be constructed, with infinite meanings, so there is no way to correct a poorly written sentence without knowing what the meaning behind it is.
I think what you are looking for is a well-established library that can process natural language and extract the meanings.
Unfortunately, there's no such library. Natural language processing, as you probably can imagine, is not an easy task. It is still a very active research field. There are many algorithms and methods in understanding natural language, but to my knowledge, most of them only work well for specific applications or words of specific types.
And those libraries, such as the CMU one, seems to be still quite rudimental. It can't do what you want to do (like identifying errors in English sentence). You have to develop algorithm to do that using the tools that they provide (such as sentence parser).
If you want to learn about it check out ai-class.com. They have some sections that talks about processing language and words.

how to find a path to go home - algorithm

(source: blogcu.com)
Assume there is a rabbit and at position (1,1). Moreover, its home is at position (7,7). How can it reach that position ?
Home positon is not fix place.
Real question, I am trying to solve a problem on a book for exersizing c.What algorithm should I apply to find solution?
Should I use linked list to store data?
Data is (1,1), (1,2),..., (3,3) ..., (7,7)
Place marked with black shows wall.
Use A*. It is the classic go-to algorithm for path-finding (that article lists many other algorithms you can consider too).
By using A* you learn an algorithm that you might actually need in your normal programming career later ;)
An example evaluation of a maze similar to that in the question using A*:
There are a bunch of search algorithms you can use. The easiest to implement will be either breadth-first search or depth-first search.
Algorithms like A* are likely to be more efficient but are a little harder to code.
Check out the Wikipedia "Search algorithms" page. It has links to a number of well-known algorithms.
Breadth-first search is always a good one.
http://www.codeproject.com/KB/recipes/mazesolver.aspx

How to determine subject, object and other words?

I'm trying to implement application that can determine meaning of sentence, by dividing it to smaller pieces. So I need to know what words are subject, object etc. so that my program can know how to handle this sentence.
This is an open research problem. You can get an overview on Wikipedia, http://en.wikipedia.org/wiki/Natural_language_processing. Consider phrases like "Time flies like an arrow, fruit flies like a banana" - unambiguously classifying words is not easy.
You should look at the Natural Language Toolkit, which is for exactly this sort of thing.
See this section of the manual: Categorizing and Tagging Words - here's an extract:
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
"Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective."
I guess there is not "simple" way to do this. You have to build a linguistic analyzer (which is quite possible), however, a language as a lot of exceptional cases. And that is what makes implementing a linguistic analyzer that hard.
The specific problem you mention, the identification of the subject and objects of a clause, is accomplished by syntactic parsing. You can get a good idea of how parsing works by using this demo of parsing software developed by Stanford University.
However, syntactic parsing does not determine the meanining of a sentence, only its structure. Determining meaning (semantics) is a very hard problem in general and there is no technology that can really 'understand' a sentence in the same way that a human would. Although there is no general solution, you may be able to do something in a very restricted subject domain. For example, is the data you want to analyse about a narrow topic with a limited set of 'things' that people talk about?
StompChicken has given the right answer to this question, but I'd like to add that the concepts of subject and object are known as grammatical relations, and that Briscoe and Carroll's RASP is a parser that can go the extra step of deducing a list of relations from the parse.
Here's some example output from their demo page. It's an extract from the output for a sentence that begins "We describe a robust accurate domain-independent approach...":
(|ncsubj| |describe:2_VV0| |We:1_PPIS2| _)
(|dobj| |describe:2_VV0| |approach:7_NN1|)

Resources