Understand the pumping lemma - theory

I am relatively new to the pumping lemma, and I have a problem here that I think I answered correctly, can anyone tell me if this works and if not why not
The problem: {www | w is {a,b}*}
My approach:
L = www
u* (v^k) * w has to be a subset of L
www
| | |
uvw
uvw = www
(u)(v^2)(w) = wwww
wwww is not part of the language www and therefore not regular
Edit: Well my according to my understanding the pumping lemma by taking the "test string" we are looking at and splitting it into a portion that stays the same follow by a portion that is repeatable and then lastly another portion that remains the same. In my "approach" I took the test string "www" and split it into u,v, and w each respectively holding a single "w", with v being the repeatable section and the other two being the ones that remain the same. I double the v section and end up with a resulting uvvw which translates to wwww which appears as if it is not a part of the language www. I have a good feeling that I am wrong because of the condition "w is {a,b}*" which I think includes the empty string, and since the empty string is viable in wwww and www my pumping lemma is faulty. I would just like to know what approach I would have to take to tackle such a problem, its just a practice problem

I do not believe your answer works, because there is no way to be sure that wwww is not in the language.
For example, let |w| be a multiple 3 (i.e 3*k for some k)
So your orignal string is length:
|3k|+|3k|+|3k|=9*3k
So if you add another string length 3k.
The length is now 12k, which is ALSO a multiple of 3.
Try something like:
Let w = 100...001, where you have p zeros enclosed by 1s.
Then no matter how you pump
10..0110..0110..01
u v w , you will be out of the language.

Related

Find repeating patterns in string

In today's AdventOfCode question users were tasked to find repeating patterns in a string, and output that.
For example, my string was:
R8L10L12R4R8L12R4R4R8L10L12R4R8L10R8R8L10L12R4R8L12R4R4R8L10R8R8L12R4R4R8L10R8R8L12R4R4
This consists of 3 repeating patterns:
A = R8L10L12R4
B = R8L12R4R4
C = R8L10R8
which means my string could be encoded to ABACABCBCB
How would I do this by algorithm? I did this by hand, and all solutions of other people I looked at were also done by hand. What is the programatic way to solve this?
Edit:
Information we had:
maximum of 3 patterns
R8L10L12R4 would be transformed to R,8,L,10,L,12,R,4. The transformed pattern could not be longer than 20 characters

How to Sort By using J

I'm used to sort by operation which many languages afford. It takes some comparator and sorts by it.
What I want to do is to sort the following words firstly by length and then by letter order. Help me please.
I didn't find anything about it in Phrases or Dictionary on jsoftware, except from sorting and grading numerical values.
words=: >;:'CLOUD USB NETWORK LAN SERVER FIREWIRE CLIENT PEER'
] alpha=: a. {~ (i.26) + a.i.'A'
ABCDEFGHIJKLMNOPQRSTUVWXYZ
;/ words/: alpha i. words
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│CLIENT │CLOUD │FIREWIRE│LAN │NETWORK │PEER │SERVER │USB │
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
My first crazy idea is to shift each word to right boundary of the array, e.g.
ABC
DEFG
XY
Then for whitespace assign the extreme ranking (in y argument of sorting primitive). And then shift each word back :D. It would be highly inefficient, I can't see another J-way.
Update
Here is the Wolfram Language code for my problem:
StringSplit # "CLOUD USB NETWORK LAN SERVER FIREWIRE CLIENT PEER"
~SortBy~
(Reverse # ComposeList[{StringLength}, #] &)
If I want to prioritize longer words, I just append Minus #* to StringLength.
Basically my sorting order here is {{5, "CLOUD"}, {3, "USB"}, {7, "NETWORK"}, ...}.
I can make the same array in J using (,.~ #&.>) applied to boxed words, but how do I use sorting primitives then? Maybe this is the right first step? I'm still not sure, but it sound much better than my first guess :).
As requested, I have promoted answers suggested in the comments to the main body of the answer.
First of all, I don't think it is a good idea to apply > to the list of words because that will add fill which destroys information about the length of each word. So start with
words=: ;:'CLOUD USB NETWORK LAN SERVER FIREWIRE CLIENT PEER'
Then you need a function foo that turns each word into a value that will sort in the order you wish using \: or /:
words \: foo words
will do the trick. Or, depending on whether you think hooks are pretty or ugly, you could express that as
(/: foo) words
My original suggestion for foo used #. to express each word as a single number:
foo=: (128&#.)#(a.&i.)#>
foo words
18145864388 1403330 345441182148939 1253582 2870553715410 39730390339053893 2322657797972 168911570
(/: foo) words
┌───┬───┬────┬─────┬──────┬──────┬───────┬────────┐
│LAN│USB│PEER│CLOUD│CLIENT│SERVER│NETWORK│FIREWIRE│
└───┴───┴────┴─────┴──────┴──────┴───────┴────────┘
In a comment, Danylo Dubinin pointed out that instead of encoding the word, we can simply catenate the word length to the front of the index vector and sort using that:
(/: (# , a.&i.)#>) words
┌───┬───┬────┬─────┬──────┬──────┬───────┬────────┐
│LAN│USB│PEER│CLOUD│CLIENT│SERVER│NETWORK│FIREWIRE│
└───┴───┴────┴─────┴──────┴──────┴───────┴────────┘

What is the most efficient way to read a CSV file into an Accelerate (or Repa) Array?

I am interested in playing around with the Accelerate library, and I would like to perform some operations on data stored inside of a CSV file. I've read this excellent introduction to Accelerate, but I'm not sure how I can go about reading CSVs into Accelerate efficiently. I've thought about this, and the only thing I can think of is to parse the entire CSV file into one long list, and then feed the entire list into Accelerate.
My data sets will be quite large, and it doesn't seem efficient to read a 1 gb+ file into memory only to copy somewhere else. I noticed there was a CSV Enumerator package on Hackage, but I'm not sure how to use it with Accelerate's generate function. Another constraint is that it seems the dimensions of the Array, or at least number of elements, must be known before generating an array using Accelerate.
Has anyone dealt with this kind of problem before?
Thanks!
I am not sure if this is 100% applicable to accelerate or repa, but here is one way I've handled this for Vector in the past:
-- | A hopefully-efficient sink that incrementally grows a vector from the input stream
sinkVector :: (PrimMonad m, GV.Vector v a) => Int -> ConduitM a o m (Int, v a)
sinkVector by = do
v <- lift $ GMV.new by
go 0 v
where
-- i is the index of the next element to be written by go
-- also exactly the number of elements in v so far
go i v = do
res <- await
case res of
Nothing -> do
v' <- lift $ GV.freeze $ GMV.slice 0 i v
return $! (i, v')
Just x -> do
v' <- case GMV.length v == i of
True -> lift $ GMV.grow v by
False -> return v
lift $ GMV.write v' i x
go (i+1) v'
It basically allocates by empty slots and proceeds to fill them. Once it hits the ceiling, it grows the underlying vector once again. I haven't benchmarked anything, but it appears to perform OK in practice. I am curious to see if there will be other more efficient answers here.
Hope this helps in some way. I do see there's a fromVector function in repa and perhaps that's your golden ticket in combination with this method.
I haven't tried reading CSV files into repa but I recommend using cassava (http://hackage.haskell.org/package/cassava). Iirc I had a 1.5G file which I used to create my stats. With cassava, my program ran in a surprisingly small amount of memory. Here's an extended example of usage:
http://idontgetoutmuch.wordpress.com/2013/10/23/parking-in-westminster-an-analysis-in-haskell/
In the case of repa, if you add rows incrementally to an array (which it sounds like you want to do) then one would hope the space usage would also grow incrementally. It certainly is worth an experiment. And possibly also contacting the repa folks. Please report back on your results :-)

Find the minimum number of needed transformations given input and target string

Given that I have an input string, for example: aab
And I am given a target string, for example: bababa
And then I am given a set of transformation rules. For example:
ab -> bba
b -> ba
How could I do, in C, an algorithm that would find the minimum number of transformations that would need to be applied in the input string to get the target string.
In this example, for example, the number would be 3. Because we would do:
1 - Apply rule 1 (abba)
2 - Apply rule 1 again (bbaba)
3 - Apply rule 2 (bababa)
It could happen that given an input and a target, there is no solution and that should be noticed too.
I am pretty much lost in strategies on doing this. It comes to my mind creating an automata but I am not sure how would I apply in this situation. I think is an interesting problem and I have been researching online, but all I can find is transformations given rules, but not how to ensure it's a minimum.
EDIT: As one of the answers suggested, we could do a graph starting from the initial string and create nodes that are the result of applying transformations to the previous node. However, this brings some problems, from my point of view:
Imagine that I have a transformation that looks like this a --> ab. And my initial string is 'a'. And my output string is 'c'. So, I keep doing transformations (growing the graph) like this:
a -> ab
ab -> abb
abb -> abbb
...
How would I know when I need to stop building the graph?
Say I have the following string aaaa, and I have a transformation rule like aa->b. How would I create the new nodes? I mean, how would I find the substrings in that input string and remember them?
I dont think there is an efficient solution for this. I think you have to do breadth-first search. by doing that you will know that as soon as you have a solution that it is a shortest solution.
EDIT:
Image: modify string breadth first
Every layer is made from the previous by applying all possible rules to all possible substrings. For example the b->ba rule can be applied to abba for each b. It is important to only apply a single rule and then remember that string (eg ababa and abbaa) in a list. You have to completely have each layer in a List in your program before you start the next Layer (=breadth first).
EDIT 2:
You write you now have an output c. For this you obviously need a rule with XX->c. So say you have rule aaa->c. Now in layer 2 you will have a string aaa which came from some a->aa rules. You will then apply a->aa again and get aaaa, that is ok, since you should go breadth first you will THEN apply the aaa->c rule to aaa and now have layer 3 consisting of aaaa, c and others. You do not continue modifying aaaa because that would go to layer 4, you already found the target c in layer 3 so you can stop.
EDIT 3:
You now ask if you can decide for an unspecified set of rules how you can decide when to stop layering. In general it is impossible, it is called the Halting problem https://en.wikipedia.org/wiki/Halting_problem .
BUT For specific rules you can tell if you can ever reach the output.
Example 1: if the target contains an atom that no rule can provide (your 'c'-Example).
Example 2: if your rules are all either increasing the string's length or keeping the length as it is (no rules that decrease the length of the string)
Example 3: you can drop certain rules if you found by algorithm that they are cyclic
Other examples exist

How do spell checkers work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I need to implement a spell checker in C. Basically, I need all the standard operations... I need to be able to spell check a block of text, make word suggestions and dynamically add new words to the index.
I'd kind of like to write this myself, tho I really don't know where to begin.
Read up on Tree Traversal. The basic concept is as follows:
Read a dictionary file into memory (this file contains the entire list of correctly spelled words that are possible/common for a given language). You can download free dictionary files online, such as Oracle's example dictionary.
Parse this dictionary file into a search tree to make the actual text search as efficient as possible. I won't describe all of the dirty details of this type of tree structure, but the tree will be made up of nodes which have (up to) 26 links to child nodes (one for each letter), plus a flag to indicate wether or not the current node is the end of a valid word.
Loop through all of the words in your document, and check each one against the search tree. If you reach a node in the tree where the next letter in the word is not a valid child of the current node, the word is not in the dictionary. Also, if you reach the end of your word, and the "valid end of word" flag is not set on that node, the word is not in the dictionary.
If a word is not found in the dictionary, inform the user. At this stage, you can also suggest alternate spellings, but that gets a tad more complicated. You will have to loop through each character in the word, substituting alternate characters and test each of them against the search tree. There are probably more efficient algorithms for finding the recommended words, but I don't know what they are.
A really short example:
Dictionary:apex apple appoint appointed
Tree: (* indicates valid end of word)
update: Thank you to Curt Sampson for pointing out that this data structure is called a Patricia Tree
A -> P -> E -> X*
\\-> P -> L -> E*
\\-> O -> I -> N -> T* -> E -> D*
Document:apple appint ape
Results:
"apple" will be found in the tree, so it is considered correct.
"appint" will be flagged as incorrect. Traversing the tree, you will follow A -> P -> P, but the second P does not have an I child node, so the search fails.
"ape" will also fail, since the E node in A -> P -> E does not have the "valid end of word" flag set.
edit: For more details on spelling suggestions, look into Levenshtein Distance, which measures the smallest number of changes that must be made to convert one string into another. The best suggestions would be the dictionary words with the smallest Levenshtein Distance to the incorrectly spelled word.
Given you don't know where to begin, I'd suggest using an existing solution. See, for example, aspell
(GLPL licenced). If you really have to implement it yourself, please tell us why.
One should look at prefixes and suffixes.
suddenly = sudden + ly.
by removing ly's you can get away storing just the root word.
Likewise preallocate = pre + allocate.
And lovingly = love + ing + ly
gets a bit more complex, as the english rules for ing get invoked.
There is also the possibility of using some sort of hashing function to map a root word
into a specific bit is a large bit map, as a constant time method of determining if the root word is spelled correctly.
You can get even more complex by trying to provide an alternate list of possible correct spellings to a misspelled word. You might research the soundex algorithm to get some ideas.
I would advise prototyping with a small set of words. Do a lot of testing, then scale up.
It is a wonderful educational problem.
Splitting a word into root and suffix is knonw as the "Porter Stemming Algorithm" it's a good way of fitting an English ditionary into an amazingly small memory.
It's also useful for seach so "spell checker" will also find "spelling check" and "spell checking"
I've done this in class
You should consider python Natural Language Toolkit NLTK which is made specificaly to handle this.
It also allows to create text interpreters such as chatbots
The Open Office Spell checker Hunspell can be a good starting point. Here is the Homepage:
Hunspell at Sourceforge
E James gives a great answer for how to tell if a word is valid. It probably depends on the spell checker for how they determine likely misspellings.
One such method, and the one that I would use is the Levenshteinn String Similarity which looks at how many letters must be added, removed, or swaped in a word in order to make another word.
If you say spelled: Country as Contry. The levenshtein string similarity would be 1 since you have to only add 1 letter to transform contry into country.
You could then loop through all possible correct spellings of words (only 171,000 english words and 3000 of those account for 95% of text). Determine those with the lowest levenshtein string similarity value, and then return the top X words that are most similar to the misspelled word.
There's a great python package called Fuzzy Wuzzy which implements this efficiently and generates a % similarity between two words or sentences based on this formula.

Resources