Find repeating patterns in string - arrays

In today's AdventOfCode question users were tasked to find repeating patterns in a string, and output that.
For example, my string was:
R8L10L12R4R8L12R4R4R8L10L12R4R8L10R8R8L10L12R4R8L12R4R4R8L10R8R8L12R4R4R8L10R8R8L12R4R4
This consists of 3 repeating patterns:
A = R8L10L12R4
B = R8L12R4R4
C = R8L10R8
which means my string could be encoded to ABACABCBCB
How would I do this by algorithm? I did this by hand, and all solutions of other people I looked at were also done by hand. What is the programatic way to solve this?
Edit:
Information we had:
maximum of 3 patterns
R8L10L12R4 would be transformed to R,8,L,10,L,12,R,4. The transformed pattern could not be longer than 20 characters

Related

How to do Norvig spell check for chinese characters mixed with english letters?

I have a list of product names written in mixture of English letters and numbers and Chinese characters stored in my database.
There is a table called products with the fields name_en, name_zh amongst others.
E.g.
AB 10"机翼
Peter Norvig has a fantastic algorithm for spell check but it only works for English.
I was wondering if there's a way to do something similar for a a narrow list of terms containing Chinese characters?
E.g. of mispelling such as
A10机翼
AB 10鸡翼
AB 10鸡一
AB 10木几翼
all will prompt AB 10"机翼 as the correct spelling
How do I do this?
You have a much more complex problem than Norvig's:
Chinese Input-method
The mis-spellings in your case (at least in your example) is mostly caused by the pinyin input method. One same typing of "jiyi" (English: airplane wings) could lead to different Chinese phrases:
机翼
鸡翼
鸡一
几翼
Chinese Segmentation
Also in Chinese to break up a long sentence into small tokens with semantic meaning, you would need to do segmentation. For example:
飞机模型零件 -> Before segmentation
飞机-模型-零件 After segmentation you got three phrases separated by '-'.
Work on the token-level
You probably can experiment starting from a list of mis-spellings. I guess you can collect a bunch of them from your user logs. Take out one misspelling at a time, using your example:
AB 10鸡翼
First break it into tokens:
A-B-10-鸡翼
(here you probably need a Chinese segmentation algorithm to realize that 鸡翼 should be treated together).
Then you should try to find its nearest neighbor in your product db using the edit distance idea. Note that:
you do not remove/edit/replace one character at a time, but remove/edit/replace one token at a time.
when edit/replace, we should limit our candidates to be those near neighbors of the original token. For example, 鸡翼 -> 机翼,几翼,机一
Build Lucene index
You can also try to tackle the problem in a different way, starting from your correct product names. Treat each product name as a document and pre-build lucene index from that. Then for each user query, the query-matching problem is converted to a search problem in which we issue a query to the search-engine for find the best-matching documents in our db. In this case, I believe Lucene would probably takes care of the segmentation (if not, you would need to extend its functionality to suit your own needs) and tokenization for you.

Find the minimum number of needed transformations given input and target string

Given that I have an input string, for example: aab
And I am given a target string, for example: bababa
And then I am given a set of transformation rules. For example:
ab -> bba
b -> ba
How could I do, in C, an algorithm that would find the minimum number of transformations that would need to be applied in the input string to get the target string.
In this example, for example, the number would be 3. Because we would do:
1 - Apply rule 1 (abba)
2 - Apply rule 1 again (bbaba)
3 - Apply rule 2 (bababa)
It could happen that given an input and a target, there is no solution and that should be noticed too.
I am pretty much lost in strategies on doing this. It comes to my mind creating an automata but I am not sure how would I apply in this situation. I think is an interesting problem and I have been researching online, but all I can find is transformations given rules, but not how to ensure it's a minimum.
EDIT: As one of the answers suggested, we could do a graph starting from the initial string and create nodes that are the result of applying transformations to the previous node. However, this brings some problems, from my point of view:
Imagine that I have a transformation that looks like this a --> ab. And my initial string is 'a'. And my output string is 'c'. So, I keep doing transformations (growing the graph) like this:
a -> ab
ab -> abb
abb -> abbb
...
How would I know when I need to stop building the graph?
Say I have the following string aaaa, and I have a transformation rule like aa->b. How would I create the new nodes? I mean, how would I find the substrings in that input string and remember them?
I dont think there is an efficient solution for this. I think you have to do breadth-first search. by doing that you will know that as soon as you have a solution that it is a shortest solution.
EDIT:
Image: modify string breadth first
Every layer is made from the previous by applying all possible rules to all possible substrings. For example the b->ba rule can be applied to abba for each b. It is important to only apply a single rule and then remember that string (eg ababa and abbaa) in a list. You have to completely have each layer in a List in your program before you start the next Layer (=breadth first).
EDIT 2:
You write you now have an output c. For this you obviously need a rule with XX->c. So say you have rule aaa->c. Now in layer 2 you will have a string aaa which came from some a->aa rules. You will then apply a->aa again and get aaaa, that is ok, since you should go breadth first you will THEN apply the aaa->c rule to aaa and now have layer 3 consisting of aaaa, c and others. You do not continue modifying aaaa because that would go to layer 4, you already found the target c in layer 3 so you can stop.
EDIT 3:
You now ask if you can decide for an unspecified set of rules how you can decide when to stop layering. In general it is impossible, it is called the Halting problem https://en.wikipedia.org/wiki/Halting_problem .
BUT For specific rules you can tell if you can ever reach the output.
Example 1: if the target contains an atom that no rule can provide (your 'c'-Example).
Example 2: if your rules are all either increasing the string's length or keeping the length as it is (no rules that decrease the length of the string)
Example 3: you can drop certain rules if you found by algorithm that they are cyclic
Other examples exist

Most efficient way to store a big DNA sequence?

I want to pack a giant DNA sequence with an iOS app (about 3,000,000,000 base pairs). Each base pair can have a value A, C, T or G. Storing each base pair in one bytes would give a file of 3 GB, which is way too much. :)
Now I though of storing each base pair in two bits (four base pairs per octet), which gives a file of 750 MB. 750 MB is still way too much, even when compressed.
Are there any better file formats for efficiently storing giant base pairs on disk? In memory is not a problem as I read in chunks.
I think you'll have to use two bits per base pair, plus implement compression as described in this paper.
"DNA sequences... are not random; they contain
repeating sections, palindromes, and other features that
could be represented by fewer bits than is required to spell
out the complete sequence in binary...
With the proposed algorithm, sequence will be compressed by 75%
irrespective of the number of repeated or non-repeated
patterns within the sequence."
DNA Compression Using Hash Based Data Structure, International Journal of Information Technology and Knowledge Management
July-December 2010, Volume 2, No. 2, pp. 383-386.
Edit: There is a program called GenCompress which claims to compress DNA sequences efficiently:
http://www1.spms.ntu.edu.sg/~chenxin/GenCompress/
Edit: See also this question on BioStar.
If you don't mind having a complex solution, take a look at this paper or this paper or even this one which is more detailed.
But I think you need to specify better what you're dealing with. Some specifics applications can lead do diferent storage. For example, the last paper I cited deals with lossy compression of DNA...
Base pairs always pair up, so you should only have to store one side of the strand. Now, I doubt that this works if there are certain mutations in the DNA (like a di-Thiamine bond) that cause the opposite strand to not be the exact opposite of the stored strand. Beyond that, I don't think you have many options other than to compress it somehow. But, then again, I'm not a bioinformatics guy, so there might be some pretty sophisticated ways to store a bunch of DNA in a small space. Another idea if it's an iOS app is just putting a reader on the device and reading the sequence from a web service.
Use a diff from a reference genome. From the size (3Gbp) that you post, it looks like you want to include a full human sequences. Since sequences don't differ too much from person to person, you should be able to compress massively by storing only a diff.
Could help a lot. Unless your goal is to store the reference sequence itself. Then you're stuck.
consider this, how many different combinations can you get? out of 4 (i think its about 16 )
actg = 1
atcg = 2
atgc = 3 and so on, so that
you can create an array like [1,2,3] then you can go one step further,
check if 1 is follow by 2, convert 12 to a, 13 = b and so on...
if I understand DNA a bit it means that you cannot get a certain value
as a must be match with c, and t with g or something like that which reduces your options,
so basically you can look for a sequence and give it a something you can also convert back...
You want to look into a 3d space-filling curve. A 3d sfc reduces the 3d complexity to a 1d complexity. It's a little bit like n octree or a r-tree. If you can store your full dna in a sfc you can look for similar tiles in the tree although a sfc is most likely to use with lossy compression. Maybe you can use a block-sorting algorithm like the bwt if you know the size of the tiles and then try an entropy compression like a huffman compression or a golomb code?
You can use the tools like MFCompress, Deliminate,Comrad.These tools provides entropy less than 2.That is for storing each symbol it will take less than 2 bits

Minimum Number of Operations needed

I have a problem, suppose I have a given string: "best", the target string is suppose: "beast". Then I have to determine the number of operations to convert the given string to the target string, however the operations allowed are:
1. add a character to string.
2. delete a character.
3. swap two char positions. (should be used wisely, we have only one chance to swap.)
In above case it is 1.
How do we solve such kind of problem, and what kind of problem is it?
I am a newbie learner.
One widely-used measure of this kind of thing is called the Levenshtein distance.
http://en.wikipedia.org/wiki/Levenshtein_distance
The WP page also mentions/links to other similar concepts. It is essentially a metric of the number of edits needed to turn one word into another.
Levenshtein distance

number combination algorithm

Write a function that given a string of digits and a target value, prints where to put +'s and *'s between the digits so they combine exactly to the target value. Note there may be more than one answer, it doesn't matter which one you print.
Examples:
"1231231234",11353 -> "12*3+1+23*123*4"
"3456237490",1185 -> "3*4*56+2+3*7+490"
"3456237490",9191 -> "no solution"
If you have an N digit value, there are N-1 possible slots for the + or * operators. So brute force, there are 3^(N-1) possibilities. Testing all of these are inefficient.
BUT
Your examples are all 10 digits. 3^9 = 19683, so brute force is FINE! No need to get any fancier.
So all you need to do is iterate through all 19683 cases, each time building a string for that case, and evaluating the expression. Evaluating the expression is a straightforward task. Iterating is straightforward (just use an incrementing value, you can read the state of the first slot by (i%3), which gives you "no operator" "+" or "*", the state of the second slot is (i/3)%3, the state of the third slot is (i/9)%3 and so on.)
Even with crude parsing code, CPUs are fast.
The brute force option starts becoming ugly after about 20 digits, and you'd have to switch to be more clever.
If this is for the gaming programmer position, do not use the brute force approach. I did that but failed this a couple of years ago. Later heard from someone inside that dynamic programming approach is the one that gets the job.
This can be solved either by backtracking or by dynamic programming.
The "cleverer" approach (using dynamic programming) is basically this:
For each substring of the original string, figure out all possible values it can create. (e.g. in your first example "12" can become either 1+2=3 or 1*2=2)
There may be a lot of different combinations, but many of them will be duplicates. (Also, you should ignore all combinations that are greater than the target).
Thus, when you add a "+" or a "*", you can envision it as combining two substrings of the string. (and since you have the possible values for each substring, you can see if such a combination is possible)
These values can be generated similarly: try splitting the substring in all possible ways, and combine the different values in each half of the substring.
The total number of "states", then, is something like |S|^2 * target - for your example case, it's worse than the brute-force method. But if you had a string of length 1000 and a target of say 5000, then the problem would be solvable with dynamic programming.
Google Code Jam had an extended version of this problem last year (in Round 1C), called Ugly Numbers. You can visit that link and click "Contest Analysis" for some approaches to that problem, when extended to large numbers of digits.

Resources