I am looking for an efficient algorithm to allow mismatches (at most 3) when comparing a pattern with a text. Original KMP does this job efficiently on my data but was considering this to extend this algo to accommodate for mismatches.
For my case: GACCCT is considered a match with GGGGGAGGTTTTTT with start position 4 in second sequence
I need to do pairwise comparison between two files. Each contains approximately 500,000 sequences. Sequences in one file is relatively short (~50 bases) while in other is longer (~200)
I tried Regex package in python, Levenshtein algorithm and edit distances. But they are slow and I will have to wait for couple of weeks to get the work done.
I think your data isn't too large, so maybe this will work:
I think you should create a suffix tree for your data. Once you do this, finding substrings will be very easy, whether or not you want to count mismatches: you just traverse the tree with the characters you're looking for, until you've either found a substring, or hit the most number of mismatches you can tolerate.
If you want at most three mismatches, there's a simple but kind of daft algorithm that'll work on most real cases. Break your pattern into four contiguous parts arbitrarily. (It is probably useful for them to match a random text location with roughly the same probability.) Find all matches in the text of your four contiguous parts. See which of those completes to an at-most-three-mismatches match by brute force.
Mehrdad's solution of using a suffix tree is better in general, but it requires more programming effort.
Related
I saw this coding challenge posted somewhere on a Elixir forum and have not quite figured out how to solve it. I have generalized the problem to make it more understandable.
Given an input of a random sequence of numbers, compute the M most common K-length sequences. M and K are constants. For example, compute the 10 most common 3-number sequences from the input.
The input could be potentially very large, so the solution should scale to any size.
I know that storing the sequences in a hash table in a higher level language is potentially the simplest, most efficient solution, but I’d like to find another solution that can be done in C without any hash functions
For uploading a file to a service, I was calculating the md5 based on the whole content of the file.
I was asked to do in a different way: the md5 of the file, and then also 3 more parts: 2% from the start of the file, 2% from 1/3 of the file, and 2% from 2/3, and 2% of the end of the file and then hash it the file's size and added the file size in bytes at the end.
Apparently this solves hash collisions between files. To me it seems like a waste of time, since your not increasing the size of md5. So for a huge large number of files, you're still gonna have, statistically, the same number of collisions.
Please help me understand the reasoning behind this.
EDIT: we are then hashing the resulting hashes.
A good cryptographically strong hashing algorithm is already designed with the goal to make it infeasible to intentionally find two different pieces of data with the same hash, let alone by accident. Therefore, just hashing the file is sufficient. Extra hashing of parts of the file is pointless.
This may seem unintuitive because obviously there must exist collisions if the length of the hash is shorter than the length of the data. However, it is not feasible to find these collisions because an MD5 hash is an unpredictable 128-bit number and the amount of possible 128-bit numbers (2^128) is mind boggling. If you could count at a rate of a trillion trillion per second, counting through all 128-bit numbers would still take (2^128 / 1e24) seconds ~ about 10 million years. This is probably a good lower limit to the amount of time that it would take to find a hash collision the brute force way without custom hardware.
That said, this is all assuming that there are no weaknesses in the hashing algorithm that allow you to do better than brute force. MD5 is broken in this regard, so you should not use it if you need to defend against attackers that would try to create collisions. It would be better to use a newer hashing algorithm like SHA-2 or SHA-3. (These also support even larger outputs such as 256 bits.)
Sounds like a dangerous practice, because you're re-hashing without factoring in a lot of data. The advantage however is that by running other hashes, you are effectivley winding up with a hash signature consisting of "more bits" - (i.e. you are getting three MD5 hashes as a result).
If you want to do this - and are in-fact okay with having more (larger) hash data to store/compare - you would be MUCH better advise to simply run a different hash function (other than MD5) that is either more secure, and/or uses a larger number of bits.
MD5 is an "older" algorithm and is known to have cryptographic weaknessess. I'd recommend one of the "SHA" algorithms - like SHA-256 or SHA-512. Advantages are that it is a stronger algorithm, you'd only have to has the data ONCE, and you'd get more bits than an MD5, yet since your running it once, it would be faster.
Note, that the possibility of hash collisions always exists. Even "high end" storage products which use hashes for detection will compare buffers to verify an exact match even if the comparison of two hashes matches.
I want to scan the numbers in a big interval wisely until I find the one I need.
But, I don't have any clue where this number might be and I will not have any clue during searching process.
Let me give an example to make it easy to state my question
Assume I am searching a number between 100000000000000 and 999999999999999
Naive approach would be starting from 100000000000000 and counting to 99... one by one.
but this is not wise because number can be on the far end If I am not lucky.
so, what is the best approach to this problem. I am not looking for mathematically best, I need a technique which is easy to implement in C programming Language.
thanks in advance.
There is no solution to your problem, but knowledge. If you don't know anything about the number, any strategy to enumerate them is equally good (or bad).
If you suppose that you are fighting against an adversary that is trying to hide the number for you, a strategy would be to make your next move unguessable. That would be to randomly pick numbers in the range and ask for them. (to avoid repetitions, you'd have to use a random permutation of your numbers.) By that you'd then find your number with an expected number of about half the total number, that is you'd gain a factor of two from the worst case. But as said all of that depends on the assumption that you can make.
Use bisection search. First see if your number is above or below the middle of the range. Depending on the answer, repeat the process for the upper or lower half of the range, respectively.
As you already know there is no strategy to improve search speed. All you can do is to speed up the search itself by using multithreading. So the technically best approach might be to try to implement the algorithm in OpenCL (which is fairly similar to C and which can be used through a C library) and run several hundred tests in parallel, depending on your hardware (GPU).
I hope this isn't too much of an arbitrary question, but I have been looking through the source codes of Faile and TSCP and I have been playing them against each other. As far as I can see the engines have a lot in common, yet Faile searches ~1.3 million nodes per second while TSCP searches only 300k nodes per second.
The source code for faile can be found here: http://faile.sourceforge.net/download.php. TSCP source code can be found here: http://www.tckerrigan.com/Chess/TSCP.
After looking through them I see some similarities: both use an array board representation (although Faile uses a 144 size board), both use a alpha beta search with some sort of transposition table, both have very similar evaluate functions. The main difference I can find is that Faile uses a redundant representation of the board by also having arrays of the piece locations. This means that when the moves are generated (by very similar functions for both programs), Faile has to for loop through fewer bad pieces, while maintaining this array costs considerably fewer resources.
My question is: why is there a 4x difference in the speed of these two programs? Also, why does Faile consistently beat TSCP (I estimate about a ~200 ELO difference just by watching their moves)? For the latter, it seems to be because Faile is searching several plies deeper.
Short answer: TSCP is very simple (as you can guess from its name). Faile is more advanced, some time was spent by developers to optimize it. So it is just reasonable for Faile to be faster, which means also deeper search and higher ELO.
Long answer: As far as I remember, the most important part of the program, using alpha beta search (part which influences performance the most), is move generator. TSCP's move generator does not generate moves in any particular order. Faile's generator (as you noticed), uses piece list, which is sorted in order of decreasing piece value. This means it generates more important moves first. This allows alpha-beta pruning to cut more unneeded moves and makes search tree less branchy. And less branchy tree may be deeper and still have the same number of nodes, which allows deeper search.
Here is a very simplified example how the order of moves allows faster search. Suppose, last white's move was silly - they moved some piece to unprotected position. If we find some black's move that removes this piece, we can ignore all other, not yet estimated moves and return back to processing white's move list. Queen controls much more space than a pawn, so it has more chances to remove this piece, so if we look at queen's moves first, we can more likely skip more unneeded moves.
I didn't compare other parts of these programs. But most likely, Faile optimizes them better as well. Things like alpha-beta algorithm itself, variable depth of the search tree, static position analysis may be also optimized.
TSCP has not hash tables (-75 ELO).
TSCP has not Killers moves for ordering (-50 ELO).
TSCP has not null move (-100 ELO).
TSCP has a bad attack function design (-25 ELO).
In these 4 things you have about a difference of 250 points ELO. This will increase the number of nodes per second but you can not compare nodes per second on different engines as programmers can use a different interpretation of what is a node.
I'm trying to evaluate different substring search (ala strstr) algorithms and implementations and looking for some well-crafted needle and haystack strings that will catch worst-case performance and possible corner-case bugs. I suppose I could work them out myself but I figure someone has to have a good collection of test cases sitting around somewhere...
Some thoughts and a partial answer to myself:
Worst case for brute force algorithm:
a^(n+1) b in (a^n b)^m
e.g. aaab in aabaabaabaabaabaabaab
Worst case for SMOA:
Something like yxyxyxxyxyxyxx in (yxyxyxxyxyxyxy)^n. Needs further refinement. I'm trying to ensure that each advancement is only half the length of the partial match, and that maximal suffix computation requires the maximal amount of backtracking. I'm pretty sure I'm on the right track because this type of case is the only way I've found so far to make my implementation of SMOA (which is asymptotically 6n+5) run slower than glibc's Two-Way (which is asymptotically 2n-m but has moderately painful preprocessing overhead).
Worst case for anything rolling-hash based:
Whatever sequence of bytes causes hash collisions with the hash of the needle. For any reasonably-fast hash and a given needle, it should be easy to construct a haystack whose hash collides with the needle's hash at every point. However, it seems difficult to simultaneously create long partial matches, which are the only way to get the worst-case behavior. Naturally for worst-case behavior the needle must have some periodicity, and a way of emulating the hash by adjusting just the final characters.
Worst case for Two-Way:
Seems to be very short needle with nontrivial MS decomposition - something like bac - where the haystack contains repeated false positives in the right-half component of the needle - something like dacdacdacdacdacdacdac. The only way this algorithm can be slow (other than by glibc authors implementing it poorly...) is by making the outer loop iterate many times and repeatedly incur that overhead (and making the setup overhead significant).
Other algorithms:
I'm really only interested in algorithms that are O(1) in space and have low preprocessing overhead, so I haven't looked at their worst cases so much. At least Boyer-Moore (without the modifications to make it O(n)) has a nontrivial worst-case where it becomes O(nm).
Doesn't answer your question directly, but you may find the algorithms in the book - Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology - interesting (has many novel algorithms on sub-string search). Additionally, it is also a good source of special and complex cases.
A procedure that might give interesting statistics, though I have no time to test right now:
Randomize over string length,
then randomize over string contents of that length,
then randomize over offset/length of a substring (possibly something not in the string),
then randomily clobber over the substring (possibly not at all),
repeat.
You can generate container strings (resp., contained test values) recursively by:
Starting with the empty string, generate all strings given by the augmentation of a string currently in the set by adding a character from an alphabet to the left or the right (both).
The alphabet for generating container strings is chosen by you.
You test 2 alphabets for contained strings. One is the one that makes up container strings, the other is its complement.