Fast string search using bitwise operators - c

What is the fastest (parallel?) way to find a substring in a very long string using bitwise operators?
e.g. find all positions of "GCAGCTGAAAACA" sequence in a human genome http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit (770MB)
*the alphabet consists of 4 symbols ('G','C',T,'A') represented using 2 bits:
'G':00, 'A':01, 'T':10, 'C':11
*you can assume the query string (the shorter one) is fixed in length, e.g. 127 characters
*by fastest I mean not including any pre-processing/indexing time
*the file is going to be loaded into memory after pre-processing, basically there will be billions of short strings to be searched for in a larger string, all in-memory.
*bitwise because I'm looking for the simplest, fastest way to search for a bit pattern in a large bit array and stay as close as possible to the silicon.
*KMP wouldn't work well as the alphabet is small
*C code, x86 machine code would all be interesting.
Input format description (.2bit): http://jcomeau.freeshell.org/www/genome/2bitformat.html
Related:
Fastest way to scan for bit pattern in a stream of bits
Algorithm help! Fast algorithm in searching for a string with its partner
http://www.arstdesign.com/articles/fastsearch.html
http://en.wikipedia.org/wiki/Bitap_algorithm

If you're just looking through a file, you're pretty much guaranteed to be io-bound. Use of a large buffer (~16K), and strstr() should be all you need. If the file is encoded in ascii,search just for "gcagctgaaaaca". If it actually is encoded in bits; just permute the possible accepted strings(there should be ~8; lop off the first byte), and use memmem() plus a tiny overlapping bit check.
I'll note here that glibc strstr and memmem already use Knuth-Morris-Pratt to search in linear time, so test that performance. It may surprise you.

If you first encode/compress the DNA string with a lossless coding method (e.g. Huffman, exponential Golumb, etc.) then you get a ranked probability table ("coding tree") for DNA tokens of various combinations of nucleotides (e.g., A, AA, CA, etc.).
What this means is that, once you compress your DNA:
You'll probably be using fewer bits to store GCAGCTGAAAACA and other subsequences, than the "unencoded" approach of always using two bits per base.
You can walk through the coding tree or table to build an encoded search string, which will usually be shorter than the unencoded search string.
You can apply the same family of exact search algorithms (e.g. Boyer-Moore) to locate this shorter, encoded search string.
As for a parallelized approach, split the encoded target string up into N chunks and run the search algorithm on each chunk, using the shortened, encoded search string. By keeping track of the bit offsets of each chunk, you should be able to generate match positions.
Overall, this compression approach would be useful if you plan on doing millions of searches on sequence data that won't change. You'd be searching fewer bits — potentially many fewer, in aggregate.

Boyer-More is a technique used to search for substrings in plain strings. The basic idea is that if your substring is, say, 10 characters long, you can look at the character at position 9 in the string to search. If that character is not part of your search string, you could simply start the search after that character. (If that character is, indeed, in your string, the Boyer-More algorithm use a look-up table to skip the optimal number of characters forward.)
It might be possible to reuse this idea for your packed representation of the genome string. After all, there are only 256 different bytes, so you could safely pre-calculate the skip-table.

The benefit of encoding the alphabet into bit fields is compactness: one byte holds the equivalent of four characters. This is similar to some of the optimizations Google performs searching for words.
This suggests four parallel executions, each with the (transformed) search string offset by one character (two bits). A quick-and-dirty approach might be to just look for the first or second byte of the search string and then check extra bytes before and after matching the rest of the string, masking off the ends if necessary. The first search is handily done by the x86 instruction scasb. Subsequent byte matches can build upon the register values with cmpb.

You could create a state machine. In this topic,
Fast algorithm to extract thousands of simple patterns out of large amounts of text
, I used [f]lex to create the state machine for me. It would require some hackery to use the 4 letter ( := two bit) alphabet, but it can be done using the same tables as generated by [f]lex. (you could even create your own fgetc() like function which extracts two bits at a time from the input stream, and keeps the other six bits for consecutive calls. Pushback will be a bit harder, but not undoable).
BTW: I seriously doubt if there is any gain in compressing the data to two bits per nucleotide, but that is a different matter.

Okay, given your parameters, the problem isn't that hard, just not one you'd approach like a traditional string search problem. It more resembles a database table-join problem, where the tables are much larger than RAM.
select a good rolling hash function aka buzhash. If you have billions of strings, you're looking for a hash with 64-bit values.
create a hash table based on each 127-element search string. The table in memory only needs to store (hash,string-id), not the whole strings.
scan your very large target string, computing the rolling hash and looking up each value of the hash in your table. Whenever there's a match, write the (string-id, target-offset) pair to a stream, possibly a file.
reread your target string and the pair stream, loading search strings as needed to compare them against the target at each offset.
I am assuming that loading all pattern strings into memory at once is prohibitive. There are ways to segment the hash table into something that is larger than RAM but not a traditional random-access hash file; if you're interested, search for "hybrid hash" and "grace hash", which are more common in the database world.
I don't know if it's worth your while, but your pair stream gives you the perfect predictive input to manage a cache of pattern strings -- Belady's classic VM page replacement algorithm.

Related

Fastest string searching algorithm for fixed text and a lot of substrings

I am trying to find an algorithm to search for binary strings of fixed size (64 bit) in a large binary buffer (100 MB). The buffer is always the same and i have got lots and lots of strings to search for (2^500 maybe).
I have to find all the occurrences of any given string, not only the first one.
What algorithm can i choose from? Maybe one that benefits from the constant buffer in which i search.
Links to C source code for such algorithm is appreciated.
Assuming your string are 8-bit aligned, from 100MB buffer you'll get 100 millions different strings, which can be put into the hash table approximately 800MB in size with constant (O(1)) access time.
This will allow you to make the search as fast as possible, because once you have your 8 byte string, you immediately know where this string was seen in your buffer.

Efficient algorithm to search a buffer for any string from a list

I am looking for an efficient search algorithm, that, for a given set of strings searches a large buffer for any one match from the set of strings.
Currently i know a few efficient single-string algorithms (i have used the Knuth before), but i don't know if they really help.
Here is what i am actually doing:
I have around 6-10 predefined strings, each around 200-300 characters (actually bytes, since i`m processing binary data)
The input is a large, sometimes few megabyte buffer
I would like to process the buffer, and when i have a match, i would like to stop the search
I have looked for multiple-string searching algorithms using a finite set of predefined patterns, but they all seem to revolve around matching ALL of the predefined strings in the buffer.
This post: Fast algorithm for searching for substrings in a string, suggested using the Aho–Corasick or the Rabin–Karp alogirthm.
I thought, that since i only need one match, i could find other methods, that are similar to the mentioned algorithms, but the constrains given by the problem can improve the performance.
Aho-Corasick is a good choice here. After building an automaton the input string is traversed from left to right so it is possible to stop immediately after the first match is found. The time complexity is O(sum of lengths of all patterns + the position of the first occurrence). It is optimal because it is not possible to find the first match without reading all patterns and all the bytes from the buffer before the first occurrence.

Precalculating MD5

I've got MD5 hash of one million symbols password and I've got first 999,992 symbols. I need to bruteforce last 8 digits. Can I precount first symbols' hash (let's call it base hash) and then just brute 8 chars length string and add its hash to base hash to make finding right pass faster? What algorithm should I use or what software can help me?
Yes, that's possible. MD5 is based on the Merkle-Damgård construction, which performs the hashing in blocks. You can hash a number of blocks, then save the state of the hash function and use it as the starting point to try different possibilities for the remaining blocks.
Based on the documentation (I haven't tested), I think calling clone() on a Java MessageDigest will copy the current state of the hash function. You could use that to build your partial hash from the known characters, then create a clone for each guess. That's assuming that the MD5 implementation actually supports cloning. There's a chance (depending on what language and library you use) that you might have to write your own MD5 implementation.
Note that MD5's block size is 512 bits (64 characters), and the length of your password (one million) is a whole multiple of that. That means your password characters will completely fill up the last block of data, and the hash function will need an additional block for padding. So you'll precompute the partial hash of the first 999,936 characters that you know, then produce the final data block from the remaining 56 characters that you know plus the 8 that you're guessing, then append the padding block after that.
An implementation like Java's MessageDigest should take care of the details of dividing things into blocks, though. You can probably (again, I haven't tested) just create a MessageDigest, call digest(byte[]) with your 999,992 known bytes, and then call clone().

storing strings in an array in a compact way [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

Hash a byte string

I'm working on a personal project, a file compression program, and am having trouble with my symbol dictionary. I need to store previously encountered byte strings into a structure in such a way that I can quickly check for their existence and retrieve them. I've been operating under the assumption that a hash table would be best suited for this purpose so my question will be pertaining to hash functions. However, if someone can suggest a better alternative to a hash table, I'm all ears.
All right. So the problem is that I can't come up with a good hashing key for these byte strings. Everything I think of either has a very uneven distribution, or is takes too long. Here is a list of the situation I'm working with:
All byte strings will be at least
two bytes in length.
The hash table will have a maximum size of 3839, and it is very likely it will fill.
Testing has shown that, with any given byte, the highest order bit is significantly less likely to be set, as compared to the lower seven bits.
Otherwise, bytes in the string can be any value from 0 - 255 (I'm working with raw byte-data of any format).
I'm working with the C language in a UNIX environment. I'd prefer to stick with standard libraries, but it doesn't need to be portable to other OSs. (I.E. unistd.h is fine).
Security is of NO concern.
Speed is of a HIGH concern.
The size isn't of intense concern, as it will NOT be written to file. However, considering the potential size of the byte strings being stored, memory space could become an issue during the compression.
A trie is better suited to this kind of thing because it lets you store your symbols as a tree and quickly parse it to match values (or reject them).
And as a bonus, you don't need a hash at all. You're storing/retrieving/comparing the entire sequence at once, while still only holding a minimal amount of memory.
Edit: And as an additional bonus, with only a second parse, you can look up sequences that are "close" to your current sequence, so you can get rid of a sequence and use the previous one for both of them, with some internal notation to hold the differences. That will help you compress files better because:
smaller dictionary means smaller files, you have to write the dictionary to your file
smaller number of items can free up space to hold other, more rare sequences if you add a population cap and you hit it with a large file.

Resources