Univocal hash function for a string 76 chars long - c

Here's my problem (I'm programming in C):
I have some huge text files containing DNA sequences (each file has something like 65 million rows and a size of about 4~5 GB). In these files there are many duplicates (don't know how many yet, but there should be many millions of them) and I want to return in output a file with only distinct values. Each string has a quality value associated, so if e.g I have 5 equal strings with different quality values I'll hold the best one and discard the other 4.
Reducing memory requirements and improving speed efficiency as far as I can is VITAL.
My idea was to create a JudyHS array using an hash function in order to convert the String DNA sequence (which is 76 letters long and has 7 possible characters) into an integer to reduce memory usage (4 or 8 bytes instead of 76 bytes on many millions of entries should be quite an achievement). This way I could use the integer as index and store only the best quality value for that index. The problem is that I can't find an hash function that UNIVOCALLY defines such a long string and produces a value that can be stored inside an integer or even a long long!
My first idea for an hash function was something like the default string hash function in Java: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1], but I could obtain a maximal value of 8,52*10^59.. way tooo big.
What about doing the same thing and store it in a double? Would the computation become a lot slower?
Please note that I'd like a way to UNIVOCALLY define a string, avoiding collisions (or at least they should be extremely rare, because I would have to access the disk at every collision, quite a costly operation...)

You have 7^76 possible DNA sequences and want to map them to 2^32 hashes without collisions? Not possible.
You need a least log2(7^76) = 214 bits to do that, about 27 bytes.
I you can live with some collisions I would recommend to stick to CRC32 or md5 instead of inventing a new wheel again.

The "simple" way to get a collision-free hash function for N elements is to use a good mixing function (say, a cryptographic hash function) and to truncate the size, so that the hash results live in a space of size at least N2. Here, you have 65 million rows -- this fits on 26 bits (226 is close to 65 millions) so 52 bits "ought to be enough".
You can try using a fast cryptographic hash function, even a "broken" one since this is not a security-related problem. MD4, MD5, SHA-1... then truncate the result to the first (or last) 64 bits, store that in a 64-bit integer type. Chances are that you will not get any collision among your 65 million rows; and if you get some, they will be very rare.
For optimized C implementations of hash functions, lookup sphlib. Use the provided sph_dec64le() function to "decode" a sequence of 8 bits into a 64-bit unsigned integer value.

Related

lightweight (quasi-random) integer fingerprint of C string

I would like to generate a nicely-mixed-up integer fingerprint of an arbitrary C string (s). Most C strings will consist of ASCII text characters:
I want very different fingerprints for similar strings, esp such similar strings as "ab" and "ba"
I want it to be difficult to invert back from the fingerprint to the string (well, my string is typically longer than 32 bits, which means that many strings would map into the same integer), which means again that I want similar strings to yield very different codes;
I want to use the 32 bits available to me efficiently in the integer result,
I want the function source to be small
I want the function to be fast.
one usage is security (but not encryption) related. I can ask a user for a text password, convert it into an integer for storage and later test whether this integer is correct. (I know I could store strings, but I don't want to. guessing a 32-bit integer correctly is impossible if my program can slow down incorrect attempts to the point where brute force cannot work faster than password guessing. another use of this function is as the start of a hash index function (mod array length) into an array.)
alas, I am probably reinventing the wheel here. such functions have probably been written a million times, and by people who are much more versed in cryptography. I don't need AES, of course, but something much more lightweight. the use is different.
my first thinking was
mod 64 each character to take advantage of the ASCII text aspect. now I have 6 bits. call this x.
I can place a 6bit string into 5 locations in a 32-bit space, leaving 2 bits over.
take the current string index position (0, 1, 2...), mod5 it to determine where I want to start to place my x into my running integer result code. XOR my x into this running-result integer.
use the remaining 2 bits to increment a counter [mod 4 to prevent overflow] for each character processed.
then I thought that bit operations may be computer-fast but take more source code. I can think of other choices. take each index position i and multiply it by an ascii representation of each character [or the x from above], and call this y[i]. now do the following:
calculate the natural logarithm of the sums of the y (or this sum plus the running result), and just pretend that the first 32 bits of this result [maybe leaving off the first few bits], which are really a double, are an integer representation. I can XOR each bitint(log(y[i])) into the running integer result.
do it even cheaper. just add the y's, and then do the logarithm with 32-bit pickoff just once at the end. alternatively, run a sum-y through srand as a seed and grab a rand.
there are probably a few other ways to do it, too. in sum, the function should map strings into very different integers, be short to code, and be very fast.
Any pointers?
A common method of generating a non-reversible digest or hash of a string is to generate a Cyclic Redundancy Checksum (CRC).
Source for CRC is widely available, in this case you should use a common CRC-32 such as that used by Ethernet. Different CRCs work on the same principle, buy use different polynomials. Do not be tempted to invent your own polynomial; the distribution is likely to be sub-optimal.
What you're looking for is called a "hash". Two examples of hash functions I'm aware of that return short integers are MurmurHash and SipHash. MurmurHash, as I recall, is not designed to be a cryptographic hash, while SipHash, on the other hand, is indeed designed with security in mind, as stated on its homepage. MurmurHash has 2 versions that return a 32-bit and a 64-bit output. SipHash returns a 64-bit output.

Fastest way to compare one byte array with many others?

I have a loop with the following structure :
Calculate a byte array with length k (somewhere slow)
Find if the calculated byte array matches any in a list of N byte arrays I have.
Repeat
My loop is to be called many many times (it's the main loop of my program), and I want the second step to be as fast as possible.
The naive implementation for the second step would be using memcmp:
char* calc;
char** list;
int k, n, i;
for(i = 0; i < n; i++) {
if (!memcmp(calc, list[i], k)) {
printf("Matches array %d", i);
}
}
Can you think of any faster way ? A few things :
My list is fixed at the start of my program, any precomputation on it is fine.
Let's assume that k is small (<= 64), N is moderate (around 100-1000).
Performance is the goal here, and portability is a non issue. Intrinsics/inline assembly is fine, as long as it's faster.
Here are a few thoughts that I had :
Given k<64 and I'm on x86_64, I could sort my lookup array as a long array, and do a binary search on it. O(log(n)). Even if k was big, I could sort my lookup array and do this binary search using memcmp.
Given k is small, again, I could compute a 8/16/32 bits checksum (the simplest being folding my arrays over themselves using a xor) of all my lookup arrays and use a built-in PCMPGT as in How to compare more than two numbers in parallel?. I know SSE4.2 is available here.
Do you think going for vectorization/sse is a good idea here ? If yes, what do you think is the best approach.
I'd like to say that this isn't early optimization, but performance is crucial here, I need the outer loop to be as fast as possible.
Thanks
EDIT1: It looks like http://schani.wordpress.com/tag/c-optimization-linear-binary-search-sse2-simd/ provides some interesting thoughts about it. Binary search on a list of long seems the way to go..
The optimum solution is going to depend on how many arrays there are to match, the size of the arrays, and how often they change. I would look at avoiding doing the comparisons at all.
Assuming the list of arrays to compare it to does not change frequently and you have many such arrays, I would create a hash of each array, then when you come to compare, hash the thing you are testing. Then you only need compare the hash values. With a hash like SHA256, you can rely on this both as a positive and negative indicator (i.e. the hashes matching is sufficient to say the arrays match as well as the hashes not matching being sufficient to say the arrays differ). This would work very well if you had (say) 1,000,000 arrays to compare against which hardly ever change, as calculating the hash would be faster than 1,000,000 array comparisons.
If your number of arrays is a bit smaller, you might consider a faster non-crytographic hash. For instance, a 'hash' which simply summed the bytes in an array module 256 (this is a terrible hash and you can do much better) would eliminate the need to compare (say) 255/256ths of the target array space. You could then compare only those where the so called 'hash' matches. There are well known hash-like things such as CRC-32 which are quick to calculate.
In either case you can then have a look up by hash (modulo X) to determine which arrays to actually compare.
You suggest k is small, N is moderate (i.e. about 1000). I'm guessing speed will revolve around memory cache. Not accessing 1,000 small arrays here is going to be pretty helpful.
All the above will be useless if the arrays change with a frequency similar to the comparison.
Addition (assuming you are looking at 64 bytes or similar). I'd look into a very fast non-cryptographic hash function. For instance look at: https://code.google.com/p/smhasher/wiki/MurmurHash3
It looks like 3-4 instructions per 32 bit word to generate the hash. You could then truncate the result to (say) 12 bits for a 4096 entry hash table with very few collisions (each bucket being linked list to the target arrays). This means you would look at something like about 30 instructions to calculate the hash, then one instruction per bucket entry (expected value 1) to find the list item, then one manual compare per expected hit (that would be between 0 and 1). So rather than comparing 1000 arrays, you would compare between 0 and 1 arrays, and generate one hash. If you can't compare 999 arrays in 30-ish instructions (I'm guessing not!) this is obviously a win.
We can assume that my stuff fits in 64bits, or even 32bits. If it
wasn't, I could hash it so it could. But now, what's the fastest way
to find whether my hash exists in the list of precomputed hashes ?
This is sort of a meta-answer, but... if your question boils down to: how can I efficiently find whether a certain 32-bit number exists in a list of other 32-bit numbers, this is a problem IP routers deal with all the time, so it might be worth looking into the networking literature to see if there's something you can adapt from their algorithms. e.g. see http://cit.mak.ac.ug/iccir/downloads/SREC_07/K.J.Poornaselvan1,S.Suresh,%20C.Divya%20Preya%20and%20C.G.Gayathri_07.pdf
(Although, I suspect they are optimized for searching through larger numbers of items than your use case..)
can you do an XOR instead of memcmp ?
or caclulate hash of each element in the array and sort it search for the hash
but hash will take more time .unless you can come up with a faster hash
Another way is to pre-build a tree from your list and use tree search.
for examples, with list:
aaaa
aaca
acbc
acca
bcaa
bcca
caca
we can get a tree like this
root
-a
--a
---a
----a
---c
----a
--c
---b
----c
---c
----a
-b
--c
---a
----a
---c
----a
-c
--a
---c
----a
Then do binary search on each level of the tree

Why are digits used as indentification instead of letters?

Why are numbers used as ID in databases (think primary key + AI) and pretty much everywhere instead of letters?
There are 10 digits available, while the English alphabet has as much as 26 letters.
Let's say that each letter/digits takes a spot. 98 takes two spots and 1202 takes four, etc. In four spots, you can store up to 10 000 IDs, but if you use letters instead, you could store as much as 456 976 IDs with the same amount of spots. Even more if you'd use case-sensitivity. That's almost 50 times more.
I do realize that this most likely doesn't matter for the regular user, but why aren't larger databases using letters instead of digits as IDs?
You are confusing characters for numeric values.
An ID column that uses an integer (say a 32bit Integer) as the data type will only take 4 bytes per row. It will also be a native value in memory and can be acted on natively in the CPU (as a binary representation).
This is not the same for characters - even if one assumes ASCII (8bit) is used, the moment you go over 4 characters, you are using more space. You also need to translate between the values in order to make valid comparisons.
Numbers pack better. You suppose that because the numbers are displayed in decimal, that they are stored as decimal, but they're really binary. Optimized for computers :).
If you want to represent one of 26 letters, you need 25 binary digits. You lose 32-26=8 possible digits per block of 5 bits.
There's no hard and fast rule that says you can't use alphanumeric fileds as IDs in databases. People do it all the time.
As for why it's more common to use numbers...
Most database systems are designed with auto-increment ability on numbers. (Yeah I know, that's a chicken/egg scenario)
Numbers can/usually do take up less bytes of storage. (Yes, you can store large numbers and shorter strings to overcome this, but as a general rule...)
I was going to expand on this, but everyone else beat me to the punch with accurate descriptions of the difference between bytes needed to store an int compared to a varchar. It would be silly to add it now. ;-)
On every system I've worked with, sorting on numbers is different than sorting on strings:
The values 1, 12,3, 2, 20 are sorted numerically as 1,2,3,12,20, but when sorted alphanumerically: 1,12,2,20,3
More computational power is needed to overcome the previous point, so numbers are simply more efficient to work with.
This is the answer to why most databases are designed with auto-incrementing numbers as opposed to auto-incrementing strings in the first bullet. Whether it's the chicken or the egg, I'll leave up to you.
Because computers only work with numbers. Even characters are treated by the computer as numbers.
Also, using character strings is much LESS efficient than numbers.
Any number smaller than 4,294,967,296 (2^32) can be stored in only 4 bytes, whereas even a 5 character (and each character takes up one byte) alphabetic string allows only 11,881,376 possibilities.
Computers don't story a base-10 numerical digit in one byte. Each byte actually can hold 256 different possible values.

How to find 0 in an integer array of size 100 ,having 99 elements as 1 and only one element 0 in most efficient way

I need to find the position( or index ) say i of an integer array A of size 100, such that A[i]=0. 99 elements of array A are 1 and only one element is 0. I want the most efficient way solving this problem.(So no one by one element comparison).
Others have already answered the fundamental question - you will have to check all entries, or at least, up until the point where you find the zero. This would be a worst case of 99 comparisons. (Because if the first 99 are ones then you already know that the last entry must be the zero, so you don't need to check it)
The possible flaw in these answers is the assumption that you can only check one entry at a time.
In reality we would probably use direct memory access to compare several integers at once. (e.g. if your "integer" is 32 bits, then processors with SIMD instructions could compare 128 bits at once to see if any entry in a group of 4 values contains the zero - this would make your brute force scan just under 4 times faster. Obviously the smaller the integer, the more entries you could compare at once).
But that isn't the optimal solution. If you can dictate the storage of these values, then you could store the entire "array" as binary bits (0/1 values) in just 100 bits (the easiest would be to use two 64-bit integers (128 bits) and fill the spare 28 bits with 1's) and then you could do a "binary chop" to find the data.
Essentially a "binary chop" works by chopping the data in half. One half will be all 1's, and the other half will have the zero in it. So a single comparison allows you to reject half of the values at once. (You can do a single comparison because half of your array will fit into a 64-bit long, so you can just compare it to 0xffffffffffffffff to see if it is all 1's). You then repeat on the half that contains the zero, chopping it in two again and determining which half holds the zero... and so on. This will always find the zero value in 7 comparisons - much better than comparing all 100 elements individually.
This could be further optimised because once you get down to the level of one or two bytes you could simply look up the byte/word value in a precalculated look-up table to tell you which bit is the zero. This would bring the algorithm down to 4 comparisons and one look-up (in a 64kB table), or 5 comparisons and one look-up (in a 256-byte table).
So we're down to about 5 operations in the worst case.
But if you could dictate the storage of the array, you could just "store" the array by noting down the index of the zero entry. There is no need at all to store all the individual values. This would only take 1 byte of memory to store the state, and this byte would already contain the answer, giving you a cost of just 1 operation (reading the stored value).
You cannot do it better then linear scan - unless the data is sorted or you have some extra data on it. At the very least you need to read all data, since you have no clue where this 0 is hiding.
If it is [sorted] - just access the relevant [minimum] location.
Something tells me that the expected answer is "compare pairs":
while (a[i] == a[i+1]) i += 2;
Although it looks better that the obvious approach, it's still O(n),
Keep track of it as you insert to build the array. Then just access the stored value directly. O(1) with a very small set of constants.
Imagine 100 sea shells, under one is a pearl. There is no more information.
There is really no way to find it faster than trying to turn them all over. The computer can't do any better with the same knowledge. In other words, a linear scan is the best you can do unless you save the position of the zero earlier in the process and just use that.
More trivia than anything else, but if you happen to have a quantum computer this can be done faster than linear.
Grover's algortithm

Will MD5 ever return the same output as its input? [duplicate]

Is there a fixed point in the MD5 transformation, i.e. does there exist x such that md5(x) == x?
Since an MD5 sum is 128 bits long, any fixed point would necessarily also have to be 128 bits long. Assuming that the MD5 sum of any string is uniformly distributed over all possible sums, then the probability that any given 128-bit string is a fixed point is 1/2128.
Thus, the probability that no 128-bit string is a fixed point is (1 − 1/2128)2128, so the probability that there is a fixed point is 1 − (1 − 1/2128)2128.
Since the limit as n goes to infinity of (1 − 1/n)n is 1/e, and 2128 is most certainly a very large number, this probability is almost exactly 1 − 1/e ≈ 63.21%.
Of course, there is no randomness actually involved – either there is a fixed point or there isn't. But, we can be 63.21% confident that there is a fixed point. (Also, notice that this number does not depend on the size of the keyspace – if MD5 sums were 32 bits or 1024 bits, the answer would be the same, so long as it's larger than about 4 or 5 bits).
My brute force attempt found a 12 prefix and 12 suffix match.
prefix 12:
54db1011d76dc70a0a9df3ff3e0b390f -> 54db1011d76d137956603122ad86d762
suffix 12:
df12c1434cec7850a7900ce027af4b78 -> b2f6053087022898fe920ce027af4b78
Blog post:
https://plus.google.com/103541237243849171137/posts/SRxXrTMdrFN
Since the hash is irreversible, this would be very hard to figure out. The only way to solve this, would be to calculate the hash on every possible output of the hash, and see if you came up with a match.
To elaborate, there are 16 bytes in an MD5 hash. That means there are 2^(16*8) = 3.4 * 10 ^ 38 combinations. If it took 1 millisecond to compute a hash on a 16 byte value, it would take 10790283070806014188970529154.99 years to calculate all those hashes.
While I don't have a yes/no answer, my guess is "yes" and furthermore that there are maybe 2^32 such fixed points (for the bit-string interpretation, not the character-string intepretation). I'm actively working on this because it seems like an awesome, concise puzzle that will require a lot of creativity (if you don't settle for brute force search right away).
My approach is the following: treat it as a math problem. We have 128 boolean variables, and 128 equations describing the outputs in terms of the inputs (which are supposed to match). By plugging in all of the constants from the tables in the algorithm and the padding bits, my hope is that the equations can be greatly simplified to yield an algorithm optimized to the 128-bit input case. These simplified equations can then be programmed in some nice language for efficient search, or treated abstractly again, assigning single bits at a time, watching out for contraditions. You only need to see a few bits of the output to know that it is not matching the input!
Probably, but finding it would take longer than we have or would involve compromising MD5.
There are two interpretations, and if one is allowed to pick either, the probability of finding a fixed point increases to 81.5%.
Interpretation 1: does the MD5 of a MD5 output in binary match its input?
Interpretation 2: does the MD5 of a MD5 output in hex match its input?
Strictly speaking, since the input of MD5 is 512 bits long and the output is 128 bits, I would say that's impossible by definition.

Resources