Fastest way to compare one byte array with many others?

Fastest way to compare one byte array with many others? - c

I have a loop with the following structure :
Calculate a byte array with length k (somewhere slow)
Find if the calculated byte array matches any in a list of N byte arrays I have.
Repeat
My loop is to be called many many times (it's the main loop of my program), and I want the second step to be as fast as possible.
The naive implementation for the second step would be using memcmp:
char* calc;
char** list;
int k, n, i;
for(i = 0; i < n; i++) {
if (!memcmp(calc, list[i], k)) {
printf("Matches array %d", i);
}
}
Can you think of any faster way ? A few things :
My list is fixed at the start of my program, any precomputation on it is fine.
Let's assume that k is small (<= 64), N is moderate (around 100-1000).
Performance is the goal here, and portability is a non issue. Intrinsics/inline assembly is fine, as long as it's faster.
Here are a few thoughts that I had :
Given k<64 and I'm on x86_64, I could sort my lookup array as a long array, and do a binary search on it. O(log(n)). Even if k was big, I could sort my lookup array and do this binary search using memcmp.
Given k is small, again, I could compute a 8/16/32 bits checksum (the simplest being folding my arrays over themselves using a xor) of all my lookup arrays and use a built-in PCMPGT as in How to compare more than two numbers in parallel?. I know SSE4.2 is available here.
Do you think going for vectorization/sse is a good idea here ? If yes, what do you think is the best approach.
I'd like to say that this isn't early optimization, but performance is crucial here, I need the outer loop to be as fast as possible.
Thanks
EDIT1: It looks like http://schani.wordpress.com/tag/c-optimization-linear-binary-search-sse2-simd/ provides some interesting thoughts about it. Binary search on a list of long seems the way to go..

The optimum solution is going to depend on how many arrays there are to match, the size of the arrays, and how often they change. I would look at avoiding doing the comparisons at all.
Assuming the list of arrays to compare it to does not change frequently and you have many such arrays, I would create a hash of each array, then when you come to compare, hash the thing you are testing. Then you only need compare the hash values. With a hash like SHA256, you can rely on this both as a positive and negative indicator (i.e. the hashes matching is sufficient to say the arrays match as well as the hashes not matching being sufficient to say the arrays differ). This would work very well if you had (say) 1,000,000 arrays to compare against which hardly ever change, as calculating the hash would be faster than 1,000,000 array comparisons.
If your number of arrays is a bit smaller, you might consider a faster non-crytographic hash. For instance, a 'hash' which simply summed the bytes in an array module 256 (this is a terrible hash and you can do much better) would eliminate the need to compare (say) 255/256ths of the target array space. You could then compare only those where the so called 'hash' matches. There are well known hash-like things such as CRC-32 which are quick to calculate.
In either case you can then have a look up by hash (modulo X) to determine which arrays to actually compare.
You suggest k is small, N is moderate (i.e. about 1000). I'm guessing speed will revolve around memory cache. Not accessing 1,000 small arrays here is going to be pretty helpful.
All the above will be useless if the arrays change with a frequency similar to the comparison.
Addition (assuming you are looking at 64 bytes or similar). I'd look into a very fast non-cryptographic hash function. For instance look at: https://code.google.com/p/smhasher/wiki/MurmurHash3
It looks like 3-4 instructions per 32 bit word to generate the hash. You could then truncate the result to (say) 12 bits for a 4096 entry hash table with very few collisions (each bucket being linked list to the target arrays). This means you would look at something like about 30 instructions to calculate the hash, then one instruction per bucket entry (expected value 1) to find the list item, then one manual compare per expected hit (that would be between 0 and 1). So rather than comparing 1000 arrays, you would compare between 0 and 1 arrays, and generate one hash. If you can't compare 999 arrays in 30-ish instructions (I'm guessing not!) this is obviously a win.

We can assume that my stuff fits in 64bits, or even 32bits. If it
wasn't, I could hash it so it could. But now, what's the fastest way
to find whether my hash exists in the list of precomputed hashes ?
This is sort of a meta-answer, but... if your question boils down to: how can I efficiently find whether a certain 32-bit number exists in a list of other 32-bit numbers, this is a problem IP routers deal with all the time, so it might be worth looking into the networking literature to see if there's something you can adapt from their algorithms. e.g. see http://cit.mak.ac.ug/iccir/downloads/SREC_07/K.J.Poornaselvan1,S.Suresh,%20C.Divya%20Preya%20and%20C.G.Gayathri_07.pdf
(Although, I suspect they are optimized for searching through larger numbers of items than your use case..)

can you do an XOR instead of memcmp ?
or caclulate hash of each element in the array and sort it search for the hash
but hash will take more time .unless you can come up with a faster hash

Another way is to pre-build a tree from your list and use tree search.
for examples, with list:
aaaa
aaca
acbc
acca
bcaa
bcca
caca
we can get a tree like this
root
-a
--a
---a
----a
---c
----a
--c
---b
----c
---c
----a
-b
--c
---a
----a
---c
----a
-c
--a
---c
----a
Then do binary search on each level of the tree

Related

Finding the pair of strings with most number of identical letters in an array

Suppose I have an array of strings of different lengths.
It can be assumed that the strings have no repeating characters.
Using a brute-force algorithm, I can find the pair of strings that have the most number of identical letters (order does not matter - for example, "ABCDZFW" and "FBZ" have 3 identical letters) in n-squared time.
Is there a more efficient way to do this?
Attempt: I've tried to think of a solution using the trie data structure, but this won't work since a trie would only group together strings with similar prefixes.

I can find the pair of strings that have the most number of identical
letters (order does not matter - for example, "ABCDZFW" and "FBZ" have
3 identical letters) in n-squared time.
I think you can't as string comparison itself is O(max(length(s1), length(s2))) along with the O(n^2) loop for checking all pairs. However you can optimize the comparison of strings in some extent.
As you mentioned the strings don't have duplicates and I am assuming the strings consist of only uppercase letters according to your input. So, it turns into each string can be only 26 characters long.
For each string, we can use a bitmask. And for each character of a string, we can set the corresponding bit 1. For example:
ABCGH
11000111 (from LSB to MSB)
Thus, we have n bit-masks for n strings.
Way #1
Now you can check all possible pairs of strings using O(n^2) loop and compare the string by ANDing two corresponding mask and check the number of set bits (hamming weight). Obviously this is an improvement of your version because the string comparison is optimized now - Only an AND operation between two 32 bit integer which is a O(1) operation.
For example for any two strings comparison will be:
ABCDG
ABCEF
X1 = mask(ABCDG) => 1001111
X2 = mask(ABCEF) => 0110111
X1 AND X2 => 0000111
hamming weight(0000111) => 3 // number of set bits
Way #2
Now, one observation is the AND of same type bit is 1. So for every masks, we will try to maximize the Hamming weight (total number of set bits) of AND value of two string's masks as the string with most matched characters have same bit 1 and ANDing these two masks will make those bits 1.
Now build a Trie with all masks - every node of the trie will hold 0 or 1 based on the corresponding bit is set or not. Insert each mask from MSB ot LSB. Before inserting ith mask into Trie(already holding i - 1 masks), we will query to try maximizing the Hamming weight of AND recusively by going to same bit's branch (to make the bit 1 in final AND variable) and also to opposite bit's branch because in later levels you might get more set bits in this branch.
Regarding this Trie part, for nice pictorial explanation, you can find a similar thread here (this works with XOR).
Here in worst case, we will need to traverse many branches of trie for maximizing the hamming weight. And in worst case it will take around 6 * 10^6 operations (which will take ~1 sec in typical machine) and also we need additional space for building trie. But say the total number of strings is 10^5, then for O(n^2) algorithms, it will take 10^10 operations which is too much - so the trie approach is still far better.
Let me know if you're having problem with implementation. Unfortunately I can able to help you with code only if you're a C/C++ or Java guy.
Thanks #JimMischel for pointing out a major flaw. I slightly misunderstood the statement first.

what is more expensive: compare or accessing an index of array

basically i saw i video on youtube that visualized sorting algorithms and they provided the program so that we can play with it .. and the program counts two main things (comparisons , array accesses) .. i wanted to see which one of (merge & quick) sort is the fastest ..
for 100 random numbers
quick sort:
comparisons 1000
array accesses 1400
merge sort:
comparisons 540
array accesses 1900
so quick sort uses less array access while merge sort uses less comparisons and the difference increases with the number of the indexes .. so which one of those is harder for computer to do?

The numbers are off. Results from actual runs with 100 random numbers. Note that quick sort compare count is affected by the implementation, Hoare uses less compares than Lomuto.
quick sort (Hoare partition scheme)
pivot reads 87 (average)
compares 401 (average)
array accesses 854 (average)
merge sort:
compares 307 (average)
array accesses 1400 (best, average, worst)
Since numbers are being sorted, I'm assuming they fit in registers, which reduces the array accesses.
For quick sort, the compares are done versus a pivot value, which should be read just once per recursive instance of quick sort and placed in a register, then one read for each value compared. An optimizing compiler may keep the values used for compare in registers so that swaps already have the two values in registers and only need to do two writes.
For merge sort, the compares add almost zero overhead to the array accesses, since the compared values will be read into registers, compared, then written from the registers instead of reading from memory again.

Sorting performance depends on many conditions, I think answering your exact question won't lead to a helpful answer (you can benchmark it easily yourself).
Sorting a small number of elements is usually not time critical, benchmarking makes sense for larger lists.
Also it is a rare case to sort an array of integers, it is much more common to sort a list of objects, comparing one or more of their properties.
If you head for performance, think about multi threading.
MergeSort is stable (equal elements keep their relative position), QuickSort is not, so you are comparing different results.
In your example, the quicksort algorithm is probably faster most of the time. If the comparison is more complex, e.g. string instead of int or multiple fields, MergeSort will become more and more effective because it needs fewer (expensive) comparisons. If you want to parallize the sorting, MergeSort is predestined because of the algorithm itself.

Bin packing - exact np-hard exponential algorithm

I wrote a heuristic algorithm for the bin packing problem using best-fit aproach,
itens S=(i1,...,in), bins size T, and a want to create a real exact exponential
algorithm witch calculates the optimal solution(minimum numbers of bins to pack all
the itens), but I have no idea how to check every possibility of packing, I'm doing in C.
Somebody can tell me any ideas what structs I have to use? How can I test all de combinations of itens? It has to be a recursive algorithm? Have some book ou article that may help me?
sorry for my bad english

The algorithm given will find one packing, usually one that is quite good, but not necessarily optimal, so it does not solve the problem.
For NP complete problems, algorithms that solve them are usually easiest to describe recursively (iterative descriptions mostly end up making explicit all the book-keeping that is hidden by recursion). For bin packing, you may start with a minimum number of bins (upper Gaussian of sum of object sizes divided by bin size, but you can even start with 1), try all combinations of assignments of objects to bins, check for each such assignment that it is legal (sum of bin content sizes <= bin size for each bin), return accepting (or outputing the found assignment) if it is, or increase number of bins if no assignment was found.
You asked for structures, here is one idea: Each bin should somehow describe the objects contained (list or array) and you need a list (or array) of all your bins. With these fairly simple structures, a recursive algorithm looks like this: To try out all possible assignments you run a loop for each object that will try assigning it to each available bin. Either you wait for all objects to be assigned before checking legality, or (as a minor optimization) you only assign an object to the bins it fits in before going on to the next object (that's the recursion that ends when the last object has been assigned), going back to the previous object if no such bin is found or (for the first object) increasing the number of bins before trying again.
Hope this helps.

How to find 0 in an integer array of size 100 ,having 99 elements as 1 and only one element 0 in most efficient way

I need to find the position( or index ) say i of an integer array A of size 100, such that A[i]=0. 99 elements of array A are 1 and only one element is 0. I want the most efficient way solving this problem.(So no one by one element comparison).

Others have already answered the fundamental question - you will have to check all entries, or at least, up until the point where you find the zero. This would be a worst case of 99 comparisons. (Because if the first 99 are ones then you already know that the last entry must be the zero, so you don't need to check it)
The possible flaw in these answers is the assumption that you can only check one entry at a time.
In reality we would probably use direct memory access to compare several integers at once. (e.g. if your "integer" is 32 bits, then processors with SIMD instructions could compare 128 bits at once to see if any entry in a group of 4 values contains the zero - this would make your brute force scan just under 4 times faster. Obviously the smaller the integer, the more entries you could compare at once).
But that isn't the optimal solution. If you can dictate the storage of these values, then you could store the entire "array" as binary bits (0/1 values) in just 100 bits (the easiest would be to use two 64-bit integers (128 bits) and fill the spare 28 bits with 1's) and then you could do a "binary chop" to find the data.
Essentially a "binary chop" works by chopping the data in half. One half will be all 1's, and the other half will have the zero in it. So a single comparison allows you to reject half of the values at once. (You can do a single comparison because half of your array will fit into a 64-bit long, so you can just compare it to 0xffffffffffffffff to see if it is all 1's). You then repeat on the half that contains the zero, chopping it in two again and determining which half holds the zero... and so on. This will always find the zero value in 7 comparisons - much better than comparing all 100 elements individually.
This could be further optimised because once you get down to the level of one or two bytes you could simply look up the byte/word value in a precalculated look-up table to tell you which bit is the zero. This would bring the algorithm down to 4 comparisons and one look-up (in a 64kB table), or 5 comparisons and one look-up (in a 256-byte table).
So we're down to about 5 operations in the worst case.
But if you could dictate the storage of the array, you could just "store" the array by noting down the index of the zero entry. There is no need at all to store all the individual values. This would only take 1 byte of memory to store the state, and this byte would already contain the answer, giving you a cost of just 1 operation (reading the stored value).

You cannot do it better then linear scan - unless the data is sorted or you have some extra data on it. At the very least you need to read all data, since you have no clue where this 0 is hiding.
If it is [sorted] - just access the relevant [minimum] location.

Something tells me that the expected answer is "compare pairs":
while (a[i] == a[i+1]) i += 2;
Although it looks better that the obvious approach, it's still O(n),

Keep track of it as you insert to build the array. Then just access the stored value directly. O(1) with a very small set of constants.

Imagine 100 sea shells, under one is a pearl. There is no more information.
There is really no way to find it faster than trying to turn them all over. The computer can't do any better with the same knowledge. In other words, a linear scan is the best you can do unless you save the position of the zero earlier in the process and just use that.

More trivia than anything else, but if you happen to have a quantum computer this can be done faster than linear.
Grover's algortithm

A Memory-Adaptive Merge Algorithm?

Many algorithms work by using the merge algorithm to merge two different sorted arrays into a single sorted array. For example, given as input the arrays
1 3 4 5 8
and
2 6 7 9
The merge of these arrays would be the array
1 2 3 4 5 6 7 8 9
Traditionally, there seem to be two different approaches to merging sorted arrays (note that the case for merging linked lists is quite different). First, there are out-of-place merge algorithms that work by allocating a temporary buffer for storage, then storing the result of the merge in the temporary buffer. Second, if the two arrays happen to be part of the same input array, there are in-place merge algorithms that use only O(1) auxiliary storage space and rearrange the two contiguous sequences into one sorted sequence. These two classes of algorithms both run in O(n) time, but the out-of-place merge algorithm tends to have a much lower constant factor because it does not have such stringent memory requirements.
My question is whether there is a known merging algorithm that can "interpolate" between these two approaches. That is, the algorithm would use somewhere between O(1) and O(n) memory, but the more memory it has available to it, the faster it runs. For example, if we were to measure the absolute number of array reads/writes performed by the algorithm, it might have a runtime of the form n g(s) + f(s), where s is the amount of space available to it and g(s) and f(s) are functions derivable from that amount of space available. The advantage of this function is that it could try to merge together two arrays in the most efficient way possible given memory constraints - the more memory available on the system, the more memory it would use and (ideally) the better the performance it would have.
More formally, the algorithm should work as follows. Given as input an array A consisting of two adjacent, sorted ranges, rearrange the elements in the array so that the elements are completely in sorted order. The algorithm is allowed to use external space, and its performance should be worst-case O(n) in all cases, but should run progressively more quickly given a greater amount of auxiliary space to use.
Is anyone familiar with an algorithm of this sort (or know where to look to find a description of one?)

at least according to the documentation, the in-place merge function in SGI STL is adaptive and "its run-time complexity depends on how much memory is available". The source code is available of course you could at least check this one.

EDIT: STL has inplace_merge, which will adapt to the size of the temporary buffer available. If the temporary buffer is at least as big as one of the sub-arrays, it's O(N). Otherwise, it splits the merge into two sub-merges and recurses. The split takes O(log N) to find the right part of the other sub array to rotate in (binary search).
So it goes from O(N) to O(N log N) depending on how much memory you have available.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight