ARM NEON: Sort an array of 16 bytes - arrays

tl;dr: What is the fastest way to sort an uint8x16_t?
I need to sort many arrays of exactly 16 unsigned bytes (in descending order, which doesn't matter, of course), and i'm trying to optimize sorting by means of ARM NEON vectorization.
And i find it to be quite a fancy puzzle, as it seems that there "must" exist a short combination of NEON instructions (such as vmax/vpmax/vmin/vpmin, vzip/vuzp) that reliably results in a sorted array.
For example, if we transform a pair (A, B) of two 8-byte arrays into (vpmax(A,B), vpmin(A,B)), we obtain same 16 values, just in different order. If we repeat this operation four times, we reliably have the array maximum in the first cell and the array minimum in the last cell; we cannot be sure about the middle elements though.
Another example: if we first do (C,D)=(vmax(A,B),vmin(A,B)), then we do (E,F)=(vpmax(C,D),vpmin(C,D)), then we do (G,H)=vzip(E,F), then we get our array split into four parts of four bytes, in each part we already know the largest element and the smallest element. Probably the next naive step would be to deinterleave this array to have top four bytes at start of the array (which won't necessary be the top 4 elements of the array, just top bytes of their respective groups) and repeat, not yet sure where it leads at the end.
Is there any known method for this particular problem or for other similar problems (for different array sizes or whatever)? Any ideas are appreciated :)

Related

Optimal way for storing compressed sets

As title says, I am searching for the optimal way of storing sets in memory. I am only interested in sets of bytes (array of integers from 0 to 255 where order is not important). It is not required that encoding/decoding be fast. The only necessary thing is that sets should require as little memory as possible.
The first method I came up with is to allocate array of 256 bits (32 bytes) for each set and the bit at position n tells if there is n in the set or not. The problem with this approach is because it requires the same amount of memory even if the set are mostly empty (has only few elements).
The second approach I tried is to store sets as regular arrays. So, if a set contains n elements, then it will require n + 1 bytes to be stored. The first byte represents the number of elements and other bytes represents elements. But, as we know, order in sets are not important, so something strongly tells me that there must be a way to impove this.
My third attempt is to enumerate all possible sets and the just store the index of set (integer representing its index in list of all possible sets of bytes). But, it turned out that it is absolutelly equivalent as the first approach. Basically, I will still need 32 bytes to store any set, so it is not very useful.
Fourth attempt I made is based on my second approach. I noticed that is the set contains n elements it will, of course, require n + 1 bytes (if I use my second method). But, if, for example, element k appeared in set (actually in array, because in my second attempt I store sets as arrays), then it cannot appear again. Basically, if k appears again, then it must mean something different (maybe k - 1). So, I did some optimizations and I noticed that I can save some bytes if I differently encode each next element (for examle [3, 3, 5, 7] is interpreted as set of 3 elements whose elements are {3, 4, 5} (every next element is decreased by its index) and [3, 3, 5, 6] is interpreted as {3, 4, 2} (notice that 3 and 4 already exists, so 6 is decreased by 2 and it becomes 4, but 4 exists and 3 exists, so it must be 2)). But how can this approach can actually save bytes? I experimented and realized that I can order elements in the array to make it possible, for some cases, to avoid using high bit to encode element, so I saved 1 bit per element, which is about n / 16 bytes saved (which is n / 2 * 1 / 8).
Fifth approach I made is similar to my second approach, but it differently interpret number fo elements. If number of elements are less than 128 then it normally read all the lements from the following array in the memory. But, if the number fo ellements is greater than 128 then it creates a full set and then just remove elements from the following array in memory. On average, is saves a lot of bytes, but it is still far away from optimal.
My last attempt (sixth attempt) is to enumerate just some sets (for example create a list of sets which will contain: full set, set with only even numbers, set with only odd numbers, set with elements less than 128, set with elements greater than 128, etc) and then to use elements from that list and basic set operations (union, intersection, etc) to reconstruct original set. It will require few bytes for each base set we use from the list and it will require a few bits for union or intersection operations, and of course one byte for length of our sequence. It very depends on number of elements in the base set list which should be hardcoded, but it seems hard to preoperly create and properly chose elements which are in that list. Anyway, something tells me that this is not very clever approach.
But hat is actually the most optimal way? Something tells me that my fourth attempt is not so bad, but can we do better? The sets I opereate with have random number of elements, so on average 128 elements per set, so I am looking for a way to allocate 128 bits (16 bytes) per set. The best I did so far is using my fourth approach which is far away from my goal.
Just to mention again, speed is not important. Encoding/decoding may be extremelly slow, the only important thing is that sets require as little amount of memory as possible. When I said "in memory" I meant encoded in memory (compressed). Also, I am interested in as little bits as possible (not only bytes) because I want to store billions of sets compressed on my HDD, so it is important to calculate average amount of bits I need for each set so I know how many resources are available to what I want to achieve.
P.S. If you want some code (but I really don't see why would you) I can post here my solutions I made in C for all of these approaches. Anyway, I am not asking for code or technical details how to implement this in specific programming language, I am just asking for method/algorithm for compressing sets.
Thank you in advance.
Your first method (and the third method, which is equivalent) is already optimal. It cannot be improved.
There are 2256 possible sets of numbers you're working with. By the pigeonhole principle, you need 2256 numbers to identify them all, and you'll need 256 bits to represent those numbers. Any method of identifying the sets which used fewer than 256 bits would leave at least one pair (and probably many pairs) of sets sharing the same identifier.
There are 2^256 possible sets of bytes.
If all sets are equally likely, then the best you can do is to use a constant 256 bits (32 bytes) to indicate which of the 2^256 possibilities you have.
You seem not to like this idea, because you think that sets with only a few elements should take fewer bits. But if they are no more likely to occur than any other sets, then that would not be optimal.
If sets with fewer elements are more likely, then using a constant 32-bytes is not optimal, but the optimal encoding depends on the precise probability distribution of possible sets, which you haven't given. The relevant concept from information theory is "entropy": https://en.wikipedia.org/wiki/Entropy_(information_theory)
Succinctly, in an optimal encoding, the average number of bits required will be the Sum_of_all Pᵢ * -log₂(Pᵢ) over all 2^256 possible sets, where each Pᵢ is the probability of having to encode a particular set (all the Pᵢ must sum to 1)
If the number of elements is the only thing that you think should affect the size of the encoding, then you can't go too far wrong with something like this:
1) Use 1 byte to write out the number of elements in the set. There are 257 possible set sizes, but you can use 0 for both 0 and 256 elements.
2) Write out the index of the set in an enumeration of all sets with that length. (If you wrote a 0 then you need 1 bit to indicate the empty or full set). If the set is known to have N elements, then the number of bits required for this number will be log₂(256!/(N!*(256-N)!)

Splitting number into bit halves

I'm implementing karatsuba's method as part of an exercise. Karatsuba's method itself isn't terribly difficult, but one part of it is confusing me. Both numbers being multiplied have to be split into two halves, the high and the low bits. But I can't find much information about how this split is done.
I noticed most Karatsuba implementations use strings to represent huge numbers, but I'm doing something a bit different. I'm representing them as an array of ints, where each element is the next 30 bits of the huge number. Note that this means these arrays may be odd-length. If the huge number's size is not a multiple of 30, it gets leading zeros so it can still be represented as such.
So how can this be split into high and low halves? The main problem I'm running into is that since it can be odd-length, that means I can't just divide the arrays by their elements. Basically, how can I select the first and last bit halves of these int arrays so I can continue recursing in Karatsuba's method?
As long as I can retrieve the bits, I can create two smaller int arrays from them.

Fastest way to compare one byte array with many others?

I have a loop with the following structure :
Calculate a byte array with length k (somewhere slow)
Find if the calculated byte array matches any in a list of N byte arrays I have.
Repeat
My loop is to be called many many times (it's the main loop of my program), and I want the second step to be as fast as possible.
The naive implementation for the second step would be using memcmp:
char* calc;
char** list;
int k, n, i;
for(i = 0; i < n; i++) {
if (!memcmp(calc, list[i], k)) {
printf("Matches array %d", i);
}
}
Can you think of any faster way ? A few things :
My list is fixed at the start of my program, any precomputation on it is fine.
Let's assume that k is small (<= 64), N is moderate (around 100-1000).
Performance is the goal here, and portability is a non issue. Intrinsics/inline assembly is fine, as long as it's faster.
Here are a few thoughts that I had :
Given k<64 and I'm on x86_64, I could sort my lookup array as a long array, and do a binary search on it. O(log(n)). Even if k was big, I could sort my lookup array and do this binary search using memcmp.
Given k is small, again, I could compute a 8/16/32 bits checksum (the simplest being folding my arrays over themselves using a xor) of all my lookup arrays and use a built-in PCMPGT as in How to compare more than two numbers in parallel?. I know SSE4.2 is available here.
Do you think going for vectorization/sse is a good idea here ? If yes, what do you think is the best approach.
I'd like to say that this isn't early optimization, but performance is crucial here, I need the outer loop to be as fast as possible.
Thanks
EDIT1: It looks like http://schani.wordpress.com/tag/c-optimization-linear-binary-search-sse2-simd/ provides some interesting thoughts about it. Binary search on a list of long seems the way to go..
The optimum solution is going to depend on how many arrays there are to match, the size of the arrays, and how often they change. I would look at avoiding doing the comparisons at all.
Assuming the list of arrays to compare it to does not change frequently and you have many such arrays, I would create a hash of each array, then when you come to compare, hash the thing you are testing. Then you only need compare the hash values. With a hash like SHA256, you can rely on this both as a positive and negative indicator (i.e. the hashes matching is sufficient to say the arrays match as well as the hashes not matching being sufficient to say the arrays differ). This would work very well if you had (say) 1,000,000 arrays to compare against which hardly ever change, as calculating the hash would be faster than 1,000,000 array comparisons.
If your number of arrays is a bit smaller, you might consider a faster non-crytographic hash. For instance, a 'hash' which simply summed the bytes in an array module 256 (this is a terrible hash and you can do much better) would eliminate the need to compare (say) 255/256ths of the target array space. You could then compare only those where the so called 'hash' matches. There are well known hash-like things such as CRC-32 which are quick to calculate.
In either case you can then have a look up by hash (modulo X) to determine which arrays to actually compare.
You suggest k is small, N is moderate (i.e. about 1000). I'm guessing speed will revolve around memory cache. Not accessing 1,000 small arrays here is going to be pretty helpful.
All the above will be useless if the arrays change with a frequency similar to the comparison.
Addition (assuming you are looking at 64 bytes or similar). I'd look into a very fast non-cryptographic hash function. For instance look at: https://code.google.com/p/smhasher/wiki/MurmurHash3
It looks like 3-4 instructions per 32 bit word to generate the hash. You could then truncate the result to (say) 12 bits for a 4096 entry hash table with very few collisions (each bucket being linked list to the target arrays). This means you would look at something like about 30 instructions to calculate the hash, then one instruction per bucket entry (expected value 1) to find the list item, then one manual compare per expected hit (that would be between 0 and 1). So rather than comparing 1000 arrays, you would compare between 0 and 1 arrays, and generate one hash. If you can't compare 999 arrays in 30-ish instructions (I'm guessing not!) this is obviously a win.
We can assume that my stuff fits in 64bits, or even 32bits. If it
wasn't, I could hash it so it could. But now, what's the fastest way
to find whether my hash exists in the list of precomputed hashes ?
This is sort of a meta-answer, but... if your question boils down to: how can I efficiently find whether a certain 32-bit number exists in a list of other 32-bit numbers, this is a problem IP routers deal with all the time, so it might be worth looking into the networking literature to see if there's something you can adapt from their algorithms. e.g. see http://cit.mak.ac.ug/iccir/downloads/SREC_07/K.J.Poornaselvan1,S.Suresh,%20C.Divya%20Preya%20and%20C.G.Gayathri_07.pdf
(Although, I suspect they are optimized for searching through larger numbers of items than your use case..)
can you do an XOR instead of memcmp ?
or caclulate hash of each element in the array and sort it search for the hash
but hash will take more time .unless you can come up with a faster hash
Another way is to pre-build a tree from your list and use tree search.
for examples, with list:
aaaa
aaca
acbc
acca
bcaa
bcca
caca
we can get a tree like this
root
-a
--a
---a
----a
---c
----a
--c
---b
----c
---c
----a
-b
--c
---a
----a
---c
----a
-c
--a
---c
----a
Then do binary search on each level of the tree

x86-64 integer vectorisation optimise

I am trying to vectorize a logical validation problem to run on Intel 64.
I will first try to describe the problem:
I have a static array v[] of 70-bit integers (appx 400,000 of them) which are all known at compile time.
A producer creates 70-bit integers a, a lot of them, very quickly.
For each a I need to find out if there exists an element from v for which v[i] & a == 0.
So far my implementation in C is something like this (simplified):
for (; *v; v++) {
if (!(a & *v))
return FOUND;
}
// a had no matching element in v
return NOT_FOUND;
I am looking into optimizing this using SSE/AVX to speed up the process and do more of those tests in parallel. I got as far as loading a and *v into an XMM register each and calling the PTEST instruction to do the validation.
I am wondering if there is a way to expand this to use all 256 bits of the new YMM registers?
Maybe packing 3x70 bits into a single register?
I can't quite figure out though how to pack/unpack them efficient enough to justify not just using one register per test.
A couple things that we know about the nature of the input:
All elements in v[] have very few bits set
It is not possible to permute/compress v[] in any way to make it use less then 70 bits
The FOUND condition is expected to be satisfied after checking appx 20% on v[] on average.
It is possible to buffer more then one a before checking them in a batch.
I do not necessarily need to know which element of v[] matched, only that one did or not.
Producing a requires very little memory, so anything left in L1 from the previous call is likely to still be there.
The resulting code is intended to be ran on the newest generation of Intel Xeon processors supporting SSE4.2, AVX instructions.
I will be happy to accept assembly or C that compiles with Intel C compiler or at least GCC.
This sounds like you what you really need is a better data structure to store the v[], so that searches take less than linear time.
Consider that if (v[0] & v[1]) & a is not zero, then neither (v[0] & a) nor (v[1] & a) can be zero. This means it is possible to create a tree structure where the v[] are the leaves, and the parent nodes are the AND combination of their children. Then, if parentNode & a gives you a non-zero value, you can skip looking at the children.
However, this isn't necessarily helpful - the parent node only ends up testing the bits common between the children, so if there are only a few of those, you still end up testing lots of leave nodes. But if you can find clusters in your data set and group many similar v[] under a common parent, this may drastically reduce the number of comparisons you have to do.
On the other hand, such a tree search involves a lot of conditional branches (expensive), and would be hard to vectorize. I'd first try if you can get away with just two levels: first do a vectorized search among the cluster parent nodes, then for each match do a search for the entries in that cluster.
Actually here's another idea, to help with the fact that 70 bits don't fit well into registers:
You could split v[] into 64 (=2^6) different arrays. Of the 70 bits in the original v[], the 6 most significant bits are used to determine which array will contain the value, and only the remaining 64 bits are actually stored in the array.
By testing the mask a against the array indices, you will know which of the 64 arrays to search (in the worst case, if a doesn't have any of the 6 highest bits set, that'll be all of them), and each individual array search deals only with 64 bits per element (much easier to pack).
In fact this second approach could be generalized into a tree structure as well, which would give you some sort of trie.

How to find 0 in an integer array of size 100 ,having 99 elements as 1 and only one element 0 in most efficient way

I need to find the position( or index ) say i of an integer array A of size 100, such that A[i]=0. 99 elements of array A are 1 and only one element is 0. I want the most efficient way solving this problem.(So no one by one element comparison).
Others have already answered the fundamental question - you will have to check all entries, or at least, up until the point where you find the zero. This would be a worst case of 99 comparisons. (Because if the first 99 are ones then you already know that the last entry must be the zero, so you don't need to check it)
The possible flaw in these answers is the assumption that you can only check one entry at a time.
In reality we would probably use direct memory access to compare several integers at once. (e.g. if your "integer" is 32 bits, then processors with SIMD instructions could compare 128 bits at once to see if any entry in a group of 4 values contains the zero - this would make your brute force scan just under 4 times faster. Obviously the smaller the integer, the more entries you could compare at once).
But that isn't the optimal solution. If you can dictate the storage of these values, then you could store the entire "array" as binary bits (0/1 values) in just 100 bits (the easiest would be to use two 64-bit integers (128 bits) and fill the spare 28 bits with 1's) and then you could do a "binary chop" to find the data.
Essentially a "binary chop" works by chopping the data in half. One half will be all 1's, and the other half will have the zero in it. So a single comparison allows you to reject half of the values at once. (You can do a single comparison because half of your array will fit into a 64-bit long, so you can just compare it to 0xffffffffffffffff to see if it is all 1's). You then repeat on the half that contains the zero, chopping it in two again and determining which half holds the zero... and so on. This will always find the zero value in 7 comparisons - much better than comparing all 100 elements individually.
This could be further optimised because once you get down to the level of one or two bytes you could simply look up the byte/word value in a precalculated look-up table to tell you which bit is the zero. This would bring the algorithm down to 4 comparisons and one look-up (in a 64kB table), or 5 comparisons and one look-up (in a 256-byte table).
So we're down to about 5 operations in the worst case.
But if you could dictate the storage of the array, you could just "store" the array by noting down the index of the zero entry. There is no need at all to store all the individual values. This would only take 1 byte of memory to store the state, and this byte would already contain the answer, giving you a cost of just 1 operation (reading the stored value).
You cannot do it better then linear scan - unless the data is sorted or you have some extra data on it. At the very least you need to read all data, since you have no clue where this 0 is hiding.
If it is [sorted] - just access the relevant [minimum] location.
Something tells me that the expected answer is "compare pairs":
while (a[i] == a[i+1]) i += 2;
Although it looks better that the obvious approach, it's still O(n),
Keep track of it as you insert to build the array. Then just access the stored value directly. O(1) with a very small set of constants.
Imagine 100 sea shells, under one is a pearl. There is no more information.
There is really no way to find it faster than trying to turn them all over. The computer can't do any better with the same knowledge. In other words, a linear scan is the best you can do unless you save the position of the zero earlier in the process and just use that.
More trivia than anything else, but if you happen to have a quantum computer this can be done faster than linear.
Grover's algortithm

Resources