Can I use SIMD to bucket sort / categorize? - c

I'm curious about SIMD and wondering if it can handle this use case.
Let's say I have an array of 2048 integers, like
[0x018A, 0x004B, 0x01C0, 0x0234, 0x0098, 0x0343, 0x0222, 0x0301, 0x0398, 0x0087, 0x0167, 0x0389, 0x03F2, 0x0034, 0x0345, ...]
Note how they all start with either 0x00, 0x01, 0x02, or 0x03. I want to split them into 4 arrays:
One for all the integers starting with 0x00
One for all the integers starting with 0x01
One for all the integers starting with 0x02
One for all the integers starting with 0x03
I imagine I would have code like this:
int main() {
uint16_t in[2048] = ...;
// 4 arrays, one for each category
uint16_t out[4][2048];
// Pointers to the next available slot in each of the arrays
uint16_t *nextOut[4] = { out[0], out[1], out[2], out[3] };
for (uint16_t *nextIn = in; nextIn < 2048; nextIn += 4) {
(*** magic simd instructions here ***)
// Equivalent non-simd code:
uint16_t categories[4];
for (int i = 0; i < 4; i++) {
categories[i] = nextIn[i] & 0xFF00;
}
for (int i = 0; i < 4; i++) {
uint16_t category = categories[i];
*nextOut[category] = nextIn[i];
nextOut[category]++;
}
}
// Now I have my categoried arrays!
}
I imagine that my first inner loop doesn't need SIMD, it can be just a (x & 0xFF00FF00FF00FF00) instruction, but I wonder if we can make that second inner loop into a SIMD instruction.
Is there any sort of SIMD instruction for this "categorizing" action that I'm doing?
The "insert" instructions seem somewhat promising, but I'm a bit too green to understand the descriptions at https://software.intel.com/en-us/node/695331.
If not, does anything come close?
Thanks!

You can do it with SIMD, but how fast it is will depend on exactly what instruction sets you have available, and how clever you are in your implementation.
One approach is to take the array and "sift" it to separate out elements that belong in different buckets. For example, grab 32 bytes from your array which will have 16 16-bit elements. Use some cmpgt instructions to get a mask where which determines whether each element falls into the 00 + 01 bucket or the 02 + 03 bucket. Then use some kind of "compress" or "filter" operation to move all masked elements contiguously into one end a register and then same for the unmasked elements.
Then repeat this one more time to sort out 00 from 01 and 02 from 03.
With AVX2 you could start with this question for inspiration on the "compress" operation. With AVX512 you could use the vcompress instruction to help out: it does exactly this operation but only at 32-bit or 64-bit granularity so you'd need to do a couple at least per vector.
You could also try a vertical approach, where you load N vectors and then swap between them so that the 0th vector has the smallest elements, etc. At this point, you can use a more optimized algorithm for the compress stage (e.g,. if you vertically sort enough vectors, the vectors at the ends may be entirely starting with 0x00 etc).
Finally, you might also consider organizing your data differently, either at the source or as a pre-processing step: separating out the "category" byte which is always 0-3 from the payload byte. Many of the processing steps only need to happen on one or the other, so you can potentially increase efficiency by splitting them out that way. For example, you could do the comparison operation on 32 bytes that are all categories, and then do the compress operation on the 32 payload bytes (at least in the final step where each category is unique).
This would lead to arrays of byte elements, not 16-bit elements, where the "category" byte is implicit. You've cut your data size in half, which might speed up everything else you want to do with the data in the future.
If you can't produce the source data in this format, you could use the bucketing as an opportunity to remove the tag byte as you put the payload into the right bucket, so the output is uint8_t out[4][2048];. If you're doing a SIMD left-pack with a pshufb byte-shuffle as discussed in comments, you could choose a shuffle control vector that packs only the payload bytes into the low half.
(Until AVX512BW, x86 SIMD doesn't have any variable-control word shuffles, only byte or dword, so you already needed a byte shuffle which can just as easily separate payloads from tags at the same time as packing payload bytes to the bottom.)

Related

Vectorize equality test without SIMD

I would like to vectorize an equality test in which all elements in a vector are compared against the same value, and the results are written to an array of 8-bit words. Each 8-bit word in the resulting array should be zero or one. (This is a little wasteful, but bit packing the booleans is not an import detail in this problem). This function can be written as:
#include <stdint.h>
void vecEq (uint8_t* numbers, uint8_t* results, int len, uint8_t target) {
for(int i = 0; i < len; i++) {
results[i] = numbers[i] == target;
}
}
If we knew that both vectors were 256-bit aligned, we could start by broadcasting target into an AVX register and then using SIMD's _mm256_cmpeq_epi8 to perform 32 equality tests at a time. However, in the setting I'm working in, both numbers and results have been allocated by a runtime (the GHC runtime, but this is irrelevant). They are both guaranteed to be 64-bit aligned. Is there any way to vectorize this operation, preferably without using AVX registers?
The approach I've considered is broadcasting the 8-bit word to a 64-bit word up front and then XORing it with 8 elements at a time. This doesn't work though because I cannot find a vectorized way to convert the result of XOR (zero means equal, anything else means unequal) to a equality test result I need (0 means unequal, 1 means equal, nothing else should ever exist). Roughly, the sketch I have is:
void vecEq (uint64_t* numbers, uint64_t* results, int len, uint_8 target) {
uint64_t targetA = (uint64_t)target;
uint64_t targetB = targetA<<56 | targetA<<48 | targetA<<40 | targetA<<32 | targetA<<24 | targetA<<16 | targetA<<8 | targetA;
for(int i = 0; i < len; i++) {
uint64_t tmp = numbers[i] ^ targetB;
results[i] = ... something with tmp ...;
}
}
Further to the comments above (the code will vectorise just fine). If you are using AVX, the best strategy is usually just to use unaligned load/store intrinsics. They have no extra cost if your data does happen to be aligned, and are as cheap as the HW can make them for cases of misalignment. (On Intel CPUs, there's still a penalty for loads/stores that span two cache lines, aka a cache-line split).
Ideally you can still align your buffers by 32, but if your data has to come from L2 or L3 or RAM, misalignment often doesn't make a measurable difference. And the best strategy for dealing with possible misalignment is usually just to let the HW handle it, instead of scalar up to an alignment boundary or something like you'd do with SSE, or with AVX512 where alignment matters again (any misalignment leads to every load/store being a cache-line split).
Just use _mm256_loadu_si256 / _mm256_storeu_si256 and forget about it.
As an interesting aside, Visual C++ will no longer emit aligned loads or stores, even if you request them.
https://godbolt.org/z/pL9nw9 (e.g. vmovups instead of vmovaps)
If compiling with GCC, you probably want to use -march=haswell or -march=znver1 not just -mavx2, or at least -mno-avx256-split-unaligned-load and -mno-avx256-split-unaligned-store so 256-bit unaligned loads compile to single instructions. The CPUs that benefit from those tune=generic defaults don't support AVX2, for example Sandybridge and Piledriver.

Copy from one memory to another skipping constant bytes in C

I am working on embedded system application. I want to copy from source to destination, skipping constant number of bytes. For example: source[6] = {0,1,2,3,4,5} and I want destination to be {0,2,4} skipping one byte. Unfortunately memcpy could not fulfilled my requirement. How can I achieve this in 'C' without using loop as I have large data to process and using loop experiences time overhead.
My current implementation is something like this which takes upto 5-6 milli-seconds for 1500 bytes to copy:
unsigned int len_actual = 1500;
/* Fill in the SPI DMA buffer. */
while (len_actual-- != 0)
{
*(tgt_handle->spi_tx_buff ++) = ((*write_irp->buffer ++)) | (2 << 16) | DSPI_PUSHR_CONT;
}
You could write a "cherry picker" function
void * memcpk(void * destination, const void * source,
size_t num, size_t size
int (*test)(const void * item));
which copies at most num "objects", each having size size from
source to destination. Only the objects that satisfy the test are copied.
Then with
int oddp(const void * intptr) { return (*((int *)intptr))%2; }
int evenp(const void * intptr) { return !oddp(intptr); }
you could do
int destination[6];
memcpk(destination, source, 6, sizeof(int), evenp);
.
Almost all CPUs have caches; which means that (e.g.) when you modify one byte the CPU fetches an entire cache line from RAM, modifies the byte in the cache, then writes the entire cache line back to RAM. By skipping small pieces you add overhead (more instructions for CPU to care about) and won't reduce the amount of data transfered between cache and RAM.
Also, typically memcpy() is optimised to copy larger pieces. For example, if you copy an array of bytes but the CPU is capable of copying 32-bits (4 bytes) at once, then memcpy() will probably do the majority of the copying as a loop with 4 bytes per iteration (to reduce the number of reads and writes and reduce the number of loop iterations).
In other words; code to avoid copying specific bytes will make it significantly slower than mempcy() for multiple reasons.
To avoid that, you really want to separate the data that needs to be copied from the data that doesn't - e.g. put everything that doesn't need to be copied at the end of the array and only copy the first part of the array (so that it remains "copy a contiguous area of bytes").
If you can't do that the next alternative to consider would be masking. For example, if you have an array of bytes where some bytes shouldn't be copied, then you'd also have an array of "mask bytes" and do something like dest[i] = (dest[i] & mask[i]) | (src[i] & ~mask[i]); in a loop. This sounds horrible (and is horrible) until you optimise it by operating on larger pieces - e.g. if the CPU can copy 32-bit pieces, masking allows you to do 4 bytes per iteration by pretending all of the arrays are arrays of uint32_t). Note that for this technique wider is better - e.g. if the CPU supports operations on 256-bit pieces (AVX on 80x86) you'd be able to do 32 bytes per iteration of the loop. It also helps if you can make guarantees about the size and alignment (e.g. if the CPU can operate on 32 bits/4 bytes at a time, ensure that the size of the arrays is always a multiple of 4 bytes and that the arrays are always 4-byte aligned; even if it means adding unused padding at the end).
Also note that depending on which CPU it actually is, there might be special support in the instruction set. For one example, modern 80x86 CPUs (that support SSE2) have a maskmovdqu instruction that is designed specifically for selectively writing some bytes but not others. In that case, you'd need to resort to instrinsics or inline assembly because "pure C" has no support for this type of thing (beyond bitwise operators).
Having overlooked your speed requirements:
You may try to find a way which solves the problem without copying at all.
Some ideas here:
If you want to iterate the destination array you could define
kind of a "picky iterator" for source that advances to the next number you allow: Instead of iter++ do iter = advance_source(iter)
If you want to search the destination array then wrap a function around bsearch() that searches source and inspects the result. And so on.
Depending on your processor memory width, and number of internal registers, you might be able to speed this up by using shift operations.
You need to know if your processor is big-endian or little-endian.
Lets say you have a 32 bit processor and bus, and at least 4 spare registers that the compiler can use for optimisation. This means you can read or write 4 bytes in the same target word, having read 2 source words. Note that you are reading the bytes you are going to discard.
You can also improve the speed by making sure that everything is word aligned, and ignoring the gaps between the buffers, so not having to worry about the odd counts of bytes.
So, for little-endian:
inline unsigned long CopyEven(unsigned long a, unsigned long b)
{
long c = a & 0xff;
c |= (a>>8) & 0xff00;
c |= (b<<16) & 0xff0000;
c |= (b<<8) &0xff000000;
return c;
}
unsigned long* d = (unsigned long*)dest;
unsigned long* s = (unsigned long*)source;
for (int count =0; count <sourceLenBytes; count+=8)
{
*d = CopyEven(s[0], s[1]);
d++;
s+=2;
}

Hash function for 64 bit to 10 bits

I want a hash function that takes a long number (64 bits) and produces result of 10 bits. What is the best hash function for such purpose. Inputs are basically addresses of variables (Addresses are of 64 bits or 8 bytes on Linux), so my hash function should be optimized for that purpose.
I would say somethig like this:
uint32_t hash(uint64_t x)
{
x >>= 3;
return (x ^ (x>>10) ^ (x>>20)) & 0x3FF;
}
The lest significant 3 bits are not very useful, as most variables are 4-byte or 8-byte aligned, so we remove them.
Then we take the next 30 bits and mix them together (XOR) in blocks of 10 bits each.
Naturally, you could also take the (x>>30)^(x>>40)^(x>>50) but I'm not sure if they'll make any difference in practice.
I wrote a toy program to see some real addresses on the stack, data area, and heap. Basically I declared 4 globals, 4 locals and did 2 mallocs. I dropped the last two bits when printing the addresses. Here is an output from one of the runs:
20125e8
20125e6
20125e7
20125e4
3fef2131
3fef2130
3fef212f
3fef212c
25e4802
25e4806
What this tells me:
The LSB in this output (3rd bit of the address) is frequently 'on' and 'off'. So I wouldn't drop it when calculating the hash. Dropping 2 LSBs seems enough.
We also see that there is more entropy in the lower 8-10 bits. We must use that when calculating the hash.
We know that on a 64 bit machine, virtual addresses are never more than 48 bits wide.
What I would do next:
/* Drop two LSBs. */
a >>= 2;
/* Get rid of the MSBs. Keep 46 bits. */
a &= 0x3fffffffffff;
/* Get the 14 MSBs and fold them in to get a 32 bit integer.
The MSBs are mostly 0s anyway, so we don't lose much entropy. */
msbs = (a >> 32) << 18;
a ^= msbs;
Now we pass this through a decent 'half avalanche' hash function, instead of rolling our own. 'Half avalanche' means each bit of the input gets a chance to affect bits at the same position and higher:
uint32_t half_avalanche( uint32_t a)
{
a = (a+0x479ab41d) + (a<<8);
a = (a^0xe4aa10ce) ^ (a>>5);
a = (a+0x9942f0a6) - (a<<14);
a = (a^0x5aedd67d) ^ (a>>3);
a = (a+0x17bea992) + (a<<7);
return a;
}
For an 10-bit hash, use the 10 MSBs of the uint32_t returned. The hash function continues to work fine if you pick N MSBs for an N bit hash, effectively doubling the bucket count with each additional bit.
I was a little bored, so I wrote a toy benchmark for this. Nothing fancy, it allocates a bunch of memory on the heap and tries out the hash I described above. The source can be had from here. An example result:
1024 buckets, 256 values generated, 29 collissions
1024 buckets, 512 values generated, 103 collissions
1024 buckets, 1024 values generated, 370 collissions
Next: I tried out the other two hashes answered here. They both have similar performance. Looks like: Just pick the fastest one ;)
Best for most distributions is mod by a prime, 1021 is the largest 10-bit prime. There's no need to strip low bits.
static inline int hashaddress(void *v)
{
return (uintptr_t)v % 1021;
}
If you think performance might be a concern, have a few alternates on hand and race them in your actual program. Microbenchmarks are waste; a difference of a few cycles is almost certain to be swamped by cache effects, and size matters.

optimized byte array shifter

I'm sure this has been asked before, but I need to implement a shift operator on a byte array of variable length size. I've looked around a bit but I have not found any standard way of doing it. I came up with an implementation which works, but I'm not sure how efficient it is. Does anyone know of a standard way to shift an array, or at least have any recommendation on how to boost the performance of my implementation;
char* baLeftShift(const char* array, size_t size, signed int displacement,char* result)
{
memcpy(result,array,size);
short shiftBuffer = 0;
char carryFlag = 0;
char* byte;
if(displacement > 0)
{
for(;displacement--;)
{
for(byte=&(result[size - 1]);((unsigned int)(byte))>=((unsigned int)(result));byte--)
{
shiftBuffer = *byte;
shiftBuffer <<= 1;
*byte = ((carryFlag) | ((char)(shiftBuffer)));
carryFlag = ((char*)(&shiftBuffer))[1];
}
}
}
else
{
unsigned int offset = ((unsigned int)(result)) + size;
displacement = -displacement;
for(;displacement--;)
{
for(byte=(char*)result;((unsigned int)(byte)) < offset;byte++)
{
shiftBuffer = *byte;
shiftBuffer <<= 7;
*byte = ((carryFlag) | ((char*)(&shiftBuffer))[1]);
carryFlag = ((char)(shiftBuffer));
}
}
}
return result;
}
If I can just add to what #dwelch is saying, you could try this.
Just move the bytes to their final locations. Then you are left with a shift count such as 3, for example, if each byte still needs to be left-shifted 3 bits into the next higher byte. (This assumes in your mind's eye the bytes are laid out in ascending order from right to left.)
Then rotate each byte to the left by 3. A lookup table might be faster than individually doing an actual rotate. Then, in each byte, the 3 bits to be shifted are now in the right-hand end of the byte.
Now make a mask M, which is (1<<3)-1, which is simply the low order 3 bits turned on.
Now, in order, from high order byte to low order byte, do this:
c[i] ^= M & (c[i] ^ c[i-1])
That will copy bits to c[i] from c[i-1] under the mask M.
For the last byte, just use a 0 in place of c[i-1].
For right shifts, same idea.
My first suggestion would be to eliminate the for loops around the displacement. You should be able to do the necessary shifts without the for(;displacement--;) loops. For displacements of magnitude greater than 7, things get a little trickier because your inner loop bounds will change and your source offset is no longer 1. i.e. your input buffer offset becomes magnitude / 8 and your shift becomes magnitude % 8.
It does look inefficient and perhaps this is what Nathan was referring to.
assuming a char is 8 bits where this code is running there are two things to do first move the whole bytes, for example if your input array is 0x00,0x00,0x12,0x34 and you shift left 8 bits then you get 0x00 0x12 0x34 0x00, there is no reason to do that in a loop 8 times one bit at a time. so start by shifting the whole chars in the array by (displacement>>3) locations and pad the holes created with zeros some sort of for(ra=(displacement>>3);ra>3)] = array[ra]; for(ra-=(displacement>>3);ra>(7-(displacement&7))). a good compiler will precompute (displacement>>3), displacement&7, 7-(displacement&7) and a good processor will have enough registers to keep all of those values. you might help the compiler by making separate variables for each of those items, but depending on the compiler and how you are using it it could make it worse too.
The bottom line though is time the code. perform a thousand 1 bit shifts then a thousand 2 bit shifts, etc time the whole thing, then try a different algorithm and time it the same way and see if the optimizations make a difference, make it better or worse. If you know ahead of time this code will only ever be used for single or less than 8 bit shifts adjust the timing test accordingly.
your use of the carry flag implies that you are aware that many processors have instructions specifically for chaining infinitely long shifts using the standard register length (for single bit at a time) rotate through carry basically. Which the C language does not support directly. for chaining single bit shifts you could consider assembler and likely outperform the C code. at least the single bit shifts are faster than C code can do. A hybrid of moving the bytes then if the number of bits to shift (displacement&7) is maybe less than 4 use the assembler else use a C loop. again the timing tests will tell you where the optimizations are.

A Perfect Hashing Function for an 8 by 8 board?

I'm implementing a board with only 2 types of pieces, and was looking for a function to map from that board to a Long Integer (64 bits). I was thinking this should not be so hard, since a long integer contains more available information than an 8 by 8 array (call it grid[x][y]) with only 3 possible elements in each spot including the empty element. I tried the following:
(1) Zobrist hashing with Longs rather than ints (Just to test - I didn't actually expect that to work perfectly)
(2) Translated the grid into a 64 character string of a base 3 number, and then took that number and parsed it into a long. I think this should work, but it took a very very long time.
Is there some simpler solution to (2) involving bit operations of shifting or something like that?
Edit: Please don't give me actual code, as this is for a class project, and that would probably be considered unethical in our department (or at least not in Java).
Edit2: Basically, there are only 10 whites and 10 blacks on the board at any given time, of which no two pieces of the same color can be neighbors, either in the horizontal, vertical, or diagonal direction. Also, there are 12 spaces for each color where only that color may place pieces.
If each tile in the game can be 1 of any 3 states at any point in the game, then the minimum amount of storage required for a "perfect hash" when hashing every possible state of the game board, at any given moment will
= power(3,8*8) individual hashes
= log2(3^64) bits
= approx. 101.4 bits, so you will need at least 102 bits to store this info
At this point, you may as well just say there are 4 states for each tile, which will bring you to needing 128 bits.
Once you do this, its rather easy to make a fast hashing algorithm for the board.
E.g. (writtin as c++, may need to alter code if the platform doesn't support 128 bit numbers)
uint_128 CreateGameBoardHash(int (&game_board)[8][8])
{
uint_128 board_hash = 0;
for(int i = 0; i < 8; ++i)
{
for(int j = 0; j < 8; ++j)
{
board_hash |= game_board[i][j] << ((i * 8 + j) *2);
}
}
return board_hash;
}
This method will only waste 26 bits (little more than 3 bytes) over the optimal solution of 102 bits, but you will save a LOT of processing time that would be otherwise spent doing base 3 math.
Edit Here's a version that doesn't require 128 bits and should work on any 16-bit (or better) processor
struct GameBoardHash
{
uint16 row[8];
};
GameBoardHash CreateGameBoardHash(int (&game_board)[8][8])
{
GameBoardHash board_hash;
for(int i = 0; i < 8; ++i)
{
board_hash.row[i] = 0;
for(int j = 0; j < 8; ++j)
{
board_hash.row[i] |= game_board[j] << (j*2);
}
}
return board_hash;
}
It won't fit into a 64 bit integer. You have 64 squares and you need more than 1 bit to record each square. Why do you need it to fit into a 64 bit int? Are you targetting the ZX81?
How about a 16 byte array containing the bits? Each 2-bits, represent a position's value, so that given a position in the 8x8 board (pos=0-63), you can figure out the index by dividing pos by 4 and you can get the value by doing bit manipulation to get two bits (bit0=pos mod 4 and bit1=bit0 + 1). The two bits can be either 00, 01, or 10.
Reading your comments to David, it doesn't seem like you really need a perfect hash value. You just need a hashable object.
Make it simple for yourself... Make some hash for you position in the overwrite to GetHashCode(), and then do the rest of the work in the Equals function.
If you REALLY need it to be perfect, then you have to use a GUID to encode your data in and make your own hash that can use 128bit keys. But that is just a huge investment of time for little benifit.

Resources