ARM Neon: conditional store suggestion

ARM Neon: conditional store suggestion - arm

I'm trying to figure out how to generate a conditional Store in ARM neon. What I would like to do is the equivalent of this SSE instruction:
void _mm_maskmoveu_si128(__m128i d, __m128i n, char *p);
which Conditionally stores byte elements of d to address p.The high bit of each byte in the selector n determines whether the corresponding byte in d will be stored.
Any suggestion on how to do it with NEON intrinsics?
Thank you
This is what I did:
int8x16_t store_mask = {0,0,0,0,0,0,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff};
int8x16_t tmp_dest = vld1q_u8((int8_t*)p_dest);
vbslq_u8(source,tmp_dest,store_mask);
vst1q_u8((int8_t*)p_dest,tmp_dest);

Assuming vectors of 16 x 1 byte elements, you would set up a mask vector where each element is either all 0s (0x00) or all 1s (0xff) to determine whether the element should be stored on not. Then you need to do the following (pseudo code):
init mask vector = 0x00/0xff in each element
init source vector = data to be selectively stored
load dest vector from dest location
apply `vbslq_u8` (`vbit` instruction) with dest vector, source vector and mask vector
store dest vector back to dest location

Related

CRC calculation reduction

I have one math and programming related question about CRC calculations, to avoid recompute full CRC for a block when you must change only a small portion of it.
My problem is the following: I have a 1K block of 4 byte structures, each one representing a data field. The full 1K block has a CRC16 block at the end, computed over the full 1K. When I have to change only a 4 byte structure, I should recompute the CRC of the full block but I'm searching for a more efficient solution to this problem. Something where:
I take the full 1K block current CRC16
I compute something on the old 4 byte block
I "subtract" something obtained at step 2 from the full 1K CRC16
I compute something on the new 4 byte block
I "add" something obtained at step 4 to the result obtained at step 3
To summarize, I am thinking about something like this:
CRC(new-full) = [CRC(old-full) - CRC(block-old) + CRC(block-new)]
But I'm missing the math behind and what to do to obtain this result, considering also a "general formula".
Thanks in advance.

Take your initial 1024-byte block A and your new 1024-byte block B. Exclusive-or them to get block C. Since you only changed four bytes, C will be bunch of zeros, four bytes which are the exclusive-or of the previous and new four bytes, and a bunch more zeros.
Now compute the CRC-16 of block C, but without any pre or post-processing. We will call that CRC-16'. (I would need to see the specific CRC-16 you're using to see what that processing is, if anything.) Exclusive-or the CRC-16 of block A with the CRC-16' of block C, and you now have the CRC-16 of block B.
At first glance, this may not seem like much of a gain compared to just computing the CRC of block B. However there are tricks to rapidly computing the CRC of a bunch of zeros. First off, the zeros preceding the four bytes that were changed give a CRC-16' of zero, regardless of how many zeros there are. So you just start computing the CRC-16' with the exclusive-or of the previous and new four bytes.
Now you continue to compute the CRC-16' on the remaining n zeros after the changed bytes. Normally it takes O(n) time to compute a CRC on n bytes. However if you know that they are all zeros (or all some constant value), then it can be computed in O(log n) time. You can see an example of how this is done in zlib's crc32_combine() routine, and apply that to your CRC.
Given your CRC-16/DNP parameters, the zeros() routine below will apply the requested number of zero bytes to the CRC in O(log n) time.
// Return a(x) multiplied by b(x) modulo p(x), where p(x) is the CRC
// polynomial, reflected. For speed, this requires that a not be zero.
uint16_t multmodp(uint16_t a, uint16_t b) {
uint16_t m = (uint16_t)1 << 15;
uint16_t p = 0;
for (;;) {
if (a & m) {
p ^= b;
if ((a & (m - 1)) == 0)
break;
}
m >>= 1;
b = b & 1 ? (b >> 1) ^ 0xa6bc : b >> 1;
}
return p;
}
// Table of x^2^n modulo p(x).
uint16_t const x2n_table[] = {
0x4000, 0x2000, 0x0800, 0x0080, 0xa6bc, 0x55a7, 0xfc4f, 0x1f78,
0xa31f, 0x78c1, 0xbe76, 0xac8f, 0xb26b, 0x3370, 0xb090
};
// Return x^(n*2^k) modulo p(x).
uint16_t x2nmodp(size_t n, unsigned k) {
k %= 15;
uint16_t p = (uint16_t)1 << 15;
for (;;) {
if (n & 1)
p = multmodp(x2n_table[k], p);
n >>= 1;
if (n == 0)
break;
if (++k == 15)
k = 0;
}
return p;
}
// Apply n zero bytes to crc.
uint16_t zeros(uint16_t crc, size_t n) {
return multmodp(x2nmodp(n, 3), crc);
}

CRC actually makes this an easy thing to do.
When looking into this, I'm sure you've started to read that CRCs are calculated with polynomials over GF(2), and probably skipped over that part to the immediately useful information. Well, it sounds like it's probably time for you to go back over that stuff and reread it a few times so you can really understand it.
But anyway...
Because of the way CRCs are calculated, they have a property that, given two blocks A and B, CRC(A xor B) = CRC(A) xor CRC(B)
So the first simplification you can make is that you just need to calculate the CRC of the changed bits. You could actually precalculate the CRCs of each bit in the block, so that when you change a bit you can just xor it's CRC into the block's CRC.
CRCs also have the property that CRC(A * B) = CRC(A * CRC(B)), where that * is polynomial multiplication over GF(2). If you stuff the block with zeros at the end, then don't do that for CRC(B).
This lets you get away with a smaller precalculated table. "Polynomial multiplication over GF(2)" is binary convolution, so multiplying by 1000 is the same as shifting by 3 bits. With this rule, you can precalculate the CRC of the offset of each field. Then just multiply (convolve) the changed bits by the offset CRC (calculated without zero stuffing), calculate the CRC of those 8 byes, and xor them into the block CRC.

The CRC is the remainder of the long integer formed by the input stream and the short integer corresponding to the polynomial, say p.
If you change some bits in the middle, this amounts to a perturbation of the dividend by n 2^k where n has the length of the perturbed section and k is the number of bits that follow.
Hence, you need to compute the perturbation of the remainder, (n 2^k) mod p. You can address this using
(n 2^k) mod p = (n mod p) (2^k mod p)
The first factor is just the CRC16 of n. The other factor can be obtained efficiently in Log k operations by the power algorithm based on squarings.

CRC depends of the calculated CRC of the data before.
So the only optimization is, to logical split the data into N segment and store the computed CRC-state for each segment.
Then, of e.g. modifying segment 6 (of 0..9), get the CRC-state of segment 5, and continue calculating CRC beginning with segment 6 and ending with 9.
Anyway, CRC calculations are very fast. So think, if it is worth it.

Can I use SIMD to bucket sort / categorize?

I'm curious about SIMD and wondering if it can handle this use case.
Let's say I have an array of 2048 integers, like
[0x018A, 0x004B, 0x01C0, 0x0234, 0x0098, 0x0343, 0x0222, 0x0301, 0x0398, 0x0087, 0x0167, 0x0389, 0x03F2, 0x0034, 0x0345, ...]
Note how they all start with either 0x00, 0x01, 0x02, or 0x03. I want to split them into 4 arrays:
One for all the integers starting with 0x00
One for all the integers starting with 0x01
One for all the integers starting with 0x02
One for all the integers starting with 0x03
I imagine I would have code like this:
int main() {
uint16_t in[2048] = ...;
// 4 arrays, one for each category
uint16_t out[4][2048];
// Pointers to the next available slot in each of the arrays
uint16_t *nextOut[4] = { out[0], out[1], out[2], out[3] };
for (uint16_t *nextIn = in; nextIn < 2048; nextIn += 4) {
(*** magic simd instructions here ***)
// Equivalent non-simd code:
uint16_t categories[4];
for (int i = 0; i < 4; i++) {
categories[i] = nextIn[i] & 0xFF00;
}
for (int i = 0; i < 4; i++) {
uint16_t category = categories[i];
*nextOut[category] = nextIn[i];
nextOut[category]++;
}
}
// Now I have my categoried arrays!
}
I imagine that my first inner loop doesn't need SIMD, it can be just a (x & 0xFF00FF00FF00FF00) instruction, but I wonder if we can make that second inner loop into a SIMD instruction.
Is there any sort of SIMD instruction for this "categorizing" action that I'm doing?
The "insert" instructions seem somewhat promising, but I'm a bit too green to understand the descriptions at https://software.intel.com/en-us/node/695331.
If not, does anything come close?
Thanks!

You can do it with SIMD, but how fast it is will depend on exactly what instruction sets you have available, and how clever you are in your implementation.
One approach is to take the array and "sift" it to separate out elements that belong in different buckets. For example, grab 32 bytes from your array which will have 16 16-bit elements. Use some cmpgt instructions to get a mask where which determines whether each element falls into the 00 + 01 bucket or the 02 + 03 bucket. Then use some kind of "compress" or "filter" operation to move all masked elements contiguously into one end a register and then same for the unmasked elements.
Then repeat this one more time to sort out 00 from 01 and 02 from 03.
With AVX2 you could start with this question for inspiration on the "compress" operation. With AVX512 you could use the vcompress instruction to help out: it does exactly this operation but only at 32-bit or 64-bit granularity so you'd need to do a couple at least per vector.
You could also try a vertical approach, where you load N vectors and then swap between them so that the 0th vector has the smallest elements, etc. At this point, you can use a more optimized algorithm for the compress stage (e.g,. if you vertically sort enough vectors, the vectors at the ends may be entirely starting with 0x00 etc).
Finally, you might also consider organizing your data differently, either at the source or as a pre-processing step: separating out the "category" byte which is always 0-3 from the payload byte. Many of the processing steps only need to happen on one or the other, so you can potentially increase efficiency by splitting them out that way. For example, you could do the comparison operation on 32 bytes that are all categories, and then do the compress operation on the 32 payload bytes (at least in the final step where each category is unique).
This would lead to arrays of byte elements, not 16-bit elements, where the "category" byte is implicit. You've cut your data size in half, which might speed up everything else you want to do with the data in the future.
If you can't produce the source data in this format, you could use the bucketing as an opportunity to remove the tag byte as you put the payload into the right bucket, so the output is uint8_t out[4][2048];. If you're doing a SIMD left-pack with a pshufb byte-shuffle as discussed in comments, you could choose a shuffle control vector that packs only the payload bytes into the low half.
(Until AVX512BW, x86 SIMD doesn't have any variable-control word shuffles, only byte or dword, so you already needed a byte shuffle which can just as easily separate payloads from tags at the same time as packing payload bytes to the bottom.)

How to load uint8_t *src to uint16x8_t

How to load uint8_t *src to uint16x8_t?
For example, we can only do the following:
uint8_t *src;
--->
uint8x8_t mysrc = vld1_u8(src);
Seems that I can not use vreinterpret_*() or (uint16x8_t)mysrc to transform mysrc to uint16x8_t? Is it right?

Load the 8 first values as 8-bit values:
uint8x8_t mysrc8x8 = vld1_u8(src);
Then use the "convert long move" instruction to transform these values to 16-bit values by prepending zeroes in the first 8 bits:
uint16x8_t mysrc16x8 = vmovl_u8(mysrc8x8);
Assuming that after some operations on these values, you obtain your output myoutput16x8 in an uint16x8_t format and want to convert them back to uint8x8_t, then you can use the vmovn_u16 instruction, bearing in mind that it will indeed truncate the values if they are bigger than 255:
uint8x8_t myoutput8x8 = vmovn_u16(myoutput16x8);
Hope this helps!

Addressing a non-integer address, and sse

I am trying to accelerate my code using SSE, and the following code works well.
Basically a __m128 variable should point to 4 floats in a row, in order to do 4 operations at once.
This code is equivalent to computing c[i]=a[i]+b[i] with i from 0 to 3.
float *data1,*data2,*data3
// ... code ... allocating data1-2-3 which are very long.
__m128* a = (__m128*) (data1);
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);
However, when I want to shift a bit the data that I use (see below), in order to compute c[i]=a[i+1]+b[i] with i from 0 to 3, it crashes at execution time.
__m128* a = (__m128*) (data1+1); // <-- +1
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);
My guess is that it is related to the fact that __m128 is 128 bits and by float data are 32 bits. So, it may be impossible for a 128-bit pointer to point on an address that is not divisible by 128.
Anyway, do you know what the problem is and how I could go around it?

Instead of using implicit aligned loads/stores like this:
__m128* a = (__m128*) (data1+1); // <-- +1
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);
use explicit aligned/unaligned loads/stores as appropriate, e.g.:
__m128 va = _mm_loadu_ps(data1+1); // <-- +1 (NB: use unaligned load)
__m128 vb = _mm_load_ps(data2);
__m128 vc = _mm_add_ps(va, vb);
_mm_store_ps(data3, vc);
Same amount of code (i.e. same number of instructions), but it won't crash, and you have explicit control over which loads/stores are aligned and which are unaligned.
Note that recent CPUs have relatively small penalties for unaligned loads, but on older CPUs there can be a 2x or greater hit.

Your problem here is that a ends up pointing to something that is not a __m128; it points to something that contains the last 96 bits of an __m128 and 32 bits outside, which can be anything. It may be the first 32 bits of the next __m128, but eventually, when you arrive at the last __m128 in the same memory block, it will be something else. Maybe reserved memory that you cannot access, hence the crash.

Large bit arrays in C

Our OS professor mentioned that for assigning a process id to a new process, the kernel incrementally searches for the first zero bit in a array of size equivalent to the maximum number of processes(~32,768 by default), where an allocated process id has 1 stored in it.
As far as I know, there is no bit data type in C. Obviously, there's something I'm missing here.
Is there any such special construct from which we can build up a bit array? How is this done exactly?
More importantly, what are the operations that can be performed on such an array?

Bit arrays are simply byte arrays where you use bitwise operators to read the individual bits.
Suppose you have a 1-byte char variable. This contains 8 bits. You can test if the lowest bit is true by performing a bitwise AND operation with the value 1, e.g.
char a = /*something*/;
if (a & 1) {
/* lowest bit is true */
}
Notice that this is a single ampersand. It is completely different from the logical AND operator &&. This works because a & 1 will "mask out" all bits except the first, and so a & 1 will be nonzero if and only if the lowest bit of a is 1. Similarly, you can check if the second lowest bit is true by ANDing it with 2, and the third by ANDing with 4, etc, for continuing powers of two.
So a 32,768-element bit array would be represented as a 4096-element byte array, where the first byte holds bits 0-7, the second byte holds bits 8-15, etc. To perform the check, the code would select the byte from the array containing the bit that it wanted to check, and then use a bitwise operation to read the bit value from the byte.
As far as what the operations are, like any other data type, you can read values and write values. I explained how to read values above, and I'll explain how to write values below, but if you're really interested in understanding bitwise operations, read the link I provided in the first sentence.
How you write a bit depends on if you want to write a 0 or a 1. To write a 1-bit into a byte a, you perform the opposite of an AND operation: an OR operation, e.g.
char a = /*something*/;
a = a | 1; /* or a |= 1 */
After this, the lowest bit of a will be set to 1 whether it was set before or not. Again, you could write this into the second position by replacing 1 with 2, or into the third with 4, and so on for powers of two.
Finally, to write a zero bit, you AND with the inverse of the position you want to write to, e.g.
char a = /*something*/;
a = a & ~1; /* or a &= ~1 */
Now, the lowest bit of a is set to 0, regardless of its previous value. This works because ~1 will have all bits other than the lowest set to 1, and the lowest set to zero. This "masks out" the lowest bit to zero, and leaves the remaining bits of a alone.

A struct can assign members bit-sizes, but that's the extent of a "bit-type" in 'C'.
struct int_sized_struct {
int foo:4;
int bar:4;
int baz:24;
};
The rest of it is done with bitwise operations. For example. searching that PID bitmap can be done with:
extern uint32_t *process_bitmap;
uint32_t *p = process_bitmap;
uint32_t bit_offset = 0;
uint32_t bit_test;
/* Scan pid bitmap 32 entries per cycle. */
while ((*p & 0xffffffff) == 0xffffffff) {
p++;
}
/* Scan the 32-bit int block that has an open slot for the open PID */
bit_test = 0x80000000;
while ((*p & bit_test) == bit_test) {
bit_test >>= 1;
bit_offset++;
}
pid = (p - process_bitmap)*8 + bit_offset;
This is roughly 32x faster than doing a simple for loop scanning an array with one byte per PID. (Actually, greater than 32x since more of the bitmap is will stay in CPU cache.)

see http://graphics.stanford.edu/~seander/bithacks.html

No bit type in C, but bit manipulation is fairly straight forward. Some processors have bit specific instructions which the code below would nicely optimize for, even without that should be pretty fast. May or may not be faster using an array of 32 bit words instead of bytes. Inlining instead of functions would also help performance.
If you have the memory to burn just use a whole byte to store one bit (or whole 32 bit number, etc) greatly improve performance at the cost of memory used.
unsigned char data[SIZE];
unsigned char get_bit ( unsigned int offset )
{
//TODO: limit check offset
if(data[offset>>3]&(1<<(offset&7))) return(1);
else return(0);
}
void set_bit ( unsigned int offset, unsigned char bit )
{
//TODO: limit check offset
if(bit) data[offset>>3]|=1<<(offset&7);
else data[offset>>3]&=~(1<<(offset&7));
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight