Fast method to copy memory with translation - ARGB to BGR - c

Overview
I have an image buffer that I need to convert to another format. The origin image buffer is four channels, 8 bits per channel, Alpha, Red, Green, and Blue. The destination buffer is three channels, 8 bits per channel, Blue, Green, and Red.
So the brute force method is:
// Assume a 32 x 32 pixel image
#define IMAGESIZE (32*32)
typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;
ARGB orig[IMAGESIZE];
BGR dest[IMAGESIZE];
for(x = 0; x < IMAGESIZE; x++)
{
dest[x].Red = orig[x].Red;
dest[x].Green = orig[x].Green;
dest[x].Blue = orig[x].Blue;
}
However, I need more speed than is provided by a loop and three byte copies. I'm hoping there might be a few tricks I can use to reduce the number of memory reads and writes, given that I'm running on a 32 bit machine.
Additional info
Every image is a multiple of at least 4 pixels. So we could address 16 ARGB bytes and move them into 12 RGB bytes per loop. Perhaps this fact can be used to speed things up, especially as it falls nicely into 32 bit boundaries.
I have access to OpenCL - and while that requires moving the entire buffer into the GPU memory, then moving the result back out, the fact that OpenCL can work on many portions of the image simultaneously, and the fact that large memory block moves are actually quite efficient may make this a worthwhile exploration.
While I've given the example of small buffers above, I really am moving HD video (1920x1080) and sometimes larger, mostly smaller, buffers around, so while a 32x32 situation may be trivial, copying 8.3MB of image data byte by byte is really, really bad.
Running on Intel processors (Core 2 and above) and thus there are streaming and data processing commands I'm aware exist, but don't know about - perhaps pointers on where to look for specialized data handling instructions would be good.
This is going into an OS X application, and I'm using XCode 4. If assembly is painless and the obvious way to go, I'm fine traveling down that path, but not having done it on this setup before makes me wary of sinking too much time into it.
Pseudo-code is fine - I'm not looking for a complete solution, just the algorithm and an explanation of any trickery that might not be immediately clear.

I wrote 4 different versions which work by swapping bytes. I compiled them using gcc 4.2.1 with -O3 -mssse3, ran them 10 times over 32MB of random data and found the averages.
Editor's note: the original inline asm used unsafe constraints, e.g. modifying input-only operands, and not telling the compiler about the side effect on memory pointed-to by pointer inputs in registers. Apparently this worked ok for the benchmark. I fixed the constraints to be properly safe for all callers. This should not affect benchmark numbers, only make sure the surrounding code is safe for all callers. Modern CPUs with higher memory bandwidth should see a bigger speedup for SIMD over 4-byte-at-a-time scalar, but the biggest benefits are when data is hot in cache (work in smaller blocks, or on smaller total sizes).
In 2020, your best bet is to use the portable _mm_loadu_si128 intrinsics version that will compile to an equivalent asm loop: https://gcc.gnu.org/wiki/DontUseInlineAsm.
Also note that all of these over-write 1 (scalar) or 4 (SIMD) bytes past the end of the output, so do the last 3 bytes separately if that's a problem.
--- #PeterCordes
The first version uses a C loop to convert each pixel separately, using the OSSwapInt32 function (which compiles to a bswap instruction with -O3).
void swap1(ARGB *orig, BGR *dest, unsigned imageSize) {
unsigned x;
for(x = 0; x < imageSize; x++) {
*((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]);
// warning: strict-aliasing UB. Use memcpy for unaligned loads/stores
}
}
The second method performs the same operation, but uses an inline assembly loop instead of a C loop.
void swap2(ARGB *orig, BGR *dest, unsigned imageSize) {
asm volatile ( // has to be volatile because the output is a side effect on pointed-to memory
"0:\n\t" // do {
"movl (%1),%%eax\n\t"
"bswapl %%eax\n\t"
"movl %%eax,(%0)\n\t" // copy a dword byte-reversed
"add $4,%1\n\t" // orig += 4 bytes
"add $3,%0\n\t" // dest += 3 bytes
"dec %2\n\t"
"jnz 0b" // }while(--imageSize)
: "+r" (dest), "+r" (orig), "+r" (imageSize)
: // no pure inputs; the asm modifies and dereferences the inputs to use them as read/write outputs.
: "flags", "eax", "memory"
);
}
The third version is a modified version of just a poseur's answer. I converted the built-in functions to the GCC equivalents and used the lddqu built-in function so that the input argument doesn't need to be aligned. (Editor's note: only P4 ever benefited from lddqu; it's fine to use movdqu but there's no downside.)
typedef char v16qi __attribute__ ((vector_size (16)));
void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) {
v16qi mask = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
uint8_t *end = orig + imagesize * 4;
for (; orig != end; orig += 16, dest += 12) {
__builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask));
}
}
Finally, the fourth version is the inline assembly equivalent of the third.
void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) {
static const int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
asm volatile (
"lddqu %3,%%xmm1\n\t"
"0:\n\t"
"lddqu (%1),%%xmm0\n\t"
"pshufb %%xmm1,%%xmm0\n\t"
"movdqu %%xmm0,(%0)\n\t"
"add $16,%1\n\t"
"add $12,%0\n\t"
"sub $4,%2\n\t"
"jnz 0b"
: "+r" (dest), "+r" (orig), "+r" (imagesize)
: "m" (mask) // whole array as a memory operand. "x" would get the compiler to load it
: "flags", "xmm0", "xmm1", "memory"
);
}
(These all compile fine with GCC9.3, but clang10 doesn't know __builtin_ia32_pshufb128; use _mm_shuffle_epi8.)
On my 2010 MacBook Pro, 2.4 Ghz i5 (Westmere/Arrandale), 4GB RAM, these were the average times for each:
Version 1: 10.8630 milliseconds
Version 2: 11.3254 milliseconds
Version 3: 9.3163 milliseconds
Version 4: 9.3584 milliseconds
As you can see, the compiler is good enough at optimization that you don't need to write assembly. Also, the vector functions were only 1.5 milliseconds faster on 32MB of data, so it won't cause much harm if you want to support the earliest Intel macs, which didn't support SSSE3.
Edit: liori asked for standard deviation information. Unfortunately, I hadn't saved the data points, so I ran another test with 25 iterations.
Average | Standard Deviation
Brute force: 18.01956 ms | 1.22980 ms (6.8%)
Version 1: 11.13120 ms | 0.81076 ms (7.3%)
Version 2: 11.27092 ms | 0.66209 ms (5.9%)
Version 3: 9.29184 ms | 0.27851 ms (3.0%)
Version 4: 9.40948 ms | 0.32702 ms (3.5%)
Also, here is the raw data from the new tests, in case anyone wants it. For each iteration, a 32MB data set was randomly generated and run through the four functions. The runtime of each function in microseconds is listed below.
Brute force: 22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845
Version 1: 10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601
Version 2: 10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936
Version 3: 9036 9619 9341 8970 9453 9758 9043 10114 9243 9027 9163 9176 9168 9122 9514 9049 9161 9086 9064 9604 9178 9233 9301 9717 9156
Version 4: 9339 10119 9846 9217 9526 9182 9145 10286 9051 9614 9249 9653 9799 9270 9173 9103 9132 9550 9147 9157 9199 9113 9699 9354 9314

The obvious, using pshufb.
#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>
// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
assert((uintptr_t)orig % 16 == 0);
assert(imagesize % 4 == 0);
__m128i mask = _mm_set_epi8(-128, -128, -128, -128, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3);
uint8_t *end = orig + imagesize * 4;
for (; orig != end; orig += 16, dest += 12) {
_mm_storeu_si128((__m128i *)dest, _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), mask));
}
}

Combining just a poseur's and Jitamaro's answers, if you assume that the inputs and outputs are 16-byte aligned and if you process pixels 4 at a time, you can use a combination of shuffles, masks, ands, and ors to store out using aligned stores. The main idea is to generate four intermediate data sets, then or them together with masks to select the relevant pixel values and write out 3 16-byte sets of pixel data. Note that I did not compile this or try to run it at all.
EDIT2: More detail about the underlying code structure:
With SSE2, you get better performance with 16-byte aligned reads and writes of 16 bytes. Since your 3 byte pixel is only alignable to 16-bytes for every 16 pixels, we batch up 16 pixels at a time using a combination of shuffles and masks and ors of 16 input pixels at a time.
From LSB to MSB, the inputs look like this, ignoring the specific components:
s[0]: 0000 0000 0000 0000
s[1]: 1111 1111 1111 1111
s[2]: 2222 2222 2222 2222
s[3]: 3333 3333 3333 3333
and the ouptuts look like this:
d[0]: 000 000 000 000 111 1
d[1]: 11 111 111 222 222 22
d[2]: 2 222 333 333 333 333
So to generate those outputs, you need to do the following (I will specify the actual transformations later):
d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1]))
d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2]))
d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))
Now, what should combine_<x> look like? If we assume that d is merely s compacted together, we can concatenate two s's with a mask and an or:
combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))
where (1 means select the left pixel, 0 means select the right pixel):
mask(0)= 111 111 111 111 000 0
mask(1)= 11 111 111 000 000 00
mask(2)= 1 111 000 000 000 000
But the actual transformations (f_<x>_low, f_<x>_high) are actually not that simple. Since we are reversing and removing bytes from the source pixel, the actual transformation is (for the first destination for brevity):
d[0]=
s[0][0].Blue s[0][0].Green s[0][0].Red
s[0][1].Blue s[0][1].Green s[0][1].Red
s[0][2].Blue s[0][2].Green s[0][2].Red
s[0][3].Blue s[0][3].Green s[0][3].Red
s[1][0].Blue s[1][0].Green s[1][0].Red
s[1][1].Blue
If you translate the above into byte offsets from source to dest, you get:
d[0]=
&s[0]+3 &s[0]+2 &s[0]+1
&s[0]+7 &s[0]+6 &s[0]+5
&s[0]+11 &s[0]+10 &s[0]+9
&s[0]+15 &s[0]+14 &s[0]+13
&s[1]+3 &s[1]+2 &s[1]+1
&s[1]+7
(If you take a look at all the s[0] offsets, they match just a poseur's shuffle mask in reverse order.)
Now, we can generate a shuffle mask to map each source byte to a destination byte (X means we don't care what that value is):
f_0_low= 3 2 1 7 6 5 11 10 9 15 14 13 X X X X
f_0_high= X X X X X X X X X X X X 3 2 1 7
f_1_low= 6 5 11 10 9 15 14 13 X X X X X X X X
f_1_high= X X X X X X X X 3 2 1 7 6 5 11 10
f_2_low= 9 15 14 13 X X X X X X X X X X X X
f_2_high= X X X X 3 2 1 7 6 5 11 10 9 15 14 13
We can further optimize this by looking the masks we use for each source pixel. If you take a look at the shuffle masks that we use for s[1]:
f_0_high= X X X X X X X X X X X X 3 2 1 7
f_1_low= 6 5 11 10 9 15 14 13 X X X X X X X X
Since the two shuffle masks don't overlap, we can combine them and simply mask off the irrelevant pixels in combine_, which we already did! The following code performs all these optimizations (plus it assumes that the source and destination addresses are 16-byte aligned). Also, the masks are written out in code in MSB->LSB order, in case you get confused about the ordering.
EDIT: changed the store to _mm_stream_si128 since you are likely doing a lot of writes and we don't want to necessarily flush the cache. Plus it should be aligned anyway so you get free perf!
#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>
// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
assert((uintptr_t)orig % 16 == 0);
assert(imagesize % 16 == 0);
__m128i shuf0 = _mm_set_epi8(
-128, -128, -128, -128, // top 4 bytes are not used
13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel
__m128i shuf1 = _mm_set_epi8(
7, 1, 2, 3, // top 4 bytes go to the first pixel
-128, -128, -128, -128, // unused
13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel
__m128i shuf2 = _mm_set_epi8(
10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel
-128, -128, -128, -128, // unused
13, 14, 15, 9); // bottom 4 go to third pixel
__m128i shuf3 = _mm_set_epi8(
13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel
-128, -128, -128, -128); // unused
__m128i mask0 = _mm_set_epi32(0, -1, -1, -1);
__m128i mask1 = _mm_set_epi32(0, 0, -1, -1);
__m128i mask2 = _mm_set_epi32(0, 0, 0, -1);
uint8_t *end = orig + imagesize * 4;
for (; orig != end; orig += 64, dest += 48) {
__m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0);
__m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1);
__m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2);
__m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3);
_mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0));
_mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1));
_mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2));
}
}

I am coming a little late to the party, seeming that the community has already decided for poseur's pshufb-answer but distributing 2000 reputation, that is so extremely generous i have to give it a try.
Here's my version without platform specific intrinsics or machine-specific asm, i have included some cross-platform timing code showing a 4x speedup if you do both the bit-twiddling like me AND activate compiler-optimization (register-optimization, loop-unrolling):
#include "stdlib.h"
#include "stdio.h"
#include "time.h"
#define UInt8 unsigned char
#define IMAGESIZE (1920*1080)
int main() {
time_t t0, t1;
int frames;
int frame;
typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;
ARGB* orig = malloc(IMAGESIZE*sizeof(ARGB));
if(!orig) {printf("nomem1");}
BGR* dest = malloc(IMAGESIZE*sizeof(BGR));
if(!dest) {printf("nomem2");}
printf("to start original hit a key\n");
getch();
t0 = time(0);
frames = 1200;
for(frame = 0; frame<frames; frame++) {
int x; for(x = 0; x < IMAGESIZE; x++) {
dest[x].Red = orig[x].Red;
dest[x].Green = orig[x].Green;
dest[x].Blue = orig[x].Blue;
x++;
}
}
t1 = time(0);
printf("finished original of %u frames in %u seconds\n", frames, t1-t0);
// on my core 2 subnotebook the original took 16 sec
// (8 sec with compiler optimization -O3) so at 60 FPS
// (instead of the 1200) this would be faster than realtime
// (if you disregard any other rendering you have to do).
// However if you either want to do other/more processing
// OR want faster than realtime processing for e.g. a video-conversion
// program then this would have to be a lot faster still.
printf("to start alternative hit a key\n");
getch();
t0 = time(0);
frames = 1200;
unsigned int* reader;
unsigned int* end = reader+IMAGESIZE;
unsigned int cur; // your question guarantees 32 bit cpu
unsigned int next;
unsigned int temp;
unsigned int* writer;
for(frame = 0; frame<frames; frame++) {
reader = (void*)orig;
writer = (void*)dest;
next = *reader;
reader++;
while(reader<end) {
cur = next;
next = *reader;
// in the following the numbers are of course the bitmasks for
// 0-7 bits, 8-15 bits and 16-23 bits out of the 32
temp = (cur&255)<<24 | (cur&65280)<<16|(cur&16711680)<<8|(next&255);
*writer = temp;
reader++;
writer++;
cur = next;
next = *reader;
temp = (cur&65280)<<24|(cur&16711680)<<16|(next&255)<<8|(next&65280);
*writer = temp;
reader++;
writer++;
cur = next;
next = *reader;
temp = (cur&16711680)<<24|(next&255)<<16|(next&65280)<<8|(next&16711680);
*writer = temp;
reader++;
writer++;
}
}
t1 = time(0);
printf("finished alternative of %u frames in %u seconds\n", frames, t1-t0);
// on my core 2 subnotebook this alternative took 10 sec
// (4 sec with compiler optimization -O3)
}
The results are these (on my core 2 subnotebook):
F:\>gcc b.c -o b.exe
F:\>b
to start original hit a key
finished original of 1200 frames in 16 seconds
to start alternative hit a key
finished alternative of 1200 frames in 10 seconds
F:\>gcc b.c -O3 -o b.exe
F:\>b
to start original hit a key
finished original of 1200 frames in 8 seconds
to start alternative hit a key
finished alternative of 1200 frames in 4 seconds

You want to use a Duff's device: http://en.wikipedia.org/wiki/Duff%27s_device. It's also working in JavaScript. This post however it's a bit funny to read http://lkml.indiana.edu/hypermail/linux/kernel/0008.2/0171.html. Imagine a Duff device with 512 Kbytes of moves.

In combination with one of the fast conversion functions here, given access to Core 2s it might be wise to split the translation into threads, which work on their, say, fourth of the data, as in this psudeocode:
void bulk_bgrFromArgb(byte[] dest, byte[] src, int n)
{
thread threads[] = {
create_thread(bgrFromArgb, dest, src, n/4),
create_thread(bgrFromArgb, dest+n/4, src+n/4, n/4),
create_thread(bgrFromArgb, dest+n/2, src+n/2, n/4),
create_thread(bgrFromArgb, dest+3*n/4, src+3*n/4, n/4),
}
join_threads(threads);
}

This assembly function should do, however I don't know if you would like to keep old data or not, this function overrides it.
The code is for MinGW GCC with intel assembly flavour, you will have to modify it to suit your compiler/assembler.
extern "C" {
int convertARGBtoBGR(uint buffer, uint size);
__asm(
".globl _convertARGBtoBGR\n"
"_convertARGBtoBGR:\n"
" push ebp\n"
" mov ebp, esp\n"
" sub esp, 4\n"
" mov esi, [ebp + 8]\n"
" mov edi, esi\n"
" mov ecx, [ebp + 12]\n"
" cld\n"
" convertARGBtoBGR_loop:\n"
" lodsd ; load value from [esi] (4byte) to eax, increment esi by 4\n"
" bswap eax ; swap eax ( A R G B ) to ( B G R A )\n"
" stosd ; store 4 bytes to [edi], increment edi by 4\n"
" sub edi, 1; move edi 1 back down, next time we will write over A byte\n"
" loop convertARGBtoBGR_loop\n"
" leave\n"
" ret\n"
);
}
You should call it like so:
convertARGBtoBGR( &buffer, IMAGESIZE );
This function is accessing memory only twice per pixel/packet (1 read, 1 write) comparing to your brute force method that had (at least / assuming it was compiled to register) 3 read and 3 write operations. Method is the same but implementation makes it more efficent.

You can do it in chunks of 4 pixels, moving 32 bits with unsigned long pointers. Just think that with 4 32 bits pixels you can construct by shifting and OR/AND, 3 words representing 4 24bits pixels, like this:
//col0 col1 col2 col3
//ARGB ARGB ARGB ARGB 32bits reading (4 pixels)
//BGRB GRBG RBGR 32 bits writing (4 pixels)
Shifting operations are always done by 1 instruction cycle in all modern 32/64 bits processors (barrel shifting technique) so its the fastest way of constructing those 3 words for writing, bitwise AND and OR are also blazing fast.
Like this:
//assuming we have 4 ARGB1 ... ARGB4 pixels and 3 32 bits words, W1, W2 and W3 to write
// and *dest its an unsigned long pointer for destination
W1 = ((ARGB1 & 0x000f) << 24) | ((ARGB1 & 0x00f0) << 8) | ((ARGB1 & 0x0f00) >> 8) | (ARGB2 & 0x000f);
*dest++ = W1;
and so on.... with next pixels in a loop.
You'll need some adjusting with images that are not multiple of 4, but I bet this is the fastest approach of all, without using assembler.
And btw, forget about using structs and indexed access, those are the SLOWER ways of all for moving data, just take a look at a disassembly listing of a compiled C++ program and you'll agree with me.

typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;
Aside from assembly or compiler intrinsics, I might try doing the following, while very carefully verifying the end behavior, as some of it (where unions are concerned) is likely to be compiler implementation dependent:
union uARGB
{
struct ARGB argb;
UInt32 x;
};
union uBGRA
{
struct
{
BGR bgr;
UInt8 Alpha;
} bgra;
UInt32 x;
};
and then for your code kernel, with whatever loop unrolling is appropriate:
inline void argb2bgr(BGR* pbgr, ARGB* pargb)
{
uARGB* puargb = (uARGB*)pargb;
uBGRA ubgra;
ubgra.x = __byte_reverse_32(pargb->x);
*pbgr = ubgra.bgra.bgr;
}
where __byte_reverse_32() assumes the existence of a compiler intrinsic that reverses the bytes of a 32-bit word.
To summarize the underlying approach:
view ARGB structure as a 32-bit integer
reverse the 32-bit integer
view the reversed 32-bit integer as a (BGR)A structure
let the compiler copy the (BGR) portion of the (BGR)A structure

Although you can use some tricks based on CPU usage,
This kind of operations can be done fasted with GPU.
It seems that you use C/ C++... So your alternatives for GPU programming may be ( on windows platform )
DirectCompute ( DirectX 11 ) See this video
Microsoft Research Project Accelerator Check this link
Cuda
"google" GPU programming ...
Shortly use GPU for this kind of array operations for make faster calculations. They are designed for it.

I haven't seen anyone showing an example of how to do it on the GPU.
A while ago I wrote something similar to your problem. I received data from a video4linux2 camera in YUV format and wanted to draw it as gray levels on the screen (just the Y component). I also wanted to draw areas that are too dark in blue and oversaturated regions in red.
I started out with the smooth_opengl3.c example from the freeglut distribution.
The data is copied as YUV into the texture and then the following GLSL shader programs are applied. I'm sure GLSL code runs on all macs nowadays and it will be significantly faster than all the CPU approaches.
Note that I have no experience on how you get the data back. In theory glReadPixels should read the data back but I never measured its performance.
OpenCL might be the easier approach, but then I will only start developing for that when I have a notebook that supports it.
(defparameter *vertex-shader*
"void main(){
gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;
gl_FrontColor = gl_Color;
gl_TexCoord[0] = gl_MultiTexCoord0;
}
")
(progn
(defparameter *fragment-shader*
"uniform sampler2D textureImage;
void main()
{
vec4 q=texture2D( textureImage, gl_TexCoord[0].st);
float v=q.z;
if(int(gl_FragCoord.x)%2 == 0)
v=q.x;
float x=0; // 1./255.;
v-=.278431;
v*=1.7;
if(v>=(1.0-x))
gl_FragColor = vec4(255,0,0,255);
else if (v<=x)
gl_FragColor = vec4(0,0,255,255);
else
gl_FragColor = vec4(v,v,v,255);
}
")

Related

_mm_stream_si128 is 2000% slower than _mm_store_si128 [duplicate]

I have code that looks like this (simple load, modify, store) (I've simplified it to make it more readable):
__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
__m128i in = _mm_loadu_si128(inptr);
__m128i out = in; // real code does more than this, but I've simplified it
_mm_stream_si12(outptr,out);
inptr += 12;
outptr += 16;
}
This code runs about 5 times faster on our older Sandy Bridge Haswell hardware compared to our newer Skylake machines. For example, if the while loop runs about 16e9 iterations, it takes 14 seconds on Sandy Bridge Haswell and 70 seconds on Skylake.
We upgraded to the lasted microcode on the Skylake,
and also stuck in vzeroupper commands to avoid any AVX issues. Both fixes had no effect.
outptr is aligned to 16 bytes, so the stream command should be writing to aligned addresses. (I put in checks to verify this statement). inptr is not aligned by design. Commenting out the loads doesn't make any effect, the limiting commands are the stores. outptr and inptr are pointing to different memory regions, there is no overlap.
If I replace the _mm_stream_si128 with _mm_storeu_si128, the code runs way faster on both machines, about 2.9 seconds.
So the two questions are
1) why is there such a big difference between Sandy Bridge Haswell and Skylake when writing using the _mm_stream_si128 intrinsic?
2) why does the _mm_storeu_si128 run 5x faster than the streaming equivalent?
I'm a newbie when it comes to intrinsics.
Addendum - test case
Here is the entire test case: https://godbolt.org/z/toM2lB
Here is a summary of the benchmarks I took on two difference processors, E5-2680 v3 (Haswell) and 8180 (Skylake).
// icpc -std=c++14 -msse4.2 -O3 -DNDEBUG ../mre.cpp -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU # 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz.
// The command line was
// perf stat ./mre 100000
//
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 1.65 7.29
// _mm_storeu_si128 0.41 0.40
The ratio of stream to store is 4x or 18x, respectively.
I'm relying on the default new allocator to align my data to 16 bytes. I'm getting luck here that it is aligned. I have tested that this is true, and in my production application, I use an aligned allocator to make absolutely sure it is, as well as checks on the address, but I left that off of the example because I don't think it matters.
Second edit - 64B aligned output
The comment from #Mystical made me check that the outputs were all cache aligned. The writes to the Tile structures are done in 64-B chunks, but the Tiles themselves were not 64-B aligned (only 16-B aligned).
So changed my test code like this:
#if 0
std::vector<Tile> tiles(outputPixels/32);
#else
std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif
and now the numbers are quite different:
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 0.19 0.48
// _mm_storeu_si128 0.25 0.52
So everything is much faster. But the Skylake is still slower than Haswell by a factor of 2.
Third Edit. Purposely misalignment
I tried the test suggested by #HaidBrais. I purposely allocated my vector class aligned to 64 bytes, then added 16 bytes or 32 bytes inside the allocator such that the allocation was either 16 Byte or 32 Byte aligned, but NOT 64 byte aligned. I also increased the number of loops to 1,000,000, and ran the test 3 times and picked the smallest time.
perf stat ./mre1 1000000
To reiterate, an alignment of 2^N means it is NOT aligned to 2^(N+1) or 2^(N+2).
// STORER alignment time (seconds)
// byte E5-2680 8180
// ---------------------------------------------------
// _mm_storeu_si128 16 3.15 2.69
// _mm_storeu_si128 32 3.16 2.60
// _mm_storeu_si128 64 1.72 1.71
// _mm_stream_si128 16 14.31 72.14
// _mm_stream_si128 32 14.44 72.09
// _mm_stream_si128 64 1.43 3.38
So it is clear that cache alignment gives the best results, but _mm_stream_si128 is better only on the 2680 processor and suffers some sort of penalty on the 8180 that I can't explain.
For furture use, here is the misaligned allocator I used (I did not templatize the misalignment, you'll have to edit the 32 and change to 0 or 16 as needed):
template <class T >
struct Mallocator {
typedef T value_type;
Mallocator() = default;
template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept
{}
T* allocate(std::size_t n) {
if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
if(! p1) throw std::bad_alloc();
p1 += 32; // misalign on purpose
return reinterpret_cast<T*>(p1);
}
void deallocate(T* p, std::size_t) noexcept {
uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
p1 -= 32;
std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }
...
std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);
The simplified code doesn't really show the actual structure of your benchmark. I don't think the simplified code will exhibit the slowness you've mentioned.
The actual loop from your godbolt code is:
while (count > 0)
{
// std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
__m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
__m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
__m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
__m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));
__m128i tileVal0 = value0;
__m128i tileVal1 = value1;
__m128i tileVal2 = value2;
__m128i tileVal3 = value3;
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);
ptr += diffBytes * 4;
count -= diffBytes * 4;
tile += diffPixels * 4;
ipixel += diffPixels * 4;
if (ipixel == 32)
{
// go to next tile
ipixel = 0;
tileIter++;
tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
}
}
Note the if (ipixel == 32) part. This jumps to a different tile every time ipixel reaches 32. Since diffPixels is 8, this happens every iteration. Hence you are only making 4 streaming stores (64 bytes) per tile. Unless each tile happens to be 64-byte aligned, which is unlikely to happen by chance and cannot be relied on, this means that every write writes to only part of two different cache lines. That's a known anti-pattern for streaming stores: for effective use of streaming stores you need to write out the full line.
On to the performance differences: streaming stores have widely varying performance on different hardware. These stores always occupy a line fill buffer for some time, but how long varies: on lots of client chips it seems to only occupy a buffer for about the L3 latency. I.e., once the streaming store reaches the L3 it can be handed off (the L3 will track the rest of the work) and the LFB can be freed on the core. Server chips often have much longer latency. Especially multi-socket hosts.
Evidently, the performance of NT stores is worse on the SKX box, and much worse for partial line writes. The overall worse performance is probably related to the redesign of the L3 cache.

Extract 10bits words from bitstream

I need to extract all 10-bit words from a raw bitstream whitch is built as ABACABACABAC...
It already works with a naive C implementation like
for(uint8_t *ptr = in_packet; ptr < max; ptr += 5){
const uint64_t val =
(((uint64_t)(*(ptr + 4))) << 32) |
(((uint64_t)(*(ptr + 3))) << 24) |
(((uint64_t)(*(ptr + 2))) << 16) |
(((uint64_t)(*(ptr + 1))) << 8) |
(((uint64_t)(*(ptr + 0))) << 0) ;
*a_ptr++ = (val >> 0);
*b_ptr++ = (val >> 10);
*a_ptr++ = (val >> 20);
*c_ptr++ = (val >> 30);
}
But performance is inadequate for my application so I would like to improve this using some AVX2 optimisations.
I visited the website https://software.intel.com/sites/landingpage/IntrinsicsGuide/# to find any functions that can help but it seems there is nothing to works with 10-bit words, only 8 or 16-bit. That seems logical since 10-bit is not native for a processor, but it make things hard for me.
Is there any way to use AVX2 to solve this problem?
Your scalar loop does not compile efficiently. Compilers do it as 5 separate byte loads. You can express an unaligned 8-byte load in C++ with memcpy:
#include <stdint.h>
#include <string.h>
// do an 8-byte load that spans the 5 bytes we want
// clang auto-vectorizes using an AVX2 gather for 4 qwords. Looks pretty clunky but not terrible
void extract_10bit_fields_v2calar(const uint8_t *__restrict src,
uint16_t *__restrict a_ptr, uint16_t *__restrict b_ptr, uint16_t *__restrict c_ptr,
const uint8_t *max)
{
for(const uint8_t *ptr = src; ptr < max; ptr += 5){
uint64_t val;
memcpy(&val, ptr, sizeof(val));
const unsigned mask = (1U<<10) - 1; // unused in original source!?!
*a_ptr++ = (val >> 0) & mask;
*b_ptr++ = (val >> 10) & mask;
*a_ptr++ = (val >> 20) & mask;
*c_ptr++ = (val >> 30) & mask;
}
}
ICC and clang auto-vectorize your 1-byte version, but do a very bad job (lots of insert/extract of single bytes). Here's your original and this function on Godbolt (with gcc and clang -O3 -march=skylake)
None of those 3 compilers are really close to what we can do manually.
Manual vectorization
My current AVX2 version of this answer forgot a detail: there are only 3 kinds of fields ABAC, not ABCD like 10-bit RGBA pixels. So I have a version of this which unpacks to 4 separate output streams (which I'll leave in because of the packed-RGBA use-case if I ever add a dedicated version for the ABAC interleave).
The existing version can use vpunpcklwd to interleave the two A parts instead of storing with separate vmovq should work for your case. There might be something more efficient, IDK.
BTW, I find it easier to remember and type instruction mnemonics, not intrinsic names. Intel's online intrinsics guide is searchable by instruction mnemonic.
Observations about your layout:
Each field spans one byte boundary, never two, so it's possible to assemble any 4 pairs of bytes in a qword that hold 4 complete fields.
Or with a byte shuffle, to create 2-byte words that each have a whole field at some offset. (e.g. for AVX512BW vpsrlvw, or for AVX2 2x vpsrld + word-blend.) A word shuffle like AVX512 vpermw would not be sufficient: some individual bytes need to be duplicated with the start of one field and end of another. I.e the source positions aren't all aligned words, especially when you have 2x 5 bytes inside the same 16-byte "lane" of a vector.
00-07|08-15|16-23|24-31|32-39 byte boundaries (8-bit)
00...09|10..19|20...29|30..39 field boundaries (10-bit)
Luckily 8 and 10 have a GCD of 2 which is >= 10-8=2. 8*5 = 4*10 so we don't get all possible start positions, e.g. never a field starting at the last bit of 1 byte, spanning another byte, and including the first bit of a 3rd byte.
Possible AVX2 strategy: unaligned 32-byte load that leave 2x 5 bytes at the top of the low lane, and 2x 5 bytes at the bottom of the high lane. Then vpshufb in-lane shuffle to set up for 2x vpsrlvd variable-count shifts, and a blend.
Quick summary of a new idea I haven't expanded yet.
Given an input of xxx a0B0A0C0 a1B1A1C1 | a2B2A2C2 a3B3A3C3 from our unaligned load, we can get a result of
a0 A0 a1 A1 B0 B1 C0 C1 | a2 A2 a3 A3 B2 B3 C2 C3 with the right choice of vpshufb control.
Then a vpermd can put all of those 32-bit groups into the right order, with all the A elements in the high half (ready for a vextracti128 to memory), and the B and C in the low half (ready for vmovq / vmovhps stores).
Use different vpermd shuffles for adjacent pairs so we can vpblendd to merge them for 128-bit B and C stores.
Old version, probably worse than unaligned load + vpshufb.
With AVX2, one option is to broadcast the containing 64-bit element to all positions in a vector and then use variable-count right shifts to get the bits to the bottom of a dword element.
You probably want to do a separate 64-bit broadcast-load for each group (thus partially overlapping with the previous), instead of trying to pick apart a __m256i of contiguous bits. (Broadcast-loads are cheap, shuffling is expensive.)
After _mm256_srlvd_epi64, then AND to isolate the low 10 bits in each qword.
Repeat that 4 times for 4 vectors of input, then use _mm256_packus_epi32 to do in-lane packing down to 32-bit then 16-bit elements.
That's the simple version. Optimizations of the interleaving are possible, e.g. by using left or right shifts to set up for vpblendd instead of a 2-input shuffle like vpackusdw or vshufps. _mm256_blend_epi32 is very efficient on existing CPUs, running on any port.
This also allows delaying the AND until after the first packing step because we don't need to avoid saturation from high garbage.
Design notes:
shown as 32-bit chunks after variable-count shifts
[0 d0 0 c0 | 0 b0 0 a0] # after an AND mask
[0 d1 0 c1 | 0 b1 0 a1]
[0 d1 0 c1 0 d0 0 c0 | 0 b1 0 a1 0 b0 0 a0] # vpackusdw
shown as 16-bit elements but actually the same as what vshufps can do
---------
[X d0 X c0 | X b0 X a0] even the top element is only garbage right shifted by 30, not quite zero
[X d1 X c1 | X b1 X a1]
[d1 c1 d0 c0 | b1 a1 b0 a0 ] vshufps (can't do d1 d0 c1 c0 unfortunately)
---------
[X d0 X c0 | X b0 X a0] variable-count >> qword
[d1 X c1 X | b1 X a1 0] variable-count << qword
[d1 d0 c1 c0 | b1 b0 a1 a0] vpblendd
This last trick extends to vpblendw, allowing us to do everything with interleaving blends, no shuffle instructions at all, resulting in the outputs we want contiguous and in the right order in qwords of a __m256i.
x86 SIMD variable-count shifts can only be left or right for all elements, so we need to make sure that all the data is either left or right of the desired position, not some of each within the same vector. We could use an immediate-count shift to set up for this, but even better is to just adjust the byte-address we load from. For loads after the first, we know it's safe to load some of the bytes before the first bitfield we want (without touching an unmapped page).
# as 16-bit elements
[X X X d0 X X X c0 | ...] variable-count >> qword
[X X d1 X X X c1 X | ...] variable-count >> qword from an offset load that started with the 5 bytes we want all to the left of these positions
[X d2 X X X c2 X X | ...] variable-count << qword
[d3 X X X c3 X X X | ...] variable-count << qword
[X d2 X d0 X c2 X c0 | ...] vpblendd
[d3 X d1 X c3 X c1 X | ...] vpblendd
[d3 d2 d1 d0 c3 c2 c1 c0 | ...] vpblendw (Same behaviour in both high and low lane)
Then mask off the high garbage inside each 16-bit word
Note: this does 4 separate outputs, like ABCD or RGBA->planar, not ABAC.
// potentially unaligned 64-bit broadcast-load, hopefully vpbroadcastq. (clang: yes, gcc: no)
// defeats gcc/clang folding it into an AVX512 broadcast memory source
// but vpsllvq's ymm/mem operand is the shift count, not data
static inline
__m256i bcast_load64(const uint8_t *p) {
// hopefully safe with strict-aliasing since the deref is inside an intrinsic?
__m256i bcast = _mm256_castpd_si256( _mm256_broadcast_sd( (const double*)p ) );
return bcast;
}
// UNTESTED
// unpack 10-bit fields from 4x 40-bit chunks into 16-bit dst arrays
// overreads past the end of the last chunk by 1 byte
// for ABCD repeating, not ABAC, e.g. packed 10-bit RGBA
void extract_10bit_fields_4output(const uint8_t *__restrict src,
uint16_t *__restrict da, uint16_t *__restrict db, uint16_t *__restrict dc, uint16_t *__restrict dd,
const uint8_t *max)
{
// FIXME: cleanup loop for non-whole-vectors at the end
while( src<max ){
__m256i bcast = bcast_load64(src); // data we want is from bits [0 to 39], last starting at 30
__m256i ext0 = _mm256_srlv_epi64(bcast, _mm256_set_epi64x(30, 20, 10, 0)); // place at bottome of each qword
bcast = bcast_load64(src+5-2); // data we want is from bits [16 to 55], last starting at 30+16 = 46
__m256i ext1 = _mm256_srlv_epi64(bcast, _mm256_set_epi64x(30, 20, 10, 0)); // place it at bit 16 in each qword element
bcast = bcast_load64(src+10); // data we want is from bits [0 to 39]
__m256i ext2 = _mm256_sllv_epi64(bcast, _mm256_set_epi64x(2, 12, 22, 32)); // place it at bit 32 in each qword element
bcast = bcast_load64(src+15-2); // data we want is from bits [16 to 55], last field starting at 46
__m256i ext3 = _mm256_sllv_epi64(bcast, _mm256_set_epi64x(2, 12, 22, 32)); // place it at bit 48 in each qword element
__m256i blend20 = _mm256_blend_epi32(ext0, ext2, 0b10101010); // X d2 X d0 X c2 X c0 | X b2 ...
__m256i blend31 = _mm256_blend_epi32(ext1, ext3, 0b10101010); // d3 X d1 X c3 X c1 X | b3 X ...
__m256i blend3210 = _mm256_blend_epi16(blend20, blend31, 0b10101010); // d3 d2 d1 d0 c3 c2 c1 c0
__m256i res = _mm256_and_si256(blend3210, _mm256_set1_epi16((1U<<10) - 1) );
__m128i lo = _mm256_castsi256_si128(res);
__m128i hi = _mm256_extracti128_si256(res, 1);
_mm_storel_epi64((__m128i*)da, lo); // movq store of the lowest 64 bits
_mm_storeh_pi((__m64*)db, _mm_castsi128_ps(lo)); // movhps store of the high half of the low 128. Efficient: no shuffle uop needed on Intel CPUs
_mm_storel_epi64((__m128i*)dc, hi);
_mm_storeh_pi((__m64*)dd, _mm_castsi128_ps(hi)); // clang pessmizes this to vpextrq :(
da += 4;
db += 4;
dc += 4;
dd += 4;
src += 4*5;
}
}
This compiles (Godbolt) to about 21 front-end uops (on Skylake) in the loop per 4 groups of 4 fields. (Including has a useless register copy for _mm256_castsi256_si128 instead of just using the low half of ymm0 = xmm0). This will be very good on Skylake. There's a good balance of uops for different ports, and variable-count shift is 1 uop for either p0 or p1 on SKL (vs. more expensive previously). The bottleneck might be just the front-end limit of 4 fused-domain uops per clock.
Replays of cache-line-split loads will happen because the unaligned loads will sometimes cross a 64-byte cache-line boundary. But that's just in the back-end, and we have a few spare cycles on ports 2 and 3 because of the front-end bottleneck (4 loads and 4 stores per set of results, with indexed stores which thus can't use port 7). If dependent ALU uops have to get replayed as well, we might start seeing back-end bottlenecks.
Despite the indexed addressing modes, there won't be unlamination because Haswell and later can keep indexed stores micro-fused, and the broadcast loads are a single pure uop anyway, not micro-fused ALU+load.
On Skylake, it can maybe come close to 4x 40-bit groups per 5 clock cycles, if memory bandwidth isn't a bottleneck. (e.g. with good cache blocking.) Once you factor in overhead and cost of cache-line-split loads causing occasional stalls, maybe 1.5 cycles per 40 bits of input, i.e. 6 cycles per 20 bytes of input on Skylake.
On other CPUs (Haswell and Ryzen), the variable-count shifts will be a bottleneck, but you can't really do anything about that. I don't think there's anything better. On HSW it's 3 uops: p5 + 2p0. On Ryzen it's only 1 uop, but it only has 1 per 2 clock throughput (for the 128-bit version), or per 4 clocks for the 256-bit version which costs 2 uops.
Beware that clang pessmizes the _mm_storeh_pi store to vpextrq [mem], xmm, 1: 2 uops, shuffle + store. (Instead of vmovhps : pure store on Intel, no ALU). GCC compiles it as written.
I used _mm256_broadcast_sd even though I really want vpbroadcastq just because there's an intrinsic that takes a pointer operand instead of __m256i (because with AVX1, only the memory-source version existed. But with AVX2, register-source versions of all the broadcast instructions exist). To use _mm256_set1_epi64, I'd have to write pure C that didn't violate strict aliasing (e.g. with memcpy) to do an unaligned uint64_t load. I don't think it will hurt performance to use an FP broadcast load on current CPUs, though.
I'm hoping _mm256_broadcast_sd allows its source operand to alias anything without C++ strict-aliasing undefined behaviour, the same way _mm256_loadu_ps does. Either way it will work in practice if it doesn't inline into a function that stores into *src, and maybe even then. So maybe a memcpy unaligned load would have made more sense!
I've had bad results in the past with getting compilers to emit pmovzxdw xmm0, [mem] from code like _mm_cvtepu16_epi32( _mm_loadu_si64(ptr) ); you often get an actual movq load + reg-reg pmovzx. That's why I didn't try that _mm256_broadcastq_epi64(__m128i).
Old idea; if we already need a byte shuffle we might as well use plain word shifts instead of vpmultishift.
With AVX512VBMI (IceLake, CannonLake), you might want vpmultishiftqb. Instead of broadcasting / shifting one group at a time, we can do all the work for a whole vector of groups after putting the right bytes in the right places first.
You'd still need/want a version for CPUs with some AVX512 but not AVX512VBMI (e.g. Skylake-avx512). Probably vpermd + vpshufb can get the bytes we need into the 128-bit lanes we want.
I don't think we can get away with using only dword-granularity shifts to allow merge-masking instead of dword blend after qword shift. We might be able to merge-mask a vpblendw though, saving a vpblendd
IceLake has 1/clock vpermw and vpermb, single-uop. (It has a 2nd shuffle unit on another port that handles some shuffle uops). So we can load a full vector that contains 4 or 8 groups of 4 elements and shuffle every byte into place efficiently. I think every CPU that has vpermb has it single-uop. (But that's only Ice Lake and the limited-release Cannon Lake).
vpermt2w (to combine 16-bit element from 2 vectors into any order) is one per 2 clock throughput. (InstLatx64 for IceLake-Y), so unfortunately it's not as efficient as the one-vector shuffles.
Anyway, you might use it like this:
64-byte / 512-bit load (includes some over-read at the end from 8x 8-byte groups instead of 8x 5-byte groups. Optionally use a zero-masked load to make this safe near the end of an array thanks to fault suppression)
vpermb to put the 2 bytes containing each field into desired final destination position.
vpsrlvw + vpandq to extract each 10-bit field into a 16-bit word
That's about 4 uops, not including the stores.
You probably want the high half containing the A elements for a contiguous vextracti64x4 and the low half containing the B and C elements for vmovdqu and vextracti128 stores.
Or for 2x vpblenddd to set up for 256-bit stores. (Use 2 different vpermb vectors to create 2 different layouts.)
You shouldn't need vpermt2w or vpermt2d to combine adjacent vectors for wider stores.
Without AVX512VBMI, probably a vpermd + vpshufb can get all the necessary bytes into each 128-bit chunk instead of vpermb. The rest of it only requires AVX512BW which Skylake-X has.

Intel SSE Intrinsics _mm_load_si128 segmentation fault,

I'm currently working with a 5 x 5 matrix using SSE features.
I'm trying to load x4 128bit integer values to the xmm registers as follows,
#include <emmintrin.h>
#include <smmintrin.h>
//===================================== Initialising matrix
int* aligned_matrix;
posix_memalign((void **)&aligned_matrix, 16, sizeof(int) * 25);
for (ssize_t i = 0; i < 25; i++)
aligned_matrix[i] = 2; // uniform value of 2 assigned
}
return aligned_matrix;
}
//===================================== then,
__m128i xmm8, xmm9;
xmm8 = _mm_load_si128((__m128i *)(aligned_matrix)); // read 4 from first row
// this line below is where the segmentation fault occurs...
xmm9 = _mm_load_si128((__m128i *)(aligned_matrix + 5)); // 4 from next row
I've got a feeling that this might be related to memory alignment or something but ...I can't pin point where I'm going wrong with this...
I'm using the following,
clang -msse -msse2 -msse4.1
*Note - the reason why I'm adding aligned_matrix + 5 is to read the next 4 elements from the second row of the 5x5 matrix.
For the second load you need to use _mm_loadu_si128 because the source data is misaligned. Explanation: an offset of +5 ints from a base address which is 16 byte aligned will no longer be 16 byte aligned.

SSE _mm_movemask_epi8 equivalent method for ARM NEON

I decided to continue Fast corners optimisation and stucked at
_mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input?
I know this post is quite outdated but I found it useful to give my (validated) solution. It assumes all ones/all zeroes in every lane of the Input argument.
const uint8_t __attribute__ ((aligned (16))) _Powers[16]=
{ 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 };
// Set the powers of 2 (do it once for all, if applicable)
uint8x16_t Powers= vld1q_u8(_Powers);
// Compute the mask from the input
uint64x2_t Mask= vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(Input, Powers))));
// Get the resulting bytes
uint16_t Output;
vst1q_lane_u8((uint8_t*)&Output + 0, (uint8x16_t)Mask, 0);
vst1q_lane_u8((uint8_t*)&Output + 1, (uint8x16_t)Mask, 8);
(Mind http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47553, anyway.)
Similarly to Michael, the trick is to form the powers of the indexes of the non-null entries, and to sum them pairwise three times. This must be done with increasing data size to double the stride on every addition. You reduce from 2 x 8 8-bit entries to 2 x 4 16-bit, then 2 x 2 32-bit and 2 x 1 64-bit. The low byte of these two numbers gives the solution. I don't think there is an easy way to pack them together to form a single short value using NEON.
Takes 6 NEON instructions if the input is in the suitable form and the powers can be preloaded.
The obvious solution seems to be completely missed here.
// Use shifts to collect all of the sign bits.
// I'm not sure if this works on big endian, but big endian NEON is very
// rare.
int vmovmaskq_u8(uint8x16_t input)
{
// Example input (half scale):
// 0x89 FF 1D C0 00 10 99 33
// Shift out everything but the sign bits
// 0x01 01 00 01 00 00 01 00
uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7));
// Merge the even lanes together with vsra. The '??' bytes are garbage.
// vsri could also be used, but it is slightly slower on aarch64.
// 0x??03 ??02 ??00 ??01
uint32x4_t paired16 = vreinterpretq_u32_u16(
vsraq_n_u16(high_bits, high_bits, 7));
// Repeat with wider lanes.
// 0x??????0B ??????04
uint64x2_t paired32 = vreinterpretq_u64_u32(
vsraq_n_u32(paired16, paired16, 14));
// 0x??????????????4B
uint8x16_t paired64 = vreinterpretq_u8_u64(
vsraq_n_u64(paired32, paired32, 28));
// Extract the low 8 bits from each lane and join.
// 0x4B
return vgetq_lane_u8(paired64, 0) | ((int)vgetq_lane_u8(paired64, 8) << 8);
}
This question deserves a newer answer for aarch64. The addition of new capabilities to Armv8 allows the same function to be implemented in fewer instructions. Here's my version:
uint32_t _mm_movemask_aarch64(uint8x16_t input)
{
const uint8_t __attribute__ ((aligned (16))) ucShift[] = {-7,-6,-5,-4,-3,-2,-1,0,-7,-6,-5,-4,-3,-2,-1,0};
uint8x16_t vshift = vld1q_u8(ucShift);
uint8x16_t vmask = vandq_u8(input, vdupq_n_u8(0x80));
uint32_t out;
vmask = vshlq_u8(vmask, vshift);
out = vaddv_u8(vget_low_u8(vmask));
out += (vaddv_u8(vget_high_u8(vmask)) << 8);
return out;
}
after some tests it looks like following code works correct:
int32_t _mm_movemask_epi8_neon(uint8x16_t input)
{
const int8_t __attribute__ ((aligned (16))) xr[8] = {-7,-6,-5,-4,-3,-2,-1,0};
uint8x8_t mask_and = vdup_n_u8(0x80);
int8x8_t mask_shift = vld1_s8(xr);
uint8x8_t lo = vget_low_u8(input);
uint8x8_t hi = vget_high_u8(input);
lo = vand_u8(lo, mask_and);
lo = vshl_u8(lo, mask_shift);
hi = vand_u8(hi, mask_and);
hi = vshl_u8(hi, mask_shift);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
return ((hi[0] << 8) | (lo[0] & 0xFF));
}
Note that I haven't tested any of this, but something like this might work:
X := the vector that you want to create the mask from
A := 0x808080808080...
B := 0x00FFFEFDFCFB... (i.e. 0,-1,-2,-3,...)
X = vand_u8(X, A); // Keep d7 of each byte in X
X = vshl_u8(X, B); // X[7]>>=0; X[6]>>=1; X[5]>>=2; ...
// Each byte of X now contains its msb shifted 7-N bits to the right, where N
// is the byte index.
// Do 3 pairwise adds in order to pack all these into X[0]
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
// X[0] should now contain the mask. Clear the remaining bytes if necessary
This would need to be repeated once to process a 128-bit vector, since vpadd only works on 64-bit vectors.
I know this question is here for 8 years already but let me give you the answer which might solve all performance problems with emulation. It's based on the blog Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more.
Most usages of movemask instructions are coming from comparisons where the vectors have 0xFF or 0x00 values from the result of every 16 bytes. After that most cases to use movemasks are to check if none/all match, find leading/trailing or iterate over bits.
If this is the case which often is, then you can use shrn reg1, reg2, #4 instruction. This instruction called Shift-Right-then-Narrow instruction can reduce a 128-bit byte mask to a 64-bit nibble mask (by alternating low and high nibbles to the result). This allows the mask to be extracted to a 64-bit general purpose register.
const uint16x8_t equalMask = vreinterpretq_u16_u8(vceqq_u8(chunk, vdupq_n_u8(tag)));
const uint8x8_t res = vshrn_n_u16(equalMask, 4);
const uint64_t matches = vget_lane_u64(vreinterpret_u64_u8(res), 0);
return matches;
After that you can use all bit operations you typically use on x86 with very minor tweaks like shifting by 2 or doing a scalar AND.

128-bit rotation using ARM Neon intrinsics

I'm trying to optimize my code using Neon intrinsics. I have a 24-bit rotation over a 128-bit array (8 each uint16_t).
Here is my c code:
uint16_t rotated[8];
uint16_t temp[8];
uint16_t j;
for(j = 0; j < 8; j++)
{
//Rotation <<< 24 over 128 bits (x << shift) | (x >> (16 - shift)
rotated[j] = ((temp[(j+1) % 8] << 8) & 0xffff) | ((temp[(j+2) % 8] >> 8) & 0x00ff);
}
I've checked the gcc documentation about Neon Intrinsics and it doesn't have instruction for vector rotations. Moreover, I've tried to do this using vshlq_n_u16(temp, 8) but all the bits shifted outside a uint16_t word are lost.
How to achieve this using neon intrinsics ? By the way is there a better documentation about GCC Neon Intrinsics ?
After some reading on Arm Community Blogs, I've found this :
VEXT: Extract
VEXT extracts a new vector of bytes from a pair of existing vectors. The bytes in the new vector are from the top of the first operand, and the bottom of the second operand. This allows you to produce a new vector containing elements that straddle a pair of existing vectors. VEXT can be used to implement a moving window on data from two vectors, useful in FIR filters. For permutation, it can also be used to simulate a byte-wise rotate operation, when using the same vector for both input operands.
The following Neon GCC Intrinsic does the same as the assembly provided in the picture :
uint16x8_t vextq_u16 (uint16x8_t, uint16x8_t, const int)
So the the 24bit rotation over a full 128bit vector (not over each element) could be done by the following:
uint16x8_t input;
uint16x8_t t0;
uint16x8_t t1;
uint16x8_t rotated;
t0 = vextq_u16(input, input, 1);
t0 = vshlq_n_u16(t0, 8);
t1 = vextq_u16(input, input, 2);
t1 = vshrq_n_u16(t1, 8);
rotated = vorrq_u16(t0, t1);
Use vext.8 to concat a vector with itself and give you the 16-byte window that you want (in this case offset by 3 bytes).
Doing this with intrinsics requires casting to keep the compiler happy, but it's still a single instruction:
#include <arm_neon.h>
uint16x8_t byterotate3(uint16x8_t input) {
uint8x16_t tmp = vreinterpretq_u8_u16(input);
uint8x16_t rotated = vextq_u8(tmp, tmp, 16-3);
return vreinterpretq_u16_u8(rotated);
}
g++5.4 -O3 -march=armv7-a -mfloat-abi=hard -mfpu=neon (on Godbolt) compiles it to this:
byterotate3(__simd128_uint16_t):
vext.8 q0, q0, q0, #13
bx lr
A count of 16-3 means we left-rotate by 3 bytes. (It means we take 13 bytes from the left vector and 3 bytes from the right vector, so it's also a right-rotate by 13).
Related: x86 also has instruction that takes a sliding window into the concatenation of two registers: palignr (added in SSSE3).
Maybe I'm missing something about NEON, but I don't understand why the OP's self-answer is using vext.16 (vextq_u16), which has 16-bit granularity. It's not even a different instruction, just an alias for vext.8 which makes it impossible to use an odd-numbered count, requiring extra instructions. The manual for vext.8 says:
VEXT pseudo-instruction
You can specify a datatype of 16, 32, or 64 instead of 8. In this
case, #imm refers to halfwords, words, or doublewords instead of
referring to bytes, and the permitted ranges are correspondingly
reduced.
I'm not 100% sure but I don't think NEON has rotate instructions.
You can compose the rotation operation you require with a left shift, a right shit and an or, e.g.:
uint8_t ror(uint8_t in, int rotation)
{
return (in >> rotation) | (in << (8-rotation));
}
Just do the same with the Neon intrinsics for left shift, right shit and or.
uint16x8_t temp;
uint8_t rot;
uint16x8_t rotated = vorrq_u16 ( vshlq_n_u16(temp, rot) , vshrq_n_u16(temp, 16 - rot) );
See http://en.wikipedia.org/wiki/Circular_shift "Implementing circular shifts."
This will rotate the values inside the lanes. If you want to rotate the lanes themselves use VEXT as described in the other answer.

Resources