Seeking maximum bitmap (aka bit array) performance with C/Intel assembly - c

Following on from my two previous questions, How to improve memory performance/data locality of 64-bit C/intel assembly program and Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?, I have further reduced the running time of the test program mentioned in these questions from 150 seconds down to 62 seconds, as I will describe below.
The 64-bit program has five 4 GB lookup tables (bytevecM, bytevecD, bytevecC, bytevecL, bytevecX). To reduce the (huge) number of cache misses, analysed in my last question, I added five 4 MB bitmaps, one per lookup table.
Here is the original inner loop:
psz = (size_t*)&bytevecM[(unsigned int)m7 & 0xffffff80];
if (psz[0] == 0 && psz[1] == 0
&& psz[2] == 0 && psz[3] == 0
&& psz[4] == 0 && psz[5] == 0
&& psz[6] == 0 && psz[7] == 0
&& psz[8] == 0 && psz[9] == 0
&& psz[10] == 0 && psz[11] == 0
&& psz[12] == 0 && psz[13] == 0
&& psz[14] == 0 && psz[15] == 0) continue;
// ... rinse and repeat for bytevecD, bytevecC, bytevecL, bytevecX
// expensive inner loop that scans 128 byte chunks from the 4 GB lookup tables...
The idea behind this simple "pre-check" was to avoid the expensive inner loop if all 128 bytes were zero. However, profiling showed that this pre-check was the primary bottleneck due to huge numbers of cache misses, as discussed last time. So I created a 4 MB bitmap to do the pre-check. (BTW, around 36% of 128-byte blocks are zero, not 98% as I mistakenly reported last time).
Here is the code I used to create a 4 MB bitmap from a 4 GB lookup table:
// Last chunk index (bitmap size=((LAST_CHUNK_IDX+1)>>3)=4,194,304 bytes)
#define LAST_CHUNK_IDX 33554431
void make_bitmap(
const unsigned char* bytevec, // in: byte vector
unsigned char* bitvec // out: bitmap
)
{
unsigned int uu;
unsigned int ucnt = 0;
unsigned int byte;
unsigned int bit;
const size_t* psz;
for (uu = 0; uu <= LAST_CHUNK_IDX; ++uu)
{
psz = (size_t*)&bytevec[uu << 7];
if (psz[0] == 0 && psz[1] == 0
&& psz[2] == 0 && psz[3] == 0
&& psz[4] == 0 && psz[5] == 0
&& psz[6] == 0 && psz[7] == 0
&& psz[8] == 0 && psz[9] == 0
&& psz[10] == 0 && psz[11] == 0
&& psz[12] == 0 && psz[13] == 0
&& psz[14] == 0 && psz[15] == 0) continue;
++ucnt;
byte = uu >> 3;
bit = (uu & 7);
bitvec[byte] |= (1 << bit);
}
printf("ucnt=%u hits from %u\n", ucnt, LAST_CHUNK_IDX+1);
}
Suggestions for a better way to do this are welcome.
With the bitmaps created via the function above, I then changed the "pre-check" to use the 4 MB bitmaps, instead of the 4 GB lookup tables, like so:
if ( (bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0 ) continue;
// ... rinse and repeat for bitvecD, bitvecC, bitvecL, bitvecX
// expensive inner loop that scans 128 byte chunks from the 4 GB lookup tables...
This was "successful" in that the running time was reduced from 150 seconds down to 62 seconds in the simple single-threaded case. However, VTune still reports some pretty big numbers, as shown below.
I profiled a more realistic test with eight simultaneous threads running across different ranges. The VTune output of the inner loop check for zero blocks is shown below:
> m7 = (unsigned int)( (m6 ^ q7) * H_PRIME );
> if ( (bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0 ) continue;
0x1400025c7 Block 15:
mov eax, r15d 1.058s
mov edx, ebx 0.109s
xor eax, ecx 0.777s
imul eax, eax, 0xf4243 1.088s
mov r9d, eax 3.369s
shr eax, 0x7 0.123s
and eax, 0x7 1.306s
movzx ecx, al 1.319s
mov eax, r9d 0.156s
shr rax, 0xa 0.248s
shl edx, cl 1.321s
test byte ptr [rax+r10*1], dl 1.832s
jz 0x140007670 2.037s
> d7 = (unsigned int)( (s6.m128i_i32[0] ^ q7) * H_PRIME );
> if ( (bitvecD[d7 >> 10] & (1 << ((d7 >> 7) & 7))) == 0 ) continue;
0x1400025f3 Block 16:
mov eax, dword ptr [rsp+0x30] 104.983s
mov edx, ebx 1.663s
xor eax, r15d 0.062s
imul eax, eax, 0xf4243 0.513s
mov edi, eax 1.172s
shr eax, 0x7 0.140s
and eax, 0x7 0.062s
movzx ecx, al 0.575s
mov eax, edi 0.689s
shr rax, 0xa 0.016s
shl edx, cl 0.108s
test byte ptr [rax+r11*1], dl 1.591s
jz 0x140007670 1.087s
> c7 = (unsigned int)( (s6.m128i_i32[1] ^ q7) * H_PRIME );
> if ( (bitvecC[c7 >> 10] & (1 << ((c7 >> 7) & 7))) == 0 ) continue;
0x14000261f Block 17:
mov eax, dword ptr [rsp+0x34] 75.863s
mov edx, 0x1 1.097s
xor eax, r15d 0.031s
imul eax, eax, 0xf4243 0.265s
mov ebx, eax 0.512s
shr eax, 0x7 0.016s
and eax, 0x7 0.233s
movzx ecx, al 0.233s
mov eax, ebx 0.279s
shl edx, cl 0.109s
mov rcx, qword ptr [rsp+0x58] 0.652s
shr rax, 0xa 0.171s
movzx ecx, byte ptr [rax+rcx*1] 0.126s
test cl, dl 77.918s
jz 0x140007667
> l7 = (unsigned int)( (s6.m128i_i32[2] ^ q7) * H_PRIME );
> if ( (bitvecL[l7 >> 10] & (1 << ((l7 >> 7) & 7))) == 0 ) continue;
0x140002655 Block 18:
mov eax, dword ptr [rsp+0x38] 0.980s
mov edx, 0x1 0.794s
xor eax, r15d 0.062s
imul eax, eax, 0xf4243 0.187s
mov r11d, eax 0.278s
shr eax, 0x7 0.062s
and eax, 0x7 0.218s
movzx ecx, al 0.218s
mov eax, r11d 0.186s
shl edx, cl 0.031s
mov rcx, qword ptr [rsp+0x50] 0.373s
shr rax, 0xa 0.233s
movzx ecx, byte ptr [rax+rcx*1] 0.047s
test cl, dl 55.060s
jz 0x14000765e
In addition to that, large amounts of time were (confusingly to me) attributed to this line:
> for (q6 = 1; q6 < 128; ++q6) {
0x1400075a1 Block 779:
inc edx 0.124s
mov dword ptr [rsp+0x10], edx
cmp edx, 0x80 0.031s
jl 0x140002574
mov ecx, dword ptr [rsp+0x4]
mov ebx, dword ptr [rsp+0x48]
...
0x140007575 Block 772:
mov edx, dword ptr [rsp+0x10] 0.699s
...
0x14000765e Block 789 (note: jz in l7 section above jumps here if zero):
mov edx, dword ptr [rsp+0x10] 1.169s
jmp 0x14000757e 0.791s
0x140007667 Block 790 (note: jz in c7 section above jumps here if zero):
mov edx, dword ptr [rsp+0x10] 2.261s
jmp 0x140007583 1.461s
0x140007670 Block 791 (note: jz in m7/d7 section above jumps here if zero):
mov edx, dword ptr [rsp+0x10] 108.355s
jmp 0x140007588 6.922s
I don't fully understand the big numbers in the VTune output above. If anyone can shed more light on these numbers, I'm all ears.
It seems to me that my five 4 MB bitmaps are bigger than my Core i7 3770 processor can fit into its 8 MB L3 cache, leading to many cache misses (though far fewer than before). If my CPU had a 30 MB L3 cache (as the upcoming Ivy Bridge-E has), I speculate that this program would run a lot faster because all five bitmaps would comfortably fit into the L3 cache. Is that right?
Further to that, since the code to test the bitmaps, namely:
m7 = (unsigned int)( (m6 ^ q7) * H_PRIME );
bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0
now appears five times in the inner loop, any suggestions for speeding up this code are very welcome.

Within the core bits of the loop, using the _bittest() MSVC intrinsic for the bitmap check combines the shl/test combo the compiler creates into a single instruction with (on SandyBridge) no latency/throughput penalty, i.e. it should shave a few cycles off.
Beyond that, can only think of calculating the bitmaps by map-reducing bit sets via recursive POR, as a variation on your zero testing that might be worth benchmarking:
for (int i = 0; i < MAX_IDX; i++) {
__m128i v[8];
__m128i* ptr = ...[i << ...];
v[0] = _mm_load_si128(ptr[0]);
v[1] = _mm_load_si128(ptr[1]);
v[2] = _mm_load_si128(ptr[2]);
v[3] = _mm_load_si128(ptr[3]);
v[4] = _mm_load_si128(ptr[4]);
v[5] = _mm_load_si128(ptr[5]);
v[6] = _mm_load_si128(ptr[6]);
v[7] = _mm_load_si128(ptr[7]);
v[0] = _mm_or_si128(v[0], v[1]);
v[2] = _mm_or_si128(v[2], v[3]);
v[4] = _mm_or_si128(v[4], v[5]);
v[6] = _mm_or_si128(v[6], v[7]);
v[0] = _mm_or_si128(v[0], v[2]);
v[2] = _mm_or_si128(v[4], v[6]);
v[0] = _mm_or_si128(v[0], v[2]);
if (_mm_movemask_epi8(_mm_cmpeq_epi8(_mm_setzero_si128(), v[0]))) {
// the contents aren't all zero
}
...
}
At this point, the pure load / accumulate-OR / extract mask might be better than a tight loop of SSE4.2 PTEST because there's no flags dependency and no branches.

For the 128-byte buffer, do the comparisons with larger integers.
unsigned char cbuf[128];
unsigned long long *lbuf = cbuf;
int i;
for (i=0; i < 128/sizeof(long long); i++) {
if (lbuf[i]) return false; // something not a zero
}
return true; // all zero

Related

count number of unique values in a 128bit avx vector, or detecting if all elements are equal?

I'm optimizing a hot path in my codebase and i have turned to vectorization. Keep in mind, I'm still quite new to all of this SIMD stuff. Here is the problem I'm trying to solve, implemented using non-SIMD
inline int count_unique(int c1, int c2, int c3, int c4)
{
return 4 - (c2 == c1)
- ((c3 == c1) || (c3 == c2))
- ((c4 == c1) || (c4 == c2) || (c4 == c3));
}
the assembly output after compiling with -O3:
count_unique:
xor eax, eax
cmp esi, edi
mov r8d, edx
setne al
add eax, 3
cmp edi, edx
sete dl
cmp esi, r8d
sete r9b
or edx, r9d
movzx edx, dl
sub eax, edx
cmp edi, ecx
sete dl
cmp r8d, ecx
sete dil
or edx, edi
cmp esi, ecx
sete cl
or edx, ecx
movzx edx, dl
sub eax, edx
ret
How would something like this be done when storing c1,c2,c3,c4 as a 16byte integer vector?
For your simplified problem (test all 4 lanes for equality), I would do it slightly differently, here’s how. This way it only takes 3 instructions for the complete test.
// True when the input vector has the same value in all 32-bit lanes
inline bool isSameValue( __m128i v )
{
// Rotate vector by 4 bytes
__m128i v2 = _mm_shuffle_epi32( v, _MM_SHUFFLE( 0, 3, 2, 1 ) );
// The XOR outputs zero for equal bits, 1 for different bits
__m128i xx = _mm_xor_si128( v, v2 );
// Use PTEST instruction from SSE 4.1 set to test the complete vector for all zeros
return (bool)_mm_testz_si128( xx, xx );
}
Ok, I have "simplified" the problem, because the only case when i was using the unique count, was if it was 1, but that is the same as checking if all elements are the same, which can be done by comparing the input with itself, but shifted over by one element (4 bytes) using the _mm_alignr_epi8 function.
inline int is_same_val(__m128i v1) {
__m128i v2 = _mm_alignr_epi8(v1, v1, 4);
__m128i vcmp = _mm_cmpeq_epi32(v1, v2);
return ((uint16_t)_mm_movemask_epi8(vcmp) == 0xffff);
}

NASM x86 core dump writing on memory [duplicate]

This question already has answers here:
How can I pass parameters in assembler x86 function call?
(3 answers)
Can't pass parameter from C to Assembly code
(3 answers)
Why does IA-32 have a non-intuitive caller and callee register saving convention?
(4 answers)
What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64
(4 answers)
Closed 3 years ago.
I am learning assembly NASM and trying to do a LFSR code and call it on a C program to evaluate the execution time diference, but failed to figure out the problem with my code.
My LFSR C version works just fine and is defined as follows:
int lfsr(){
int cont = 0;
uint32_t start_state = SEED;
uint32_t lfsr = start_state;
uint32_t bit;
while (cont != 16777216) {
bit = ((lfsr >> 1) ^ (lfsr >> 5) ^ (lfsr >> 7) ^ (lfsr >> 13)) & 1;
lfsr = (lfsr >> 1) | (bit << 23);
lfsr_nums[cont] = lfsr;
cont++;
}
return cont;
}
My NASM x86 was based on the C version and it generates the numbers the same way the C code does. It should take a pointer to an array as parameter, and return (as reference) the same array with the numbers created and return (as value) the amount of the numbers. The LFSR logic works just fine, I checked the numbers created, but the code still give me a SegFault Core Dump error.
With gdb the message is that the error is in the do procedure. While I tried to debug my code I found out that the error was in the mov dword [esi + 4 * eax], ebx, if I comment it out the code doesn't output a segfault.
section .text
global lfsr_nasm
lfsr_nasm:
push dword ebx;
mov esi, edi ; vec
mov eax, 0 ;Cont = 0
mov ebx, 0x1313 ; lfst = start_state = seed
do:
mov ecx, ebx ; ecx = lfst
shr ecx, 1 ; lfsr >> 1
mov edx, ebx ; edx = lfst
shr edx, 5; lfst >> 5
xor ecx, edx ; lfst >> 1 ^ lfsr >> 5
mov edx, ebx ; edx = lfsr
shr edx, 7 ; edx = lfst >> 7
xor ecx, edx; lfst >> 1 ^ lfsr >> 5 ^ lfsr >> 7
mov edx, ebx ; edx = lfsr
shr edx, 13 ; edx = lfst >> 13
xor ecx, edx; lfst >> 1 ^ lfsr >> 5 ^ lfsr >> 7 ^ lfsr >> 13
and ecx, 1 ;ecx = bit
shr ebx, 1 ;lfsr >> 1
shl ecx, 23 ; bit << 23
or ebx, ecx ; lfsr = (lfsr >> 1) | (bit << 23);
mov dword [esi + 4 * eax], ebx
inc eax ; cont++
cmp eax, 16777216; cont != 16777216
jne do ;
pop dword ebx;
ret
The way I make the call in C, and declare my vector and NASM function:
extern int lfsr_nasm (uint32_t *vec);
uint32_t lfsr_nums[16777216];
int main(int argc, char *argv[]){
int cont;
cont = lfsr_nasm(lfsr_nums);
for(int i = 0; i < 16777216; i++){
printf("%d ", lfsr_nums[i]);
}
}
I believe that the vector is too big for the NASM or C and maybe the program is trying to access bad memory, but I couldn't find anything to confirm my believes neither a fix to the problem. Already tried with malloc and calloc.

Is this function ok to convert binary to integer in c

I'm using this function to convert an 8-bit binary number represented as a boolean array to an integer. Is it efficient? I'm using it in an embedded system. It performs ok but I'm interested in some opinions or suggestions for improvement (or replacement) if there are any.
uint8_t b2i( bool *bs ){
uint8_t ret = 0;
ret = bs[7] ? 1 : 0;
ret += bs[6] ? 2 : 0;
ret += bs[5] ? 4 : 0;
ret += bs[4] ? 8 : 0;
ret += bs[3] ? 16 : 0;
ret += bs[2] ? 32 : 0;
ret += bs[1] ? 64 : 0;
ret += bs[0] ? 128 : 0;
return ret;
}
It is not possible to say without a specific system in mind. Disassemble the code and see what you got. Benchmark your code on a specific system. This is the key to understanding manual optimization.
Generally, there are lots of considerations. The CPU's data word size, instruction set, compiler optimizer performance, branch prediction (if any), data cache (if any) etc etc.
To make the code perform optimally regardless of data word size, you can change uint8_t to uint_fast8_t. That is unless you need exactly 8 bits, then leave it as uint8_t.
Cache use may or may not be more efficient if given an up-counting loop. At any rate, loop unrolling is an old kind of manual optimization that we shouldn't use in modern programming - the compiler is more capable of making that call than the programmer.
The worst problem with the code is the numerous branches. These might cause a bottleneck.
Your code results in the following x86 machine code gcc -O2:
b2i:
cmp BYTE PTR [rdi+6], 0
movzx eax, BYTE PTR [rdi+7]
je .L2
add eax, 2
.L2:
cmp BYTE PTR [rdi+5], 0
je .L3
add eax, 4
.L3:
cmp BYTE PTR [rdi+4], 0
je .L4
add eax, 8
.L4:
cmp BYTE PTR [rdi+3], 0
je .L5
add eax, 16
.L5:
cmp BYTE PTR [rdi+2], 0
je .L6
add eax, 32
.L6:
cmp BYTE PTR [rdi+1], 0
je .L7
add eax, 64
.L7:
lea edx, [rax-128]
cmp BYTE PTR [rdi], 0
cmovne eax, edx
ret
Whole lot of potentially inefficient branching. We can make the code both faster and more readable by using a loop:
uint8_t b2i (const bool bs[8])
{
uint8_t result = 0;
for(size_t i=0; i<8; i++)
{
result |= bs[8-1-i] << i;
}
return result;
}
(ideally the bool array should be arranged from LSB first but that would change the meaning of the code compared to the original)
Which gives this machine code instead:
b2i:
lea rsi, [rdi-8]
mov rax, rdi
xor r8d, r8d
.L2:
movzx edx, BYTE PTR [rax+7]
mov ecx, edi
sub ecx, eax
sub rax, 1
sal edx, cl
or r8d, edx
cmp rax, rsi
jne .L2
mov eax, r8d
ret
More instructions but less branching. It will likely perform better than your code on x86 and other high end CPUs with branch prediction and instruction cache. But worse than your code on a 8 bit microcontroller where only the total number of instructions count.
You can also do this with a loop and bit shifts to reduce code repetition:
int b2i(bool *bs) {
int ret = 0;
for (int i = 0; i < 8; i++) {
ret = ret << 1;
ret += bs[i];
}
return ret;
}

Did I find a bug in Cygwin GCC (v 4.5.3)?

Within the following block of code, num_insts is re-assigned to 0 following the first iteration of the loop.
inst_t buf[5] = {0};
num_insts = 10;
int i = 5;
for( ; i > 0; i-- )
{
buf[i] = buf[i-1];
}
buf[0] = next;
I cannot think of any possible valid reason for this behavior, but I'm also sleep deprived so a second opinion would be appreciated.
The assembly being executed for the buf shift is this:
004017ed: mov 0x90(%esp),%eax
004017f4: lea -0x1(%eax),%ecx
004017f7: mov 0x90(%esp),%edx
004017fe: mov %edx,%eax
00401800: shl $0x2,%eax
00401803: add %edx,%eax
00401805: shl $0x2,%eax
00401808: lea 0xa0(%esp),%edi
0040180f: lea (%edi,%eax,1),%eax
00401812: lea -0x7c(%eax),%edx
00401815: mov %ecx,%eax
00401817: shl $0x2,%eax
0040181a: add %ecx,%eax
0040181c: shl $0x2,%eax
0040181f: lea 0xa0(%esp),%ecx
And the register contents prior to executing the first assembly instruction above is this:
eax 0
ecx 0
edx 0
ebx 2665332
esp 0x28ab50
ebp 0x28ac08
esi 0
edi 2665432
eip 0x4017ed <main+1593>
Following those instructions, this:
eax 0
ecx 0
edx 2665432
ebx 2665332
esp 0x28ab50
ebp 0x28ac08
esi 0
edi 2665456
eip 0x401848 <main+1684>
I don't know nearly enough assembly to make sense of any of this, but maybe someone answering this will benefit from it.
For first iteration with i = 5 you code:
for( ; i > 0; i-- ) // i = 5 > 0 = true
{
buf[i] = buf[i-1]; // b[5] = b [5 - 1]
}
Is buf[5] = buf[4]; because buf is just of size 5, maximum index value can be 4 so bug in your code = array out of index problem => rhs buf[5].

Fast 24-bit array -> 32-bit array conversion?

Quick Summary:
I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements?
Details:
I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixels (either as YUV or RGB images), but DX10 takes 32-bit pixel textures. So, I have to expand the 24-bit values to 32-bits before I can load them into the GPU.
I really don't care what I set the remaining 8 bits to, or where the incoming 24-bits are in that 32-bit value - I can fix all that in a pixel shader. But I need to do the conversion from 24-bit to 32-bit really quickly.
I'm not terribly familiar with SIMD SSE operations, but from my cursory glance it doesn't look like I can do the expansion using them, given my reads and writes aren't the same size. Any suggestions? Or am I stuck sequentially massaging this data set?
This feels so very silly - I'm using the pixel shaders for parallelism, but I have to do a sequential per-pixel operation before that. I must be missing something obvious...
The code below should be pretty fast. It copies 4 pixels in each iteration, using only 32-bit read/write instructions. The source and destination pointers should be aligned to 32 bits.
uint32_t *src = ...;
uint32_t *dst = ...;
for (int i=0; i<num_pixels; i+=4) {
uint32_t sa = src[0];
uint32_t sb = src[1];
uint32_t sc = src[2];
dst[i+0] = sa;
dst[i+1] = (sa>>24) | (sb<<8);
dst[i+2] = (sb>>16) | (sc<<16);
dst[i+3] = sc>>8;
src += 3;
}
Edit:
Here is a way to do this using the SSSE3 instructions PSHUFB and PALIGNR. The code is written using compiler intrinsics, but it shouldn't be hard to translate to assembly if needed. It copies 16 pixels in each iteration. The source and destination pointers Must be aligned to 16 bytes, or it will fault. If they aren't aligned, you can make it work by replacing _mm_load_si128 with _mm_loadu_si128 and _mm_store_si128 with _mm_storeu_si128, but this will be slower.
#include <emmintrin.h>
#include <tmmintrin.h>
__m128i *src = ...;
__m128i *dst = ...;
__m128i mask = _mm_setr_epi8(0,1,2,-1, 3,4,5,-1, 6,7,8,-1, 9,10,11,-1);
for (int i=0; i<num_pixels; i+=16) {
__m128i sa = _mm_load_si128(src);
__m128i sb = _mm_load_si128(src+1);
__m128i sc = _mm_load_si128(src+2);
__m128i val = _mm_shuffle_epi8(sa, mask);
_mm_store_si128(dst, val);
val = _mm_shuffle_epi8(_mm_alignr_epi8(sb, sa, 12), mask);
_mm_store_si128(dst+1, val);
val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sb, 8), mask);
_mm_store_si128(dst+2, val);
val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sc, 4), mask);
_mm_store_si128(dst+3, val);
src += 3;
dst += 4;
}
SSSE3 (not to be confused with SSE3) will require a relatively new processor: Core 2 or newer, and I believe AMD doesn't support it yet. Performing this with SSE2 instructions only will take a lot more operations, and may not be worth it.
SSE3 is awesome, but for those who can't use it for whatever reason, here's the conversion in x86 assembler, hand-optimized by yours truly. For completeness, I give the conversion in both directions: RGB32->RGB24 and RGB24->RGB32.
Note that interjay's C code leaves trash in the MSB (the alpha channel) of the destination pixels. This might not matter in some applications, but it matters in mine, hence my RGB24->RGB32 code forces the MSB to zero. Similarly, my RGB32->RGB24 code ignores the MSB; this avoids garbage output if the source data has a non-zero alpha channel. These features cost almost nothing in terms of performance, as verified by benchmarks.
For RGB32->RGB24 I was able to beat the VC++ optimizer by about 20%. For RGB24->RGB32 the gain was insignificant. Benchmarking was done on an i5 2500K. I omit the benchmarking code here, but if anyone wants it I'll provide it. The most important optimization was bumping the source pointer as soon as possible (see the ASAP comment). My best guess is that this increases parallelism by allowing the instruction pipeline to prefetch sooner. Other than that I just reordered some instructions to reduce dependencies and overlap memory accesses with bit-bashing.
void ConvRGB32ToRGB24(const UINT *Src, UINT *Dst, UINT Pixels)
{
#if !USE_ASM
for (UINT i = 0; i < Pixels; i += 4) {
UINT sa = Src[i + 0] & 0xffffff;
UINT sb = Src[i + 1] & 0xffffff;
UINT sc = Src[i + 2] & 0xffffff;
UINT sd = Src[i + 3];
Dst[0] = sa | (sb << 24);
Dst[1] = (sb >> 8) | (sc << 16);
Dst[2] = (sc >> 16) | (sd << 8);
Dst += 3;
}
#else
__asm {
mov ecx, Pixels
shr ecx, 2 // 4 pixels at once
jz ConvRGB32ToRGB24_$2
mov esi, Src
mov edi, Dst
ConvRGB32ToRGB24_$1:
mov ebx, [esi + 4] // sb
and ebx, 0ffffffh // sb & 0xffffff
mov eax, [esi + 0] // sa
and eax, 0ffffffh // sa & 0xffffff
mov edx, ebx // copy sb
shl ebx, 24 // sb << 24
or eax, ebx // sa | (sb << 24)
mov [edi + 0], eax // Dst[0]
shr edx, 8 // sb >> 8
mov eax, [esi + 8] // sc
and eax, 0ffffffh // sc & 0xffffff
mov ebx, eax // copy sc
shl eax, 16 // sc << 16
or eax, edx // (sb >> 8) | (sc << 16)
mov [edi + 4], eax // Dst[1]
shr ebx, 16 // sc >> 16
mov eax, [esi + 12] // sd
add esi, 16 // Src += 4 (ASAP)
shl eax, 8 // sd << 8
or eax, ebx // (sc >> 16) | (sd << 8)
mov [edi + 8], eax // Dst[2]
add edi, 12 // Dst += 3
dec ecx
jnz SHORT ConvRGB32ToRGB24_$1
ConvRGB32ToRGB24_$2:
}
#endif
}
void ConvRGB24ToRGB32(const UINT *Src, UINT *Dst, UINT Pixels)
{
#if !USE_ASM
for (UINT i = 0; i < Pixels; i += 4) {
UINT sa = Src[0];
UINT sb = Src[1];
UINT sc = Src[2];
Dst[i + 0] = sa & 0xffffff;
Dst[i + 1] = ((sa >> 24) | (sb << 8)) & 0xffffff;
Dst[i + 2] = ((sb >> 16) | (sc << 16)) & 0xffffff;
Dst[i + 3] = sc >> 8;
Src += 3;
}
#else
__asm {
mov ecx, Pixels
shr ecx, 2 // 4 pixels at once
jz SHORT ConvRGB24ToRGB32_$2
mov esi, Src
mov edi, Dst
push ebp
ConvRGB24ToRGB32_$1:
mov ebx, [esi + 4] // sb
mov edx, ebx // copy sb
mov eax, [esi + 0] // sa
mov ebp, eax // copy sa
and ebx, 0ffffh // sb & 0xffff
shl ebx, 8 // (sb & 0xffff) << 8
and eax, 0ffffffh // sa & 0xffffff
mov [edi + 0], eax // Dst[0]
shr ebp, 24 // sa >> 24
or ebx, ebp // (sa >> 24) | ((sb & 0xffff) << 8)
mov [edi + 4], ebx // Dst[1]
shr edx, 16 // sb >> 16
mov eax, [esi + 8] // sc
add esi, 12 // Src += 12 (ASAP)
mov ebx, eax // copy sc
and eax, 0ffh // sc & 0xff
shl eax, 16 // (sc & 0xff) << 16
or eax, edx // (sb >> 16) | ((sc & 0xff) << 16)
mov [edi + 8], eax // Dst[2]
shr ebx, 8 // sc >> 8
mov [edi + 12], ebx // Dst[3]
add edi, 16 // Dst += 16
dec ecx
jnz SHORT ConvRGB24ToRGB32_$1
pop ebp
ConvRGB24ToRGB32_$2:
}
#endif
}
And while we're at it, here are the same conversions in actual SSE3 assembly. This only works if you have an assembler (FASM is free) and have a CPU that supports SSE3 (likely but it's better to check). Note that the intrinsics don't necessarily output something this efficient, it totally depends on the tools you use and what platform you're compiling for. Here, it's straightforward: what you see is what you get. This code generates the same output as the x86 code above, and it's about 1.5x faster (on an i5 2500K).
format MS COFF
section '.text' code readable executable
public _ConvRGB32ToRGB24SSE3
; ebp + 8 Src (*RGB32, 16-byte aligned)
; ebp + 12 Dst (*RGB24, 16-byte aligned)
; ebp + 16 Pixels
_ConvRGB32ToRGB24SSE3:
push ebp
mov ebp, esp
mov eax, [ebp + 8]
mov edx, [ebp + 12]
mov ecx, [ebp + 16]
shr ecx, 4
jz done1
movupd xmm7, [mask1]
top1:
movupd xmm0, [eax + 0] ; sa = Src[0]
pshufb xmm0, xmm7 ; sa = _mm_shuffle_epi8(sa, mask)
movupd xmm1, [eax + 16] ; sb = Src[1]
pshufb xmm1, xmm7 ; sb = _mm_shuffle_epi8(sb, mask)
movupd xmm2, xmm1 ; sb1 = sb
pslldq xmm1, 12 ; sb = _mm_slli_si128(sb, 12)
por xmm0, xmm1 ; sa = _mm_or_si128(sa, sb)
movupd [edx + 0], xmm0 ; Dst[0] = sa
psrldq xmm2, 4 ; sb1 = _mm_srli_si128(sb1, 4)
movupd xmm0, [eax + 32] ; sc = Src[2]
pshufb xmm0, xmm7 ; sc = _mm_shuffle_epi8(sc, mask)
movupd xmm1, xmm0 ; sc1 = sc
pslldq xmm0, 8 ; sc = _mm_slli_si128(sc, 8)
por xmm0, xmm2 ; sc = _mm_or_si128(sb1, sc)
movupd [edx + 16], xmm0 ; Dst[1] = sc
psrldq xmm1, 8 ; sc1 = _mm_srli_si128(sc1, 8)
movupd xmm0, [eax + 48] ; sd = Src[3]
pshufb xmm0, xmm7 ; sd = _mm_shuffle_epi8(sd, mask)
pslldq xmm0, 4 ; sd = _mm_slli_si128(sd, 4)
por xmm0, xmm1 ; sd = _mm_or_si128(sc1, sd)
movupd [edx + 32], xmm0 ; Dst[2] = sd
add eax, 64
add edx, 48
dec ecx
jnz top1
done1:
pop ebp
ret
public _ConvRGB24ToRGB32SSE3
; ebp + 8 Src (*RGB24, 16-byte aligned)
; ebp + 12 Dst (*RGB32, 16-byte aligned)
; ebp + 16 Pixels
_ConvRGB24ToRGB32SSE3:
push ebp
mov ebp, esp
mov eax, [ebp + 8]
mov edx, [ebp + 12]
mov ecx, [ebp + 16]
shr ecx, 4
jz done2
movupd xmm7, [mask2]
top2:
movupd xmm0, [eax + 0] ; sa = Src[0]
movupd xmm1, [eax + 16] ; sb = Src[1]
movupd xmm2, [eax + 32] ; sc = Src[2]
movupd xmm3, xmm0 ; sa1 = sa
pshufb xmm0, xmm7 ; sa = _mm_shuffle_epi8(sa, mask)
movupd [edx], xmm0 ; Dst[0] = sa
movupd xmm4, xmm1 ; sb1 = sb
palignr xmm1, xmm3, 12 ; sb = _mm_alignr_epi8(sb, sa1, 12)
pshufb xmm1, xmm7 ; sb = _mm_shuffle_epi8(sb, mask);
movupd [edx + 16], xmm1 ; Dst[1] = sb
movupd xmm3, xmm2 ; sc1 = sc
palignr xmm2, xmm4, 8 ; sc = _mm_alignr_epi8(sc, sb1, 8)
pshufb xmm2, xmm7 ; sc = _mm_shuffle_epi8(sc, mask)
movupd [edx + 32], xmm2 ; Dst[2] = sc
palignr xmm3, xmm3, 4 ; sc1 = _mm_alignr_epi8(sc1, sc1, 4)
pshufb xmm3, xmm7 ; sc1 = _mm_shuffle_epi8(sc1, mask)
movupd [edx + 48], xmm3 ; Dst[3] = sc1
add eax, 48
add edx, 64
dec ecx
jnz top2
done2:
pop ebp
ret
section '.data' data readable writeable align 16
label mask1 dqword
db 0,1,2,4, 5,6,8,9, 10,12,13,14, -1,-1,-1,-1
label mask2 dqword
db 0,1,2,-1, 3,4,5,-1, 6,7,8,-1, 9,10,11,-1
The different input/output sizes are not a barrier to using simd, just a speed bump. You would need to chunk the data so that you read and write in full simd words (16 bytes).
In this case, you would read 3 SIMD words (48 bytes == 16 rgb pixels), do the expansion, then write 4 SIMD words.
I'm just saying you can use SIMD, I'm not saying you should. The middle bit, the expansion, is still tricky since you have non-uniform shift sizes in different parts of the word.
SSE 4.1 .ASM:
PINSRD XMM0, DWORD PTR[ESI], 0
PINSRD XMM0, DWORD PTR[ESI+3], 1
PINSRD XMM0, DWORD PTR[ESI+6], 2
PINSRD XMM0, DWORD PTR[ESI+9], 3
PSLLD XMM0, 8
PSRLD XMM0, 8
MOVNTDQ [EDI], XMM1
add ESI, 12
add EDI, 16

Resources