I'm optimizing a hot path in my codebase and i have turned to vectorization. Keep in mind, I'm still quite new to all of this SIMD stuff. Here is the problem I'm trying to solve, implemented using non-SIMD
inline int count_unique(int c1, int c2, int c3, int c4)
{
return 4 - (c2 == c1)
- ((c3 == c1) || (c3 == c2))
- ((c4 == c1) || (c4 == c2) || (c4 == c3));
}
the assembly output after compiling with -O3:
count_unique:
xor eax, eax
cmp esi, edi
mov r8d, edx
setne al
add eax, 3
cmp edi, edx
sete dl
cmp esi, r8d
sete r9b
or edx, r9d
movzx edx, dl
sub eax, edx
cmp edi, ecx
sete dl
cmp r8d, ecx
sete dil
or edx, edi
cmp esi, ecx
sete cl
or edx, ecx
movzx edx, dl
sub eax, edx
ret
How would something like this be done when storing c1,c2,c3,c4 as a 16byte integer vector?
For your simplified problem (test all 4 lanes for equality), I would do it slightly differently, here’s how. This way it only takes 3 instructions for the complete test.
// True when the input vector has the same value in all 32-bit lanes
inline bool isSameValue( __m128i v )
{
// Rotate vector by 4 bytes
__m128i v2 = _mm_shuffle_epi32( v, _MM_SHUFFLE( 0, 3, 2, 1 ) );
// The XOR outputs zero for equal bits, 1 for different bits
__m128i xx = _mm_xor_si128( v, v2 );
// Use PTEST instruction from SSE 4.1 set to test the complete vector for all zeros
return (bool)_mm_testz_si128( xx, xx );
}
Ok, I have "simplified" the problem, because the only case when i was using the unique count, was if it was 1, but that is the same as checking if all elements are the same, which can be done by comparing the input with itself, but shifted over by one element (4 bytes) using the _mm_alignr_epi8 function.
inline int is_same_val(__m128i v1) {
__m128i v2 = _mm_alignr_epi8(v1, v1, 4);
__m128i vcmp = _mm_cmpeq_epi32(v1, v2);
return ((uint16_t)_mm_movemask_epi8(vcmp) == 0xffff);
}
This question already has answers here:
How can I pass parameters in assembler x86 function call?
(3 answers)
Can't pass parameter from C to Assembly code
(3 answers)
Why does IA-32 have a non-intuitive caller and callee register saving convention?
(4 answers)
What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64
(4 answers)
Closed 3 years ago.
I am learning assembly NASM and trying to do a LFSR code and call it on a C program to evaluate the execution time diference, but failed to figure out the problem with my code.
My LFSR C version works just fine and is defined as follows:
int lfsr(){
int cont = 0;
uint32_t start_state = SEED;
uint32_t lfsr = start_state;
uint32_t bit;
while (cont != 16777216) {
bit = ((lfsr >> 1) ^ (lfsr >> 5) ^ (lfsr >> 7) ^ (lfsr >> 13)) & 1;
lfsr = (lfsr >> 1) | (bit << 23);
lfsr_nums[cont] = lfsr;
cont++;
}
return cont;
}
My NASM x86 was based on the C version and it generates the numbers the same way the C code does. It should take a pointer to an array as parameter, and return (as reference) the same array with the numbers created and return (as value) the amount of the numbers. The LFSR logic works just fine, I checked the numbers created, but the code still give me a SegFault Core Dump error.
With gdb the message is that the error is in the do procedure. While I tried to debug my code I found out that the error was in the mov dword [esi + 4 * eax], ebx, if I comment it out the code doesn't output a segfault.
section .text
global lfsr_nasm
lfsr_nasm:
push dword ebx;
mov esi, edi ; vec
mov eax, 0 ;Cont = 0
mov ebx, 0x1313 ; lfst = start_state = seed
do:
mov ecx, ebx ; ecx = lfst
shr ecx, 1 ; lfsr >> 1
mov edx, ebx ; edx = lfst
shr edx, 5; lfst >> 5
xor ecx, edx ; lfst >> 1 ^ lfsr >> 5
mov edx, ebx ; edx = lfsr
shr edx, 7 ; edx = lfst >> 7
xor ecx, edx; lfst >> 1 ^ lfsr >> 5 ^ lfsr >> 7
mov edx, ebx ; edx = lfsr
shr edx, 13 ; edx = lfst >> 13
xor ecx, edx; lfst >> 1 ^ lfsr >> 5 ^ lfsr >> 7 ^ lfsr >> 13
and ecx, 1 ;ecx = bit
shr ebx, 1 ;lfsr >> 1
shl ecx, 23 ; bit << 23
or ebx, ecx ; lfsr = (lfsr >> 1) | (bit << 23);
mov dword [esi + 4 * eax], ebx
inc eax ; cont++
cmp eax, 16777216; cont != 16777216
jne do ;
pop dword ebx;
ret
The way I make the call in C, and declare my vector and NASM function:
extern int lfsr_nasm (uint32_t *vec);
uint32_t lfsr_nums[16777216];
int main(int argc, char *argv[]){
int cont;
cont = lfsr_nasm(lfsr_nums);
for(int i = 0; i < 16777216; i++){
printf("%d ", lfsr_nums[i]);
}
}
I believe that the vector is too big for the NASM or C and maybe the program is trying to access bad memory, but I couldn't find anything to confirm my believes neither a fix to the problem. Already tried with malloc and calloc.
I'm using this function to convert an 8-bit binary number represented as a boolean array to an integer. Is it efficient? I'm using it in an embedded system. It performs ok but I'm interested in some opinions or suggestions for improvement (or replacement) if there are any.
uint8_t b2i( bool *bs ){
uint8_t ret = 0;
ret = bs[7] ? 1 : 0;
ret += bs[6] ? 2 : 0;
ret += bs[5] ? 4 : 0;
ret += bs[4] ? 8 : 0;
ret += bs[3] ? 16 : 0;
ret += bs[2] ? 32 : 0;
ret += bs[1] ? 64 : 0;
ret += bs[0] ? 128 : 0;
return ret;
}
It is not possible to say without a specific system in mind. Disassemble the code and see what you got. Benchmark your code on a specific system. This is the key to understanding manual optimization.
Generally, there are lots of considerations. The CPU's data word size, instruction set, compiler optimizer performance, branch prediction (if any), data cache (if any) etc etc.
To make the code perform optimally regardless of data word size, you can change uint8_t to uint_fast8_t. That is unless you need exactly 8 bits, then leave it as uint8_t.
Cache use may or may not be more efficient if given an up-counting loop. At any rate, loop unrolling is an old kind of manual optimization that we shouldn't use in modern programming - the compiler is more capable of making that call than the programmer.
The worst problem with the code is the numerous branches. These might cause a bottleneck.
Your code results in the following x86 machine code gcc -O2:
b2i:
cmp BYTE PTR [rdi+6], 0
movzx eax, BYTE PTR [rdi+7]
je .L2
add eax, 2
.L2:
cmp BYTE PTR [rdi+5], 0
je .L3
add eax, 4
.L3:
cmp BYTE PTR [rdi+4], 0
je .L4
add eax, 8
.L4:
cmp BYTE PTR [rdi+3], 0
je .L5
add eax, 16
.L5:
cmp BYTE PTR [rdi+2], 0
je .L6
add eax, 32
.L6:
cmp BYTE PTR [rdi+1], 0
je .L7
add eax, 64
.L7:
lea edx, [rax-128]
cmp BYTE PTR [rdi], 0
cmovne eax, edx
ret
Whole lot of potentially inefficient branching. We can make the code both faster and more readable by using a loop:
uint8_t b2i (const bool bs[8])
{
uint8_t result = 0;
for(size_t i=0; i<8; i++)
{
result |= bs[8-1-i] << i;
}
return result;
}
(ideally the bool array should be arranged from LSB first but that would change the meaning of the code compared to the original)
Which gives this machine code instead:
b2i:
lea rsi, [rdi-8]
mov rax, rdi
xor r8d, r8d
.L2:
movzx edx, BYTE PTR [rax+7]
mov ecx, edi
sub ecx, eax
sub rax, 1
sal edx, cl
or r8d, edx
cmp rax, rsi
jne .L2
mov eax, r8d
ret
More instructions but less branching. It will likely perform better than your code on x86 and other high end CPUs with branch prediction and instruction cache. But worse than your code on a 8 bit microcontroller where only the total number of instructions count.
You can also do this with a loop and bit shifts to reduce code repetition:
int b2i(bool *bs) {
int ret = 0;
for (int i = 0; i < 8; i++) {
ret = ret << 1;
ret += bs[i];
}
return ret;
}
Within the following block of code, num_insts is re-assigned to 0 following the first iteration of the loop.
inst_t buf[5] = {0};
num_insts = 10;
int i = 5;
for( ; i > 0; i-- )
{
buf[i] = buf[i-1];
}
buf[0] = next;
I cannot think of any possible valid reason for this behavior, but I'm also sleep deprived so a second opinion would be appreciated.
The assembly being executed for the buf shift is this:
004017ed: mov 0x90(%esp),%eax
004017f4: lea -0x1(%eax),%ecx
004017f7: mov 0x90(%esp),%edx
004017fe: mov %edx,%eax
00401800: shl $0x2,%eax
00401803: add %edx,%eax
00401805: shl $0x2,%eax
00401808: lea 0xa0(%esp),%edi
0040180f: lea (%edi,%eax,1),%eax
00401812: lea -0x7c(%eax),%edx
00401815: mov %ecx,%eax
00401817: shl $0x2,%eax
0040181a: add %ecx,%eax
0040181c: shl $0x2,%eax
0040181f: lea 0xa0(%esp),%ecx
And the register contents prior to executing the first assembly instruction above is this:
eax 0
ecx 0
edx 0
ebx 2665332
esp 0x28ab50
ebp 0x28ac08
esi 0
edi 2665432
eip 0x4017ed <main+1593>
Following those instructions, this:
eax 0
ecx 0
edx 2665432
ebx 2665332
esp 0x28ab50
ebp 0x28ac08
esi 0
edi 2665456
eip 0x401848 <main+1684>
I don't know nearly enough assembly to make sense of any of this, but maybe someone answering this will benefit from it.
For first iteration with i = 5 you code:
for( ; i > 0; i-- ) // i = 5 > 0 = true
{
buf[i] = buf[i-1]; // b[5] = b [5 - 1]
}
Is buf[5] = buf[4]; because buf is just of size 5, maximum index value can be 4 so bug in your code = array out of index problem => rhs buf[5].
Quick Summary:
I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements?
Details:
I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixels (either as YUV or RGB images), but DX10 takes 32-bit pixel textures. So, I have to expand the 24-bit values to 32-bits before I can load them into the GPU.
I really don't care what I set the remaining 8 bits to, or where the incoming 24-bits are in that 32-bit value - I can fix all that in a pixel shader. But I need to do the conversion from 24-bit to 32-bit really quickly.
I'm not terribly familiar with SIMD SSE operations, but from my cursory glance it doesn't look like I can do the expansion using them, given my reads and writes aren't the same size. Any suggestions? Or am I stuck sequentially massaging this data set?
This feels so very silly - I'm using the pixel shaders for parallelism, but I have to do a sequential per-pixel operation before that. I must be missing something obvious...
The code below should be pretty fast. It copies 4 pixels in each iteration, using only 32-bit read/write instructions. The source and destination pointers should be aligned to 32 bits.
uint32_t *src = ...;
uint32_t *dst = ...;
for (int i=0; i<num_pixels; i+=4) {
uint32_t sa = src[0];
uint32_t sb = src[1];
uint32_t sc = src[2];
dst[i+0] = sa;
dst[i+1] = (sa>>24) | (sb<<8);
dst[i+2] = (sb>>16) | (sc<<16);
dst[i+3] = sc>>8;
src += 3;
}
Edit:
Here is a way to do this using the SSSE3 instructions PSHUFB and PALIGNR. The code is written using compiler intrinsics, but it shouldn't be hard to translate to assembly if needed. It copies 16 pixels in each iteration. The source and destination pointers Must be aligned to 16 bytes, or it will fault. If they aren't aligned, you can make it work by replacing _mm_load_si128 with _mm_loadu_si128 and _mm_store_si128 with _mm_storeu_si128, but this will be slower.
#include <emmintrin.h>
#include <tmmintrin.h>
__m128i *src = ...;
__m128i *dst = ...;
__m128i mask = _mm_setr_epi8(0,1,2,-1, 3,4,5,-1, 6,7,8,-1, 9,10,11,-1);
for (int i=0; i<num_pixels; i+=16) {
__m128i sa = _mm_load_si128(src);
__m128i sb = _mm_load_si128(src+1);
__m128i sc = _mm_load_si128(src+2);
__m128i val = _mm_shuffle_epi8(sa, mask);
_mm_store_si128(dst, val);
val = _mm_shuffle_epi8(_mm_alignr_epi8(sb, sa, 12), mask);
_mm_store_si128(dst+1, val);
val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sb, 8), mask);
_mm_store_si128(dst+2, val);
val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sc, 4), mask);
_mm_store_si128(dst+3, val);
src += 3;
dst += 4;
}
SSSE3 (not to be confused with SSE3) will require a relatively new processor: Core 2 or newer, and I believe AMD doesn't support it yet. Performing this with SSE2 instructions only will take a lot more operations, and may not be worth it.
SSE3 is awesome, but for those who can't use it for whatever reason, here's the conversion in x86 assembler, hand-optimized by yours truly. For completeness, I give the conversion in both directions: RGB32->RGB24 and RGB24->RGB32.
Note that interjay's C code leaves trash in the MSB (the alpha channel) of the destination pixels. This might not matter in some applications, but it matters in mine, hence my RGB24->RGB32 code forces the MSB to zero. Similarly, my RGB32->RGB24 code ignores the MSB; this avoids garbage output if the source data has a non-zero alpha channel. These features cost almost nothing in terms of performance, as verified by benchmarks.
For RGB32->RGB24 I was able to beat the VC++ optimizer by about 20%. For RGB24->RGB32 the gain was insignificant. Benchmarking was done on an i5 2500K. I omit the benchmarking code here, but if anyone wants it I'll provide it. The most important optimization was bumping the source pointer as soon as possible (see the ASAP comment). My best guess is that this increases parallelism by allowing the instruction pipeline to prefetch sooner. Other than that I just reordered some instructions to reduce dependencies and overlap memory accesses with bit-bashing.
void ConvRGB32ToRGB24(const UINT *Src, UINT *Dst, UINT Pixels)
{
#if !USE_ASM
for (UINT i = 0; i < Pixels; i += 4) {
UINT sa = Src[i + 0] & 0xffffff;
UINT sb = Src[i + 1] & 0xffffff;
UINT sc = Src[i + 2] & 0xffffff;
UINT sd = Src[i + 3];
Dst[0] = sa | (sb << 24);
Dst[1] = (sb >> 8) | (sc << 16);
Dst[2] = (sc >> 16) | (sd << 8);
Dst += 3;
}
#else
__asm {
mov ecx, Pixels
shr ecx, 2 // 4 pixels at once
jz ConvRGB32ToRGB24_$2
mov esi, Src
mov edi, Dst
ConvRGB32ToRGB24_$1:
mov ebx, [esi + 4] // sb
and ebx, 0ffffffh // sb & 0xffffff
mov eax, [esi + 0] // sa
and eax, 0ffffffh // sa & 0xffffff
mov edx, ebx // copy sb
shl ebx, 24 // sb << 24
or eax, ebx // sa | (sb << 24)
mov [edi + 0], eax // Dst[0]
shr edx, 8 // sb >> 8
mov eax, [esi + 8] // sc
and eax, 0ffffffh // sc & 0xffffff
mov ebx, eax // copy sc
shl eax, 16 // sc << 16
or eax, edx // (sb >> 8) | (sc << 16)
mov [edi + 4], eax // Dst[1]
shr ebx, 16 // sc >> 16
mov eax, [esi + 12] // sd
add esi, 16 // Src += 4 (ASAP)
shl eax, 8 // sd << 8
or eax, ebx // (sc >> 16) | (sd << 8)
mov [edi + 8], eax // Dst[2]
add edi, 12 // Dst += 3
dec ecx
jnz SHORT ConvRGB32ToRGB24_$1
ConvRGB32ToRGB24_$2:
}
#endif
}
void ConvRGB24ToRGB32(const UINT *Src, UINT *Dst, UINT Pixels)
{
#if !USE_ASM
for (UINT i = 0; i < Pixels; i += 4) {
UINT sa = Src[0];
UINT sb = Src[1];
UINT sc = Src[2];
Dst[i + 0] = sa & 0xffffff;
Dst[i + 1] = ((sa >> 24) | (sb << 8)) & 0xffffff;
Dst[i + 2] = ((sb >> 16) | (sc << 16)) & 0xffffff;
Dst[i + 3] = sc >> 8;
Src += 3;
}
#else
__asm {
mov ecx, Pixels
shr ecx, 2 // 4 pixels at once
jz SHORT ConvRGB24ToRGB32_$2
mov esi, Src
mov edi, Dst
push ebp
ConvRGB24ToRGB32_$1:
mov ebx, [esi + 4] // sb
mov edx, ebx // copy sb
mov eax, [esi + 0] // sa
mov ebp, eax // copy sa
and ebx, 0ffffh // sb & 0xffff
shl ebx, 8 // (sb & 0xffff) << 8
and eax, 0ffffffh // sa & 0xffffff
mov [edi + 0], eax // Dst[0]
shr ebp, 24 // sa >> 24
or ebx, ebp // (sa >> 24) | ((sb & 0xffff) << 8)
mov [edi + 4], ebx // Dst[1]
shr edx, 16 // sb >> 16
mov eax, [esi + 8] // sc
add esi, 12 // Src += 12 (ASAP)
mov ebx, eax // copy sc
and eax, 0ffh // sc & 0xff
shl eax, 16 // (sc & 0xff) << 16
or eax, edx // (sb >> 16) | ((sc & 0xff) << 16)
mov [edi + 8], eax // Dst[2]
shr ebx, 8 // sc >> 8
mov [edi + 12], ebx // Dst[3]
add edi, 16 // Dst += 16
dec ecx
jnz SHORT ConvRGB24ToRGB32_$1
pop ebp
ConvRGB24ToRGB32_$2:
}
#endif
}
And while we're at it, here are the same conversions in actual SSE3 assembly. This only works if you have an assembler (FASM is free) and have a CPU that supports SSE3 (likely but it's better to check). Note that the intrinsics don't necessarily output something this efficient, it totally depends on the tools you use and what platform you're compiling for. Here, it's straightforward: what you see is what you get. This code generates the same output as the x86 code above, and it's about 1.5x faster (on an i5 2500K).
format MS COFF
section '.text' code readable executable
public _ConvRGB32ToRGB24SSE3
; ebp + 8 Src (*RGB32, 16-byte aligned)
; ebp + 12 Dst (*RGB24, 16-byte aligned)
; ebp + 16 Pixels
_ConvRGB32ToRGB24SSE3:
push ebp
mov ebp, esp
mov eax, [ebp + 8]
mov edx, [ebp + 12]
mov ecx, [ebp + 16]
shr ecx, 4
jz done1
movupd xmm7, [mask1]
top1:
movupd xmm0, [eax + 0] ; sa = Src[0]
pshufb xmm0, xmm7 ; sa = _mm_shuffle_epi8(sa, mask)
movupd xmm1, [eax + 16] ; sb = Src[1]
pshufb xmm1, xmm7 ; sb = _mm_shuffle_epi8(sb, mask)
movupd xmm2, xmm1 ; sb1 = sb
pslldq xmm1, 12 ; sb = _mm_slli_si128(sb, 12)
por xmm0, xmm1 ; sa = _mm_or_si128(sa, sb)
movupd [edx + 0], xmm0 ; Dst[0] = sa
psrldq xmm2, 4 ; sb1 = _mm_srli_si128(sb1, 4)
movupd xmm0, [eax + 32] ; sc = Src[2]
pshufb xmm0, xmm7 ; sc = _mm_shuffle_epi8(sc, mask)
movupd xmm1, xmm0 ; sc1 = sc
pslldq xmm0, 8 ; sc = _mm_slli_si128(sc, 8)
por xmm0, xmm2 ; sc = _mm_or_si128(sb1, sc)
movupd [edx + 16], xmm0 ; Dst[1] = sc
psrldq xmm1, 8 ; sc1 = _mm_srli_si128(sc1, 8)
movupd xmm0, [eax + 48] ; sd = Src[3]
pshufb xmm0, xmm7 ; sd = _mm_shuffle_epi8(sd, mask)
pslldq xmm0, 4 ; sd = _mm_slli_si128(sd, 4)
por xmm0, xmm1 ; sd = _mm_or_si128(sc1, sd)
movupd [edx + 32], xmm0 ; Dst[2] = sd
add eax, 64
add edx, 48
dec ecx
jnz top1
done1:
pop ebp
ret
public _ConvRGB24ToRGB32SSE3
; ebp + 8 Src (*RGB24, 16-byte aligned)
; ebp + 12 Dst (*RGB32, 16-byte aligned)
; ebp + 16 Pixels
_ConvRGB24ToRGB32SSE3:
push ebp
mov ebp, esp
mov eax, [ebp + 8]
mov edx, [ebp + 12]
mov ecx, [ebp + 16]
shr ecx, 4
jz done2
movupd xmm7, [mask2]
top2:
movupd xmm0, [eax + 0] ; sa = Src[0]
movupd xmm1, [eax + 16] ; sb = Src[1]
movupd xmm2, [eax + 32] ; sc = Src[2]
movupd xmm3, xmm0 ; sa1 = sa
pshufb xmm0, xmm7 ; sa = _mm_shuffle_epi8(sa, mask)
movupd [edx], xmm0 ; Dst[0] = sa
movupd xmm4, xmm1 ; sb1 = sb
palignr xmm1, xmm3, 12 ; sb = _mm_alignr_epi8(sb, sa1, 12)
pshufb xmm1, xmm7 ; sb = _mm_shuffle_epi8(sb, mask);
movupd [edx + 16], xmm1 ; Dst[1] = sb
movupd xmm3, xmm2 ; sc1 = sc
palignr xmm2, xmm4, 8 ; sc = _mm_alignr_epi8(sc, sb1, 8)
pshufb xmm2, xmm7 ; sc = _mm_shuffle_epi8(sc, mask)
movupd [edx + 32], xmm2 ; Dst[2] = sc
palignr xmm3, xmm3, 4 ; sc1 = _mm_alignr_epi8(sc1, sc1, 4)
pshufb xmm3, xmm7 ; sc1 = _mm_shuffle_epi8(sc1, mask)
movupd [edx + 48], xmm3 ; Dst[3] = sc1
add eax, 48
add edx, 64
dec ecx
jnz top2
done2:
pop ebp
ret
section '.data' data readable writeable align 16
label mask1 dqword
db 0,1,2,4, 5,6,8,9, 10,12,13,14, -1,-1,-1,-1
label mask2 dqword
db 0,1,2,-1, 3,4,5,-1, 6,7,8,-1, 9,10,11,-1
The different input/output sizes are not a barrier to using simd, just a speed bump. You would need to chunk the data so that you read and write in full simd words (16 bytes).
In this case, you would read 3 SIMD words (48 bytes == 16 rgb pixels), do the expansion, then write 4 SIMD words.
I'm just saying you can use SIMD, I'm not saying you should. The middle bit, the expansion, is still tricky since you have non-uniform shift sizes in different parts of the word.
SSE 4.1 .ASM:
PINSRD XMM0, DWORD PTR[ESI], 0
PINSRD XMM0, DWORD PTR[ESI+3], 1
PINSRD XMM0, DWORD PTR[ESI+6], 2
PINSRD XMM0, DWORD PTR[ESI+9], 3
PSLLD XMM0, 8
PSRLD XMM0, 8
MOVNTDQ [EDI], XMM1
add ESI, 12
add EDI, 16