I'm using this function to convert an 8-bit binary number represented as a boolean array to an integer. Is it efficient? I'm using it in an embedded system. It performs ok but I'm interested in some opinions or suggestions for improvement (or replacement) if there are any.
uint8_t b2i( bool *bs ){
uint8_t ret = 0;
ret = bs[7] ? 1 : 0;
ret += bs[6] ? 2 : 0;
ret += bs[5] ? 4 : 0;
ret += bs[4] ? 8 : 0;
ret += bs[3] ? 16 : 0;
ret += bs[2] ? 32 : 0;
ret += bs[1] ? 64 : 0;
ret += bs[0] ? 128 : 0;
return ret;
}
It is not possible to say without a specific system in mind. Disassemble the code and see what you got. Benchmark your code on a specific system. This is the key to understanding manual optimization.
Generally, there are lots of considerations. The CPU's data word size, instruction set, compiler optimizer performance, branch prediction (if any), data cache (if any) etc etc.
To make the code perform optimally regardless of data word size, you can change uint8_t to uint_fast8_t. That is unless you need exactly 8 bits, then leave it as uint8_t.
Cache use may or may not be more efficient if given an up-counting loop. At any rate, loop unrolling is an old kind of manual optimization that we shouldn't use in modern programming - the compiler is more capable of making that call than the programmer.
The worst problem with the code is the numerous branches. These might cause a bottleneck.
Your code results in the following x86 machine code gcc -O2:
b2i:
cmp BYTE PTR [rdi+6], 0
movzx eax, BYTE PTR [rdi+7]
je .L2
add eax, 2
.L2:
cmp BYTE PTR [rdi+5], 0
je .L3
add eax, 4
.L3:
cmp BYTE PTR [rdi+4], 0
je .L4
add eax, 8
.L4:
cmp BYTE PTR [rdi+3], 0
je .L5
add eax, 16
.L5:
cmp BYTE PTR [rdi+2], 0
je .L6
add eax, 32
.L6:
cmp BYTE PTR [rdi+1], 0
je .L7
add eax, 64
.L7:
lea edx, [rax-128]
cmp BYTE PTR [rdi], 0
cmovne eax, edx
ret
Whole lot of potentially inefficient branching. We can make the code both faster and more readable by using a loop:
uint8_t b2i (const bool bs[8])
{
uint8_t result = 0;
for(size_t i=0; i<8; i++)
{
result |= bs[8-1-i] << i;
}
return result;
}
(ideally the bool array should be arranged from LSB first but that would change the meaning of the code compared to the original)
Which gives this machine code instead:
b2i:
lea rsi, [rdi-8]
mov rax, rdi
xor r8d, r8d
.L2:
movzx edx, BYTE PTR [rax+7]
mov ecx, edi
sub ecx, eax
sub rax, 1
sal edx, cl
or r8d, edx
cmp rax, rsi
jne .L2
mov eax, r8d
ret
More instructions but less branching. It will likely perform better than your code on x86 and other high end CPUs with branch prediction and instruction cache. But worse than your code on a 8 bit microcontroller where only the total number of instructions count.
You can also do this with a loop and bit shifts to reduce code repetition:
int b2i(bool *bs) {
int ret = 0;
for (int i = 0; i < 8; i++) {
ret = ret << 1;
ret += bs[i];
}
return ret;
}
Related
In short: try switching foos pointer from 0 to 1 here:
godbolt - compiler explorer link - what is happening?
I was surprised at how many instruction came out of clang when I compiled the following C code. - And I noticed that it only happens when the pointer foos is zero. (x86-64 clang 12.0.1 with -O2 or -O3).
#include <stdint.h>
typedef uint8_t u8;
typedef uint32_t u32;
typedef struct {
u32 x;
u32 y;
}Foo;
u32 count = 500;
int main()
{
u8 *foos = (u8 *)0;
u32 element_size = 8;
u32 offset = 0;
for(u32 i=0;i<count;i++)
{
u32 *p = (u32 *)(foos + element_size*i);
*p = i;
}
return 0;
}
This is the output when the pointer is zero.
main: # #main
mov r8d, dword ptr [rip + count]
test r8, r8
je .LBB0_6
lea rcx, [r8 - 1]
mov eax, r8d
and eax, 3
cmp rcx, 3
jae .LBB0_7
xor ecx, ecx
jmp .LBB0_3
.LBB0_7:
and r8d, -4
mov esi, 16
xor ecx, ecx
.LBB0_8: # =>This Inner Loop Header: Depth=1
lea edi, [rsi - 16]
and edi, -32
mov dword ptr [rdi], ecx
lea edi, [rsi - 8]
and edi, -24
lea edx, [rcx + 1]
mov dword ptr [rdi], edx
mov edx, esi
and edx, -16
lea edi, [rcx + 2]
mov dword ptr [rdx], edi
lea edx, [rsi + 8]
and edx, -8
lea edi, [rcx + 3]
mov dword ptr [rdx], edi
add rcx, 4
add rsi, 32
cmp r8, rcx
jne .LBB0_8
.LBB0_3:
test rax, rax
je .LBB0_6
lea rdx, [8*rcx]
.LBB0_5: # =>This Inner Loop Header: Depth=1
mov esi, edx
and esi, -8
mov dword ptr [rsi], ecx
add rdx, 8
add ecx, 1
add rax, -1
jne .LBB0_5
.LBB0_6:
xor eax, eax
ret
count:
.long 500 # 0x1f4
Can you please help me understand what is happening here? I don't know assembly very well. The AND with 3 suggest to me that there's some alignment branching. The top part of LBB0_8 looks very strange to me...
This is loop unrolling.
The code first checks if count is greater than 3, and if so, branches to LBB0_7, which sets up loop variables and drops into the loop at LBB0_8. This loop does 4 steps per iteration, as long as there are still 4 or more to do. Afterwards it falls through to the "slow path" at LBB0_3/LBB0_5 that just does one step per iteration.
That slow path is also very similar to what you get when you compile the code with a non-zero value for that pointer.
As for why this happens, I don't know. Initially I was thinking that the compiler proves that a NULL deref will happen inside the loop and optimises on that, but usually that's akin to replacing the loop contents with __builtin_unreachable();, which causes it to throw out the loop entirely. Still can't rule it out, but I've seen the compiler throw out a lot of code many times, so it seems at least unlikely that UB causes this.
Then I was thinking maybe the fact that 0 requires no additional calculation, but all it'd have to change was mov esi, 16 to mov esi, 17, so it'd have the same amount of instructions.
What's also interesting is that on x86_64, it generates a loop with 4 steps per iteration, whereas on arm64 it generates one with 2 steps per iteration.
I don't understand what is the problem because the result is right, but there is something wrong in it and i don't get it.
1.This is the x86 code I have to convert to C:
%include "io.inc"
SECTION .data
mask DD 0xffff, 0xff00ff, 0xf0f0f0f, 0x33333333, 0x55555555
SECTION .text
GLOBAL CMAIN
CMAIN:
GET_UDEC 4, EAX
MOV EBX, mask
ADD EBX, 16
MOV ECX, 1
.L:
MOV ESI, DWORD [EBX]
MOV EDI, ESI
NOT EDI
MOV EDX, EAX
AND EAX, ESI
AND EDX, EDI
SHL EAX, CL
SHR EDX, CL
OR EAX, EDX
SHL ECX, 1
SUB EBX, 4
CMP EBX, mask - 4
JNE .L
PRINT_UDEC 4, EAX
NEWLINE
XOR EAX, EAX
RET
2.My converted C code, when I input 0 it output me the right answer but there is something false in my code I don't understand what is:
#include "stdio.h"
int main(void)
{
int mask [5] = {0xffff, 0xff00ff, 0xf0f0f0f, 0x33333333, 0x55555555};
int eax;
int esi;
int ebx;
int edi;
int edx;
char cl = 0;
scanf("%d",&eax);
ebx = mask[4];
ebx = ebx + 16;
int ecx = 1;
L:
esi = ebx;
edi = esi;
edi = !edi;
edx = eax;
eax = eax && esi;
edx = edx && edi;
eax = eax << cl;
edx = edx >> cl ;
eax = eax || edx;
ecx = ecx << 1;
ebx = ebx - 4;
if(ebx == mask[1]) //mask - 4
{
goto L;
}
printf("%d",eax);
return 0;
}
Assembly AND is C bitwise &, not logical &&. (Same for OR). So you want eax &= esi.
(Using &= "compound assignment" makes the C even look like x86-style 2-operand asm so I'd recommend that.)
NOT is also bitwise flip-all-the-bits, not booleanize to 0/1. In C that's edi = ~edi;
Read the manual for x86 instructions like https://www.felixcloutier.com/x86/not, and for C operators like ~ and ! to check that they are / aren't what you want. https://en.cppreference.com/w/c/language/expressions https://en.cppreference.com/w/c/language/operator_arithmetic
You should be single-stepping your C and your asm in a debugger so you notice the first divergence, and know which instruction / C statement to fix. Don't just run the whole thing and look at one number for the result! Debuggers are massively useful for asm; don't waste your time without one.
CL is the low byte of ECX, not a separate C variable. You could use a union between uint32_t and uint8_t in C, or just use eax <<= ecx&31; since you don't have anything that writes CL separately from ECX. (x86 shifts mask their count; that C statement could compile to shl eax, cl. https://www.felixcloutier.com/x86/sal:sar:shl:shr). The low 5 bits of ECX are also the low 5 bits of CL.
SHR is a logical right shift, not arithmetic, so you need to be using unsigned not int at least for the >>. But really just use it for everything.
You're handling EBX completely wrong; it's a pointer.
MOV EBX, mask
ADD EBX, 16
This is like unsigned int *ebx = mask+4;
The size of a dword is 4 bytes, but C pointer math scales by the type size, so +1 is a whole element, not 1 byte. So 16 bytes is 4 dwords = 4 unsigned int elements.
MOV ESI, DWORD [EBX]
That's a load using EBX as an address. This should be easy to see if you single-step the asm in a debugger: It's not just copying the value.
CMP EBX, mask - 4
JNE .L
This is NASM syntax; it's comparing against the address of the dword before the start of the array. It's effectively the bottom of a fairly normal do{}while loop. (Why are loops always compiled into "do...while" style (tail jump)?)
do { // .L
...
} while(ebx != &mask[-1]); // cmp/jne
It's looping from the end of the mask array, stopping when the pointer goes past the end.
Equivalently, the compare could be ebx !-= mask - 1. I wrote it with unary & (address-of) cancelling out the [] to make it clear that it's the address of what would be one element before the array.
Note that it's jumping on not equal; you had your if()goto backwards, jumping only on equality. This is a loop.
unsigned mask[] should be static because it's in section .data, not on the stack. And not const, because again it's in .data not .rodata (Linux) or .rdata (Windows))
This one doesn't affect the logic, only that detail of decompiling.
There may be other bugs; I didn't try to check everything.
if(ebx != mask[1]) //mask - 4
{
goto L;
}
//JNE IMPLIES a !=
Studying I found the use of the (i+1)mod(SIZE) to perform a cycle in an array of elements.
So I wondered if this method was more efficient than an if-statement...
For example:
#define SIZE 15
int main(int argc, char *argv[]) {
int items[SIZE];
for(int i = 0; items[0] < 5; i = (i + 1) % SIZE) items[i] += 1;
return 0;
}
It is more efficient than(?):
#define SIZE 15
int main(int argc, char *argv[]) {
int items[SIZE];
for(int i = 0; items[0] < 5; i++) {
if(i == SIZE) i = 0;
items[i] += 1;
}
return 0;
}
Thanks for the answers and your time.
You can check the assembly online (i. e. here). The result depends on the architecture and the optimization, but without optimization and for x64 with GCC, you get this code (as a simple example).
Example 1:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-68], edi
mov QWORD PTR [rbp-80], rsi
mov DWORD PTR [rbp-4], 0
.L3:
mov eax, DWORD PTR [rbp-64]
cmp eax, 4
jg .L2
mov eax, DWORD PTR [rbp-4]
cdqe
mov eax, DWORD PTR [rbp-64+rax*4]
lea edx, [rax+1]
mov eax, DWORD PTR [rbp-4]
cdqe
mov DWORD PTR [rbp-64+rax*4], edx
mov eax, DWORD PTR [rbp-4]
add eax, 1
movsx rdx, eax
imul rdx, rdx, -2004318071
shr rdx, 32
add edx, eax
mov ecx, edx
sar ecx, 3
cdq
sub ecx, edx
mov edx, ecx
mov DWORD PTR [rbp-4], edx
mov ecx, DWORD PTR [rbp-4]
mov edx, ecx
sal edx, 4
sub edx, ecx
sub eax, edx
mov DWORD PTR [rbp-4], eax
jmp .L3
.L2:
mov eax, 0
pop rbp
ret
Example 2:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-68], edi
mov QWORD PTR [rbp-80], rsi
mov DWORD PTR [rbp-4], 0
.L4:
mov eax, DWORD PTR [rbp-64]
cmp eax, 4
jg .L2
cmp DWORD PTR [rbp-4], 15
jne .L3
mov DWORD PTR [rbp-4], 0
.L3:
mov eax, DWORD PTR [rbp-4]
cdqe
mov eax, DWORD PTR [rbp-64+rax*4]
lea edx, [rax+1]
mov eax, DWORD PTR [rbp-4]
cdqe
mov DWORD PTR [rbp-64+rax*4], edx
add DWORD PTR [rbp-4], 1
jmp .L4
.L2:
mov eax, 0
pop rbp
ret
You see, that for the specific case with x86, the solution without modulo is much shorter.
Although you are only asking about mod vs branch, there are probably more like five cases depending on the actual implementation of the mod and branch:
Modulus-based
Power-of-two
If the value of SIZE is known to the compiler and is a power of 2, the mod will compile into a single and like this and will be very efficient in performance and code size. The and is still part of the loop increment dependency chain though, putting a speed limit on the performance of 2 cycles per iteration unless the compiler is clever enough to unroll it and keep the and out of the carried chain (gcc and clang weren't).
Known, not power-of-two
On the other hand, if the value of SIZE is known but not a power of two, then you are likely to get a multiplication-based implementation of the fixed modulus value, like this. This generally takes something like 4-6 instructions, which end up part of the dependency chain. So this will limit your performance to something like 1 iteration every 5-8 cycles, depending exactly on the latency of the dependency chain.
Unknown
In your example SIZE is a known constant, but in the more general case where it is not known at compile time you will get an division instruction on platforms that support it. Something like this.
That is good for code size, since it's a single instruction, but probably disastrous for performance because now you have a slow division instruction as part of the carried dependency for the loop. Depending on your hardware and the type of the SIZE variable, you are looking at 20-100 cycles per iteration.
Branch-based
You put a branch in your code, but jump compiler made decide to implement that as a conditional jump or as a conditional move. At -O2, gcc decides on a jump and clang on a conditional move.
Conditional Jump
This is the direct interpretation of your code: use a conditional branch to implement the i == SIZE condition.
It has the advantage of making the condition a control dependency, not a data dependency, so your loop will mostly run at full speed when the branch is not taken.
However, performance could be seriously impacted if the branch mispredicts often. That depends heavily on the value of SIZE and on your hardware. Modern Intel should be able to predict nested loops like this up to 20-something iterations, but beyond that it will mispredict once every time the inner loop is exited. Of course, is SIZE is very large then the single mispredict won't matter much anyways, so the worst case is SIZE just large enough to mispredict.
Conditional Move
clang uses a conditional move to update i. This is a reasonable option, but it does mean a carried data flow dependency of 3-4 cycles.
1 Either actually a constant like your example or effectively a constant due to inlining and constant propagation.
so my question is basic but i had a hard time finding anything on the internet.
lets say i want to write a function in C that calls an external nasm function written in x86_64 assembly.
I want to pass to the external function two char* of numbers, preform some arithmetic operations on the two and return char* of the result. My idea was to iterate over [rdi] and [rsi] somehow and saving the result in rax (i.e add rax, [rdi], [rsi]) but I'm having a hard time to actually do so. what would be the right way to go over each character? increasing [rsi] and [rdi]? and also- I would only need to move to rax the value of the first character right?
Thanks in advance!
If you could post assembly/C code - it would be easier to suggest changes.
For any assembly, I would start with a C code(since I think in C :)) and then convert to assembly using a compiler and then optimize it in the assembly as needed. Assuming you need write a function which takes two strings and adds them and returns the result as int like the following:
int ext_asm_func(unsigned char *arg1, unsigned char *arg2, int len)
{
int i, result = 0;
for(i=0; i<len; i++) {
result += arg1[i] + arg2[i];
}
return result;
}
Here is assembly (generated by gcc https://godbolt.org/g/1N6vBT):
ext_asm_func(unsigned char*, unsigned char*, int):
test edx, edx
jle .L4
lea r9d, [rdx-1]
xor eax, eax
xor edx, edx
add r9, 1
.L3:
movzx ecx, BYTE PTR [rdi+rdx]
movzx r8d, BYTE PTR [rsi+rdx]
add rdx, 1
add ecx, r8d
add eax, ecx
cmp r9, rdx
jne .L3
rep ret
.L4:
xor eax, eax
ret
Consider the following C code (assuming 80-bit long double) (note, I do know of memcmp, this is just an experiment):
enum { sizeOfFloat80=10 }; // NOTE: sizeof(long double) != sizeOfFloat80
_Bool sameBits1(long double x, long double y)
{
for(int i=0;i<sizeOfFloat80;++i)
if(((char*)&x)[i]!=((char*)&y)[i])
return 0;
return 1;
}
All compilers I checked (gcc, clang, icc on gcc.godbolt.org) generate similar code, here's an example for gcc with options -O3 -std=c11 -fomit-frame-pointer -m32:
sameBits1:
movzx eax, BYTE PTR [esp+16]
cmp BYTE PTR [esp+4], al
jne .L11
movzx eax, BYTE PTR [esp+17]
cmp BYTE PTR [esp+5], al
jne .L11
movzx eax, BYTE PTR [esp+18]
cmp BYTE PTR [esp+6], al
jne .L11
movzx eax, BYTE PTR [esp+19]
cmp BYTE PTR [esp+7], al
jne .L11
movzx eax, BYTE PTR [esp+20]
cmp BYTE PTR [esp+8], al
jne .L11
movzx eax, BYTE PTR [esp+21]
cmp BYTE PTR [esp+9], al
jne .L11
movzx eax, BYTE PTR [esp+22]
cmp BYTE PTR [esp+10], al
jne .L11
movzx eax, BYTE PTR [esp+23]
cmp BYTE PTR [esp+11], al
jne .L11
movzx eax, BYTE PTR [esp+24]
cmp BYTE PTR [esp+12], al
jne .L11
movzx eax, BYTE PTR [esp+25]
cmp BYTE PTR [esp+13], al
sete al
ret
.L11:
xor eax, eax
ret
This looks ugly, has branch on every byte and in fact doesn't seem to have been optimized at all (but at least the loop is unrolled). It's easy to see though that this could be optimized to the code equivalent to the following (and in general for larger data to use larger strides):
#include <string.h>
_Bool sameBits2(long double x, long double y)
{
long long X=0; memcpy(&X,&x,sizeof x);
long long Y=0; memcpy(&Y,&y,sizeof y);
short Xhi=0; memcpy(&Xhi,sizeof x+(char*)&x,sizeof Xhi);
short Yhi=0; memcpy(&Yhi,sizeof y+(char*)&y,sizeof Yhi);
return X==Y && Xhi==Yhi;
}
And this code now gets much nicer compilation result:
sameBits2:
sub esp, 20
mov edx, DWORD PTR [esp+36]
mov eax, DWORD PTR [esp+40]
xor edx, DWORD PTR [esp+24]
xor eax, DWORD PTR [esp+28]
or edx, eax
movzx eax, WORD PTR [esp+48]
sete dl
cmp WORD PTR [esp+36], ax
sete al
add esp, 20
and eax, edx
ret
So my question is: why is none of the three compilers able to do this optimization? It it something very uncommon to see in the C code?
Firstly, it is unable to do this optimization because you completely obfuscated the meaning of your code by overloading it with unduly amount of memory reinterpretation. A code like this justly makes the compiler react with "I don't know what on Earth this is, but if that's what you want, that's what you'll get". Why you expect the compiler to even bother to transform on kind of memory reinterpretation into another kind of memory reinterpretation (!) is completely unclear to me.
Secondly, it can probably be made to do it in theory, but it is probably not very high on the list of its priorities. Remember, that code optimization is usually done by a pattern matching algorithm, not by some kind of A.I. And this is just not one of the patterns it recognizes.
Most of the time your manual attempts to perform low-level optimization of the code will defeat compiler's effort to do the same. If you want to optimize it yourself, then go all the way. Don't expect to be able to start and then hand it over to the compiler to finish the job for you.
Comparison of two long double values x and y can be done very easily: x == y. If you want a bit-to-bit memory comparison, you will probably make the compiler's job easier by just using memcmp in a compiler that inherently knows what memcmp is (built-in, intrinsic function).