passing an array of chars to external assembly function - c

so my question is basic but i had a hard time finding anything on the internet.
lets say i want to write a function in C that calls an external nasm function written in x86_64 assembly.
I want to pass to the external function two char* of numbers, preform some arithmetic operations on the two and return char* of the result. My idea was to iterate over [rdi] and [rsi] somehow and saving the result in rax (i.e add rax, [rdi], [rsi]) but I'm having a hard time to actually do so. what would be the right way to go over each character? increasing [rsi] and [rdi]? and also- I would only need to move to rax the value of the first character right?
Thanks in advance!

If you could post assembly/C code - it would be easier to suggest changes.
For any assembly, I would start with a C code(since I think in C :)) and then convert to assembly using a compiler and then optimize it in the assembly as needed. Assuming you need write a function which takes two strings and adds them and returns the result as int like the following:
int ext_asm_func(unsigned char *arg1, unsigned char *arg2, int len)
{
int i, result = 0;
for(i=0; i<len; i++) {
result += arg1[i] + arg2[i];
}
return result;
}
Here is assembly (generated by gcc https://godbolt.org/g/1N6vBT):
ext_asm_func(unsigned char*, unsigned char*, int):
test edx, edx
jle .L4
lea r9d, [rdx-1]
xor eax, eax
xor edx, edx
add r9, 1
.L3:
movzx ecx, BYTE PTR [rdi+rdx]
movzx r8d, BYTE PTR [rsi+rdx]
add rdx, 1
add ecx, r8d
add eax, ecx
cmp r9, rdx
jne .L3
rep ret
.L4:
xor eax, eax
ret

Related

Why the first actual parameter printing as a output in C

#include <stdio.h>
int add(int a, int b)
{
if (a > b)
return a * b;
}
int main(void)
{
printf("%d", add(3, 7));
return 0;
}
Output:
3
In the above code, I am calling the function inside the print. In the function, the if condition is not true, so it won't execute. Then why I am getting 3 as output? I tried changing the first parameter to some other value, but it's printing the same when the if condition is not satisfied.
What happens here is called undefined behaviour.
When (a <= b), you don't return any value (and your compiler probably told you so). But if you use the return value of the function anyway, even if the function doesn't return anything, that value is garbage. In your case it is 3, but with another compiler or with other compiler flags it could be something else.
If your compiler didn't warn you, add the corresponding compiler flags. If your compiler is gcc or clang, use the -Wall compiler flags.
Jabberwocky is right: this is undefined behavior. You should turn your compiler warnings on and listen to them.
However, I think it can still be interesting to see what the compiler was thinking. And we have a tool to do just that: Godbolt Compiler Explorer.
We can plug your C program into Godbolt and see what assembly instructions it outputs. Here's the direct Godbolt link, and here's the assembly that it produces.
add:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov eax, DWORD PTR [rbp-4]
cmp eax, DWORD PTR [rbp-8]
jle .L2
mov eax, DWORD PTR [rbp-4]
imul eax, DWORD PTR [rbp-8]
jmp .L1
.L2:
.L1:
pop rbp
ret
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
mov esi, 7
mov edi, 3
call add
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
pop rbp
ret
Again, to be perfectly clear, what you've done is undefined behavior. With different compiler flags or a different compiler version or even just a compiler that happens to feel like doing things differently on a particular day, you will get different behavior. What I'm studying here is the assembly output by gcc 12.2 on Godbolt with optimizations disabled, and I am not representing this as standard or well-defined behavior.
This engine is using the System V AMD64 calling convention, common on Linux machines. In System V, the first two integer or pointer arguments are passed in the rdi and rsi registers, and integer values are returned in rax. Since everything we work with here is either an int or a char*, this is good enough for us. Note that the compiler seems to have been smart enough to figure out that it only needs edi, esi, and eax, the lower half-words of each of these registers, so I'll start using edi, esi, and eax from this point on.
Our main function works fine. It does everything we'd expect. Our two function calls are here.
mov esi, 7
mov edi, 3
call add
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
To call add, we put 3 in the edi register and 7 in the esi register and then we make the call. We get the return value back from add in eax, and we move it to esi (since it will be the second argument to printf). We put the address of the static memory containing "%d" in edi (the first argument), and then we call printf. This is all normal. main knows that add was declared to return an integer, so it has the right to assume that, after calling add, there will be something useful in eax.
Now let's look at add.
add:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov eax, DWORD PTR [rbp-4]
cmp eax, DWORD PTR [rbp-8]
jle .L2
mov eax, DWORD PTR [rbp-4]
imul eax, DWORD PTR [rbp-8]
jmp .L1
.L2:
.L1:
pop rbp
ret
The rbp and rsp shenanigans are standard function call fare and aren't specific to add. First, we load our two arguments onto the call stack as local variables. Now here's where the undefined behavior comes in. Remember that I said eax is the return value of our function. Whatever happens to be in eax when the function returns is the returned value.
We want to compare a and b. To do that, we need a to be in a register (lots of assembly instructions require their left-hand argument to be a register, while the right-hand can be a register, reference, immediate, or just about anything). So we load a into eax. Then we compare the value in eax to the value b on the call stack. If a > b, then the jle does nothing. We go down to the next two lines, which are the inside of your if statement. They correctly set eax and return a value.
However, if a <= b, then the jle instruction jumps to the end of the function without doing anything else to eax. Since the last thing in eax happened to be a (because we happened to use eax as our comparison register in cmp), that's what gets returned from our function.
But this really is just random. It's what the compiler happened to have put in that register previously. If I turn optimizations up (with -O3), then gcc inlines the whole function call and ends up printing out 0 rather than a. I don't know exactly what sequence of optimizations led to this conclusion, but since they started out by hinging on undefined behavior, the compiler is free to make what assumptions it chooses.

Efficiency difference between an if-statement and mod(SIZE)

Studying I found the use of the (i+1)mod(SIZE) to perform a cycle in an array of elements.
So I wondered if this method was more efficient than an if-statement...
For example:
#define SIZE 15
int main(int argc, char *argv[]) {
int items[SIZE];
for(int i = 0; items[0] < 5; i = (i + 1) % SIZE) items[i] += 1;
return 0;
}
It is more efficient than(?):
#define SIZE 15
int main(int argc, char *argv[]) {
int items[SIZE];
for(int i = 0; items[0] < 5; i++) {
if(i == SIZE) i = 0;
items[i] += 1;
}
return 0;
}
Thanks for the answers and your time.
You can check the assembly online (i. e. here). The result depends on the architecture and the optimization, but without optimization and for x64 with GCC, you get this code (as a simple example).
Example 1:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-68], edi
mov QWORD PTR [rbp-80], rsi
mov DWORD PTR [rbp-4], 0
.L3:
mov eax, DWORD PTR [rbp-64]
cmp eax, 4
jg .L2
mov eax, DWORD PTR [rbp-4]
cdqe
mov eax, DWORD PTR [rbp-64+rax*4]
lea edx, [rax+1]
mov eax, DWORD PTR [rbp-4]
cdqe
mov DWORD PTR [rbp-64+rax*4], edx
mov eax, DWORD PTR [rbp-4]
add eax, 1
movsx rdx, eax
imul rdx, rdx, -2004318071
shr rdx, 32
add edx, eax
mov ecx, edx
sar ecx, 3
cdq
sub ecx, edx
mov edx, ecx
mov DWORD PTR [rbp-4], edx
mov ecx, DWORD PTR [rbp-4]
mov edx, ecx
sal edx, 4
sub edx, ecx
sub eax, edx
mov DWORD PTR [rbp-4], eax
jmp .L3
.L2:
mov eax, 0
pop rbp
ret
Example 2:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-68], edi
mov QWORD PTR [rbp-80], rsi
mov DWORD PTR [rbp-4], 0
.L4:
mov eax, DWORD PTR [rbp-64]
cmp eax, 4
jg .L2
cmp DWORD PTR [rbp-4], 15
jne .L3
mov DWORD PTR [rbp-4], 0
.L3:
mov eax, DWORD PTR [rbp-4]
cdqe
mov eax, DWORD PTR [rbp-64+rax*4]
lea edx, [rax+1]
mov eax, DWORD PTR [rbp-4]
cdqe
mov DWORD PTR [rbp-64+rax*4], edx
add DWORD PTR [rbp-4], 1
jmp .L4
.L2:
mov eax, 0
pop rbp
ret
You see, that for the specific case with x86, the solution without modulo is much shorter.
Although you are only asking about mod vs branch, there are probably more like five cases depending on the actual implementation of the mod and branch:
Modulus-based
Power-of-two
If the value of SIZE is known to the compiler and is a power of 2, the mod will compile into a single and like this and will be very efficient in performance and code size. The and is still part of the loop increment dependency chain though, putting a speed limit on the performance of 2 cycles per iteration unless the compiler is clever enough to unroll it and keep the and out of the carried chain (gcc and clang weren't).
Known, not power-of-two
On the other hand, if the value of SIZE is known but not a power of two, then you are likely to get a multiplication-based implementation of the fixed modulus value, like this. This generally takes something like 4-6 instructions, which end up part of the dependency chain. So this will limit your performance to something like 1 iteration every 5-8 cycles, depending exactly on the latency of the dependency chain.
Unknown
In your example SIZE is a known constant, but in the more general case where it is not known at compile time you will get an division instruction on platforms that support it. Something like this.
That is good for code size, since it's a single instruction, but probably disastrous for performance because now you have a slow division instruction as part of the carried dependency for the loop. Depending on your hardware and the type of the SIZE variable, you are looking at 20-100 cycles per iteration.
Branch-based
You put a branch in your code, but jump compiler made decide to implement that as a conditional jump or as a conditional move. At -O2, gcc decides on a jump and clang on a conditional move.
Conditional Jump
This is the direct interpretation of your code: use a conditional branch to implement the i == SIZE condition.
It has the advantage of making the condition a control dependency, not a data dependency, so your loop will mostly run at full speed when the branch is not taken.
However, performance could be seriously impacted if the branch mispredicts often. That depends heavily on the value of SIZE and on your hardware. Modern Intel should be able to predict nested loops like this up to 20-something iterations, but beyond that it will mispredict once every time the inner loop is exited. Of course, is SIZE is very large then the single mispredict won't matter much anyways, so the worst case is SIZE just large enough to mispredict.
Conditional Move
clang uses a conditional move to update i. This is a reasonable option, but it does mean a carried data flow dependency of 3-4 cycles.
1 Either actually a constant like your example or effectively a constant due to inlining and constant propagation.

flip an image assembly code

I'm working on a c based program that works on assembly for image twisting. Thepseudocode that is supposed to work is this one(always using images of 240x320
voltearHorizontal(imgO, imgD){
dirOrig = imgO;
dirDest = imgD;
dirOrig = dirOrig + 239*320; //bring the pointer to the first pixel of the last row
for(f=0; f<240; f++){
for(c=0; c<320; c++){
[dirDest]=[dirOrig];
dirOrig++­­;
dirDest++;
}
dirOrig=dirOrig+640;//move the pixel to the first one of the upper row
}
}
But when applied to assembly, on the result, the first rows are not read, leaving the space in black.
https://gyazo.com/7a76f147da96ae2bc27e109593ed6df8
this is the code I've written, that's supposed to work, and this one is what really happens to the image:
https://gyazo.com/2e389248d9959a786e736eecd3bf1531
Why are, with this code, not written/read the upper lines of pixels of the origen image to the second image? what part of code did I get wrong?
I think I have no tags left to put for my problem, thanks for any help that can be given (on where I am wrong).Also, the horitzontal flip (the oneabove is the vertical) simply finishes the program unexpectedly:
https://gyazo.com/a7a18cf10ac3c06fc73a93d9e55be70c
Any special reason, why you write it as slow assembler?
Why don't you just keep it in fast C++? https://godbolt.org/g/2oIpzt
#include <cstring>
void voltearHorizontal(const unsigned char* imgO, unsigned char* imgD) {
imgO += 239*320; //bring the pointer to the first pixel of the last row
for(unsigned f=0; f<240; ++f) {
memcpy(imgD, imgO, 320);
imgD += 320;
imgO -= 320;
}
}
Will be compiled with gcc6.3 -O3 to:
voltearHorizontal(unsigned char const*, unsigned char*):
lea rax, [rdi+76480]
lea r8, [rdi-320]
mov rdx, rsi
.L2:
mov rcx, QWORD PTR [rax]
lea rdi, [rdx+8]
mov rsi, rax
sub rax, 320
and rdi, -8
mov QWORD PTR [rdx], rcx
mov rcx, QWORD PTR [rax+632]
mov QWORD PTR [rdx+312], rcx
mov rcx, rdx
add rdx, 320
sub rcx, rdi
sub rsi, rcx
add ecx, 320
shr ecx, 3
cmp rax, r8
rep movsq
jne .L2
rep ret
Ie. like 800% more efficient than your inline asm.
Anyway, in your question the problem is:
dirOrig=dirOrig+640;//move the pixel to the first one of the upper row
You need to do -= 640 to return two lines up.
About those inline asm in screens... put them as text into question, but from a quick look on them I would simply rewrite it in C++ and keep it to compiler, you are doing many performance-wrong things in your asm, so I don't see any point in doing that, plus inline asm is ugly and hard to maintain, and hard to write correctly.
I did check even that asm in picture. You have lines counter in eax, but you use al to copy the pixel, so it does destroy the line counter value.
Use debugger next time.
BTW, your pictures are 320x240, not 240x320.

C pointers and references

I would like to know what's really happening calling & and * in C.
Is that it costs a lot of resources? Should I call & each time I wanna get an adress of a same given variable or keep it in memory i.e in a cache variable. Same for * i.e when I wanna get a pointer value ?
Example
void bar(char *str)
{
check_one(*str)
check_two(*str)
//... Could be replaced by
char c = *str;
check_one(c);
check_two(c);
}
I would like to know what's really happening calling & and * in C.
There's no such thing as "calling" & or *. They are the address operator, or the dereference operator, and instruct the compiler to work with the address of an object, or with the object that a pointer points to, respectively.
And C is not C++, so there's no references; I think you just misused that word in your question's title.
In most cases, that's basically two ways to look at the same thing.
Usually, you'll use & when you actually want the address of an object. Since the compiler needs to handle objects in memory with their address anyway, there's no overhead.
For the specific implications of using the operators, you'll have to look at the assembler your compiler generates.
Example: consider this trivial code, disassembled via godbolt.org:
#include <stdio.h>
#include <stdlib.h>
void check_one(char c)
{
if(c == 'x')
exit(0);
}
void check_two(char c)
{
if(c == 'X')
exit(1);
}
void foo(char *str)
{
check_one(*str);
check_two(*str);
}
void bar(char *str)
{
char c = *str;
check_one(c);
check_two(c);
}
int main()
{
char msg[] = "something";
foo(msg);
bar(msg);
}
The compiler output can far wildly depending on the vendor and optimization settings.
clang 3.8 using -O2
check_one(char): # #check_one(char)
movzx eax, dil
cmp eax, 120
je .LBB0_2
ret
.LBB0_2:
push rax
xor edi, edi
call exit
check_two(char): # #check_two(char)
movzx eax, dil
cmp eax, 88
je .LBB1_2
ret
.LBB1_2:
push rax
mov edi, 1
call exit
foo(char*): # #foo(char*)
push rax
movzx eax, byte ptr [rdi]
cmp eax, 88
je .LBB2_3
movzx eax, al
cmp eax, 120
je .LBB2_2
pop rax
ret
.LBB2_3:
mov edi, 1
call exit
.LBB2_2:
xor edi, edi
call exit
bar(char*): # #bar(char*)
push rax
movzx eax, byte ptr [rdi]
cmp eax, 88
je .LBB3_3
movzx eax, al
cmp eax, 120
je .LBB3_2
pop rax
ret
.LBB3_3:
mov edi, 1
call exit
.LBB3_2:
xor edi, edi
call exit
main: # #main
xor eax, eax
ret
Notice that foo and bar are identical. Do other compilers do something similar? Well...
gcc x64 5.4 using -O2
check_one(char):
cmp dil, 120
je .L6
rep ret
.L6:
push rax
xor edi, edi
call exit
check_two(char):
cmp dil, 88
je .L11
rep ret
.L11:
push rax
mov edi, 1
call exit
bar(char*):
sub rsp, 8
movzx eax, BYTE PTR [rdi]
cmp al, 120
je .L16
cmp al, 88
je .L17
add rsp, 8
ret
.L16:
xor edi, edi
call exit
.L17:
mov edi, 1
call exit
foo(char*):
jmp bar(char*)
main:
sub rsp, 24
movabs rax, 7956005065853857651
mov QWORD PTR [rsp], rax
mov rdi, rsp
mov eax, 103
mov WORD PTR [rsp+8], ax
call bar(char*)
mov rdi, rsp
call bar(char*)
xor eax, eax
add rsp, 24
ret
Well, if there were any doubt foo and bar are equivalent, a least by the compiler, I think this:
foo(char*):
jmp bar(char*)
is a strong argument they indeed are.
In C, there's no runtime cost associated with either the unary & or * operators; both are evaluated at compile time. So there's no difference in runtime between
check_one(*str)
check_two(*str)
and
char c = *str;
check_one( c );
check_two( c );
ignoring the overhead of the assignment.
That's not necessarily true in C++, since you can overload those operators.
tldr;
If you are programming in C, then the & operator is used to obtain the address of a variable and * is used to get the value of that variable, given it's address.
This is also the reason why in C, when you pass a string to a function, you must state the length of the string otherwise, if someone unfamiliar with your logic sees the function signature, they could not tell if the function is called as bar(&some_char) or bar(some_cstr).
To conclude, if you have a variable x of type someType, then &x will result in someType* addressOfX and *addressOfX will result in giving the value of x. Functions in C only take pointers as parameters, i.e. you cannot create a function where the parameter type is &x or &&x
Also your examples can be rewritten as:
check_one(str[0])
check_two(str[0])
AFAIK, in x86 and x64 your variables are stored in memory (if not stated with register keyword) and accessed by pointers.
const int foo = 5 equal to foo dd 5 and check_one(*foo) equal to push dword [foo]; call check_one.
If you create additional variable c, then it looks like:
c resd 1
...
mov eax, [foo]
mov dword [c], eax ; Variable foo just copied to c
push dword [c]
call check_one
And nothing changed, except additional copying and memory allocation.
I think that compiler's optimizer deals with it and makes both cases as fast as it is possible. So you can use more readable variant.

Why no compiler appears able to optimize this code?

Consider the following C code (assuming 80-bit long double) (note, I do know of memcmp, this is just an experiment):
enum { sizeOfFloat80=10 }; // NOTE: sizeof(long double) != sizeOfFloat80
_Bool sameBits1(long double x, long double y)
{
for(int i=0;i<sizeOfFloat80;++i)
if(((char*)&x)[i]!=((char*)&y)[i])
return 0;
return 1;
}
All compilers I checked (gcc, clang, icc on gcc.godbolt.org) generate similar code, here's an example for gcc with options -O3 -std=c11 -fomit-frame-pointer -m32:
sameBits1:
movzx eax, BYTE PTR [esp+16]
cmp BYTE PTR [esp+4], al
jne .L11
movzx eax, BYTE PTR [esp+17]
cmp BYTE PTR [esp+5], al
jne .L11
movzx eax, BYTE PTR [esp+18]
cmp BYTE PTR [esp+6], al
jne .L11
movzx eax, BYTE PTR [esp+19]
cmp BYTE PTR [esp+7], al
jne .L11
movzx eax, BYTE PTR [esp+20]
cmp BYTE PTR [esp+8], al
jne .L11
movzx eax, BYTE PTR [esp+21]
cmp BYTE PTR [esp+9], al
jne .L11
movzx eax, BYTE PTR [esp+22]
cmp BYTE PTR [esp+10], al
jne .L11
movzx eax, BYTE PTR [esp+23]
cmp BYTE PTR [esp+11], al
jne .L11
movzx eax, BYTE PTR [esp+24]
cmp BYTE PTR [esp+12], al
jne .L11
movzx eax, BYTE PTR [esp+25]
cmp BYTE PTR [esp+13], al
sete al
ret
.L11:
xor eax, eax
ret
This looks ugly, has branch on every byte and in fact doesn't seem to have been optimized at all (but at least the loop is unrolled). It's easy to see though that this could be optimized to the code equivalent to the following (and in general for larger data to use larger strides):
#include <string.h>
_Bool sameBits2(long double x, long double y)
{
long long X=0; memcpy(&X,&x,sizeof x);
long long Y=0; memcpy(&Y,&y,sizeof y);
short Xhi=0; memcpy(&Xhi,sizeof x+(char*)&x,sizeof Xhi);
short Yhi=0; memcpy(&Yhi,sizeof y+(char*)&y,sizeof Yhi);
return X==Y && Xhi==Yhi;
}
And this code now gets much nicer compilation result:
sameBits2:
sub esp, 20
mov edx, DWORD PTR [esp+36]
mov eax, DWORD PTR [esp+40]
xor edx, DWORD PTR [esp+24]
xor eax, DWORD PTR [esp+28]
or edx, eax
movzx eax, WORD PTR [esp+48]
sete dl
cmp WORD PTR [esp+36], ax
sete al
add esp, 20
and eax, edx
ret
So my question is: why is none of the three compilers able to do this optimization? It it something very uncommon to see in the C code?
Firstly, it is unable to do this optimization because you completely obfuscated the meaning of your code by overloading it with unduly amount of memory reinterpretation. A code like this justly makes the compiler react with "I don't know what on Earth this is, but if that's what you want, that's what you'll get". Why you expect the compiler to even bother to transform on kind of memory reinterpretation into another kind of memory reinterpretation (!) is completely unclear to me.
Secondly, it can probably be made to do it in theory, but it is probably not very high on the list of its priorities. Remember, that code optimization is usually done by a pattern matching algorithm, not by some kind of A.I. And this is just not one of the patterns it recognizes.
Most of the time your manual attempts to perform low-level optimization of the code will defeat compiler's effort to do the same. If you want to optimize it yourself, then go all the way. Don't expect to be able to start and then hand it over to the compiler to finish the job for you.
Comparison of two long double values x and y can be done very easily: x == y. If you want a bit-to-bit memory comparison, you will probably make the compiler's job easier by just using memcmp in a compiler that inherently knows what memcmp is (built-in, intrinsic function).

Resources