I am interested in trying F# in a high-performance application. I do not want to have a large array's bounds checked during iteration and the lack of break/return statements is concerning.
This is a contrived example that will break upon finding a value, but can someone tell me if bounds checking is also elided?
let innerExists (item: Char) (items: Char array): bool =
let mutable state = false
let mutable i = 0
while not state && i < items.Length do
state <- item = items.[i]
i <- i + 1
state
let exists (input: Char array)(illegalChars: Char array): bool =
let mutable state = false
let mutable i = 0
while not state && i < input.Length do
state <- innerExists input.[i] illegalChars
i <- i + 1
state
exists [|'A'..'z'|] [|'.';',';';'|]
Here is the relevant disassembly:
while not state && i < input.Length do
000007FE6EB4237A cmp dword ptr [rbp-14h],0
000007FE6EB4237E jne 000007FE6EB42383
000007FE6EB42380 nop
000007FE6EB42381 jmp 000007FE6EB42386
000007FE6EB42383 nop
000007FE6EB42384 jmp 000007FE6EB423A9
000007FE6EB42386 nop
000007FE6EB42387 mov r8d,dword ptr [rbp-18h]
000007FE6EB4238B mov rdx,qword ptr [rbp+18h]
000007FE6EB4238F cmp r8d,dword ptr [rdx+8]
000007FE6EB42393 setl r8b
000007FE6EB42397 movzx r8d,r8b
000007FE6EB4239B mov dword ptr [rbp-24h],r8d
000007FE6EB4239F mov r8d,dword ptr [rbp-24h]
000007FE6EB423A3 mov dword ptr [rbp-1Ch],r8d
000007FE6EB423A7 jmp 000007FE6EB423B1
000007FE6EB423A9 nop
000007FE6EB423AA xor r8d,r8d
000007FE6EB423AD mov dword ptr [rbp-1Ch],r8d
000007FE6EB423B1 cmp dword ptr [rbp-1Ch],0
000007FE6EB423B5 je 000007FE6EB42409
state <- innerExists input.[i] illegalChars
000007FE6EB423B7 mov r8d,dword ptr [rbp-18h]
000007FE6EB423BB mov rdx,qword ptr [rbp+18h]
000007FE6EB423BF cmp r8,qword ptr [rdx+8]
000007FE6EB423C3 jb 000007FE6EB423CA
000007FE6EB423C5 call 000007FECD796850
000007FE6EB423CA lea rdx,[rdx+r8*2+10h]
000007FE6EB423CF movzx r8d,word ptr [rdx]
000007FE6EB423D3 mov rdx,qword ptr [rbp+10h]
000007FE6EB423D7 mov rdx,qword ptr [rdx+8]
000007FE6EB423DB mov r9,qword ptr [rbp+20h]
000007FE6EB423DF mov rcx,7FE6EEE0640h
000007FE6EB423E9 call 000007FE6EB41E40
000007FE6EB423EE mov dword ptr [rbp-20h],eax
000007FE6EB423F1 mov eax,dword ptr [rbp-20h]
000007FE6EB423F4 movzx eax,al
000007FE6EB423F7 mov dword ptr [rbp-14h],eax
i <- i + 1
000007FE6EB423FA mov eax,dword ptr [rbp-18h]
Others pointed out to use existing function FSharp.Core to achieve the same result but I think that OP asks if in loops like the boundary check of arrays are elided (as it is checked in the loop condition).
For simple code like above the jitter should be able to elide the checks. To see this it is correct to check the assembly code but it is important to not run with the VS debugger attached as the jitter don't optimize the code then. The reason that it can be impossible to show correct values in the debugger.
First let's look at exists optimized x64:
; not state?
00007ff9`1cd37551 85c0 test eax,eax
; if state is true then exit the loop
00007ff9`1cd37553 7521 jne 00007ff9`1cd37576
; i < input.Length?
00007ff9`1cd37555 395e08 cmp dword ptr [rsi+8],ebx
; Seems overly complex but perhaps this is as good as it gets?
00007ff9`1cd37558 0f9fc1 setg cl
00007ff9`1cd3755b 0fb6c9 movzx ecx,cl
00007ff9`1cd3755e 85c9 test ecx,ecx
; if we have reached end of the array then exit
00007ff9`1cd37560 7414 je 00007ff9`1cd37576
; mov i in ebx to rcx, unnecessary but moves like these are very cheap
00007ff9`1cd37562 4863cb movsxd rcx,ebx
; input.[i] (note we don't check the boundary again)
00007ff9`1cd37565 0fb74c4e10 movzx ecx,word ptr [rsi+rcx*2+10h]
; move illegalChars pointer to rdx
00007ff9`1cd3756a 488bd7 mov rdx,rdi
; call innerExists
00007ff9`1cd3756d e8ee9affff call 00007ff9`1cd31060
; i <- i + 1
00007ff9`1cd37572 ffc3 inc ebx
; Jump top of loop
00007ff9`1cd37574 ebdb jmp 00007ff9`1cd37551
; We are done!
00007ff9`1cd37576
So the code looks a bit too complex for what it should need to be but it seems it only checks the array condition once.
Now let's look at innerExists optimized x64:
# let mutable state = false
00007ff9`1cd375a0 33c0 xor eax,eax
# let mutable i = 0
00007ff9`1cd375a2 4533c0 xor r8d,r8d
; not state?
00007ff9`1cd375a5 85c0 test eax,eax
; if state is true then exit the loop
00007ff9`1cd375a7 752b jne 00007ff9`1cd375d4
; i < items.Length
00007ff9`1cd375a9 44394208 cmp dword ptr [rdx+8],r8d
; Seems overly complex but perhaps this is as good as it gets?
00007ff9`1cd375ad 410f9fc1 setg r9b
00007ff9`1cd375b1 450fb6c9 movzx r9d,r9b
00007ff9`1cd375b5 4585c9 test r9d,r9d
; if we have reached end of the array then exit
00007ff9`1cd375b8 741a je 00007ff9`1cd375d4
; mov i in r8d to rax, unnecessary but moves like these are very cheap
00007ff9`1cd375ba 4963c0 movsxd rax,r8d
; items.[i] (note we don't check the boundary again)
00007ff9`1cd375bd 0fb7444210 movzx eax,word ptr [rdx+rax*2+10h]
; mov item in cx to r9d, unnecessary but moves like these are very cheap
00007ff9`1cd375c2 440fb7c9 movzx r9d,cx
; item = items.[i]?
00007ff9`1cd375c6 413bc1 cmp eax,r9d
00007ff9`1cd375c9 0f94c0 sete al
; state <- ?
00007ff9`1cd375cc 0fb6c0 movzx eax,al
; i <- i + 1
00007ff9`1cd375cf 41ffc0 inc r8d
; Jump top of loop
00007ff9`1cd375d2 ebd1 jmp 00007ff9`1cd375a5
; We are done!
00007ff9`1cd375d4 c3 ret
So looks overly complex for what it should be but at least it looks like it only checks the array condition once.
So finally, it looks like the jitter eliminates the array boundary checks because it can prove this has already been checked successfully in the loop condition which I believe is what the OP wondered.
The x64 code doesn't look as clean as it could but from my experimentation x64 code that is cleaned up doesn't perform that much better, I suspect the CPU vendors optimize the CPU for the crappy code jitters produce.
An interesting exercise would be to code up an equivalent program in C++ and run it through https://godbolt.org/, choose x86-64 gcc (trunk) (gcc seems to do best right now) and specify the options -O3 -march=native and see the resulting x64 code.
Update
The code rewritten in https://godbolt.org/ to allow us seeing the assembly code generated by a c++ compiler:
template<int N>
bool innerExists(char item, char const (&items)[N]) {
for (auto i = 0; i < N; ++i) {
if (item == items[i]) return true;
}
return false;
}
template<int N1, int N2>
bool exists(char const (&input)[N1], char const (&illegalCharacters)[N2]) {
for (auto i = 0; i < N1; ++i) {
if (innerExists(input[i], illegalCharacters)) return true;
}
return false;
}
char const separators[] = { '.', ',', ';' };
char const str[58] = { };
bool test() {
return exists(str, separators);
}
x86-64 gcc (trunk) with options -O3 -march=native the following code is generated
; Load the string to test into edx
mov edx, OFFSET FLAT:str+1
.L2:
; Have we reached the end?
cmp rdx, OFFSET FLAT:str+58
; If yes, then jump to the end
je .L7
; Load a character
movzx ecx, BYTE PTR [rdx]
; Comparing the 3 separators are encoded in the assembler
; because the compiler detected the array is always the same
mov eax, ecx
and eax, -3
cmp al, 44
sete al
cmp cl, 59
sete cl
; increase outer i
inc rdx
; Did we find a match?
or al, cl
; If no then loop to .L2
je .L2
; We are done!
ret
.L7:
; No match found, clear result
xor eax, eax
; We are done!
ret
Looks pretty good but what I missing in the code above is using AVX to test multiple characters at once.
Bound check are eliminated by the JIT compiler, so it works the same for F# as for C#. You can expect elimination for code as in your example, as well as for
for i = 0 to data.Lenght - 1 do
...
and also for tail recursive functions, which compiles down to loops.
The built in Array.contains and Array.exists (source code) are written so that the JIT compiler can eliminate bounds checks.
What's wrong with the Array.contains and Array.exists functions ?
let exists input illegalChars =
input |> Array.exists (fun c -> illegalChars |> Array.contains c)
Related
First of all, I am a student, I do not yet have extensive knowledge about C, C ++ and assembler, so I am making a extreme effort to understand it.
I have this piece of assembly code from an Intel x86-32 bit processor.
My goal is to transform it to source code.
0x80483dc <main>: push ebp
0x80483dd <main+1>: mov ebp,esp
0x80483df <main+3>: sub esp,0x10
0x80483e2 <main+6>: mov DWORD PTR [ebp-0x8],0x80484d0
0x80483e9 <main+13>: lea eax,[ebp-0x8]
0x80483ec <main+16>: mov DWORD PTR [ebp-0x4],eax
0x80483ef <main+19>: mov eax,DWORD PTR [ebp-0x4]
0x80483f2 <main+22>: mov edx,DWORD PTR [eax+0xc]
0x80483f5 <main+25>: mov eax,DWORD PTR [ebp-0x4]
0x80483f8 <main+28>: movzx eax,WORD PTR [eax+0x10]
0x80483fc <main+32>: cwde
0x80483fd <main+33>: add edx, eax
0x80483ff <main+35>: mov eax,DWORD PTR [ebp-0x4]
0x8048402 <main+38>: mov DWORD PTR [eax+0xc],edx
0x8048405 <main+41>: mov eax,DWORD PTR [ebp-0x4]
0x8048408 <main+44>: movzx eax,BYTE PTR [eax]
0x804840b <main+47>: cmp al,0x4f
0x804840d <main+49>: jne 0x8048419 <main+61>
0x804840f <main+51>: mov eax,DWORD PTR [ebp-0x4]
0x8048412 <main+54>: movzx eax,BYTE PTR [eax]
0x8048415 <main+57>: cmp al,0x4b
0x8048417 <main+59>: je 0x804842d <main+81>
0x8048419 <main+61>: mov eax,DWORD PTR [ebp-0x4]
0x804841c <main+64>: mov eax,DWORD PTR [eax+0xc]
0x804841f <main+67>: mov edx, eax
0x8048421 <main+69>: and edx,0xf0f0f0f
0x8048427 <main+75>: mov eax,DWORD PTR [ebp-0x4]
0x804842a <main+78>: mov DWORD PTR [eax+0x4],edx
0x804842d <main+81>: mov eax,0x0
0x8048432 <main+86>: leave
0x8048433 <main+87>: ret
This is what I understand from the code:
There are 4 variables:
a = [ebp-0x8] ebp
b = [ebp-0x4] eax
c = [eax + 0xc] edx
d = [eax + 0x10] eax
Values:
0x4 = 4
0x8 = 8
0xc = 12
0x10 = 16
0x4b = 75
0x4f = 79
Types:
char (8 bits) = 1 BYTE
short (16 bits) = WORD
int (32 bit) = DWORD
long (32 bits) = DWORD
long long (32 bit) = DWORD
This is what I was able to create:
#include <stdio.h>
int main (void)
{
int a = 0x80484d0;
int b
short c;
int d;
c + b?
if (79 <= al) {
instructions
} else {
instructions
}
return 0
}
But I'm stuck. Nor can I understand what the sentence "cmp al .." compares to, what is "al"?
How do these instructions work?
EDIT1:
That said, as you comment the assembly seems to be wrong or as someone comments say, it is insane!
The code and the exercise are from the following book called: "Reversing, Reverse Engineering" on page 140 (3.8 Proposed Exercises). It would never have occurred to me that it was wrong, if so, this clearly makes it difficult for me to learn ...
So it is not possible to do a reversing to get the source code because it is not a good assembly? Maybe I am not oppressed? Is it possible to optimize it?
EDIT2:
Hi!
I did ask and finally she says this should be the c code:
inf foo(void){
char *string;//ebp-0x8
unsigned int *pointerstring//[ebp-0x4]
unsigned int *position;
*position = *(pointerstring+0xc);
unsigned char character;
character=(unsigned char) string[*position];
if ((character != 0x4)||(character != 0x4b))
{
*(position+0x4)=(unsigned int)(*position & 0x0f0f0f0f);
}
return(0);
}
Does it have any sense at all for you?, could someone please explain this to me?
Does anyone really program like this?
Thanks very much!
Your assembly is completely insane. This is roughly equivalent C:
int main() {
int i = 0x80484d0; // in ebp-8
int *p = &i; // in ebp-4
p[3] += (short)p[4]; // add argc to the return address(!)
if((char)*p != 0x4f || (char)*p != 0x4b) // always true because of || instead of &&
p[1] = p[3] & 0xf0f0f0f; // note that p[1] is p
return 0;
}
It should be immediately obvious that this is horrifically bad code that almost certainly won't do what the programmer intended.
The x86 assembly language follows a long legacy and has mostly kept compatibility. We need to go back to the 8086/8088 chip where that story starts. These were 16 bit processors, which means that their register had a word size of 16 bits. The general purpose registers were named AX, BX, CX and DX. The 8086 had instructions to manipulate the upper and lower 8-bit parts of these registers that were then named AH, AL, BH, BL, CH, CL, DH and DL. This Wikipedia page describes this, please take a look.
The 32 bit versions of these registers have an E in front: EAX, EBX, ECX, etc.
The particular instruction you mention, e.g, cmp al,0x4f is comparing the lower byte of the AX register with 0x4f. The comparison is effectively the same as a subtraction, but does not save the result, only sets the flags.
For the 8086 instruction set, there is a nice reference here. Your program is 32 bit code, so you will need at least a 80386 instruction reference.
You have analyzed variables, and that's a good place to start. You should try to add type annotations to them, size, as you started, and, when used as pointers (like b), pointers to what kind/size.
I might update your variable chart as follows, knowing that [ebp-4] is b:
c = [b + 0xc]
d = [b + 0x10]
e = [b + 0], size = byte
Another thing to analyze is the control flow. For most instructions control flow is sequential, but certain instructions purposefully alter it. Broadly speaking, when the pc is moved forward, it skips some code and when the pc is moved backward it repeats some code it already ran. Skipping code is used to construct if-then, if-then-else, and statements that break out of loops. Jumping back is used to continue looping.
Some instructions, called conditional branches, on some dynamic condition being true: skip forward (or backwards) and on being false do the simple sequential advancement to the next instruction (sometimes called conditional branch fall through).
The control sequences here:
...
0x8048405 <main+41>: mov eax,DWORD PTR [ebp-0x4] b
0x8048408 <main+44>: movzx eax,BYTE PTR [eax] b->e
0x804840b <main+47>: cmp al,0x4f b->e <=> 'O'
0x804840d <main+49>: jne 0x8048419 <main+61> b->e != 'O' skip to 61
** we know that the letter, a->e, must be 'O' here
0x804840f <main+51>: mov eax,DWORD PTR [ebp-0x4] b
0x8048412 <main+54>: movzx eax,BYTE PTR [eax] b->e
0x8048415 <main+57>: cmp al,0x4b b->e <=> 'K'
0x8048417 <main+59>: je 0x804842d <main+81> b->e == 'K' skip to 81
** we know that the letter, a->e must not be 'K' here if we fall thru the above je
** this line can be reached by taken branch jne or by fall thru je
0x8048419 <main+61>: mov eax,DWORD PTR [ebp-0x4] ******
...
The flow of control reaches this last line tagged we know that either the letter is either not 'O' or it is not 'K'.
The construct where the jne instruction is used to skip another test is a short-circuit || operator. Thus the control construct is:
if ( a->e != 'O' || a->e != 'K' ) {
then-part
}
As that these two conditional branches are the only flow control modifications in the function, there is no else part of the if, and there are no loops or other if's.
This code appears to have a slight problem.
If the value is not 'O', the then-part will fire from the first test. However, if we reach the 2nd test, we already know the letter is 'O', so testing it for 'K' is silly and will be true ('O' is not 'K').
Thus, this if-then will always fire.
It is either very inefficient, or, there is a bug that perhaps it is the next letter along in the (presumably) string should be tested for 'K' not the same exact letter.
Studying I found the use of the (i+1)mod(SIZE) to perform a cycle in an array of elements.
So I wondered if this method was more efficient than an if-statement...
For example:
#define SIZE 15
int main(int argc, char *argv[]) {
int items[SIZE];
for(int i = 0; items[0] < 5; i = (i + 1) % SIZE) items[i] += 1;
return 0;
}
It is more efficient than(?):
#define SIZE 15
int main(int argc, char *argv[]) {
int items[SIZE];
for(int i = 0; items[0] < 5; i++) {
if(i == SIZE) i = 0;
items[i] += 1;
}
return 0;
}
Thanks for the answers and your time.
You can check the assembly online (i. e. here). The result depends on the architecture and the optimization, but without optimization and for x64 with GCC, you get this code (as a simple example).
Example 1:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-68], edi
mov QWORD PTR [rbp-80], rsi
mov DWORD PTR [rbp-4], 0
.L3:
mov eax, DWORD PTR [rbp-64]
cmp eax, 4
jg .L2
mov eax, DWORD PTR [rbp-4]
cdqe
mov eax, DWORD PTR [rbp-64+rax*4]
lea edx, [rax+1]
mov eax, DWORD PTR [rbp-4]
cdqe
mov DWORD PTR [rbp-64+rax*4], edx
mov eax, DWORD PTR [rbp-4]
add eax, 1
movsx rdx, eax
imul rdx, rdx, -2004318071
shr rdx, 32
add edx, eax
mov ecx, edx
sar ecx, 3
cdq
sub ecx, edx
mov edx, ecx
mov DWORD PTR [rbp-4], edx
mov ecx, DWORD PTR [rbp-4]
mov edx, ecx
sal edx, 4
sub edx, ecx
sub eax, edx
mov DWORD PTR [rbp-4], eax
jmp .L3
.L2:
mov eax, 0
pop rbp
ret
Example 2:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-68], edi
mov QWORD PTR [rbp-80], rsi
mov DWORD PTR [rbp-4], 0
.L4:
mov eax, DWORD PTR [rbp-64]
cmp eax, 4
jg .L2
cmp DWORD PTR [rbp-4], 15
jne .L3
mov DWORD PTR [rbp-4], 0
.L3:
mov eax, DWORD PTR [rbp-4]
cdqe
mov eax, DWORD PTR [rbp-64+rax*4]
lea edx, [rax+1]
mov eax, DWORD PTR [rbp-4]
cdqe
mov DWORD PTR [rbp-64+rax*4], edx
add DWORD PTR [rbp-4], 1
jmp .L4
.L2:
mov eax, 0
pop rbp
ret
You see, that for the specific case with x86, the solution without modulo is much shorter.
Although you are only asking about mod vs branch, there are probably more like five cases depending on the actual implementation of the mod and branch:
Modulus-based
Power-of-two
If the value of SIZE is known to the compiler and is a power of 2, the mod will compile into a single and like this and will be very efficient in performance and code size. The and is still part of the loop increment dependency chain though, putting a speed limit on the performance of 2 cycles per iteration unless the compiler is clever enough to unroll it and keep the and out of the carried chain (gcc and clang weren't).
Known, not power-of-two
On the other hand, if the value of SIZE is known but not a power of two, then you are likely to get a multiplication-based implementation of the fixed modulus value, like this. This generally takes something like 4-6 instructions, which end up part of the dependency chain. So this will limit your performance to something like 1 iteration every 5-8 cycles, depending exactly on the latency of the dependency chain.
Unknown
In your example SIZE is a known constant, but in the more general case where it is not known at compile time you will get an division instruction on platforms that support it. Something like this.
That is good for code size, since it's a single instruction, but probably disastrous for performance because now you have a slow division instruction as part of the carried dependency for the loop. Depending on your hardware and the type of the SIZE variable, you are looking at 20-100 cycles per iteration.
Branch-based
You put a branch in your code, but jump compiler made decide to implement that as a conditional jump or as a conditional move. At -O2, gcc decides on a jump and clang on a conditional move.
Conditional Jump
This is the direct interpretation of your code: use a conditional branch to implement the i == SIZE condition.
It has the advantage of making the condition a control dependency, not a data dependency, so your loop will mostly run at full speed when the branch is not taken.
However, performance could be seriously impacted if the branch mispredicts often. That depends heavily on the value of SIZE and on your hardware. Modern Intel should be able to predict nested loops like this up to 20-something iterations, but beyond that it will mispredict once every time the inner loop is exited. Of course, is SIZE is very large then the single mispredict won't matter much anyways, so the worst case is SIZE just large enough to mispredict.
Conditional Move
clang uses a conditional move to update i. This is a reasonable option, but it does mean a carried data flow dependency of 3-4 cycles.
1 Either actually a constant like your example or effectively a constant due to inlining and constant propagation.
I would like to know what's really happening calling & and * in C.
Is that it costs a lot of resources? Should I call & each time I wanna get an adress of a same given variable or keep it in memory i.e in a cache variable. Same for * i.e when I wanna get a pointer value ?
Example
void bar(char *str)
{
check_one(*str)
check_two(*str)
//... Could be replaced by
char c = *str;
check_one(c);
check_two(c);
}
I would like to know what's really happening calling & and * in C.
There's no such thing as "calling" & or *. They are the address operator, or the dereference operator, and instruct the compiler to work with the address of an object, or with the object that a pointer points to, respectively.
And C is not C++, so there's no references; I think you just misused that word in your question's title.
In most cases, that's basically two ways to look at the same thing.
Usually, you'll use & when you actually want the address of an object. Since the compiler needs to handle objects in memory with their address anyway, there's no overhead.
For the specific implications of using the operators, you'll have to look at the assembler your compiler generates.
Example: consider this trivial code, disassembled via godbolt.org:
#include <stdio.h>
#include <stdlib.h>
void check_one(char c)
{
if(c == 'x')
exit(0);
}
void check_two(char c)
{
if(c == 'X')
exit(1);
}
void foo(char *str)
{
check_one(*str);
check_two(*str);
}
void bar(char *str)
{
char c = *str;
check_one(c);
check_two(c);
}
int main()
{
char msg[] = "something";
foo(msg);
bar(msg);
}
The compiler output can far wildly depending on the vendor and optimization settings.
clang 3.8 using -O2
check_one(char): # #check_one(char)
movzx eax, dil
cmp eax, 120
je .LBB0_2
ret
.LBB0_2:
push rax
xor edi, edi
call exit
check_two(char): # #check_two(char)
movzx eax, dil
cmp eax, 88
je .LBB1_2
ret
.LBB1_2:
push rax
mov edi, 1
call exit
foo(char*): # #foo(char*)
push rax
movzx eax, byte ptr [rdi]
cmp eax, 88
je .LBB2_3
movzx eax, al
cmp eax, 120
je .LBB2_2
pop rax
ret
.LBB2_3:
mov edi, 1
call exit
.LBB2_2:
xor edi, edi
call exit
bar(char*): # #bar(char*)
push rax
movzx eax, byte ptr [rdi]
cmp eax, 88
je .LBB3_3
movzx eax, al
cmp eax, 120
je .LBB3_2
pop rax
ret
.LBB3_3:
mov edi, 1
call exit
.LBB3_2:
xor edi, edi
call exit
main: # #main
xor eax, eax
ret
Notice that foo and bar are identical. Do other compilers do something similar? Well...
gcc x64 5.4 using -O2
check_one(char):
cmp dil, 120
je .L6
rep ret
.L6:
push rax
xor edi, edi
call exit
check_two(char):
cmp dil, 88
je .L11
rep ret
.L11:
push rax
mov edi, 1
call exit
bar(char*):
sub rsp, 8
movzx eax, BYTE PTR [rdi]
cmp al, 120
je .L16
cmp al, 88
je .L17
add rsp, 8
ret
.L16:
xor edi, edi
call exit
.L17:
mov edi, 1
call exit
foo(char*):
jmp bar(char*)
main:
sub rsp, 24
movabs rax, 7956005065853857651
mov QWORD PTR [rsp], rax
mov rdi, rsp
mov eax, 103
mov WORD PTR [rsp+8], ax
call bar(char*)
mov rdi, rsp
call bar(char*)
xor eax, eax
add rsp, 24
ret
Well, if there were any doubt foo and bar are equivalent, a least by the compiler, I think this:
foo(char*):
jmp bar(char*)
is a strong argument they indeed are.
In C, there's no runtime cost associated with either the unary & or * operators; both are evaluated at compile time. So there's no difference in runtime between
check_one(*str)
check_two(*str)
and
char c = *str;
check_one( c );
check_two( c );
ignoring the overhead of the assignment.
That's not necessarily true in C++, since you can overload those operators.
tldr;
If you are programming in C, then the & operator is used to obtain the address of a variable and * is used to get the value of that variable, given it's address.
This is also the reason why in C, when you pass a string to a function, you must state the length of the string otherwise, if someone unfamiliar with your logic sees the function signature, they could not tell if the function is called as bar(&some_char) or bar(some_cstr).
To conclude, if you have a variable x of type someType, then &x will result in someType* addressOfX and *addressOfX will result in giving the value of x. Functions in C only take pointers as parameters, i.e. you cannot create a function where the parameter type is &x or &&x
Also your examples can be rewritten as:
check_one(str[0])
check_two(str[0])
AFAIK, in x86 and x64 your variables are stored in memory (if not stated with register keyword) and accessed by pointers.
const int foo = 5 equal to foo dd 5 and check_one(*foo) equal to push dword [foo]; call check_one.
If you create additional variable c, then it looks like:
c resd 1
...
mov eax, [foo]
mov dword [c], eax ; Variable foo just copied to c
push dword [c]
call check_one
And nothing changed, except additional copying and memory allocation.
I think that compiler's optimizer deals with it and makes both cases as fast as it is possible. So you can use more readable variant.
I am implementing a function for a bubblesort algorithm in assembly language (Linux, 64-bit, yasm). The function is called from within a C file where the array and the array size are passed through to assembly via rdi and rsi respectively.
xor rax, rax
xor rbx, rbx
xor r14, r14 ; r14 : int j = 0
xor r15, r15 ; r15 : boolean swapped
inc r15 ; swapped = true (=> swapped = 1)
while:
cmp r15, 1 ; while (swapped) (=> check if swapped == 1)
jne end_while
dec r15 ; swapped = false (=> swapped = 0)
inc r14 ; j++
mov rdx, rsi ; rdx = size
sub rdx, r14 ; size - j
xor rcx, rcx ; int i = 0
for:
cmp rcx, rdx ; i < size - j
je end_for
mov rax, [rdi+rcx*4+4] ; rax = rdi+rcx*4+4 => arr[i+1]
mov rbx, [rdi+rcx*4] ; rbx = rdi+rcx*4 => temp = arr[i]
cmp rbx, rax ; if(arr[i] > arr[i+1])
jng done_if
mov [rdi+rcx*4], rax ; arr[i] = arr[i+1]
mov [rdi+rcx*4+4], rbx ; arr[i+1] = temp
inc r15 ; swapped = true (=> swapped = 1)
done_if:
inc rcx ; ++i
jmp for
end_for:
end_while:
ret
The array sorts integers only. I coded the bubblesort in Java and tested it there - it works fine. However, when I pass the array {9,8,7,6,5,4,3,2,1,0} via the C file the output is {8,8,8,8,8,8,8,8,8,9}. I debugged with gdb but still can't see where the issue is. The for-loop construction works fine (rcx and rdx function correctly). It seems that there might be an issue with the way the array elements are accessed.
Any advice would be appreciated.
Your problem is that you are using quadwords (64-bit integers) everywhere but your array is full of doublewords (32-bit integers). In particular, the part where you used mov rax, [rdi+rcx*4+4] should be changed to movl eax, [rdi+rcx*4+4], and the other mov instructions should similarly be changed to movl.
I need to translate what is commented within the method, to assembler. I have a roughly idea, but can't.
Anyone can help me please? Is for an Intel x32 architecture:
int
secuencia ( int n, EXPRESION * * o )
{
int a, i;
//--- Translate from here ...
for ( i = 0; i < n; i++ ){
a = evaluarExpresion( *o );
o++;
}
return a ;
//--- ... until here.
}
Translated code must be within __asm as:
__asm {
translated code
}
Thank you,
FINAL UPDATE:
This is the final version, working and commented, thanks to all for your help :)
int
secuencia ( int n, EXPRESION * * o )
{
int a = 0, i;
__asm
{
mov dword ptr [i],0 ; int i = 0
jmp salto1
ciclo1:
mov eax,dword ptr [i]
add eax,1 ; increment in 1 the value of i
mov dword ptr [i],eax ; i++
salto1:
mov eax,dword ptr [i]
cmp eax,dword ptr [n] ; Compare i and n
jge final ; If is greater goes to 'final'
mov eax,dword ptr [o]
mov ecx,dword ptr [eax] ; Recover * o (its value)
push ecx ; Make push of * o (At the stack, its value)
call evaluarExpresion ; call evaluarExpresion( * o )
add esp,4 ; Recover memory from the stack (4KB corresponding to the * o pointer)
mov dword ptr [a],eax ; Save the result of evaluarExpresion as the value of a
mov eax,dword ptr [o] ; extract the pointer to o
add eax,4 ; increment the pointer by a factor of 4 (next of the actual pointed by *o)
mov dword ptr [o],eax ; o++
jmp ciclo1 ; repeat
final: ; for's final
mov eax,dword ptr [a] ; return a - it save the return value at the eax registry (by convention this is where the result must be stored)
}
}
Essentially in assembly languages, strictly speaking there isn't a notion of a loop the same way there would be in a higher level language. It's all implemented with jumps (eg. as a "goto"...)
That said, x86 has some instructions with the assumption that you'll be writing "loops", implicitly using the register ECX as a loop counter.
Some examples:
mov ecx, 5 ; ecx = 5
.label:
; Loop body code goes here
; ECX will start out as 5, then 4, then 3, then 1...
loop .label ; if (--ecx) goto .label;
Or:
jecxz .loop_end ; if (!ecx) goto .loop_end;
.loop_start:
; Loop body goes here
loop .loop_start ; if (--ecx) goto .loop_start;
.loop_end:
And, if you don't like this loop instruction thing counting backwards... You can write something like:
xor ecx, ecx ; ecx = 0
.loop_start:
cmp ecx, 5 ; do (ecx-5) discarding result, then set FLAGS
jz .loop_end ; if (ecx-5) was zero (eg. ecx == 5), jump to .loop_end
; Loop body goes here.
inc ecx ; ecx++
jmp .loop_start
.loop_end:
This would be closer to the typical for (int i=0; i<5; ++i) { }
Note that
for (init; cond; advance) {
...
}
is essentially syntactic sugar for
init;
while(cond) {
...
advance;
}
which should be easy enough to translate into assembly language if you've been paying any attention in class.
Use gcc to generate the assembly code
gcc -S -c sample.c
man gcc is your friend
For that you would probably use the loop instruction that decrements the ecx (often called, extended counter) at each loop and goes out when ecx reaches zero.But why use inline asm for it anyway? I'm pretty sure something as simple as that will be optimized correctly by the compiler...
(We say x86 architecture, because it's based on 80x86 computers, but it's an "ok" mistake =p)