Design a function in C from assembly code - c

I need to design a function in C language to achieve what is written in the machine code. I get the assembly operations in steps, but my function is said to be implemented wrong. I am confused.
This is the disassembled code of the function.
(Hand transcribed from an image, typos are possible
especially in the machine-code. See revision history for the image)
0000000000000000 <ex3>:
0: b9 00 00 00 00 mov 0x0,%ecx
5: eb 1b jmp L2 // 22 <ex3+0x22>
7: 48 63 c1 L1: movslq %ecx,%rax
a: 4c 8d 04 07 lea (%rdi,%rax,1),%r8
e: 45 0f b6 08 movzbl (%r8),%r9d
12: 48 01 f0 add %rsi,%rax
15: 44 0f b6 10 movzbl (%rax),%r10d
19: 45 88 10 mov %r10b,(%r8)
1c: 44 88 08 mov %r9b,(%rax)
1f: 83 c1 01 add $0x1,%ecx
22: 39 d1 L2: cmp %edx,%ecx
24: 7c e1 jl L1 // 7 <ex3+0x7>
26: f3 c3 repz retq
My code(the signature of the function is not given or settled):
#include <assert.h>
int
ex3(int rdi, int rsi,int edx, int r8,int r9 ) {
int ecx = 0;
int rax;
if(ecx>edx){
rax = ecx;
r8 =rdi+rax;
r9 =r8;
rax =rsi;
int r10=rax;
r8=r10;
rax =r9;
ecx+=1;
}
return rax;
}
Please explain what cause the bugs if you recognize any.

I am pretty sure it is this: swap two areas of memory:
void memswap(unsigned char *rdi, unsigned char *rsi, int edx) {
int ecx;
for (ecx = 0; ecx < edx; ecx++) {
unsigned char r9 = rdi[ecx];
unsigned char r10 = rsi[ecx];
rdi[ecx] = r10;
rsi[ecx] = r9;
}
}

(Editor's note: this is a partial answer that only addresses the loop structure. It doesn't cover the movzbl byte loads, or the fact that some of these variables are pointers, or type widths. There's room for other answers to cover other parts of the question.)
C supports goto and even though the usage of them is often frowned upon, they are very useful here. Use them to make it as similar to the assembly as possible. This allows you to make sure that the code works before you start introducing more proper control flow mechanisms, like while loops. So I would do something like this:
goto L2;
L1:
rax = ecx;
r8 =rdi+rax;
r9 =r8;
rax =rsi;
int r10=rax;
r8=r10;
rax =r9;
ecx+=1;
L2:
if(edx<ecx)
goto L1;
You can easily transform the above code to:
while(edx<ecx) {
rax = ecx;
r8 =rdi+rax;
r9 =r8;
rax =rsi;
int r10=rax;
r8=r10;
rax =r9;
ecx+=1;
}
Note that I have not checked if the code within the L1-block and then later the while block is correct or not. (Editor's note: it's missing all the memory accesses). But your jumping was wrong and is now corrected.
What you can do from here (again, assuming that this is correct) is to start trying to see patterns. It seems like ecx is used as some kind of index variable. And the variable rax can be replaced in the beginning. We can do a few other similar changes.This gives us:
int i=0;
while(edx<i) {
// rax = ecx;
// r8 =rdi+i; // r8=rdi+i
// r9 = rdi + i; // r9 = r8
// rax =rsi;
int r10 = rsi; // int r10=rax;
r8 = r10;
rax = r9 = rdi+i;
i++;
}
Here it clearly seems like something is a bit iffy. The while condition is edx<i but i is incremented and not decremented each iteration. That's a good indication that something is wrong. I'm not skilled enough in assembly to figure it out, but at least this is a method you can use.
Just take it step by step.
add $0x1,%ecx is AT&T syntax for incrementing ecx by 1. According to this site using Intel syntax, the result is stored in the first operand. In AT&T syntax, that's the last operand.
One interesting thing to notice is that if we removed the goto L2 statement, this would instead be equivalent to
do {
// Your code
} while(edx<ecx);
A while-loop can be compiled to a do-while-loop with an additional goto. (See Why are loops always compiled into "do...while" style (tail jump)?). It's pretty easy to understand.
In assembly, loops are made with gotos that jump backward in the code. You test and then decide if you want to jump back. So in order to test before the first iteration, you need to jump to the test first. (Compilers also sometimes compile while loops with an if()break at the top and a jmp at the bottom. But only with optimization disabled. See While, Do While, For loops in Assembly Language (emu8086))
Forward jumping is often the result of compiling if statements.
I also just realized that I now have three good ways to use goto. The first two is breaking out of nested loops and releasing resources in opposite order of allocation. And now the third is this, when you reverse engineer assembly.

For those that prefer a .S format for GCC, I used:
ex3:
mov $0x0, %ecx
jmp lpe
lps:
movslq %ecx, %rax
lea (%rdi, %rax, 1), %r8
movzbl (%r8), %r9d
add %rsi, %rax
movzbl (%rax), %r10d
mov %r10b, (%r8)
mov %r9b, (%rax)
add $0x1, %ecx
lpe:
cmp %edx, %ecx
jl lps
repz retq
.data
.text
.global _main
_main:
mov $0x111111111111, %rdi
mov $0x222222222222, %rsi
mov $0x5, %rdx
mov $0x333333333333, %r8
mov $0x444444444444, %r9
call ex3
xor %eax, %eax
ret
you can then compile it with gcc main.S -o main and run objdump -x86-asm-syntax=intel -d main to see it in intel format OR run the resulting main executable in a decompiler.. but meh.. Let's do some manual work..
First I would convert the AT&T syntax to the more commonly known Intel syntax.. so:
ex3:
mov ecx, 0
jmp lpe
lps:
movsxd rax, ecx
lea r8, [rdi + rax]
movzx r9d, byte ptr [r8]
add rax, rsi
movzx r10d, byte ptr [rax]
mov byte ptr [r8], r10b
mov byte ptr [rax], r9b
add ecx, 0x1
lpe:
cmp ecx, edx
jl lps
rep ret
Now I can see clearly that from lps (loop start) to lpe (loop end), is a for-loop.
How? Because first it sets the counter register (ecx) to 0. Then it checks if ecx < edx by doing a cmp ecx, edx followed by a jl (jump if less than).. If it is, it runs the code and increments ecx by 1 (add ecx, 1).. if not, it exists the block..
Thus it looks like: for (int32_t ecx = 0; ecx < edx; ++ecx).. (note that edx is the lower 32-bits of rdx).
So now we translate the rest with the knowledge that:
r10 is a 64-bit register. r10d is the upper 32 bits, r10b is the lower 8 bits.
r9 is a 64-bit register. Same logic as r10 applies.
So we can represent a register as I have below:
typedef union Register
{
uint64_t reg;
struct
{
uint32_t upper32;
uint32_t lower32;
};
struct
{
uint16_t uupper16;
uint16_t ulower16;
uint16_t lupper16;
uint16_t llower16;
};
struct
{
uint8_t uuupper8;
uint8_t uulower8;
uint8_t ulupper8;
uint8_t ullower8;
uint8_t luupper8;
uint8_t lulower8;
uint8_t llupper8;
uint8_t lllower8;
};
} Register;
Whichever is better.. you can choose for yourself..
Now we can start looking at the instructions themselves..
movsxd or movslq moves a 32-bit register into a 64-bit register with a sign extension.
Now we can write the code:
uint8_t* ex3(uint8_t* rdi, uint64_t rsi, int32_t edx)
{
uintptr_t rax = 0;
for (int32_t ecx = 0; ecx < edx; ++ecx)
{
rax = ecx;
uint8_t* r8 = rdi + rax;
Register r9 = { .reg = *r8 }; //zero extend into the upper half of the register
rax += rsi;
Register r10 = { .reg = *(uint8_t*)rax }; //zero extend into the upper half of the register
*r8 = r10.lllower8;
*(uint8_t*)rax = r9.lllower8;
}
return rax;
}
Hopefully I didn't screw anything up..

Related

Why would gcc -O3 generate multiple ret instructions? [duplicate]

This question already has an answer here:
Why does GCC emit a repeated `ret`?
(1 answer)
Closed 2 months ago.
I was looking at some recursive function from here:
int get_steps_to_zero(int n)
{
if (n == 0) {
// Base case: we have reached zero
return 0;
} else if (n % 2 == 0) {
// Recursive case 1: we can divide by 2
return 1 + get_steps_to_zero(n / 2);
} else {
// Recursive case 2: we can subtract by 1
return 1 + get_steps_to_zero(n - 1);
}
}
I checked the disassembly in order to check if gcc managed tail-call optimization/unrolling. Looks like it did, though with x86-64 gcc 12.2 -O3 I get a function like this, ending with two ret instructions:
get_steps_to_zero:
xor eax, eax
test edi, edi
jne .L5
jmp .L6
.L10:
mov edx, edi
shr edx, 31
add edi, edx
sar edi
test edi, edi
je .L9
.L5:
add eax, 1
test dil, 1
je .L10
sub edi, 1
test edi, edi
jne .L5
.L9:
ret
.L6:
ret
Godbolt example.
What's the purpose of the multiple returns? Is it a bug?
EDIT
Seems like this appeared from gcc 11.x. When compiling under gcc 10.x, then the function ends like:
.L1:
mov eax, r8d
ret
.L6:
xor r8d, r8d
mov eax, r8d
ret
As in: store result in eax. The 11.x version instead zeroes eax in the beginning of the function then modifies it in the function body, eliminating the need for the extra mov instruction.
This is a manifestation of pass ordering problem. At some point in the optimization pipeline, the two basic blocks ending in ret are not equivalent, then some pass makes them equivalent, but no following pass is capable of collapsing the two equivalent blocks into one.
On Compiler Explorer, you can see how compiler optimization pipeline works by inspecting snapshots of internal representation between passes. For GCC, select "Add New > GCC Tree/RTL" in the compiler pane. Here's your example, with a snapshot immediately preceding the problematic transformation pre-selected in the new pane: https://godbolt.org/z/nTazM5zGG
Towards the end of the dump, you can see the two basic blocks:
65: NOTE_INSN_BASIC_BLOCK 8
77: use ax:SI
66: simple_return
and
43: NOTE_INSN_BASIC_BLOCK 9
5: ax:SI=0
38: use ax:SI
74: NOTE_INSN_EPILOGUE_BEG
75: simple_return
Basically the second block is different in that it sets eax to zero before returning. If you look at the next pass (called "jump2"), you see that it lifts the ax:SI=0 instruction from basic block 9 and basic block 3 to basic block 2, making BB 9 equivalent to BB 8.
If you disable this optimization with -fno-crossjumping, the difference will be carried to the end, making the resulting assembly less surprising.
Conclusion first: This is a deliberate optimization choice by GCC.
If you use GCC locally (gcc -O3 -S) instead of on Godbolt, you can see that there are alignment directives between the two ret instructions:
; top part omitted
.L9:
ret
.p2align 4,,10
.p2align 3
.L6:
ret
.cfi_endproc
The object file, when disassembled, includes an NOP in that padding area:
8: 75 13 jne 1d <get_steps_to_zero+0x1d>
a: eb 24 jmp 30 <get_steps_to_zero+0x30>
c: 0f 1f 40 00 nopl 0x0(%rax)
<...>
2b: 75 f0 jne 1d <get_steps_to_zero+0x1d>
2d: c3 ret
2e: 66 90 xchg %ax,%ax
30: c3 ret
The second ret instruction is aligned to a 16-byte boundary whereas the first one isn't. This allows the processor to load the instruction faster when used as a jump target from a distant source. Subsequent C return statements, however, are close enough to the first ret instruction such that they will not benefit from jumping to aligned targets.
This alignment is even more noticeable on my Zen 2 CPU with -mtune=native, with more padding bytes added:
29: 75 f2 jne 1d <get_steps_to_zero+0x1d>
2b: c3 ret
2c: 0f 1f 40 00 nopl 0x0(%rax)
30: c3 ret

How can I store abc(x, y) which is a pointer function into an array as per the following code sample?

Function 1.
It is a pointer function.
char *abc(unsigned int a, unsigned int b)
{
//do something here ...
}
Function 2
Leveraged the function 1 into function 2.
I am trying to store the abc function into an array, however I am getting the error as : error: assignment to expression with array type.
fun2()
{
unsigned int x, y;
x= 5, y=6;
char *array1;
char array2;
for(i=0; i<3; i++)
{
array2[i] = abc(x, y);
}
}
You can't store the invocation of a function in C since it would defeat many existing popular optimizations involving register parameters passing - see because normally parameters are assigned their argument values immediately before the execution flow is transferred to the calling site - compilers may choose to use the registers to store those values but as it stands those registers are volatile and so if we were to delay the actual call they would be overwritten at said later time - possibly even by another call to some function which also have its arguments passed as registers. A solution - which I've personally implemented - is to have a function simulate the call for you by re-assigning to the proper registers and any further arguments - to the stack. In this case you store the argument values in a flat memory. But this must be done in assembly exclusively for this purpose and specific to your target architecture. On the other hand if your architecture is not using any such optimizations - it could be quite easier but still hand written assembly would be required.
In any case this is not a feature the standard (or even pre standard as far as I know) C has implemented anytime.
For example this is an implementation for x86-64 I've wrote some time ago (for MSVC masm assembler):
PUBLIC makeuniquecall
.data
makeuniquecall_jmp_table dq zero_zero, one_zero, two_zero, three_zero ; ordinary
makeuniquecall_jmp_table_one dq zero_one, one_one, two_one, three_one ; single precision
makeuniquecall_jmp_table_two dq zero_two, one_two, two_two, three_two ; double precision
.code
makeuniquecall PROC
;rcx - function pointer
;rdx - raw argument data
;r8 - a byte array specifying each register parameter if it's float and the last qword is the size of the rest
push r12
push r13
push r14
mov r12, rcx
mov r13, rdx
mov r14, r8
; first store the stack vars
mov rax, [r14 + 4] ; retrieve size of stack
sub rsp, rax
mov rdi, rsp
xor rdx, rdx
mov r8, 8
div r8
mov rcx, rax
mov rsi, r13
;add rsi, 32
rep movs qword ptr [rdi], qword ptr [rsi]
xor r10,r10
cycle:
mov rax, r14
add rax, r10
movzx rax, byte ptr [rax]
test rax, rax
jnz jmp_one
lea rax, makeuniquecall_jmp_table
jmp qword ptr[rax + r10 * 8]
jmp_one:
cmp rax, 1
jnz jmp_two
lea rax, makeuniquecall_jmp_table_one
jmp qword ptr[rax + r10 * 8]
jmp_two:
lea rax, makeuniquecall_jmp_table_two
jmp qword ptr[rax + r10 * 8]
zero_zero::
mov rcx, qword ptr[r13+r10*8]
jmp continue
one_zero::
mov rdx, qword ptr[r13+r10*8]
jmp continue
two_zero::
mov r8, qword ptr[r13+r10*8]
jmp continue
three_zero::
mov r9, qword ptr[r13+r10*8]
jmp continue
zero_one::
movss xmm0, dword ptr[r13+r10*8]
jmp continue
one_one::
movss xmm1, dword ptr[r13+r10*8]
jmp continue
two_one::
movss xmm2, dword ptr[r13+r10*8]
jmp continue
three_one::
movss xmm3, dword ptr[r13+r10*8]
jmp continue
zero_two::
movsd xmm0, qword ptr[r13+r10*8]
jmp continue
one_two::
movsd xmm1, qword ptr[r13+r10*8]
jmp continue
two_two::
movsd xmm2, qword ptr[r13+r10*8]
jmp continue
three_two::
movsd xmm3, qword ptr[r13+r10*8]
continue:
inc r10
cmp r10, 4
jb cycle
mov r14, [r14 + 4] ; retrieve size of stack
call r12
add rsp, r14
pop r14
pop r13
pop r12
ret
makeuniquecall ENDP
END
And your code will look something like this:
#include <stdio.h>
char* abc(unsigned int a, unsigned int b)
{
printf("a - %d, b - %d\n", a, b);
return "return abc str\n";
}
extern makeuniquecall();
main()
{
unsigned int x, y;
x = 5, y = 6;
#pragma pack(4)
struct {
struct { char maskargs[4]; unsigned long long szargs; } invok;
char *(*pfunc)();
unsigned long long args[2], shadow[2];
} array2[3];
#pragma pack(pop)
for (int i = 0; i < 3; i++)
{
memset(array2[i].invok.maskargs, 0, sizeof array2[i].invok.maskargs); // standard - no floats passed
array2[i].invok.szargs = 8 * 4; //consider shadow space
array2[i].pfunc = abc;
array2[i].args[0] = x;
array2[i].args[1] = y;
}
//now do the calls
for (int i = 0; i < 3; i++)
printf("%s\n", ((char *(*)())makeuniquecall)(array2[i].pfunc, array2[i].args, &array2[i].invok));
}
You'll probably not need that for your specific case you will get away with simply storing each argument and calling the function directly - i.e. (plus this method won't be x86-64 specific):
//now do the calls
for (int i = 0; i < 3; i++)
printf("%s\n", array2[i].pfunc(array2[i].args[0], array2[i].args[1]));
But mine implementation gives you the flexibility to store different amount of arguments for each call.
Note consider this guide for running above examples on msvc (since it requires to add asm file for the assembly code).
I love such noob questions since they make you think about why x-y feature doesn't actually exist in the language.

Understanding the decompilation of an object to source code

First of all, I am a student, I do not yet have extensive knowledge about C, C ++ and assembler, so I am making a extreme effort to understand it.
I have this piece of assembly code from an Intel x86-32 bit processor.
My goal is to transform it to source code.
0x80483dc <main>: push ebp
0x80483dd <main+1>: mov ebp,esp
0x80483df <main+3>: sub esp,0x10
0x80483e2 <main+6>: mov DWORD PTR [ebp-0x8],0x80484d0
0x80483e9 <main+13>: lea eax,[ebp-0x8]
0x80483ec <main+16>: mov DWORD PTR [ebp-0x4],eax
0x80483ef <main+19>: mov eax,DWORD PTR [ebp-0x4]
0x80483f2 <main+22>: mov edx,DWORD PTR [eax+0xc]
0x80483f5 <main+25>: mov eax,DWORD PTR [ebp-0x4]
0x80483f8 <main+28>: movzx eax,WORD PTR [eax+0x10]
0x80483fc <main+32>: cwde
0x80483fd <main+33>: add edx, eax
0x80483ff <main+35>: mov eax,DWORD PTR [ebp-0x4]
0x8048402 <main+38>: mov DWORD PTR [eax+0xc],edx
0x8048405 <main+41>: mov eax,DWORD PTR [ebp-0x4]
0x8048408 <main+44>: movzx eax,BYTE PTR [eax]
0x804840b <main+47>: cmp al,0x4f
0x804840d <main+49>: jne 0x8048419 <main+61>
0x804840f <main+51>: mov eax,DWORD PTR [ebp-0x4]
0x8048412 <main+54>: movzx eax,BYTE PTR [eax]
0x8048415 <main+57>: cmp al,0x4b
0x8048417 <main+59>: je 0x804842d <main+81>
0x8048419 <main+61>: mov eax,DWORD PTR [ebp-0x4]
0x804841c <main+64>: mov eax,DWORD PTR [eax+0xc]
0x804841f <main+67>: mov edx, eax
0x8048421 <main+69>: and edx,0xf0f0f0f
0x8048427 <main+75>: mov eax,DWORD PTR [ebp-0x4]
0x804842a <main+78>: mov DWORD PTR [eax+0x4],edx
0x804842d <main+81>: mov eax,0x0
0x8048432 <main+86>: leave
0x8048433 <main+87>: ret
This is what I understand from the code:
There are 4 variables:
a = [ebp-0x8] ebp
b = [ebp-0x4] eax
c = [eax + 0xc] edx
d = [eax + 0x10] eax
Values:
0x4 = 4
0x8 = 8
0xc = 12
0x10 = 16
0x4b = 75
0x4f = 79
Types:
char (8 bits) = 1 BYTE
short (16 bits) = WORD
int (32 bit) = DWORD
long (32 bits) = DWORD
long long (32 bit) = DWORD
This is what I was able to create:
#include <stdio.h>
int main (void)
{
   int a = 0x80484d0;
   int b
   short c;
   int d;
   c + b?
if (79 <= al) {
instructions
} else {
instructions
}
   return 0
}
But I'm stuck. Nor can I understand what the sentence "cmp al .." compares to, what is "al"?
How do these instructions work?
EDIT1:
That said, as you comment the assembly seems to be wrong or as someone comments say, it is insane!
The code and the exercise are from the following book called: "Reversing, Reverse Engineering" on page 140 (3.8 Proposed Exercises). It would never have occurred to me that it was wrong, if so, this clearly makes it difficult for me to learn ...
So it is not possible to do a reversing to get the source code because it is not a good assembly? Maybe I am not oppressed? Is it possible to optimize it?
EDIT2:
Hi!
I did ask and finally she says this should be the c code:
inf foo(void){
char *string;//ebp-0x8
unsigned int *pointerstring//[ebp-0x4]
unsigned int *position;
*position = *(pointerstring+0xc);
unsigned char character;
character=(unsigned char) string[*position];
if ((character != 0x4)||(character != 0x4b))
{
*(position+0x4)=(unsigned int)(*position & 0x0f0f0f0f);
}
return(0);
}
Does it have any sense at all for you?, could someone please explain this to me?
Does anyone really program like this?
Thanks very much!
Your assembly is completely insane. This is roughly equivalent C:
int main() {
int i = 0x80484d0; // in ebp-8
int *p = &i; // in ebp-4
p[3] += (short)p[4]; // add argc to the return address(!)
if((char)*p != 0x4f || (char)*p != 0x4b) // always true because of || instead of &&
p[1] = p[3] & 0xf0f0f0f; // note that p[1] is p
return 0;
}
It should be immediately obvious that this is horrifically bad code that almost certainly won't do what the programmer intended.
The x86 assembly language follows a long legacy and has mostly kept compatibility. We need to go back to the 8086/8088 chip where that story starts. These were 16 bit processors, which means that their register had a word size of 16 bits. The general purpose registers were named AX, BX, CX and DX. The 8086 had instructions to manipulate the upper and lower 8-bit parts of these registers that were then named AH, AL, BH, BL, CH, CL, DH and DL. This Wikipedia page describes this, please take a look.
The 32 bit versions of these registers have an E in front: EAX, EBX, ECX, etc.
The particular instruction you mention, e.g, cmp al,0x4f is comparing the lower byte of the AX register with 0x4f. The comparison is effectively the same as a subtraction, but does not save the result, only sets the flags.
For the 8086 instruction set, there is a nice reference here. Your program is 32 bit code, so you will need at least a 80386 instruction reference.
You have analyzed variables, and that's a good place to start. You should try to add type annotations to them, size, as you started, and, when used as pointers (like b), pointers to what kind/size.
I might update your variable chart as follows, knowing that [ebp-4] is b:
c = [b + 0xc]
d = [b + 0x10]
e = [b + 0], size = byte
Another thing to analyze is the control flow. For most instructions control flow is sequential, but certain instructions purposefully alter it. Broadly speaking, when the pc is moved forward, it skips some code and when the pc is moved backward it repeats some code it already ran. Skipping code is used to construct if-then, if-then-else, and statements that break out of loops. Jumping back is used to continue looping.
Some instructions, called conditional branches, on some dynamic condition being true: skip forward (or backwards) and on being false do the simple sequential advancement to the next instruction (sometimes called conditional branch fall through).
The control sequences here:
...
0x8048405 <main+41>: mov eax,DWORD PTR [ebp-0x4] b
0x8048408 <main+44>: movzx eax,BYTE PTR [eax] b->e
0x804840b <main+47>: cmp al,0x4f b->e <=> 'O'
0x804840d <main+49>: jne 0x8048419 <main+61> b->e != 'O' skip to 61
** we know that the letter, a->e, must be 'O' here
0x804840f <main+51>: mov eax,DWORD PTR [ebp-0x4] b
0x8048412 <main+54>: movzx eax,BYTE PTR [eax] b->e
0x8048415 <main+57>: cmp al,0x4b b->e <=> 'K'
0x8048417 <main+59>: je 0x804842d <main+81> b->e == 'K' skip to 81
** we know that the letter, a->e must not be 'K' here if we fall thru the above je
** this line can be reached by taken branch jne or by fall thru je
0x8048419 <main+61>: mov eax,DWORD PTR [ebp-0x4] ******
...
The flow of control reaches this last line tagged we know that either the letter is either not 'O' or it is not 'K'.
The construct where the jne instruction is used to skip another test is a short-circuit || operator. Thus the control construct is:
if ( a->e != 'O' || a->e != 'K' ) {
then-part
}
As that these two conditional branches are the only flow control modifications in the function, there is no else part of the if, and there are no loops or other if's.
This code appears to have a slight problem.
If the value is not 'O', the then-part will fire from the first test. However, if we reach the 2nd test, we already know the letter is 'O', so testing it for 'K' is silly and will be true ('O' is not 'K').
Thus, this if-then will always fire.
It is either very inefficient, or, there is a bug that perhaps it is the next letter along in the (presumably) string should be tested for 'K' not the same exact letter.

C pointers and references

I would like to know what's really happening calling & and * in C.
Is that it costs a lot of resources? Should I call & each time I wanna get an adress of a same given variable or keep it in memory i.e in a cache variable. Same for * i.e when I wanna get a pointer value ?
Example
void bar(char *str)
{
check_one(*str)
check_two(*str)
//... Could be replaced by
char c = *str;
check_one(c);
check_two(c);
}
I would like to know what's really happening calling & and * in C.
There's no such thing as "calling" & or *. They are the address operator, or the dereference operator, and instruct the compiler to work with the address of an object, or with the object that a pointer points to, respectively.
And C is not C++, so there's no references; I think you just misused that word in your question's title.
In most cases, that's basically two ways to look at the same thing.
Usually, you'll use & when you actually want the address of an object. Since the compiler needs to handle objects in memory with their address anyway, there's no overhead.
For the specific implications of using the operators, you'll have to look at the assembler your compiler generates.
Example: consider this trivial code, disassembled via godbolt.org:
#include <stdio.h>
#include <stdlib.h>
void check_one(char c)
{
if(c == 'x')
exit(0);
}
void check_two(char c)
{
if(c == 'X')
exit(1);
}
void foo(char *str)
{
check_one(*str);
check_two(*str);
}
void bar(char *str)
{
char c = *str;
check_one(c);
check_two(c);
}
int main()
{
char msg[] = "something";
foo(msg);
bar(msg);
}
The compiler output can far wildly depending on the vendor and optimization settings.
clang 3.8 using -O2
check_one(char): # #check_one(char)
movzx eax, dil
cmp eax, 120
je .LBB0_2
ret
.LBB0_2:
push rax
xor edi, edi
call exit
check_two(char): # #check_two(char)
movzx eax, dil
cmp eax, 88
je .LBB1_2
ret
.LBB1_2:
push rax
mov edi, 1
call exit
foo(char*): # #foo(char*)
push rax
movzx eax, byte ptr [rdi]
cmp eax, 88
je .LBB2_3
movzx eax, al
cmp eax, 120
je .LBB2_2
pop rax
ret
.LBB2_3:
mov edi, 1
call exit
.LBB2_2:
xor edi, edi
call exit
bar(char*): # #bar(char*)
push rax
movzx eax, byte ptr [rdi]
cmp eax, 88
je .LBB3_3
movzx eax, al
cmp eax, 120
je .LBB3_2
pop rax
ret
.LBB3_3:
mov edi, 1
call exit
.LBB3_2:
xor edi, edi
call exit
main: # #main
xor eax, eax
ret
Notice that foo and bar are identical. Do other compilers do something similar? Well...
gcc x64 5.4 using -O2
check_one(char):
cmp dil, 120
je .L6
rep ret
.L6:
push rax
xor edi, edi
call exit
check_two(char):
cmp dil, 88
je .L11
rep ret
.L11:
push rax
mov edi, 1
call exit
bar(char*):
sub rsp, 8
movzx eax, BYTE PTR [rdi]
cmp al, 120
je .L16
cmp al, 88
je .L17
add rsp, 8
ret
.L16:
xor edi, edi
call exit
.L17:
mov edi, 1
call exit
foo(char*):
jmp bar(char*)
main:
sub rsp, 24
movabs rax, 7956005065853857651
mov QWORD PTR [rsp], rax
mov rdi, rsp
mov eax, 103
mov WORD PTR [rsp+8], ax
call bar(char*)
mov rdi, rsp
call bar(char*)
xor eax, eax
add rsp, 24
ret
Well, if there were any doubt foo and bar are equivalent, a least by the compiler, I think this:
foo(char*):
jmp bar(char*)
is a strong argument they indeed are.
In C, there's no runtime cost associated with either the unary & or * operators; both are evaluated at compile time. So there's no difference in runtime between
check_one(*str)
check_two(*str)
and
char c = *str;
check_one( c );
check_two( c );
ignoring the overhead of the assignment.
That's not necessarily true in C++, since you can overload those operators.
tldr;
If you are programming in C, then the & operator is used to obtain the address of a variable and * is used to get the value of that variable, given it's address.
This is also the reason why in C, when you pass a string to a function, you must state the length of the string otherwise, if someone unfamiliar with your logic sees the function signature, they could not tell if the function is called as bar(&some_char) or bar(some_cstr).
To conclude, if you have a variable x of type someType, then &x will result in someType* addressOfX and *addressOfX will result in giving the value of x. Functions in C only take pointers as parameters, i.e. you cannot create a function where the parameter type is &x or &&x
Also your examples can be rewritten as:
check_one(str[0])
check_two(str[0])
AFAIK, in x86 and x64 your variables are stored in memory (if not stated with register keyword) and accessed by pointers.
const int foo = 5 equal to foo dd 5 and check_one(*foo) equal to push dword [foo]; call check_one.
If you create additional variable c, then it looks like:
c resd 1
...
mov eax, [foo]
mov dword [c], eax ; Variable foo just copied to c
push dword [c]
call check_one
And nothing changed, except additional copying and memory allocation.
I think that compiler's optimizer deals with it and makes both cases as fast as it is possible. So you can use more readable variant.

Difference in for loops of old and new GCC's generated assembly code

I am reading a chapter about assembly code, which has an example. Here is the C program:
int main()
{
int i;
for(i=0; i < 10; i++)
{
puts("Hello, world!\n");
}
return 0;
}
Here is the assembly code provided in the book:
0x08048384 <main+0>: push ebp
0x08048385 <main+1>: mov ebp,esp
0x08048387 <main+3>: sub esp,0x8
0x0804838a <main+6>: and esp,0xfffffff0
0x0804838d <main+9>: mov eax,0x0
0x08048392 <main+14>: sub esp,eax
0x08048394 <main+16>: mov DWORD PTR [ebp-4],0x0
0x0804839b <main+23>: cmp DWORD PTR [ebp-4],0x9
0x0804839f <main+27>: jle 0x80483a3 <main+31>
0x080483a1 <main+29>: jmp 0x80483b6 <main+50>
0x080483a3 <main+31>: mov DWORD PTR [esp],0x80484d4
0x080483aa <main+38>: call 0x80482a8 <_init+56>
0x080483af <main+43>: lea eax,[ebp-4]
0x080483b2 <main+46>: inc DWORD PTR [eax]
0x080483b4 <main+48>: jmp 0x804839b <main+23>
Here is part of my version:
0x0000000000400538 <+8>: mov DWORD PTR [rbp-0x4],0x0
=> 0x000000000040053f <+15>: jmp 0x40054f <main+31>
0x0000000000400541 <+17>: mov edi,0x4005f0
0x0000000000400546 <+22>: call 0x400410 <puts#plt>
0x000000000040054b <+27>: add DWORD PTR [rbp-0x4],0x1
0x000000000040054f <+31>: cmp DWORD PTR [rbp-0x4],0x9
0x0000000000400553 <+35>: jle 0x400541 <main+17>
My question is, why is in case of the book's version it assigns 0 to the variable(mov DWORD PTR [ebp-4],0x0) and compares just after that with cmp but in my version, it assigns and then it does jmp 0x40054f <main+31> where the cmp is?
It seems more logical to assign and compare without any jump, because it is like that inside for loop.
Why did your compiler do something different than a different compiler that was used in the book? Because it's a different compiler. No two compilers will compile all code the same, even very trivial code can be compiled vastly different by two different compilers or even two versions of the same compiler. And it's quite obvious both were compiled without any optimization, with optimization the results would be even more different.
Let's reason about what the for loop does.
for (i = 0; i < 10; i++) {
code;
}
Let's write it a little bit closer to the assembler that was generated by the first compiler generated.
i = 0;
start: if (i > 9) goto out;
code;
i++;
goto start;
out:
Now the same thing for "my version":
i = 0;
goto cmp;
start: code;
i++;
cmp: if (i < 10) goto start;
The clear difference here is that in "my version" there will only be one jump executed within the loop while the book version has two. It's a quite common way to generate loops in more modern compilers because of how sensitive CPUs are to branches. Many compilers will generate code like this even without any optimizations because it performs better in most cases. Older compilers didn't do this because either they didn't think about it or this trick was performed in an optimization stage which wasn't enabled when compiling the code in the book.
Notice that a compiler with any kind of optimization enabled wouldn't even do that first goto cmp because it would know that it was unnecessary. Try compiling your code with optimization enabled (you say you use gcc, give it the -O2 flag) and see how vastly different it will look after that.
You didn't quote the full assembly-language body of the function from your textbook, but my psychic powers tell me that it looked something like this (also, I've replaced literal addresses with labels, for clarity):
# ... establish stack frame ...
mov DWORD PTR [rbp-4],0x0
cmp DWORD PTR [rbp-4],0x9
jle .L0
.L1:
mov rdi, .Lconst0
call puts
add DWORD PTR [rbp-0x4],0x1
cmp DWORD PTR [rbp-0x4],0x9
jle .L1
.L0:
# ... return from function ...
GCC has noticed that it can eliminate the initial cmp and jle by replacing them with an unconditional jmp down to the cmp at the bottom of the loop, so that is what it did. This is a standard optimization called loop inversion. Apparently it does this even with the optimizer off; with optimization on, it would also have noticed that the initial comparison must be false, hoisted out the address load, placed the loop index in a register, and converted to a count-down loop so it could eliminate the cmp altogether; something like this:
# ... establish stack frame ...
mov ebx, 10
mov r14, .Lconst0
.L1:
mov rdi, r14
call puts
dec ebx
jne .L1
# ... return from function ...
(The above was actually generated by Clang. My version of GCC did something else, equally sensible but harder to explain.)

Resources