I need to understand HOW longjmp function works; I know what it does, but I need to know how it does it.
I tried to disas the code in gdb but I can't understand some steps. The code is:
0xb7ead420 <siglongjmp+0>: push %ebp
0xb7ead421 <siglongjmp+1>: mov %esp,%ebp
0xb7ead423 <siglongjmp+3>: sub $0x18,%esp
0xb7ead426 <siglongjmp+6>: mov %ebx,-0xc(%ebp)
0xb7ead429 <siglongjmp+9>: call 0xb7e9828f <_Unwind_Find_FDE#plt+119>
0xb7ead42e <siglongjmp+14>: add $0x12bbc6,%ebx
0xb7ead434 <siglongjmp+20>: mov %esi,-0x8(%ebp)
0xb7ead437 <siglongjmp+23>: mov 0xc(%ebp),%esi
0xb7ead43a <siglongjmp+26>: mov %edi,-0x4(%ebp)
0xb7ead43d <siglongjmp+29>: mov 0x8(%ebp),%edi
0xb7ead440 <siglongjmp+32>: mov %esi,0x4(%esp)
0xb7ead444 <siglongjmp+36>: mov %edi,(%esp)
0xb7ead447 <siglongjmp+39>: call 0xb7ead4d0
0xb7ead44c <siglongjmp+44>: mov 0x18(%edi),%eax
0xb7ead44f <siglongjmp+47>: test %eax,%eax
0xb7ead451 <siglongjmp+49>: jne 0xb7ead470 <siglongjmp+80>
0xb7ead453 <siglongjmp+51>: test %esi,%esi
0xb7ead455 <siglongjmp+53>: mov $0x1,%eax
0xb7ead45a <siglongjmp+58>: cmove %eax,%esi
0xb7ead45d <siglongjmp+61>: mov %esi,0x4(%esp)
0xb7ead461 <siglongjmp+65>: mov %edi,(%esp)
0xb7ead464 <siglongjmp+68>: call 0xb7ead490
0xb7ead469 <siglongjmp+73>: lea 0x0(%esi,%eiz,1),%esi
0xb7ead470 <siglongjmp+80>: lea 0x1c(%edi),%eax
0xb7ead473 <siglongjmp+83>: movl $0x0,0x8(%esp)
0xb7ead47b <siglongjmp+91>: mov %eax,0x4(%esp)
0xb7ead47f <siglongjmp+95>: movl $0x2,(%esp)
0xb7ead486 <siglongjmp+102>: call 0xb7ead890 <sigprocmask>
0xb7ead48b <siglongjmp+107>: jmp 0xb7ead453 <siglongjmp+51>
Can someone briefly explain me the code, or indicate where I can find the source code in the system?
Here is the i386 code for longjmp, in the standard i386 ABI, without any crazy extensions for interaction with C++, exceptions, cleanup functions, signal mask, etc.:
mov 4(%esp),%edx
mov 8(%esp),%eax
test %eax,%eax
jnz 1f
inc %eax
1:
mov (%edx),%ebx
mov 4(%edx),%esi
mov 8(%edx),%edi
mov 12(%edx),%ebp
mov 16(%edx),%ecx
mov %ecx,%esp
mov 20(%edx),%ecx
jmp *%ecx
Mostly, it restores the registers and stack as they were at the time of the corresponding setjmp(). There is some additional cleanup required (fixing signal handling and unwinding pending stack handlers), as well as returning a different value as the apparent return value of setjmp, but restoring the state is the essence of the operation.
For it to work, the stack cannot be below the point at which setjmp was called. Longjmp is a brutish way to just forget everything that has been called below it down to the same level in the call stack (or function call nesting sequence) mostly by simply setting the stack pointer to the same frame it was when setjmp was called.
For it to work cleanly, longjmp() calls all the exit handlers for intermediate functions, so they can delete variables, and whatever other cleanup is normally done when a function returns. Resetting the stack to a point less deep releases all the auto variables but if one of those is a FILE *, the file needs to be closed and the i/o buffer freed too.
I think you need to see Procedure Activation Records and Call Stacks and Setjmp.h 's jmp_buf's structure.
Quoted from Expert C Programming: Deep C Secrets:
Setjmp saves a copy of the program counter and the current pointer to the top of the stack. This saves some initial values, if you like. Then longjmp restores these values effectively transferring control and resetting the state back to where you were when you did the save. It's termed "unwinding the stack", because you unroll activation records from the stack until you get to the saved one.
Have a look at page 153 also here.
The stackframe will be highly dependent on the machine and the executable, but the idea is the same.
In Windows X64 MASM
.code
my_jmp_buf STRUCT
_Frame QWORD ?;
_Rbx QWORD ?;
_Rsp QWORD ?;
_Rbp QWORD ?;
_Rsi QWORD ?;
_Rdi QWORD ?;
_R12 QWORD ?;
_R13 QWORD ?;
_R14 QWORD ?;
_R15 QWORD ?;
_Rip QWORD ?;
_MxCsr DWORD ?;
_FpCsr WORD ?;
_Spare WORD ?;
_Xmm6 XMMWORD ?;
_Xmm7 XMMWORD ?;
_Xmm8 XMMWORD ?;
_Xmm9 XMMWORD ?;
_Xmm10 XMMWORD ?;
_Xmm11 XMMWORD ?;
_Xmm12 XMMWORD ?;
_Xmm13 XMMWORD ?;
_Xmm14 XMMWORD ?;
_Xmm15 XMMWORD ?;
my_jmp_buf ENDS
;extern "C" int my_setjmp(jmp_buf env);
public my_setjmp
my_setjmp PROC
mov rax, [rsp] ;save ip
mov (my_jmp_buf ptr[rcx])._Rip, rax
lea rax, [rsp + 8] ;save sp before call this function
mov (my_jmp_buf ptr[rcx])._Rsp, rax
mov (my_jmp_buf ptr[rcx])._Frame, rax
;save gprs
mov (my_jmp_buf ptr[rcx])._Rbx,rbx
mov (my_jmp_buf ptr[rcx])._Rbp,rbp
mov (my_jmp_buf ptr[rcx])._Rsi,rsi
mov (my_jmp_buf ptr[rcx])._Rdi,rdi
mov (my_jmp_buf ptr[rcx])._R12,r12
mov (my_jmp_buf ptr[rcx])._R13,r13
mov (my_jmp_buf ptr[rcx])._R14,r14
mov (my_jmp_buf ptr[rcx])._R15,r15
;save fp and xmm
stmxcsr (my_jmp_buf ptr[rcx])._MxCsr
fnstcw (my_jmp_buf ptr[rcx])._FpCsr
movdqa (my_jmp_buf ptr[rcx])._Xmm6,xmm6
movdqa (my_jmp_buf ptr[rcx])._Xmm7,xmm7
movdqa (my_jmp_buf ptr[rcx])._Xmm8,xmm8
movdqa (my_jmp_buf ptr[rcx])._Xmm9,xmm9
movdqa (my_jmp_buf ptr[rcx])._Xmm10,xmm10
movdqa (my_jmp_buf ptr[rcx])._Xmm11,xmm11
movdqa (my_jmp_buf ptr[rcx])._Xmm12,xmm12
movdqa (my_jmp_buf ptr[rcx])._Xmm13,xmm13
movdqa (my_jmp_buf ptr[rcx])._Xmm14,xmm14
movdqa (my_jmp_buf ptr[rcx])._Xmm15,xmm15
xor rax,rax
ret
my_setjmp ENDP
;extern "C" void my_longjmp(jmp_buf env, int value);
public my_longjmp
my_longjmp PROC
;restore fp and xmm
movdqa xmm15,(my_jmp_buf ptr[rcx])._Xmm15
movdqa xmm14,(my_jmp_buf ptr[rcx])._Xmm14
movdqa xmm13,(my_jmp_buf ptr[rcx])._Xmm13
movdqa xmm12,(my_jmp_buf ptr[rcx])._Xmm12
movdqa xmm11,(my_jmp_buf ptr[rcx])._Xmm11
movdqa xmm10,(my_jmp_buf ptr[rcx])._Xmm10
movdqa xmm9,(my_jmp_buf ptr[rcx])._Xmm9
movdqa xmm8,(my_jmp_buf ptr[rcx])._Xmm8
movdqa xmm7,(my_jmp_buf ptr[rcx])._Xmm7
movdqa xmm6,(my_jmp_buf ptr[rcx])._Xmm6
fldcw (my_jmp_buf ptr[rcx])._FpCsr
ldmxcsr (my_jmp_buf ptr[rcx])._MxCsr
;restore gprs
mov r15, (my_jmp_buf ptr[rcx])._R15
mov r14, (my_jmp_buf ptr[rcx])._R14
mov r13, (my_jmp_buf ptr[rcx])._R13
mov r12, (my_jmp_buf ptr[rcx])._R12
mov rdi, (my_jmp_buf ptr[rcx])._Rdi
mov rsi, (my_jmp_buf ptr[rcx])._Rsi
mov rbp, (my_jmp_buf ptr[rcx])._Rbp
mov rbx, (my_jmp_buf ptr[rcx])._Rbx
;retore sp
mov rsp, (my_jmp_buf ptr[rcx])._Rsp
;restore ip
mov rcx, (my_jmp_buf ptr[rcx])._Rip; must be the last instruction as rcx modified
;return value
mov rax, rdx
jmp rcx
my_longjmp ENDP
END
You pass setjmp() a buffer parameter. It then stores the current register info etc. into this buffer. The call to longjmp() then restores these values from the buffer. Furthermore, what wallyk said.
Here are versions of setmp and longjmp I wrote for a small clib subset (written and tested in Visual Studio 2008). The assembly code is stored in a separate .asm file.
.586
.MODEL FLAT, C ; Flat memory model, C calling conventions.
;.STACK ; Not required for this example.
;.DATA ; Not required for this example.
.code
; Simple version of setjmp (x86-32 bit).
;
; Saves ebp, ebx, edi, esi, esp and eip in that order.
;
setjmp_t proc
push ebp
mov ebp, esp
push edi
mov edi, [ebp+8] ; Pointer to jmpbuf struct.
mov eax, [ebp] ; Save ebp, note we are saving the stored version on the stack.
mov [edi], eax
mov [edi+4], ebx ; Save ebx
mov eax, [ebp-4]
mov [edi+8], eax ; Save edi, note we are saving the stored verion on the stack.
mov [edi+12], esi ; Save esi
mov eax, ebp;
add eax, 8
mov [edi+16], eax ; Save sp, note saving sp pointing to last item on stack just before call to setjmp.
mov eax, [ebp+4]
mov [edi+20], eax ; Save return address (will be used as jump address in longjmp().
xor eax, eax ; return 0;
pop edi
pop ebp
ret
setjmp_t endp
; Simple version of longjmp (x86-32 bit).
;
; Restores ebp, ebx, edi, esi, esp and eip.
;
longjmp_t proc
mov edi, [esp+4] ; Pointer to jmpbuf struct.
mov eax, [esp+8] ; Get return value (value passed to longjmp).
mov ebp, [edi] ; Restore ebp.
mov ebx, [edi+4] ; Restore ebx.
mov esi, [edi+12] ; Restore esi.
mov esp, [edi+16] ; Restore stack pointer.
mov ecx, [edi+20] ; Original return address to setjmp.
mov edi, [edi+8] ; Restore edi, note, done last as we were using edi up to this point.
jmp ecx ; Wing and a prayer...
longjmp_t endp
end
A C code snippet to test it:
extern "C" int setjmp_t( int *p);
extern "C" int longjmp_t( int *p, int n);
jmp_buf test2_buff;
void DoTest2()
{
int x;
x = setjmp_t( test2_buff);
printf( "setjmp_t return - %d\n", x);
switch (x)
{
case 0:
printf( "About to do long jump...\n");
longjmp_t( test2_buff, 99);
break;
default:
printf( "Here becauuse of long jump...\n");
break;
}
printf( "Test2 passed!\n");
}
Note that I used the declaration from 'setjmp.h' for the buffer but if you want you can use an array of ints (minimum of 6 ints).
Related
Here is a basic program I written on the godbolt compiler, and it's as simple as:
#include<stdio.h>
void main()
{
int a = 10;
int *p = &a;
printf("%d", *p);
}
The results after compilation I get:
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-12], 10
lea rax, [rbp-12]
mov QWORD PTR [rbp-8], rax
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
Question: Pushing the rbp, making the stack frame by making a 16 byte block, how from a register, a value is moved to a stack location and vice versa, how the job of LEA is to figure out the address, I got this part.
Problem:
lea rax, [rbp-12]
mov QWORD PTR [rbp-8], rax
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
Lea -> getting address of rbp-12 into rax,
then moving the value which is the address of rbp-12 into rax,
but next line again says, move to rax, the value of rbp-8. This seems ambiguous. Then again moving the value of rax to eax. I don't understand the amount of work here. Why couldn't I have done
lea rax, [rbp-12]
mov QWORD PTR [rbp-8], rax
mov eax, QWORD PTR [rbp-8]
and be done with it? coz on the original line, rbp-12's address is stored onto rax, then rax stored to rbp-8. then rbp-8 stored again into rax, and then again rax is stored into eax? couldn't we have just copied the rbp-8 directly to eax? i guess not. But my question is why?
I know there is de-referencing in pointers, so How LEA helps grabbing the address of rbp-12, I understand, but on the next parts, when did it went from grabbing values from addresses I completely lost. And also, after that, I didn't understand any of the asm lines.
You're seeing very un-optimized code. Here's my line-by-line interpretation:
.LC0:
.string "%d" ; Format string for printf
main:
push rbp ; Save original base pointer
mov rbp, rsp ; Set base pointer to beginning of stack frame
sub rsp, 16 ; Allocate space for stack frame
mov DWORD PTR [rbp-12], 10 ; Initialize variable 'a'
lea rax, [rbp-12] ; Load effective address of 'a'
mov QWORD PTR [rbp-8], rax ; Store address of 'a' in 'p'
mov rax, QWORD PTR [rbp-8] ; Load 'p' into rax (even though it's already there - heh!)
mov eax, DWORD PTR [rax] ; Load 32-bit value of '*p' into eax
mov esi, eax ; Load value to print into esi
mov edi, OFFSET FLAT:.LC0 ; Load format string address into edi
mov eax, 0 ; Zero out eax (not sure why -- likely printf call protocol)
call printf ; Make the printf call
nop ; No-op (not sure why)
leave ; Remove the stack frame
ret ; Return
Compilers, when not optimizing, generate code like this as they parse the code you gave them. It's doing a lot of unnecessary stuff, but it is quicker to generate and makes using a debugger easier.
Compare this with the optimized code (-O2):
.LC0:
.string "%d" ; Format string for printf
main:
mov esi, 10 ; Don't need those variables -- just a 10 to pass to printf!
mov edi, OFFSET FLAT:.LC0 ; Load format string address into edi
xor eax, eax ; It's a few cycles faster to xor a register with itself than to load an immediate 0
jmp printf ; Just jmp to printf -- it will handle the return
The optimizer found that the variables weren't necessary, so no stack frame is created. Nothing is left but the printf call! And that's done as a jmp since nothing else need be done here when the printf is complete.
I'm working on a calculator, and at the part where operations are actually executed, I have a big long switch block that looks something like this (the cases go up linearly starting at 1):
switch(operator) {
case 1:
a = a + b;
break;
case 2:
a = a - b;
break;
...
case 20:
a = sin(a);
break;
I have quite a few operators and functions at this point and testing each case one at a time doesn't seem like it would be the fastest option.
Is there a way to use a table (such as an array of goto labels, or an array of function pointers) so that the "operator" variable would cause the program to jump to the appropriate operation without the program having to test for each of the cases? If so, how would I go about doing this, given the above code?
It can not be faster then switch and I will show you why. Here is my version of your function with switch
void switch_fun(int operator) {
switch(operator) {
case 0:
a = a + b;
break;
case 1:
a = a - b;
break;
case 2:
a = a * b;
break;
}
}
Which translate to this assembler (with comments for non-assembly people)
switch_fun:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi ## copy 'operator'
cmp DWORD PTR [rbp-4], 2 ## compare 'operator' with 2
je .L2 ## if 'operator' is equal to 2 jump to .L2 (few lines below)
cmp DWORD PTR [rbp-4], 0 ## same comparition for 0
je .L4
cmp DWORD PTR [rbp-4], 1 ## and for 1
je .L5
jmp .L6 ## if nothing match jump at the end of the function
.L4:
mov edx, DWORD PTR a[rip]
mov eax, DWORD PTR b[rip]
add eax, edx ## addition, like in case 0:
mov DWORD PTR a[rip], eax
jmp .L3 ## jump at the end
.L5:
mov eax, DWORD PTR a[rip]
mov edx, DWORD PTR b[rip]
sub eax, edx ## subtraction
mov DWORD PTR a[rip], eax
jmp .L3
.L2:
mov edx, DWORD PTR a[rip]
mov eax, DWORD PTR b[rip]
imul eax, edx ## multiplication
mov DWORD PTR a[rip], eax
nop
.L3:
.L6:
nop
pop rbp
ret
It's pretty simple construction and probably impossible to optimize even more, but let's try. Here I have version of this function but with array of addresses to goto labels.
void array_fun(int operator) {
void* labels[] = {&&addition, &&subtraction, &&mul};
goto *labels[operator];
addition:
a = a + b;
goto outer;
subtraction:
a = a - b;
goto outer;
mul:
a = a * b;
outer:
return;
}
Ignoring fact that this code is really unsafe because if operator will be greater then 2 this will crash program but whatever, let's see how this look in assembly.
array_fun:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-36], edi
mov QWORD PTR [rbp-32], OFFSET FLAT:.L8
mov QWORD PTR [rbp-24], OFFSET FLAT:.L9
mov QWORD PTR [rbp-16], OFFSET FLAT:.L10 ## here you eliminated some comparitions
## but add some new 'moves' and calculations, everything below is practically the same like in first function, it's definitely not faster, probably slower then switch
mov eax, DWORD PTR [rbp-36]
cdqe
mov rax, QWORD PTR [rbp-32+rax*8]
nop
jmp rax
.L8:
mov edx, DWORD PTR a[rip]
mov eax, DWORD PTR b[rip]
add eax, edx
mov DWORD PTR a[rip], eax
jmp .L12
.L9:
mov eax, DWORD PTR a[rip]
mov edx, DWORD PTR b[rip]
sub eax, edx
mov DWORD PTR a[rip], eax
jmp .L12
.L10:
mov edx, DWORD PTR a[rip]
mov eax, DWORD PTR b[rip]
imul eax, edx
mov DWORD PTR a[rip], eax
.L12:
nop
pop rbp
ret
TL;DR don't try to optimize switch on integers because it's impossible to do
EDIT: With array of pointers to functions it would be even worse because you will get overhead because of calling a function
This question already has answers here:
Why is the address of static variables relative to the Instruction Pointer?
(1 answer)
32-bit absolute addresses no longer allowed in x86-64 Linux?
(1 answer)
Closed 4 years ago.
The C source:
int sum(int a, int b) {
return a + b;
}
int main() {
int (*ptr_sum_1)(int,int) = sum; // assign the address of the "sum"
int (*ptr_sum_2)(int,int) = sum; // to the function pointer
int (*ptr_sum_3)(int,int) = sum;
int a = (*ptr_sum_1)(2,4); // call the "sum" through the pointer
int b = sum(2,4); // call the "sum" by usual way
return 0;
}
The crucial part of the assembly code:
lea rax, sum[rip]
mov QWORD PTR -24[rbp], rax
lea rax, sum[rip]
mov QWORD PTR -16[rbp], rax
lea rax, sum[rip]
mov QWORD PTR -8[rbp], rax
The executing program instructions from GDB:
0x5fa <sum>: push rbp
0x5fb <sum+1>: mov rbp,rsp
0x5fe <sum+4>: mov DWORD PTR [rbp-0x4],edi
0x601 <sum+7>: mov DWORD PTR [rbp-0x8],esi
0x604 <sum+10>: mov edx,DWORD PTR [rbp-0x4]
0x607 <sum+13>: mov eax,DWORD PTR [rbp-0x8]
0x60a <sum+16>: add eax,edx
0x60c <sum+18>: pop rbp
0x60d <sum+19>: ret
0x60e <main>: push rbp
0x60f <main+1>: mov rbp,rsp
0x612 <main+4>: sub rsp,0x20
0x616 <main+8>: lea rax,[rip+0xffffffffffffffdd] # 0x5fa <sum>
0x61d <main+15>: mov QWORD PTR [rbp-0x18],rax
0x621 <main+19>: lea rax,[rip+0xffffffffffffffd2] # 0x5fa <sum>
0x628 <main+26>: mov QWORD PTR [rbp-0x10],rax
0x62c <main+30>: lea rax,[rip+0xffffffffffffffc7] # 0x5fa <sum>
0x633 <main+37>: mov QWORD PTR [rbp-0x8],rax
0x637 <main+41>: mov rax,QWORD PTR [rbp-0x18]
0x63b <main+45>: mov esi,0x4
0x640 <main+50>: mov edi,0x2
0x645 <main+55>: call rax
0x647 <main+57>: mov DWORD PTR [rbp-0x20],eax
0x64a <main+60>: mov esi,0x4
0x64f <main+65>: mov edi,0x2
0x654 <main+70>: call 0x5fa <sum>
0x659 <main+75>: mov DWORD PTR [rbp-0x1c],eax
0x65c <main+78>: mov eax,0x0
0x661 <main+83>: leave
0x662 <main+84>: ret
I think that the sum label is just the starting address of the sum procedure - 0x5fa, so I don't understand why gcc can't use it directly, but uses the calculation sum[rip] for this.
Question:
Why is sum[rip] used in the lea rax, sum[rip] instruction in assembly, instead of the simple sum label, e.g. lea rax, sum?
Will the mov rax, 0x5fa instruction do the same? Because we know the sum address after linking: the call 0x5fa <sum> instruction just uses it directly.
I believe that it might depend on your build of GCC, but on the Linux distributions that I use everything is set up to default to PIC builds. That's Position Independent Code. It's better for both shared libraries and executables, because the result can be mapped into memory anywhere without needing a fixup pass. It's better for security because ASLR can be applied.
With x86-64 there's no significant penalty for using PIC so why wouldn't it be used everywhere?
I never learn C language so it makes me confuse. I just like to know if I did it correctly or where I need to improve. For this code I used assembly x86 32 bit. Thanks
This is what I supposed to do:
Write a procedure with the signature
char *strchar(char *s1, char c1)
that returns a pointer to the first occurrence of the character c1 within the string s1 or, if not found, returns a null.
This is what I came out with:
strchar (char*, char):
push ebp
mov ebp, esp
mov dword ptr [ebp-24], edi
mov EAX , esi
mov BYTE PTR [ebp-28], al
.L5:
mov EAX , dword ptr [ebp-24]
movzx EAX , byte ptr [ EAX ]
test AL, AL
je .L2
mov EAX , dword PTR [ebp-24]
movzx EAX , BYTE PTR [ EAX ]
cmp BYTE PTR [ebp-28], al
jne .L3
mov eax, dword PTR [ebp-24]
jmp .L6
.L3:
add dword PTR [ebp-24], 1
jmp .L5
.L2:
LEA eax, [ebp-9]
MOV DWORD PTR [EBP-8], eax
MOV EAX, DWORD PTR [ebp-8]
.L6:
POP EBP
RET
The lines:
mov dword ptr [ebp-24], edi
mov EAX , esi
mov BYTE PTR [ebp-28], al
assume that a stack frame has been allocated for this function which doesn’t appear true; I think you should have something like:
sub esp, 32
after the
mov ebp,esp
Also, the three lines after L2 seem confused. The only way to get to L2 is if the nil (0) byte is discovered in the string, at which point, the code should return a NULL pointer.
The exit path in the code (L6) leaves eax alone, so all that should be needed is:
L2:
mov eax, 0
It might make debugging easier if you kept the alias up to date; so:
L2:
mov eax, 0
mov [ebp-24], eax
Also, the calling convention used here is a bit odd: the string is passed in edi and the character in esi. Normally, in x86-32, these would both be passed on the stack. This looks like it might have been x86-64 code, converted to x86-32....
A final note; this assembly code looks like the output of a compiler with optimisations disabled. Often, generating the assembly with the optimisations enabled generates easier to understand code. This code, for example, could be much more concisely written as below, without even devolving into weird intel ops:
strchar:
mov edx, esi
mov eax, edi
L:
mov dh, [eax]
test dh, dh
jz null
cmp dh, dl
je done
inc eax
jmp L
null:
mov eax, 0
done:
ret
Here is one with stack overhead
[global strchar]
strchar:
push ebp
mov ebp, esp
mov dl, byte [ebp + 12]
mov ecx, dword [ebp + 8]
xor eax, eax
.loop: mov al, [ecx]
or al, al
jz .exit
cmp al, dl
jz .found
add ecx, 1
jmp .loop
.found: mov eax, ecx
.exit:
leave
ret
Here is one without stack overhead
[global strchar]
strchar:
mov dl, byte [esp + 8]
mov ecx, dword [esp + 4]
xor eax, eax
.loop: mov al, [ecx]
or al, al
jz .exit
cmp al, dl
jz .found
add ecx, 1
jmp .loop
.found: mov eax, ecx
.exit:
ret
These are using the 'cdecl' calling convention. For 'stdcall' change the last 'ret' to 'ret 8'.
So, I am trying to see if there is any difference between using a jump table of function pointers versus a switch statements for performing many, one command operations like these.
This is the code to assembly link i made
Here is my actual code as well
enum code {
ADD,
SUB,
MUL,
DIV,
REM
};
typedef struct {
int val;
} Value;
typedef struct {
enum code ins;
int operand;
} Op;
void run(Value* arg, Op* func)
{
switch(func->ins)
{
case ADD: arg->val += func->operand; break;
case SUB: arg->val -= func->operand; break;
case MUL: arg->val *= func->operand; break;
case DIV: arg->val /= func->operand; break;
case REM: arg->val %= func->operand; break;
}
}
My question is, based on the generated assembly in that link or the code, would there be any difference from making a bunch of small functions to complete the operations in the cases of the switch statement, and making an array of pointers to those functions and calling them with the same enum?
Using gcc x86_64 7.1
void add(Value* arg, Op* func)
{
arg->val += func->operand;
}
static void (*jmptable)(Value*, Op*)[] = {
&add
}
Assembly code paste:
run(Value*, Op*):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov QWORD PTR [rbp-16], rsi
mov rax, QWORD PTR [rbp-16]
mov eax, DWORD PTR [rax]
cmp eax, 4
ja .L9
mov eax, eax
mov rax, QWORD PTR .L4[0+rax*8]
jmp rax
.L4:
.quad .L3
.quad .L5
.quad .L6
.quad .L7
.quad .L8
.L3:
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-16]
mov eax, DWORD PTR [rax+4]
add edx, eax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], edx
jmp .L2
.L5:
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-16]
mov eax, DWORD PTR [rax+4]
sub edx, eax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], edx
jmp .L2
.L6:
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-16]
mov eax, DWORD PTR [rax+4]
imul edx, eax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], edx
jmp .L2
.L7:
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov rdx, QWORD PTR [rbp-16]
mov esi, DWORD PTR [rdx+4]
cdq
idiv esi
mov edx, eax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], edx
jmp .L2
.L8:
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov rdx, QWORD PTR [rbp-16]
mov ecx, DWORD PTR [rdx+4]
cdq
idiv ecx
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], edx
nop
.L2:
.L9:
nop
pop rbp
ret
A catchall answer to all these questions: you should measure.
Practically though, I'm betting on the switch version. Function calls have overhead (and they can be hardly inlined in this context), which you could eliminate with labels as values, which is a common compiler extension*, but you should really try all your options and measure if the performance of this piece of code matters to you greatly.
Otherwise, use whatever's most convenient to you.
*a switch is likely to generate a jump table equivalent to what you could compose from labels as values but it could switch between different implementations depending on the particular case values and their number
Can you spot the difference? Trust in compiler (it will do such a micro optimisations much better than you) - and do not forget break statements. Care about algorithm, not about such a small details.
https://godbolt.org/g/sPxse2
Looks like due to branch prediction and bounds checking, using the switch labels as jump points may be up to 20% faster on older systems - newer systems having better branch prediction. Basically, this relies on a compiler extension. You still have the switch, but the switch doesn't fall through to the dispatcher. Instead, each case has its own dispatcher that jumps directly into the case. A number of popular VMs do this.
See here for more info and examples:https://www.cipht.net/2017/10/03/are-jump-tables-always-fastest.html