clang: compiling with non-flat x86 stack model

clang: compiling with non-flat x86 stack model - c

If I understand clang assumes that the stack segment for x86 is flat (has 0 base). E.g., when compiling using the following command line:
clang -cc1 -S -mllvm --x86-asm-syntax=intel -o - -triple i986-unknown-unknown -mrelocation-model static xxx.c
xxx.c:
void f()
{
int a = 5;
int *ap = &a;
int b = *ap;
}
the following assembly is produced:
f:
sub ESP, 12
lea EAX, DWORD PTR [ESP + 8]
mov DWORD PTR [ESP + 8], 5
mov DWORD PTR [ESP + 4], EAX
mov EAX, DWORD PTR [ESP + 4]
mov EAX, DWORD PTR [EAX]
mov DWORD PTR [ESP], EAX
add ESP, 12
ret
This may only be correct if the stack is flat, because EAX contains the offset from SS base.
Is it possible to compile the C code for SS with an arbitrary base?

When executing in protected mode, segment registers do not contain offsets you can use. Rather, each segment is independent of the others. As far as I know, no operating system in use today uses an ABI where pointers contain segment information and as you can take the address of any local variable (placed on the stack) this would be necessary. I don't think clang (or gcc for that matter) can compile code for such a model.

Related

returning Assembly Function In C Code x86 Compilation on 64 bit linux

C code
#include <stdio.h>
int fibonacci(int);
int main()
{
int x = fibonacci(3);
printf("Fibonacci is : %d",x);
return 0;
}
Assembly
section .text
global fibonacci
fibonacci:
push ebp;
mov ebp, esp;
; initialize
mov dword [prev], 0x00000000;
mov dword [cur], 0x00000001;
mov byte [it], 0x01;
mov eax, dword [ebp + 8]; // n = 3
mov byte [n], al;
getfib:
xor edx,edx;
mov dl, byte [n];
cmp byte [it] , dl;
jg loopend;
mov eax,dword [prev];
add eax, dword [cur];
mov ebx, dword [cur];
mov dword [prev], ebx;
mov dword [cur] , eax;
inc byte [it];
jmp getfib;
loopend:
mov eax, dword [cur];
pop ebp;
ret;
section .bss
it resb 1
prev resd 1
cur resd 1
n resb 1
I was trying to run this assembly function in C code and on debugging , i saw that value in variable x in C code is right but there is some error coming when i use the printf function
Need Help on it
Command to compile:
nasm -f elf32 asmcode.asm -o a.o
gcc -ggdb -no-pie -m32 a.o ccode.c -o a.out
Click Below Pictures if they seem blurred
Below is debug before printf execute
Below is after printf execute

Your code does not preserve the ebx register which is a callee-preserved register. The main function apparently tries to do some rip-relative addressing to obtain the address of the format string for printf using ebx as a base register. This fails because your code overwrote ebx.
To fix this issue, make sure to save all callee-saved registers before you use them and then restore their value on return. For example, you can do
fibonacci:
push ebp
mov ebp, esp
push ebx ; <---
...
pop ebx ; <---
pop ebp
ret

Why does MSVC not inline variable names in the generated assembly code?

When compiling the program beneath to assembly using MSVC:
int main() {int var1 = 1; int var2 = 2;}
Then the generated assembly code will look like this:
var1$ = 0
var2$ = 4
main PROC
sub rsp, 24
mov DWORD PTR var1$[rsp], 1
mov DWORD PTR var2$[rsp], 2
xor eax, eax
add rsp, 24
ret 0
main ENDP
I'm wondering why the var1$ = 0 and var2$ = 4 is necessary since the it could be avoided by inlining it in the mov as such: mov DWORD PTR [rsp], 1 and mov DWORD PTR [rsp + 4], 2. This results in a smaller assembly, and maybe also a smaller .obj and .exe file.
I'm compiling using CL Test.c /Fa

Which actions are taken in compile time?

I cannot get the list of things that are optimized at compile time, and what are the compiler optimizations that do the reduction if all the necessary info is given. However, the question is tagged C, but the thing that made me ask it is C++'s constexpr - I naively thought that the things the keyword allows were already available "out of box".
EDIT: the basic example is
#include <stdio.h>
int main(int argc, char *argv[]){
int a = 10;
int b = 8;
printf("Answer is %d", a-b);
}
If compiled with -O0 (x86-64 GCC9.2) we get real load and subtraction when the program is run
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 10
mov DWORD PTR [rbp-8], 8
mov eax, DWORD PTR [rbp-4]
sub eax, DWORD PTR [rbp-8]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
When -O3 (x86-64 GCC9.2) is used, subtraction is optimized out and we load only 2
mov esi, 2
mov edi, OFFSET FLAT:.LC0
xor eax, eax
jmp printf
We probably can replace some function calls with constant values or simplify functions if values are available in code. What is the optimization name that does the thing?
EDIT2 The question is mainly about GCC or Clang for x86-64 platform. I understand that a compiler must not optimize any code, but often programmers use the compilers with options other than -O0. As noted by KamilCuk, the code above can be reduced to puts("Answer is 2") but it is not, and I don't know why. I'd highly appreciate if someone provides an [or a link to] overview of optimizations that GCC or Clang do at different levels of optimization.

Is there any way to save registers before jumping into function?

this is my first question, because I couldn't find anything related to this topic.
Recently, while making a class for my C game engine project I've found something interesting:
struct Stack *S1 = new(Stack);
struct Stack *S2 = new(Stack);
S1->bPush(S1, 1, 2); //at this point
bPush is a function pointer in the structure.
So I wondered, what does operator -> in that case, and I've discovered:
mov r8b,2 ; a char, written to a low point of register r8
mov dl,1 ; also a char, but to d this time
mov rcx,qword ptr [S1] ; this is the 1st parameter of function
mov rax,qword ptr [S1] ; !Why cannot I use this one?
call qword ptr [rax+1A0h] ; pointer call
so I assume -> writes an object pointer to rcx, and I'd like to use it in functions (methods they shall be). So the question is, how can I do something alike
push rcx
// do other call vars
pop rcx
mov qword ptr [this], rcx
before it starts writing other variables of the function. Something with preprocessor?

It looks like you'd have an easier time (and get asm that's the same or more efficient) if you wrote in C++ so you could use language built-in support for virtual functions, and for running constructors on initialization. Not to mention not having to manually run destructors. You wouldn't need your struct Class hack.
I'd like to implicitly pass *this pointer, because as shown in second asm part it does the same thing twice, yes, it is what I'm looking for, bPush is a part of a struct and it cannot be called from outside, but I have to pass the pointer S1, which it already has.
You get inefficient asm because you disabled optimization.
MSVC -O2 or -Ox doesn't reload the static pointer twice. It does waste a mov instruction copying between registers, but if you want better asm use a better compiler (like gcc or clang).
The oldest MSVC on the Godbolt compiler explorer is CL19.0 from MSVC 2015, which compiles this source
struct Stack {
int stuff[4];
void (*bPush)(struct Stack*, unsigned char value, unsigned char length);
};
struct Stack *const S1 = new(Stack);
int foo(){
S1->bPush(S1, 1, 2);
//S1->bPush(S1, 1, 2);
return 0; // prevent tailcall optimization
}
into this asm (Godbolt)
# MSVC 2015 -O2
int foo(void) PROC ; foo, COMDAT
$LN4:
sub rsp, 40 ; 00000028H
mov rax, QWORD PTR Stack * __ptr64 __ptr64 S1
mov r8b, 2
mov dl, 1
mov rcx, rax ;; copy RAX to the arg-passing register
call QWORD PTR [rax+16]
xor eax, eax
add rsp, 40 ; 00000028H
ret 0
int foo(void) ENDP ; foo
(I compiled in C++ mode so I could write S1 = new(Stack) without having to copy your github code, and write it at global scope with a non-constant initializer.)
Clang7.0 -O3 loads into RCX straight away:
# clang -O3
foo():
sub rsp, 40
mov rcx, qword ptr [rip + S1]
mov dl, 1
mov r8b, 2
call qword ptr [rcx + 16] # uses the arg-passing register
xor eax, eax
add rsp, 40
ret
Strangely, clang only decides to use low-byte registers when targeting the Windows ABI with __attribute__((ms_abi)). It uses mov esi, 1 to avoid false dependencies when targeting its default Linux calling convention, not mov sil, 1.
Or if you are using optimization, then it's because even older MSVC is even worse. In that case you probably can't do anything in the C source to fix it, although you might try using a struct Stack *p = S1 local variable to hand-hold the compiler into loading it into a register once and reusing it from there.)

How Do I Access Thread Local Storage From ml64.exe (MSVC 64-bit X64 Assembler)?

The following C function attempts to prevent recursion in multicore code in a thread-safe manner using a thread local storage variable. However, for reasons that are somewhat complicated, I NEED to write this function in X64 assembler (Intel X86 / AMD 64-bit) and assemble it with ml64.exe from VC2010. I know how to do this if I'm using global variables but I'm not sure how to do it properly with a TLS variable that has __declspec(thread).
__declspec(thread) int tls_VAR = 0;
void norecurse( )
{
if(0==tls_VAR)
{
tls_VAR=1;
DoWork();
tls_VAR=0;
}
}
Note: This is what VC2010 kicks out for the function. However, MASM (ml64.exe) doesn't support the gs:88 or OFFSET FLAT: parts of the code.
; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01
include listing.inc
INCLUDELIB MSVCRTD
INCLUDELIB OLDNAMES
PUBLIC norecurse
EXTRN DoWork:PROC
EXTRN tls_VAR:DWORD
EXTRN _tls_index:DWORD
pdata SEGMENT
$pdata$norecurse DD imagerel $LN4
DD imagerel $LN4+70
DD imagerel $unwind$norecurse
pdata ENDS
xdata SEGMENT
$unwind$norecurse DD 040a01H
DD 06340aH
DD 07006320aH
; Function compile flags: /Ogtpy
xdata ENDS
_TEXT SEGMENT
norecurse PROC
; File p:\hackytests\64bittest2010\64bittest\64bittest.cpp
; Line 19
$LN4:
mov QWORD PTR [rsp+8], rbx
push rdi
sub rsp, 32 ; 00000020H
; Line 20
mov ecx, DWORD PTR _tls_index
mov rax, QWORD PTR gs:88
mov edi, OFFSET FLAT:tls_VAR
mov rbx, QWORD PTR [rax+rcx*8]
cmp DWORD PTR [rbx+rdi], 0
jne SHORT $LN1#norecurse
; Line 22
mov DWORD PTR [rbx+rdi], 1
; Line 23
call DoWork
; Line 24
mov DWORD PTR [rbx+rdi], 0
$LN1#norecurse:
; Line 26
mov rbx, QWORD PTR [rsp+48]
add rsp, 32 ; 00000020H
pop rdi
ret 0
norecurse ENDP
_TEXT ENDS
END

As your answer indicates the problem comes down finding the MASM equivalents to the following two lines in assembly listing generated by the Microsoft's C++ compiler:
mov rax, QWORD PTR gs:88
mov edi, OFFSET FLAT:tls_VAR
The first line is easy. Just replace gs:88 with gs:[88].
The second line is less obvious. The OFFSET FLAT: operator is a red herring. It means use the offset relative to the beginning of the "FLAT" segment. With the 32-bit version of MASM, the FLAT segment is the segment that includes the entire 4G address space. This is the segment that's used for both the code and data segment as part of the 32-bit flat memory model. The 64-bit version of MASM doesn't support memory models, it essentially always assumes a 64-bit version of the flat memory model, so it doesn't support the FLAT keyword. As result the plain OFFSET operator ends meaning the same thing. (In fact with the 32-bit assembler, plain OFFSET also normally means the same thing because PECOFF only supports the flat memory model.)
However using OFFSET here won't work. That's because it would use the offset of the address of tls_VAR in memory relative to address 0. Or in other words, it would use the absolute address of tls_VAR in memory. What's needed here is the offset relative to the beginning of the TLS data section.
So the compiler must be doing something special here. In order find out, I dumped the relocations in the object file generated while compiling your example C code:
> dumpbin /relocations t215a.obj
...
RELOCATIONS #4
Symbol Symbol
Offset Type Applied To Index Name
-------- ---------------- ----------------- -------- ------
00000008 REL32 00000000 14 _tls_index
00000016 SECREL 00000000 8 tls_VAR
0000002D REL32 00000000 C DoWork
...
As you can see it generates a relocation of type SECREL for the reference to tls_VAR. This makes the relocation relative to the base of the section in the generated executable that that symbol appears in. In this case that's the .tls section, so this relocation generates an offset relative to the beginning of the section used for static TLS data.
So now the question becomes how to get MASM to generate the same SECREL relocation the compiler emits. This turns out to have a easy solution as well, just replace OFFSET FLAT: with SECTIONREL.
So with these changes (and a bit of optimization) your function becomes:
EXTERN tls_VAR:DWORD
EXTERN _tls_index:DWORD
EXTERN DoWork:PROC
PUBLIC norecurse
_TEXT SEGMENT
norecurse PROC
push rbx
sub rsp, 32
mov rax, gs:[88]
mov ecx, _tls_index
mov rbx, [rax + rcx * 8]
cmp DWORD PTR [rbx + SECTIONREL tls_VAR], 0
jne return
mov DWORD PTR [rbx + SECTIONREL tls_VAR], 1
call DoWork
mov DWORD PTR [rbx + SECTIONREL tls_VAR], 0
return:
add rsp, 32
pop rbx
ret
norecurse ENDP
_TEXT ENDS
END

I was able to work a hack around the issue. My implementation in assember is less efficient than the C compiler generated code though because I was not able to figure out how to use the following two addressing modes:
mov rax, QWORD PTR gs:88
mov edi, OFFSET FLAT:tls_VAR
For (1), I had to load 88 into rax and use gs:[rax] to access the TLS-base for the thread.
For (2), the lack of OFFSET FLAT in MASM (ml64.exe) meant that I had to be more clever. I computed the offset by subtracting _tls_start from the TLS-base for the thread that could be applied to TLS-variables in assembler to access their thread local values.
PUBLIC norecurse
EXTRN _tls_index:DWORD
EXTRN _tls_start:DWORD
EXTRN tls_VAR:DWORD
EXTRN DoWork:PROC
_TEXT SEGMENT
norecurse PROC
; non-volatile
push rbx
sub rsp,32
; The gs segment register refers to the base address of the TEB on x64.
; 88 (0×58) is the offset in the TEB for the ThreadLocalStoragePointer member on x64
mov rax,88
mov edx, DWORD PTR _tls_index
mov rax, gs:[rax]
mov r11, QWORD PTR [rax+rdx*8]
lea r10, _tls_start
; r11 will be the the offset-adjusted TLS-Base
sub r11, r10
; ebx will be the the thread local address of tls_VAR
lea rdx, tls_VAR
lea rbx,[r11+rdx]
cmp DWORD PTR [rbx], 0
jne #F
mov DWORD PTR [rbx], 1
call DoWork
mov DWORD PTR [rbx], 0
##:
add rsp,32
pop rbx
ret
norecurse ENDP
_TEXT ENDS
END
I'd love to see more efficient method or pointers on how to actually use the two addressing modes I couldn't figure out with MASM (ml64.exe) though.

Check out TlsGetValue, TlsSetvalue, and friends.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight