thread local storage in assembly

thread local storage in assembly - c

I want to increment a TLS variable in assembly but is gives a segmentation fault in the assembly code. I don't want to let compiler change any other register or memory. Is there a way to do this without using gcc input and output syntax?
__thread unsigned val;
int main() {
val = 0;
asm("incl %gs:val");
return 0;
}

If you really really need to be able to do this for some reason, you should access a thread-local variable from assembly language by preloading its address in C, like this:
__thread unsigned val;
void incval(void)
{
unsigned *vp = &val;
asm ("incl\t%0" : "+m" (*vp));
}
This is because the code sequence required to access a thread-local variable is different for just about every OS and CPU combination supported by GCC, and also varies if you're compiling for a shared library rather than an executable (i.e. with -fPIC). The above construct allows the compiler to emit the correct code sequence for you. In cases where it is possible to access the thread-local variable without any extra instructions, the address generation will be folded into the assembly operation. By way of illustration, here is how gcc 4.7 for x86/Linux compiles the above in several different possible modes (I've stripped out a bunch of assembler directives in all cases, for clarity)...
# -S -O2 -m32 -fomit-frame-pointer
incval:
incl %gs:val#ntpoff
ret
# -S -O2 -m64
incval:
incl %fs:val#tpoff
ret
# -S -O2 -m32 -fomit-frame-pointer -fpic
incval:
pushl %ebx
call __x86.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
leal val#tlsgd(,%ebx,1), %eax
call ___tls_get_addr#PLT
incl (%eax)
popl %ebx
ret
# -S -O2 -m64 -fpic
incval:
.byte 0x66
leaq val#tlsgd(%rip), %rdi
.value 0x6666
rex64
call __tls_get_addr#PLT
incl (%rax)
ret
Do realize that all four examples would be different if I'd compiled for x86/OSX, and different yet again for x86/Windows.

Related

Why can't I get the value of asm registers in C?

I'm trying to get the values of the assembly registers rdi, rsi, rdx, rcx, r8, but I'm getting the wrong value, so I don't know if what I'm doing is taking those values or telling the compiler to write on these registers, and if that's the case how could I achieve what I'm trying to do (Put the value of assembly registers in C variables)?
When this code compiles (with gcc -S test.c)
#include <stdio.h>
void beautiful_function(int a, int b, int c, int d, int e) {
register long rdi asm("rdi");
register long rsi asm("rsi");
register long rdx asm("rdx");
register long rcx asm("rcx");
register long r8 asm("r8");
const long save_rdi = rdi;
const long save_rsi = rsi;
const long save_rdx = rdx;
const long save_rcx = rcx;
const long save_r8 = r8;
printf("%ld\n%ld\n%ld\n%ld\n%ld\n", save_rdi, save_rsi, save_rdx, save_rcx, save_r8);
}
int main(void) {
beautiful_function(1, 2, 3, 4, 5);
}
it outputs the following assembly code (before the function call):
movl $1, %edi
movl $2, %esi
movl $3, %edx
movl $4, %ecx
movl $5, %r8d
callq _beautiful_function
When I compile and execute it outputs this:
0
0
4294967296
140732705630496
140732705630520
(some undefined values)
What did I do wrong ? and how could I do this?

Your code didn't work because Specifying Registers for Local Variables explicitly tells you not to do what you did:
The only supported use for this feature is to specify registers for input and output operands when calling Extended asm (see Extended Asm).
Other than when invoking the Extended asm, the contents of the specified register are not guaranteed. For this reason, the following uses are explicitly not supported. If they appear to work, it is only happenstance, and may stop working as intended due to (seemingly) unrelated changes in surrounding code, or even minor changes in the optimization of a future version of gcc:
Passing parameters to or from Basic asm
Passing parameters to or from Extended asm without using input or output operands.
Passing parameters to or from routines written in assembler (or other languages) using non-standard calling conventions.
To put the value of registers in variables, you can use Extended asm, like this:
long rdi, rsi, rdx, rcx;
register long r8 asm("r8");
asm("" : "=D"(rdi), "=S"(rsi), "=d"(rdx), "=c"(rcx), "=r"(r8));
But note that even this might not do what you want: the compiler is within its rights to copy the function's parameters elsewhere and reuse the registers for something different before your Extended asm runs, or even to not pass the parameters at all if you never read them through the normal C variables. (And indeed, even what I posted doesn't work when optimizations are enabled.) You should strongly consider just writing your whole function in assembly instead of inline assembly inside of a C function if you want to do what you're doing.

Even if you had a valid way of doing this (which this isn't), it probably only makes sense at the top of a function which isn't inlined. So you'd probably need __attribute__((noinline, noclone)). (noclone is a GCC attribute that clang will warn about not recognizing; it means not to make an alternate version of the function with fewer actual args, to be called in the case where some of them are known constants that can get propagated into the clone.)
register ... asm local vars aren't guaranteed to do anything except when used as operands to Extended Asm statements. GCC does sometimes still read the named register if you leave it uninitialized, but clang doesn't. (And it looks like you're on a Mac, where the gcc command is actually clang, because so many build scripts use gcc instead of cc.)
So even without optimization, the stand-alone non-inlined version of your beautiful_function is just reading uninitialized stack space when it reads your rdi C variable in const long save_rdi = rdi;. (GCC does happen to do what you wanted here, even at -Os - optimizes but chooses not to inline your function. See clang and GCC (targeting Linux) on Godbolt, with asm + program output.).
Using an asm statement to make register asm do something
(This does what you say you want (reading registers), but because of other optimizations, still doesn't produce 1 2 3 4 5 with clang when the caller can see the definition. Only with actual GCC. There might be a clang option to disable some relevant IPA / IPO optimization, but I didn't find one.)
You can use an asm volatile() statement with an empty template string to tell the compiler that the values in those registers are now the values of those C variables. (The register ... asm declarations force it to pick the right register for the right variable)
#include <stdlib.h>
#include <stdio.h>
__attribute__((noinline,noclone))
void beautiful_function(int a, int b, int c, int d, int e) {
register long rdi asm("rdi");
register long rsi asm("rsi");
register long rdx asm("rdx");
register long rcx asm("rcx");
register long r8 asm("r8");
// "activate" the register-asm locals:
// associate register values with C vars here, at this point
asm volatile("nop # asm statement here" // can be empty, nop is just because Godbolt filters asm comments
: "=r"(rdi), "=r"(rsi), "=r"(rdx), "=r"(rcx), "=r"(r8) );
const long save_rdi = rdi;
const long save_rsi = rsi;
const long save_rdx = rdx;
const long save_rcx = rcx;
const long save_r8 = r8;
printf("%ld\n%ld\n%ld\n%ld\n%ld\n", save_rdi, save_rsi, save_rdx, save_rcx, save_r8);
}
int main(void) {
beautiful_function(1, 2, 3, 4, 5);
}
This makes asm in your beautiful_function that does capture the incoming values of your registers. (It doesn't inline, and the compiler happens not to have used any instructions before the asm statement that steps on any of those registers. The latter is not guaranteed in general.)
On Godbolt with clang -O3 and gcc -O3
gcc -O3 does actually work, printing what you expect. clang still prints garbage, because the caller sees that the args are unused, and decides not to set those registers. (If you'd hidden the definition from the caller, e.g. in another file without LTO, that wouldn't happen.)
(With GCC, noninline,noclone attributes are enough to disable this inter-procedural optimization, but not with clang. Not even compiling with -fPIC makes that possible. I guess the idea is that symbol-interposition to provide an alternate definition of beautiful_function that does use its args would violate the one definition rule in C. So if clang can see a definition for a function, it assumes that's how the function works, even if it isn't allowed to actually inline it.)
With clang:
main:
pushq %rax # align the stack
# arg-passing optimized away
callq beautiful_function#PLT
# indirect through the PLT because I compiled for Linux with -fPIC,
# and the function isn't "static"
xorl %eax, %eax
popq %rcx
retq
But the actual definition for beautiful_function does exactly what you want:
# clang -O3
beautiful_function:
pushq %r14
pushq %rbx
nop # asm statement here
movq %rdi, %r9 # copying all 5 register outputs to different regs
movq %rsi, %r10
movq %rdx, %r11
movq %rcx, %rbx
movq %r8, %r14
leaq .L.str(%rip), %rdi
xorl %eax, %eax
movq %r9, %rsi # then copying them to printf args
movq %r10, %rdx
movq %r11, %rcx
movq %rbx, %r8
movq %r14, %r9
popq %rbx
popq %r14
jmp printf#PLT # TAILCALL
GCC wastes fewer instructions, just for example starting with movq %r8, %r9 to move your r8 C var as the 6th arg to printf. Then movq %rcx, %r8 to set up the 5th arg, overwriting one of the output registers before it's read all of them. Something clang was over-cautious about. However, clang does still push/pop %r12 around the asm statement; I don't understand why. It ends by tailcalling printf, so it wasn't for alignment.
Related:
How to specify a specific register to assign the result of a C expression in inline assembly in GCC? - the opposite problem: materialize a C variable value in a specific register at a certain point.
Reading a register value into a C variable - the previous canonical Q&A which uses the now-unsupported register ... asm("regname") method like you were trying to. Or with a register-asm global variable, which hurts efficiency of all your code by leaving it otherwise untouched.
I forgot I'd answered that Q&A, making basically the same points as this. And some other points, e.g. that this doesn't work on registers like the stack pointer.

Compiling .asm and .c files gives "i386 incompatibility" problem when trying to link objective files [duplicate]

I wrote assembly code that successfully compiles:
as power.s -o power.o
However, it fails when I try to link the object file:
ld power.o -o power
In order to run on the 64 bit OS (Ubuntu 14.04), I added .code32 at the beginning of the power.s file, however I still get the error:
Segmentation fault (core dumped)
power.s:
.code32
.section .data
.section .text
.global _start
_start:
pushl $3
pushl $2
call power
addl $8, %esp
pushl %eax
pushl $2
pushl $5
call power
addl $8, %esp
popl %ebx
addl %eax, %ebx
movl $1, %eax
int $0x80
.type power, #function
power:
pushl %ebp
movl %esp, %ebp
subl $4, %esp
movl 8(%ebp), %ebx
movl 12(%ebp), %ecx
movl %ebx, -4(%ebp)
power_loop_start:
cmpl $1, %ecx
je end_power
movl -4(%ebp), %eax
imull %ebx, %eax
movl %eax, -4(%ebp)
decl %ecx
jmp power_loop_start
end_power:
movl -4(%ebp), %eax
movl %ebp, %esp
popl %ebp
ret

TL:DR: use gcc -m32 -static -nostdlib foo.S (or equivalent as and ld options).
Or if you don't define your own _start, just gcc -m32 -no-pie foo.S
You may need to install gcc-multilib if you link libc, or however your distro packages /usr/lib32/libc.so, /usr/lib32/libstdc++.so and so on. But if you define your own _start and don't link libraries, you don't need the library package, just a kernel that supports 32-bit processes and system calls. This includes most distros, but not Windows Subsystem for Linux v1.
Don't use .code32
.code32 does not change the output file format, and that's what determines the mode your program will run in. It's up to you to not try to run 32bit code in 64bit mode. .code32 is for assembling kernels that have some 16 and some 32-bit code, and stuff like that. If that's not what you're doing, avoid it so you'll get build-time errors when you build a .S in the wrong mode if it has any push or pop instructions, for example. .code32 just lets you create confusing-to-debug runtime problems instead of build-time errors.
Suggestion: use the .S extension for hand-written assembler. (gcc -c foo.S will run it through the C preprocessor before as, so you can #include <sys/syscall.h> for syscall numbers, for example). Also, it distinguishes it from .s compiler output (from gcc foo.c -O3 -S).
To build 32-bit binaries, use one of these commands
gcc -g foo.S -o foo -m32 -nostdlib -static # static binary with absolutely no libraries or startup code
# -nostdlib still dynamically links when Linux where PIE is the default, or on OS X
gcc -g foo.S -o foo -m32 -no-pie # dynamic binary including the startup boilerplate code.
# Use with code that defines a main(), not a _start
Documentation for nostdlib, -nostartfiles, and -static.
Using libc functions from _start (see the end of this answer for an example)
Some functions, like malloc(3), or stdio functions including printf(3), depend on some global data being initialized (e.g. FILE *stdout and the object it actually points to).
gcc -nostartfiles leaves out the CRT _start boilerplate code, but still links libc (dynamically, by default). On Linux, shared libraries can have initializer sections that are run by the dynamic linker when it loads them, before jumping to your _start entry point. So gcc -nostartfiles hello.S still lets you call printf. For a dynamic executable, the kernel runs /lib/ld-linux.so.2 on it instead of running it directly (use readelf -a to see the "ELF interpreter" string in your binary). When your _start eventually runs, not all the registers will be zeroed, because the dynamic linker ran code in your process.
However, gcc -nostartfiles -static hello.S will link, but crash at runtime if you call printf or something without calling glibc's internal init functions. (see Michael Petch's comment).
Of course you can put any combination of .c, .S, and .o files on the same command line to link them all into one executable. If you have any C, don't forget -Og -Wall -Wextra: you don't want to be debugging your asm when the problem was something simple in the C that calls it that the compiler could have warned you about.
Use -v to have gcc show you the commands it runs to assemble and link. To do it "manually":
as foo.S -o foo.o -g --32 && # skips the preprocessor
ld -o foo foo.o -m elf_i386
file foo
foo: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, not stripped
gcc -nostdlib -m32 is easier to remember and type than the two different options for as and ld (--32 and -m elf_i386). Also, it works on all platforms, including ones where executable format isn't ELF. (But Linux examples won't work on OS X, because the system call numbers are different, or on Windows because it doesn't even use the int 0x80 ABI.)
NASM/YASM
gcc can't handle NASM syntax. (-masm=intel is more like MASM than NASM syntax, where you need offset symbol to get the address as an immediate). And of course the directives are different (e.g. .globl vs global).
You can build with nasm or yasm, then link the .o with gcc as above, or ld directly.
I use a wrapper script to avoid the repetitive typing of the same filename with three different extensions. (nasm and yasm default to file.asm -> file.o, unlike GNU as's default output of a.out). Use this with -m32 to assemble and link 32bit ELF executables. Not all OSes use ELF, so this script is less portable than using gcc -nostdlib -m32 to link would be..
#!/bin/bash
# usage: asm-link [-q] [-m32] foo.asm [assembler options ...]
# Just use a Makefile for anything non-trivial. This script is intentionally minimal and doesn't handle multiple source files
# Copyright 2020 Peter Cordes. Public domain. If it breaks, you get to keep both pieces
verbose=1 # defaults
fmt=-felf64
#ldopt=-melf_i386
ldlib=()
linker=ld
#dld=/lib64/ld-linux-x86-64.so.2
while getopts 'Gdsphl:m:nvqzN' opt; do
case "$opt" in
m) if [ "m$OPTARG" = "m32" ]; then
fmt=-felf32
ldopt=-melf_i386
#dld=/lib/ld-linux.so.2 # FIXME: handle linker=gcc non-static executable
fi
if [ "m$OPTARG" = "mx32" ]; then
fmt=-felfx32
ldopt=-melf32_x86_64
fi
;;
# -static
l) linker="gcc -no-pie -fno-plt -nostartfiles"; ldlib+=("-l$OPTARG");;
p) linker="gcc -pie -fno-plt -nostartfiles"; ldlib+=("-pie");;
h) ldlib+=("-Ttext=0x200800000");; # symbol addresses outside the low 32. data and bss go in range of text
# strace -e raw=write will show the numeric address
G) nodebug=1;; # .label: doesn't break up objdump output
d) disas=1;;
s) runsize=1;;
n) use_nasm=1 ;;
q) verbose=0 ;;
v) verbose=1 ;;
z) ldlib+=("-zexecstack") ;;
N) ldlib+=("-N") ;; # --omagic = read+write text section
esac
done
shift "$((OPTIND-1))" # Shift off the options and optional --
src=$1
base=${src%.*}
shift
#if [[ ${#ldlib[#]} -gt 0 ]]; then
# ldlib+=("--dynamic-linker" "$dld")
#ldlib=("-static" "${ldlib[#]}")
#fi
set -e
if (($use_nasm)); then
# (($nodebug)) || dbg="-g -Fdwarf" # breaks objdump disassembly, and .labels are included anyway
( (($verbose)) && set -x # print commands as they're run, like make
nasm "$fmt" -Worphan-labels $dbg "$src" "$#" &&
$linker $ldopt -o "$base" "$base.o" "${ldlib[#]}")
else
(($nodebug)) || dbg="-gdwarf2"
( (($verbose)) && set -x # print commands as they're run, like make
yasm "$fmt" -Worphan-labels $dbg "$src" "$#" &&
$linker $ldopt -o "$base" "$base.o" "${ldlib[#]}" )
fi
# yasm -gdwarf2 includes even .local labels so they show up in objdump output
# nasm defaults to that behaviour of including even .local labels
# nasm defaults to STABS debugging format, but -g is not the default
if (($disas));then
objdump -drwC -Mintel "$base"
fi
if (($runsize));then
size $base
fi
I prefer YASM for a few reasons, including that it defaults to making long-nops instead of padding with many single-byte nops. That makes for messy disassembly output, as well as being slower if the nops ever run. (In NASM, you have to use the smartalign macro package.)
However, YASM hasn't been maintained for a while and only NASM has AVX512 support; these days I more often just use NASM.
Example: a program using libc functions from _start
# hello32.S
#include <asm/unistd_32.h> // syscall numbers. only #defines, no C declarations left after CPP to cause asm syntax errors
.text
#.global main # uncomment these to let this code work as _start, or as main called by glibc _start
#main:
#.weak _start
.global _start
_start:
mov $__NR_gettimeofday, %eax # make a syscall that we can see in strace output so we know when we get here
int $0x80
push %esp
push $print_fmt
call printf
#xor %ebx,%ebx # _exit(0)
#mov $__NR_exit_group, %eax # same as glibc's _exit(2) wrapper
#int $0x80 # won't flush the stdio buffer
movl $0, (%esp) # reuse the stack slots we set up for printf, instead of popping
call exit # exit(3) does an fflush and other cleanup
#add $8, %esp # pop the space reserved by the two pushes
#ret # only works in main, not _start
.section .rodata
print_fmt: .asciz "Hello, World!\n%%esp at startup = %#lx\n"
$ gcc -m32 -nostdlib hello32.S
/tmp/ccHNGx24.o: In function `_start':
(.text+0x7): undefined reference to `printf'
...
$ gcc -m32 hello32.S
/tmp/ccQ4SOR8.o: In function `_start':
(.text+0x0): multiple definition of `_start'
...
Fails at run-time, because nothing calls the glibc init functions. (__libc_init_first, __dl_tls_setup, and __libc_csu_init in that order, according to Michael Petch's comment. Other libc implementations exist, including MUSL which is designed for static linking and works without initialization calls.)
$ gcc -m32 -nostartfiles -static hello32.S # fails at run-time
$ file a.out
a.out: ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), statically linked, BuildID[sha1]=ef4b74b1c29618d89ad60dbc6f9517d7cdec3236, not stripped
$ strace -s128 ./a.out
execve("./a.out", ["./a.out"], [/* 70 vars */]) = 0
[ Process PID=29681 runs in 32 bit mode. ]
gettimeofday(NULL, NULL) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)
You could also gdb ./a.out, and run b _start, layout reg, run, and see what happens.
$ gcc -m32 -nostartfiles hello32.S # Correct command line
$ file a.out
a.out: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, BuildID[sha1]=7b0a731f9b24a77bee41c13ec562ba2a459d91c7, not stripped
$ ./a.out
Hello, World!
%esp at startup = 0xffdf7460
$ ltrace -s128 ./a.out > /dev/null
printf("Hello, World!\n%%esp at startup = %#lx\n", 0xff937510) = 43 # note the different address: Address-space layout randomization at work
exit(0 <no return ...>
+++ exited (status 0) +++
$ strace -s128 ./a.out > /dev/null # redirect stdout so we don't see a mix of normal output and trace output
execve("./a.out", ["./a.out"], [/* 70 vars */]) = 0
[ Process PID=29729 runs in 32 bit mode. ]
brk(0) = 0x834e000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
.... more syscalls from dynamic linker code
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
mmap2(NULL, 1814236, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xfffffffff7556000 # map the executable text section of the library
... more stuff
# end of dynamic linker's code, finally jumps to our _start
gettimeofday({1461874556, 431117}, NULL) = 0
fstat64(1, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0 # stdio is figuring out whether stdout is a terminal or not
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0xff938870) = -1 ENOTTY (Inappropriate ioctl for device)
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xfffffffff7743000 # 4k buffer for stdout
write(1, "Hello, World!\n%esp at startup = 0xff938fb0\n", 43) = 43
exit_group(0) = ?
+++ exited with 0 +++
If we'd used _exit(0), or made the sys_exit system call ourselves with int 0x80, the write(2) wouldn't have happened. With stdout redirected to a non-tty, it defaults to full-buffered (not line-buffered), so the write(2) is only triggered by the fflush(3) as part of exit(3). Without redirection, calling printf(3) with a string containing newlines will flush immediately.
Behaving differently depending on whether stdout is a terminal can be desirable, but only if you do it on purpose, not by mistake.

I'm learning x86 assembly (on 64-bit Ubuntu 18.04) and had a similar problem with the exact same example (it's from Programming From the Ground Up, in chapter 4 [http://savannah.nongnu.org/projects/pgubook/ ]).
After poking around I found the following two lines assembled and linked this:
as power.s -o power.o --32
ld power.o -o power -m elf_i386
These tell the computer that you're only working in 32-bit (despite the 64-bit architecture).
If you want to use gdb debugging, then use the assembler line:
as --gstabs power.s -o power.o --32.
The .code32 seems to be unnecessary.
I haven't tried it your way, but the gnu assembler (gas) also seems okay with:
.globl start
# (that is, no 'a' in global).
Moreover, I'd suggest you probably want to keep the comments from the original code as it seems to be recommended to comment profusely in assembly. (Even if you are the only one to look at the code, it'll make it easier to figure out what you were doing if you look at it months or years later.)
It would be nice to know how to alter this to use the 64-bit R*X and RBP, RSP registers though.

Combining c and assembler code

This is my C code:
#include <stdio.h>
void sum();
int newAlphabet;
int main(void)
{
sum();
printf("%d\n",newAlphabet);
}
And this is my assembler code:
.globl _sum
_sum:
movq $1, %rax
movq %rax, _newAlphabet
ret
I'm trying to call the sum function, from my main function, to set newAlphabet equal to 1, but when I compile it (gcc -o test assembler.c assembler.s, compiled on a 64-bit OSX laptop) I get the following errors:
32-bit absolute addressing is not supported for x86-64
cannot do signed 4 byte relocation
both caused by the line "movq %rax, _newAlphabet"
I'm sure I'm making a very basic mistake. Can anyone help? Thanks in advance.
EDIT:
Here are the relevant portions of the C code once it has been translated to assembler:
.comm _newAlphabet,4,2
...
movq _newAlphabet#GOTPCREL(%rip), %rax

Mac OS X uses position-independent executables by default, which means your code can't use constant global addresses for variables. Instead you'll need to access globals in an IP-relative way. Just change:
movq %rax, _newAlphabet
to:
mov %eax, _newAlphabet(%rip)
and you'll be set (I changed from 64 to 32 bit registers to match sizeof(int) on Mac OS X. Note that you also need a .globl _newAlphabet in there somewhere. Here's an example I just made based on your code (note that I initialized newAlphabet to prove it works):
example.c:
#include <stdio.h>
void sum(void);
int newAlphabet = 2;
int main(void)
{
printf("%d\n",newAlphabet);
sum();
printf("%d\n",newAlphabet);
return 0;
}
assembly.s:
.globl _sum
.globl _newAlphabet
_sum:
movl $1, _newAlphabet(%rip)
ret
Build & run:
$ cc -c -o example.o example.c
$ cc -c -o assembly.o assembly.s
$ cc -o example example.o assembly.o
$ ./example
2
1

likely(x) and __builtin_expect((x),1)

I know the kernel uses the likely and unlikely macros prodigiously. The docs for the macros are located at Built-in Function: long __builtin_expect (long exp, long c). But they don't really discuss the details.
How exactly does a compiler handle likely(x) and __builtin_expect((x),1)?
Is it handled by the code generator or the optimizer?
Does it depend upon optimization levels?
What's an example of the code generated?

I just tested a simple example on gcc.
For x86 this seems to be handled by the optimizer and depend on optimization levels. Although I guess a correct answer here would be "it depends on the compiler".
The code generated is CPU dependent. Some cpus (sparc64 comes immediately to my mind, but I'm sure there are others) have flags on conditional branch instructions that tell the CPU how to predict it, so the compiler generates "predict true/predict false" instructions depending on the built in rules in the compiler and hints from the code (like __builtin_expect).
Intel documents their behavior here: https://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts . In short the behavior on Intel CPUs is that if the CPU has no previous information about a branch it will predict forward branches as unlikely to be taken, while backwards branches are likely to be taken (think about loops vs. error handling).
This is some example code:
int bar(int);
int
foo(int x)
{
if (__builtin_expect(x>10, PREDICTION))
return bar(10);
return 42;
}
Compiled with (I'm using omit-frame-pointer to make the output more readable, but I still cleaned it up below):
$ cc -S -fomit-frame-pointer -O0 -DPREDICTION=0 -o 00.s foo.c
$ cc -S -fomit-frame-pointer -O0 -DPREDICTION=1 -o 01.s foo.c
$ cc -S -fomit-frame-pointer -O2 -DPREDICTION=0 -o 20.s foo.c
$ cc -S -fomit-frame-pointer -O2 -DPREDICTION=1 -o 21.s foo.c
There's no difference between 00.s and 01.s, so that shows that this is dependent on optimization (for gcc at least).
Here's the (cleaned up) generated code for 20.s:
foo:
cmpl $10, %edi
jg .L2
movl $42, %eax
ret
.L2:
movl $10, %edi
jmp bar
And here is 21.s:
foo:
cmpl $10, %edi
jle .L6
movl $10, %edi
jmp bar
.L6:
movl $42, %eax
ret
As expected the compiler rearranged the code so that the branch we don't expect to take is done in a forward branch.

Get the Stack Pointer in C on Mac OS X Lion

I've run into some strange behaviour when trying to obtain the current stack pointer in C (using inline ASM). The code looks like:
#include <stdio.h>
class os {
public:
static void* current_stack_pointer();
};
void* os::current_stack_pointer() {
register void *esp __asm__ ("rsp");
return esp;
}
int main() {
printf("%p\n", os::current_stack_pointer());
}
If I compile the code using the standard gcc options:
$ g++ test.cc -o test
It generates the following assembly:
__ZN2os21current_stack_pointerEv:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp,%rbp
0000000000000004 movq %rdi,0xf8(%rbp)
0000000000000008 movq 0xe0(%rbp),%rax
000000000000000c movq %rax,%rsp
000000000000000f movq %rsp,%rax
0000000000000012 movq %rax,0xe8(%rbp)
0000000000000016 movq 0xe8(%rbp),%rax
000000000000001a movq %rax,0xf0(%rbp)
000000000000001e movq 0xf0(%rbp),%rax
0000000000000022 popq %rbp
If I run the resulting binary it crashes with a SIGILL (Illegal Instruction). However if I add a little optimisation to the compile:
$ g++ -O1 test.cc -o test
The generated assembly is much simpler:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp,%rbp
0000000000000004 movq %rsp,%rax
0000000000000007 popq %rbp
0000000000000008 ret
And the code runs fine. So to the question; is there a more stable to get hold of the stack pointer from C code on Mac OS X? The same code has no problems on Linux.

The problem with attempting to fetch the stack pointer through a function call is that the stack pointer inside the called function is pointing at a value that will be completely different after the function returns, and therefore you're capturing the address of a location that will be invalid after the call. You're also making the assumption that there was no function prologue added by the compiler on that platform (i.e., both your functions currently have a prologue where the compiler setups up the current activation record on the stack for the function, which will change the value of RSP that you are attempting to capture). At the very least, provided that there was no function prologue added by the compiler, you will need to subtract the size of a pointer on the platform you're using in order to actually get the "true" address to where the stack will be pointing after the return from the function call. This is because the assembly command call pushes the return address for the instruction pointer onto the stack, and ret in the callee will pop that value off the stack. Thus inside the callee, there will at the very least be a return-address instruction that the stack-pointer will be pointing to, and that location won't be valid after the function call. Finally, on certain platforms (unfortunately not x86), you can use the __attributes__((naked)) tag to create a function with no prologue in gcc. Using the inline keyword to avoid a prologue is not completely reliable since it does not force the compiler to inline the function ... under certain low-optimization levels, inlining will not occur, and you'll end up with a prologue again, and the stack-pointer will not be pointing to the correct location if you decide to take it's address in those cases.
If you must have the value of the stack pointer, then the only reliable method will be to use assembly, follow the rules of your platform's ABI, compile to an object file using an assembler, and then link that object file with the rest of the object files in your executable. You can then expose the assembler function to the rest of your code by including a function declaration in a header file. So your code could look like (assuming you're using gcc to compile your assembly):
//get_stack_pointer.h
extern "C" void* get_stack_ptr();
//get_stack_pointer.S
.section .text
.global get_stack_ptr
get_stack_ptr:
movq %rsp, %rax
addq $8, %rax
ret

Rather than using a register variable with a constraint, you should just write some explicit inline assembler to fetch %esp:
static void *getsp(void)
{
void *sp;
__asm__ __volatile__ ("movq %%rsp,%0"
: "=r" (sp)
: /* No input */);
return sp;
}
You can also convert this to a macro using gcc statement expressions:
#define GETSP() ({void *sp;__asm__ __volatile__("movl %%esp,%0":"=r"(sp):);sp;})

A multi arch version was what I needed recently:
/**
* helps to check the architecture macros:
* `echo | gcc -E -dM - | less`
*
* this is arm, x64 and i386 (linux | apple) compatible
* #return address where the stack starts
*/
void *get_sp(void) {
void *sp;
__asm__ __volatile__(
#ifdef __x86_64__
"movq %%rsp,%0"
#elif __i386__
"movl %%esp,%0"
#elif __arm__
// sp is an alias for r13
"mov %%sp,%0"
#endif
: "=r" (sp)
: /* no input */
);
return sp;
}

I do not have a reference for that, but GCC is known to occasionally (often) misbehave in the presence of inline assembly if compilation is not optimized at all. So you should always add the -O1 flag.
As a side-note, what you are trying to do is not very robust in the presence of an optimizing compiler, because the compiler may inline the call to current_stack_pointer() and the returned value may thus be an approximation of the current stack pointer value (not even a lower bound).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

thread local storage in assembly - c

Related

Why can't I get the value of asm registers in C?

Compiling .asm and .c files gives "i386 incompatibility" problem when trying to link objective files [duplicate]

Combining c and assembler code

likely(x) and __builtin_expect((x),1)

Get the Stack Pointer in C on Mac OS X Lion

Categories

Resources