I just learned about bit-fields in C, and I became curious about how the compiler implement this feature. As far as my knowledge about the C compiler goes, single bits cannot be accessed individually.
Bit-fields are implemented by reading the surrounding addressable unit of memory (byte or word), masking and shifting.
More precisely, reading a bit-field is implemented as read-shift-mask, and writing to a bit-field is implemented as read-mask-shift value to write-or-write.
This is pretty expensive, but if you intend to store data compactly and are willing to pay the price of the bitwise operations, then bit-fields offer a clearer, lighter syntax at the source level for the same operations that you could have written by hand. What you lose is control of the layout (the standard does not specify how bit-fields are allocated from a containing word, and this will vary from compiler to compiler more than the meaning of bitwise operations does).
Whenever you have doubts about what a C compiler does for a given construct, you can always read the assembly code:
struct s {
unsigned int a:3;
unsigned int b:3;
} s;
void f(void)
{
s.b = 5;
}
int g(void)
{
return s.a;
}
This is compiled by gcc -O -S to:
_f: ## #f
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
movq _s#GOTPCREL(%rip), %rax
movb (%rax), %cl ; read
andb $-57, %cl ; mask
orb $40, %cl ; since the value to write was a constant, 5, the compiler has pre-shifted it by 3, giving 40
movb %cl, (%rax) ; write
popq %rbp
retq
.cfi_endproc
.globl _g
.align 4, 0x90
_g: ## #g
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp7:
.cfi_def_cfa_offset 16
Ltmp8:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp9:
.cfi_def_cfa_register %rbp
movq _s#GOTPCREL(%rip), %rax
movzbl (%rax), %eax
andl $7, %eax
popq %rbp
retq
.cfi_endproc
Related
I have the C code:
long fib(long n) {
if (n < 2) return 1;
return fib(n-1) + fib(n-2);
}
int main(int argc, char** argv) {
return 0;
}
which I compiled by running gcc -O0 -fno-optimize-sibling-calls -S file.c yielding assembly code that has not been optimized:
.file "long.c"
.text
.globl fib
.type fib, #function
fib:
.LFB5:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
subq $24, %rsp
.cfi_offset 3, -24
movq %rdi, -24(%rbp)
cmpq $1, -24(%rbp)
jg .L2
movl $1, %eax
jmp .L3
.L2:
movq -24(%rbp), %rax
subq $1, %rax
movq %rax, %rdi
call fib
movq %rax, %rbx
movq -24(%rbp), %rax
subq $2, %rax
movq %rax, %rdi
call fib
addq %rbx, %rax
.L3:
addq $24, %rsp
popq %rbx
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE5:
.size fib, .-fib
.globl main
.type main, #function
main:
.LFB6:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE6:
.size main, .-main
.ident "GCC: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0"
.section .note.GNU-stack,"",#progbits
My question is:
Why do we decrement the stack pointer by 24, subq $24, %rsp? As I see it, we store one element only, first argument n in %rdi, on the stack after the initial two pushes. So why don't we just decrement the stack pointer by 8 and then move n to -8(%rbp)? So
subq $8, %rsp
movq %rdi, -8(%rbp)
GCC does not fully optimize with -O0, not even its stack use. (This may aid in debugging by making some of its use of the stack more transparent to humans. For example, objects a, b, and c may share a single stack location if their active lifetimes (defined by uses in the program, not by the model of lifetime in the C standard) with -O3, but may have separately reserved places in the stack with -O0, and that makes it easier for a human to see where a, b, and c are used in the assembly code. The wasted 16 bytes may be a side effect of this, as those spaces may be reserved for some purpose that this small function did not happen to use, such as space to save certain registers if needed.)
Changing optimization to -O3 results in GCC subtracting only eight from the stack pointer.
I'm reading Computer Systems: A Programmer's Perspective 3rd edition and the assembly in 3.10.5 Supporting Variable-Size Stack Frames, Figure 3.43 confuses me.
The part of the book is trying to explain how a variable-size stack frame is generated and it gives a C code and its assembly version as an example.
Here is the code of C and assembly(Figure 3.43 of the book):
I don't know what the use of line 8-10 in the assembly is. Why not just use movq %rsp, %r8after line 7?
(a) C code
long vframe(long n, long idx, long *q) {
long i;
long *p[n];
p[0] = &i;
for (i = 1; i < n; i++)
p[i] = q;
return *p[idx];
}
(b) Portions of generated assembly code
vframe:
2: pushq %rbp
3: movq %rsp, %rbp
4: subq $16, %rsp
5: leaq 22(, %rdi, 8), %rax
6: andq $-16, %rax
7: subq %rax, %rsp
8: leaq 7(%rsp), %rax
9: shrq $3, %rax
10: leaq 0(, %rax, 8), %r8
11: movq %r8, %rcx
................................
12: L3:
13: movq %rdx, (%rcx, %rax, 8)
14: addq $1, %rax
15: movq %rax, -8(%rbp)
16: L2:
17: movq -8(%rbp), %rax
18: cmpq %rdi, %rax
19: jl L3
20: leave
21: ret
Here is what I think:
After line 7, the %rsp should be a multiple of 16 (%rsp should be a multiple of 16 before vframe is called because of stack frame alignment. When vframe is called, %rsp is subtracted by 8 to hold the return address of the caller, and then the pushq instruction in line 2 subtracts %rsp by another 8, and in line 4 a 16. So at the start of line 7, %rsp is a multiple of 16. In line 7, %rsp is subtracted by %rax. Since line 6 makes %rax a multiple of 16, the result of line 7 is setting %rsp a multiple of 16) which means the lower 4 bits of %rsp are all zeros.
Then in line 8, %rsp+7 is stored in %rax, and in line 9 %rax is shifted right logically by 3 bits, and in line 10, %rax*8 is stored in %r8.
After line 7, the lower 4 bits of %rsp are all zeros. In line 8 %rsp+7 just makes the lower 3 bits all ones, and line 9 truncates these 3 ones, and in line 10 %rax*8 makes the result shift left by 3 bits. So the final result should just be the original %rsp (the result of line 7).
So I wonder whether line 8-10 are useless.
Why not just use movq %rsp, %r8 after line 7 and remove the original line 8-10?
I thought that a useful exploratory program would be to reduce your generated code to:
.globl _vframe
_vframe:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq 22(, %rdi, 8), %rax
andq $-16, %rax
subq %rax, %rsp
leaq 7(%rsp), %rax
shrq $3, %rax
leaq 0(, %rax, 8), %r8
mov %r8, %rax
sub %rsp, %rax
leave
ret
Note that I just eliminated the code that did anything useful, and returned the difference between %r8 and %rsp.
Then wrote a driver:
extern void *vframe(unsigned long n);
#include <stdio.h>
int main(void) {
int i;
for (i = 0; i < (1<<18); i++) {
void *p = vframe(i);
if (p) {
printf("%d %p\n", i, p);
}
}
return 0;
}
to check it out. They were always the same. So, why? It may be that it is a standard code emission when confronted with a given construct (var len array). The compiler has to maintain certain standards, such as traceable call frames and alignment, os might just emit this code as the known solution to that. Variable length arrays are generally considered a mistake in the language; a tribute to c++, adding a half-working, half-thought-out mechanism to C; so compiler implementors might not give to much attention to the code generated on their behalf.
Why is the assembly output of store_idx_x86() the same as store_idx() and load_idx_x86() the same as load_idx()?
It was my understanding that __atomic_load_n() would flush the core's invalidation queue, and __atomic_store_n() would flush the core's store buffer.
Note -- I complied with: gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16)
Update: I understand that x86 will never reorder stores with other stores and loads with other loads -- so is gcc smart enough to implement sfence and lfence only when it is needed or should using __atomic_ result in a fence (assuming a memory model stricter than __ATOMIC_RELAXED)?
Code
#include <stdint.h>
inline void store_idx_x86(uint64_t* dest, uint64_t idx)
{
*dest = idx;
}
inline void store_idx(uint64_t* dest, uint64_t idx)
{
__atomic_store_n(dest, idx, __ATOMIC_RELEASE);
}
inline uint64_t load_idx_x86(uint64_t* source)
{
return *source;
}
inline uint64_t load_idx(uint64_t* source)
{
return __atomic_load_n(source, __ATOMIC_ACQUIRE);
}
Assembly:
.file "util.c"
.text
.globl store_idx_x86
.type store_idx_x86, #function
store_idx_x86:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
movq -8(%rbp), %rax
movq -16(%rbp), %rdx
movq %rdx, (%rax)
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size store_idx_x86, .-store_idx_x86
.globl store_idx
.type store_idx, #function
store_idx:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
movq -8(%rbp), %rax
movq -16(%rbp), %rdx
movq %rdx, (%rax)
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size store_idx, .-store_idx
.globl load_idx_x86
.type load_idx_x86, #function
load_idx_x86:
.LFB2:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movq (%rax), %rax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE2:
.size load_idx_x86, .-load_idx_x86
.globl load_idx
.type load_idx, #function
load_idx:
.LFB3:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movq (%rax), %rax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE3:
.size load_idx, .-load_idx
.ident "GCC: (GNU) 4.8.2 20140120 (Red Hat 4.8.2-16)"
.section .note.GNU-stack,"",#progbits
Why is the assembly output of store_idx_x86() the same as store_idx() and load_idx_x86() the same as load_idx()?
On x86, assuming compiler-enforced alignment, they are the same operations. Loads and Stores to aligned addresses of the native size or smaller are guaranteed to be atomic. Reference Intel manual vol 3A, 8.1.1:
The Pentium processor (and newer processors since) guarantees that the following additional memory operations
will always be carried out atomically: Reading or writing a quadword aligned on a 64-bit boundary [...]
Furthermore, x86 enforces a strongly ordered memory model, meaning every store and load has implicit release and acquire semantics, respectively.
Lastly, the fencing instructions you mention are only required when using Intel's non-temporal SSE instructions (great reference here), or when needing to create a store-load fence (article here) (and that one is the mfence or lock instruction actually).
Aside: I was curious about that statement in Intel's manuals, so I devised a test program. Frustratingly, on my computer (2 core i3-4030U), I get this output from it:
unaligned
4265292 / 303932066 | 1.40337%
unaligned, but in same cache line
2373 / 246957659 | 0.000960893%
aligned (8 byte)
0 / 247097496 | 0%
Which seems to violate what Intel says. I will investigate. In the meantime, you should clone that demo program and see what it gives you. You just need -std=c++11 ... -pthread on linux.
Casually, when reading the assembler listing of a sample C program, I noted that the stack pointer is not 16 bit aligned before calling function foo:
void foo() { }
int func(int p) { foo(); return p; }
int main() { return func(1); }
func:
pushq %rbp
movq %rsp, %rbp
subq $8, %rsp ; See here
movl %edi, -4(%rbp)
movl $0, %eax
call foo
movl -4(%rbp), %eax
leave
ret
The subq $8, %rsp instruction makes RSP not aligned before calling foo (it should be "subq $16, %rsp").
In System V ABI, par. 3.2.2, I read: "the value (%rsp − 8) is always a multiple of 16 when control is transferred to the function entry point".
Someone can help me to understand why gcc doesn't put subq $16, %rsp ?
Thank you in advance.
Edit:
I forgot to mention my OS and compiler version:
Debian wheezy, gcc 4.7.2
Assuming that the stack pointer is 16-byte aligned when func is entered, then the combination of
pushq %rbp ; <- 8 bytes
movq %rsp, %rbp
subq $8, %rsp ; <- 8 bytes
will keep it 16-byte aligned for the subsequent call to foo().
It seems that since the compiler knows about the implementation of foo() and that it's a noop, it's not bothering with the stack alignment. If foo() is seen as only a declaration or prototype in the translation unit where func() is compiled you'll see your expected stack alignment.
Why does the type used to derefernce pointers passed to printf affect the output, even if the types are the same size:
void test_double(void *x)
{
double *y = x;
uint64_t *z = x;
printf("double/double: %lf\n", *y);
printf("double/uint64: %lf\n", *z);
printf("uint64/double: 0x%016llx\n", *y);
printf("uint64/uint64: 0x%016llx\n", *z);
}
int main(int argc, char** argv)
{
double x = 1.0;
test_double(&x);
return 0;
}
Output:
double/double: 1.000000
double/uint64: 1.000000
uint64/double: 0x00007f00e17d7000
uint64/uint64: 0x3ff0000000000000
I would have expected the last two lines to both correctly print 0x3ff0000000000000, the representation of 1.0 in an IEEE754 double floating point.
It's Undefined Behavior. The C language standard says that if the variadic arguments don't have the type implied by the format string, then that's UB. In your third print statement, you're passing a double, but it's expecting a uint64_t. Since it's UB, anything can happen.
This specification allows the implementation to do things like pass integers on the stack but floating-point values through the FPU registers, which is what I suspect is happening in your test case. For example, the cdecl calling convention on Linux on x86 (GCC) passes floating-point function arguments on the x87 pseudo-stack (registers ST0...ST7).
If you look at the generated assembly, you'll probably discover why your third and fourth print statements are behaving differently. On Mac OS X 10.8.2 64-bit with Clang 4.1, I was able to reproduce similar results, and the assembly looked like this, which I've annotated:
.section __TEXT,__text,regular,pure_instructions
.globl _test_double
.align 4, 0x90
_test_double: ## #test_double
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp3:
.cfi_def_cfa_offset 16
Ltmp4:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp5:
.cfi_def_cfa_register %rbp
pushq %rbx
pushq %rax
Ltmp6:
.cfi_offset %rbx, -24
# printf("%lf", double)
movq %rdi, %rbx
movsd (%rbx), %xmm0
leaq L_.str(%rip), %rdi
movb $1, %al
callq _printf
# printf("%lf", uint64_t)
movq (%rbx), %rsi
leaq L_.str1(%rip), %rdi
xorb %al, %al
callq _printf
# printf("%llx", double)
leaq L_.str2(%rip), %rdi
movsd (%rbx), %xmm0
movb $1, %al
callq _printf
# printf("%llx", uint64_t)
leaq L_.str3(%rip), %rdi
movq (%rbx), %rsi
xorb %al, %al
addq $8, %rsp
popq %rbx
popq %rbp
jmp _printf ## TAILCALL
.cfi_endproc
In the case of printing a double value, it's putting the argument into the SIMD %xmm0 register:
movsd (%rbx), %xmm0
But in the case of a uint64_t value, it's passing the argument through the integer register %rsi:
movq (%rbx), %rsi