ASM to C: how to dereference a pointer and add an offset?

ASM to C: how to dereference a pointer and add an offset? - c

I feel kind of dumb, but I'm struggling with dereferencing a pointer (+ adding an offset) in C.
What I want to recreate in C is this behavior:
movabs rax, 0xdeadbeef
add rax, 0xa
mov rax, QWORD PTR [rax]
So at the end rax should be: *(0xdeadbeef+0xa)
Especially the equivalent to mov rax, QWORD PTR [rax] would be improtant, as I need to use the calculated value and retrieve the data (=a different address) that is being stored at that point.
I tried so many things, but here is my current stage:
void *ptr = (void*)0xdeadbeef;
void *ptr2 = *(void*)(ptr+0xa);
Which translates to sth like this:
0x7ffff7fe6050: mov QWORD PTR [rbp-0x38],rax
0x7ffff7fe6054: mov rax,QWORD PTR [rbp-0x38]
0x7ffff7fe6058: add rax,0xa
EDIT: It does not actually compile, I made a mistake with the provided C code here and can't figure out which code actually compiled to this. It's not that important anyways as the main target was the translation of ASM to C and the problem is solved now. Thanks everyone for participating.
So the first 2 lines are basically useless and just the value is added to my address and nothing more. I need it to be interpreted as an address and retrieve the value at that point though.
The data stored at those places doesn't matter at this point. Essentially what I want to do is find a specific value in memory and I know a way of adding offsets and dereferencing pointers to get to my goal. The final step will just be a typecast from my address to the actual datatype at that point.
I know this may seem trivial to some of you, but I'm not super familiar with C, so I'm struggling here...

You can simplify your asm to a single instruction, with the math done at assemble time. movabs rax, [0xdeadbeef + 0xa] can use the AL/AX/EAX/RAX-only form of mov that loads from a 64-bit absolute address (https://felixcloutier.com/x86/MOV.html). (It won't fit in a 32-bit sign-extended disp32, because the high bit of the low 32 is set, unlike normal static addresses in position-dependent code). Regular mov with a 32-bit address-size override would work, too, in about 7 bytes, because your address does fit in a zero-extended 32-bit integer.
In C you can also do the whole thing with a single statement. No need to overcomplicate things: your address is a pointer to a pointer, so you need to cast your integer to a x ** type.
void *ptr = *(const void**)(0xdeadbeefUL + 0xa);
In asm pointers are just integers, so it makes sense to do your math using integers instead of char*. Making it unsigned guarantees it zero-extends to pointer-width instead of sign-extending.
(Numeric literals in C have a type wide enough to represent the value, though, so 0xdeadbeef on an x86-64 compiler would be an int64_t (long long). You wouldn't actually get 0xdeadbeef being a negative 32-bit int that sign-extended to 0xffffffffdeadbeef.)
Since void doesn't have a size, you can't add / subtract integers to a void*. And pointer-math on void ** would be in chunks of sizeof(void*).
To avoid undefined behaviour from dereferencing a void** that's not aligned by 8 = alignof(void*) (in both mainstream x86-64 ABIs), you'd want to use memcpy. But I assume your example address is just a fake example. The mainstream x86 compilers like gcc don't do anything weird with unaligned addresses to punish programmers for UB, so the compiler output will contain unaligned loads which work fine on x86. But when auto-vectorizing you can run into problems from this kind of UB. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
But if you did for some reason want to break things up into multiple asm statements, you could transliterate it into multiple C statements like this:
uintptr_t wheres_the_beef = 0xdeadbeef; // mov eax, 0xdeadbeef
wheres_the_beef += 0xa; // add eax, 0xa
void **address = (void**)wheres_the_beef; // purely a cast, no asm instructions;
void *ptr = *address; // mov rax, [rax]
You could mess around with char* if you wanted to add byte offsets to pointers, but there's really no point here.
Again, this still has undefined behaviour on most C implementations, where alignof(void*) is greater than 1 so void **address = (void**)wheres_the_beef creates a misaligned pointer.
(Fun fact: even creating misaligned pointers is UB in ISO C. But all x86 compilers that support Intel's intrinsics must support creating of misaligned pointers for passing them to intrinsics like _mm_loadu_ps(), so only actually dereferencing them is a potential problem on x86 compilers.)

Related

Why is gcc allowed to speculatively load from a struct?

Example Showing the gcc Optimization and User Code that May Fault
The function 'foo' in the snippet below will load only one of the struct members A or B; well at least that is the intention of the unoptimized code.
typedef struct {
int A;
int B;
} Pair;
int foo(const Pair *P, int c) {
int x;
if (c)
x = P->A;
else
x = P->B;
return c/102 + x;
}
Here is what gcc -O3 gives:
mov eax, esi
mov edx, -1600085855
test esi, esi
mov ecx, DWORD PTR [rdi+4] <-- ***load P->B**
cmovne ecx, DWORD PTR [rdi] <-- ***load P->A***
imul edx
lea eax, [rdx+rsi]
sar esi, 31
sar eax, 6
sub eax, esi
add eax, ecx
ret
So it appears that gcc is allowed to speculatively load both struct members in order to eliminate branching. But then, is the following code considered undefined behavior or is the gcc optimization above illegal?
#include <stdlib.h>
int naughty_caller(int c) {
Pair *P = (Pair*)malloc(sizeof(Pair)-1); // *** Allocation is enough for A but not for B ***
if (!P) return -1;
P->A = 0x42; // *** Initializing allocation only where it is guaranteed to be allocated ***
int res = foo(P, 1); // *** Passing c=1 to foo should ensure only P->A is accessed? ***
free(P);
return res;
}
If the load speculation will happen in the above scenario there is a chance that loading P->B will cause an exception because the last byte of P->B may lie in unallocated memory. This exception will not happen if the optimization is turned off.
The Question
Is the gcc optimization shown above of load speculation legal? Where does the spec say or imply that it's ok?
If the optimization is legal, how is the code in 'naughtly_caller' turn out to be undefined behavior?

Reading a variable (that was not declared as volatile) is not considered to be a "side effect" as specified by the C standard. So the program is free to read a location and then discard the result, as far as the C standard is concerned.
This is very common. Suppose you request 1 byte of data from a 4 byte integer. The compiler may then read the whole 32 bits if that's faster (aligned read), and then discard everything but the requested byte. Your example is similar to this but the compiler decided to read the whole struct.
Formally this is found in the behavior of "the abstract machine", C11 chapter 5.1.2.3. Given that the compiler follows the rules specified there, it is free to do as it pleases. And the only rules listed are regarding volatile objects and sequencing of instructions. Reading a different struct member in a volatile struct would not be ok.
As for the case of allocating too little memory for the whole struct, that's undefined behavior. Because the memory layout of the struct is usually not for the programmer to decide - for example the compiler is allowed to add padding at the end. If there's not enough memory allocated, you might end up accessing forbidden memory even though your code only works with the first member of the struct.

No, if *P is allocated correctly P->B will never be in unallocated memory. It might not be initialized, that is all.
The compiler has every right to do what they do. The only thing that is not allowed is to oops about the access of P->B with the excuse that it is not initialized. But what and how they do all of this is under the discretion of the implementation and not your concern.
If you cast a pointer to a block returned by malloc to Pair* that is not guaranteed to be wide enough to hold a Pair the behavior of your program is undefined.

This is perfectly legal because reading some memory location isn't considered an observable behavior in the general case (volatile would change this).
Your example code is indeed undefined behavior, but I can't find any passage in the standard docs that explicitly states this. But I think it's enough to have a look at the rules for effective types ... from N1570, §6.5 p6:
If a value is stored into an object having no declared type through an
lvalue having a type that is not a character type, then the type of the lvalue becomes the
effective type of the object for that access and for subsequent accesses that do not modify
the stored value.
So, your write access to *P actually gives that object the type Pair -- therefore it just extends into memory you didn't allocate, the result is an out of bounds access.

A postfix expression followed by the -> operator and an identifier designates a member of a structure or union object. The value is that of the named member of the object to which the first expression points
If invoking the expression P->A is well-defined, then P must actually point to an object of type struct Pair, and consequently P->B is well-defined as well.

A -> operator on a Pair * implies that there's a whole Pair object fully allocated. (#Hurkyl quotes the standard.)
x86 (like any normal architecture) doesn't have side-effects for accessing normal allocated memory, so x86 memory semantics are compatible with the C abstract machine's semantics for non-volatile memory. Compilers can speculatively load if/when they think that will be a performance win on target microarchitecture they're tuning for in any given situation.
Note that on x86 memory protection operates with page granularity. The compiler could unroll a loop or vectorize with SIMD in a way that reads outside an object, as long as all pages touched contain some bytes of the object. Is it safe to read past the end of a buffer within the same page on x86 and x64?. libc strlen() implementations hand-written in assembly do this, but AFAIK gcc doesn't, instead using scalar loops for the leftover elements at the end of an auto-vectorized loop even where it already aligned the pointers with a (fully unrolled) startup loop. (Perhaps because it would make runtime bounds-checking with valgrind difficult.)
To get the behaviour you were expecting, use a const int * arg.
An array is a single object, but pointers are different from arrays. (Even with inlining into a context where both array elements are known to be accessible, I wasn't able to get gcc to emit code like it does for the struct, so if it's struct code is a win, it's a missed optimization not to do it on arrays when it's also safe.).
In C, you're allowed to pass this function a pointer to a single int, as long as c is non-zero. When compiling for x86, gcc has to assume that it could be pointing to the last int in a page, with the following page unmapped.
Source + gcc and clang output for this and other variations on the Godbolt compiler explorer
// exactly equivalent to const int p[2]
int load_pointer(const int *p, int c) {
int x;
if (c)
x = p[0];
else
x = p[1]; // gcc missed optimization: still does an add with c known to be zero
return c + x;
}
load_pointer: # gcc7.2 -O3
test esi, esi
jne .L9
mov eax, DWORD PTR [rdi+4]
add eax, esi # missed optimization: esi=0 here so this is a no-op
ret
.L9:
mov eax, DWORD PTR [rdi]
add eax, esi
ret
In C, you can pass sort of pass an array object (by reference) to a function, guaranteeing to the function that it's allowed to touch all the memory even if the C abstract machine doesn't. The syntax is int p[static 2]
int load_array(const int p[static 2], int c) {
... // same body
}
But gcc doesn't take advantage, and emits identical code to load_pointer.
Off topic: clang compiles all versions (struct and array) the same way, using a cmov to branchlessly compute a load address.
lea rax, [rdi + 4]
test esi, esi
cmovne rax, rdi
add esi, dword ptr [rax]
mov eax, esi # missed optimization: mov on the critical path
ret
This isn't necessarily good: it has higher latency than gcc's struct code, because the load address is dependent on a couple extra ALU uops. It is pretty good if both addresses aren't safe to read and a branch would predict poorly.
We can get better code for the same strategy from gcc and clang, using setcc (1 uop with 1c latency on all CPUs except some really ancient ones), instead of cmovcc (2 uops on Intel before Skylake). xor-zeroing is always cheaper than an LEA, too.
int load_pointer_v3(const int *p, int c) {
int offset = (c==0);
int x = p[offset];
return c + x;
}
xor eax, eax
test esi, esi
sete al
add esi, dword ptr [rdi + 4*rax]
mov eax, esi
ret
gcc and clang both put the final mov on the critical path. And on Intel Sandybridge-family, the indexed addressing mode doesn't stay micro-fused with the add. So this would be better, like what it does in the branching version:
xor eax, eax
test esi, esi
sete al
mov eax, dword ptr [rdi + 4*rax]
add eax, esi
ret
Simple addressing modes like [rdi] or [rdi+4] have 1c lower latency than others on Intel SnB-family CPUs, so this might actually be worse latency on Skylake (where cmov is cheap). The test and lea can run in parallel.
After inlining, that final mov probably wouldn't exist, and it could just add into esi.

This is always allowed under the "as-if" rule if no conforming program can tell the difference. For example, an implementation could guarantee that after each block allocated with malloc, there are at least eight bytes that can be accessed without side effects. In that situation, the compiler can generate code that would be undefined behaviour if you wrote it in your code. So it would be legal for the compiler to read P[1] whenever P[0] is correctly allocated, even if that would be undefined behaviour in your own code.
But in your case, if you don't allocate enough memory for a struct, then reading any member is undefined behaviour. So here the compiler is allowed to do this, even if reading P->B crashes.

x86 mov instruction in C pointer of different size

I'm trying to replicate an x86 mov instruction, such as mov %ecx,-0x4(%ebp) in C and am confused about how to do it. I have an int array for the registers and an int displacement. How would I move the value of %ecx into the memory address 4 less than the value stored in %ebp?
I have:
int* destAddress=(int*)(displacement + registers[destination]);
*destAddress=registers[source];
I'm getting a Warning: cast to pointer from integer of different size.

mov %ecx,-0x4(%ebp)
or, in Intel syntax:
mov DWORD PTR [ebp-4], ecx
is storing the value in ECX into the memory location [ebp-4].
EBP is the "base pointer" and is commonly used (in unoptimized code) to access data on the stack. Based on the negative offset, this instruction is almost certainly storing the value of ECX into the first DWORD-sized local variable.
If you wanted to translate this to C, it would be:
int local = value;
assuming that value is mapped to the ECX register, and local is a local variable allocated on the stack. Really, that's it.
[Except that a C compiler would generally put a local variable like this in a register, so this would really translate to something more like mov edx, ecx. The only time it would spill to stack would be if it ran out of registers (which isn't uncommon in the very register-poor x86 ISA).Alternatively, you could force it to spill by making the variable volatile: volatile int local = value;.But there is no good reason for doing that in real code.]
There is pointer dereferencing going on here under the hood, of course, as you see in the assembly-language instruction, but it doesn't manifest in the C representation.
If you wanted to get some pointer notation in there, say you had an array of values allocated on the stack, and wanted to initialize its first member:
int array[4];
array[0] = value; // set first element of array to 'value' (== ECX)
The displacement (-4) won't appear at all in the C code. The C compiler handles that.

C function implemented in assembler x86 when parameters are passed by reference

I am implementing a C defined function in assembler. The function is as follows
extern void swapNums(float* one,float* two,float* three);
A) swapNums accepts 3 references to a float, and then places the smallest of these 3 values in 'one', the middle value in 'two' and the largest of the three values in 'three'.
I want to know:
Which registers are used to store the references to the 3 floats,
i.e. Is it rsi,rdi,rdx... or is it xmm0,xmm1,xmm3 ...
How do I change the value in *one, *two, *three so that i can satisfy A)
I am accustomed to implementing C functions in assembler when the parameters of the function are passed by value. For example when dealing with floating point parameters, I follow the conventions below:
section .data
global <name of function>
<name of function>:
movss [x1], xmm0 ; move the first parameter into memory location x1
movss [x2], xmm1 ; move the second parameter into memory location x2
movss [xNorm], xmm2 ; move the third parameter into memory location x3
and then link the object files of the .asm and .c files when creating an executable

I've got the exact same problem, and I figured it out.
rdi contains the literal address, [rdi] will 'point' to the value stored at that address; so the three addresses will be stored at rdi, rsi, and rdx (Linux, and Mac OS/X); or rcx, rdx, and r8 (Windows).
So then to move the actual values that need to be compared across to the SSE registers you need to use, movss xmm0, [rdi].
Finally to get the floats back across to the C program, you need to use movss [rdi], xmm0: this will move the value of xmm0 over to where rdi is 'pointing' to.

There is no pass by reference in C because there are no references in C. When programming in C don't use the word "reference", forget that it even exists. All values in C are passed by value because they are values, not references which don't exist. Pointers are values, if that wasn't clear. (Strictly speaking, the value of a pointer is a reference to another object so we can't completely forget the word "reference", but that really confuses people when learning C).
When writing x86_64 (your question implies x86_64, not x86 which your title mentions) assembler all pointers are handled exactly the same regardless of the type they point to and in the function call ABI they are passed in the same registers as integer values.

How to dereference zero address with GCC? [duplicate]

This question already has answers here:
C standard compliant way to access null pointer address?
(5 answers)
Closed 7 years ago.
Suppose I need to write to zero address (e.g. I've mmapped something there and want to access it, for whatever reason including curiosity), and the address is known at compile time. Here're some variants I could think of to obtain the pointer, one of these works and another three don't:
#include <stdint.h>
void testNullPointer()
{
// Obviously UB
unsigned* p=0;
*p=0;
}
void testAddressZero()
{
// doesn't work for zero, GCC detects it as NULL
uintptr_t x=0;
unsigned* p=(unsigned*)x;
*p=0;
}
void testTrickyAddressZero()
{
// works, but the resulting assembly is not as terse as it could be
unsigned* p;
asm("xor %0,%0\n":"=r"(p));
*p=0;
}
void testVolatileAddressZero()
{
// p is updated, but the code doesn't actually work
unsigned*volatile p=0;
*p=0; // because this doesn't dereference p! // EDIT: pointee should also be volatile, then this will work
}
I compile this with
gcc test.c -masm=intel -O3 -c -o test.o
and then objdump -d test.o -M intel --no-show-raw-insn gives me (alignment bytes are skipped here):
00000000 <testNullPointer>:
0: mov DWORD PTR ds:0x0,0x0
a: ud2a
00000010 <testAddressZero>:
10: mov DWORD PTR ds:0x0,0x0
1a: ud2a
00000020 <testTrickyAddressZero>:
20: xor eax,eax
22: mov DWORD PTR [eax],0x0
28: ret
00000030 <testVolatileAddressZero>:
30: sub esp,0x10
33: mov DWORD PTR [esp+0xc],0x0
3b: mov eax,DWORD PTR [esp+0xc]
3f: add esp,0x10
42: ret
Here the testNullPointer obviously has UB since it dereferences what is null pointer by definition.
The principle of testAddressZero would give the expected code for any other than 0 address, e.g. 1, but for zero GCC appears to detect that address zero corresponds to null pointer, so also generates UD2.
The asm way of getting the zero address certainly inhibits the compiler's checks, but the price of that is that one has to write different assembly code for each architecture even if the principle of testAddressZero might have been successful (i.e. the same flat memory model on each arch) if not UD2 and similar traps. Also, the code appears not as terse as in the above two variants.
The way of volatile pointer would seem to be the best, but the code generated here appears to not dereference the address for some reason, so it's also broken.
The question now: if I'm targeting GCC, how can I seamlessly access zero address without any traps or other consequences of UB, and without the need to write in assembly?

As a workaround you can use the GCC option -fno-delete-null-pointer-checks that refrain the compiler to actively check for null pointer dereferencing.
While this option is intended to be used to speed-up code optimization it can be used in specific cases as this.

I would put the pointer into a global variable:
const uintptr_t zero = 0;
unsigned* zeroAddress= (unsigned *)zero;
void testZeroAddressPointer()
{
*zeroAddress=0;
}
Provided you expose the address beyond the scope of optimization (so the compiler can't figure out you don't set it somewhere else), that should do the trick, albeit slightly less efficiently.
Edit: make this code independent of implicit zero to null conversion.

The 0 address is the C99 NULL pointer (actually the "implementation" of the null pointer, which you can often write as 0....) on all the architectures I know about.
The null pointer has a very specific status in hosted C99: when a pointer can be (or was) dereferenced, it is guaranteed (by the language specification) to not be NULL (otherwise, it is undefined behavior).
Hence, the GCC compiler has the right to optimize (and actually will optimize)
int *p = something();
int x = *p;
/// the compiler is permitted to skip the following
/// because p has been dereferenced so cannot be NULL
if (p == NULL) { doit(); return; };
In your case, you might want to compile for the freestanding subset of the C99 standard. So compile with gcc -ffreestanding (beware, this option can bring some infelicities).
BTW, you might declare some extern char strange[] __attribute__((weak)); (perhaps even add asm("0") ...) and have some assembler or linker trick to make that strange have a 0 address. The compiler would not know that such a strange symbol is in fact at the 0 address...
My strong suggestion is to avoid dereferencing the 0 address.... See this. If you really need to deference the address 0, be prepared to suffer.... (so code some asm, lower the optimization, etc...).
(If you have mmap-ed the first page, just avoid using its first byte at address 0; that is often not a big deal.)
(IIRC, you are touching a grey area of GCC optimizations - and perhaps even of the C99 language specification, and you certainly want the free standing flavor of C; notice that -O3 optimization for free standing C is not well tested in the GCC compiler and might have residual bugs....)
You could consider changing the GCC compiler so that the null pointer has the numerical address 42. That would take some work.

Performance difference when accessing using pointer and double pointer

Is there any performance difference when we access a memory location by using a pointer and double pointer?
If so, which one is faster ?

There is no simple answer it, as the answer might depend in the actual machine. If I remember correctly some legacy machines (such as PDP11) offered a 'double pointer' access in a single instruction.
However, this is not the situation today. accessing memory is not as simple as it looks and requires a lot of work, due to virtual memory. For this reason - my guess is that double reference should in fact be slower on most modern machines - more work has to be done to translate two addresses from virtual addresses to physical addresses and retrieving them - but that's just educated guess.
Note however, that the compiler might optimize 'redundant' accesses for you already.
For my best knowledge however, there is no machine that has faster 'double access' than 'single access', so we can say that single access is not worse than double access.
As a side note, I believe in real life programs, the difference is neglectable (comparing to anything else done in the program), and unless done in a very performance sensitive loop - just do whatever is more readable. Also, the compiler might optimize it for you already if it can.

Assuming you are talking about something like
int a = 10;
int *aptr = &a;
int **aptrptr = &aptr;
Then the cost of
*aptr = 20;
Is one dereference. The address pointed to by aptr must first be retrieved and then the address can be stored to.
The cost of
**aptrptr = 30;
Is two dereferences. The address pointed to by aptrptr must first be retrieved. Then the addess stored in that address must be retrieved. Then this address can be stored to.
Is this what you were asking?
Therefore, to conclude using a single pointer is faster if that suits your needs.
Note, that if you access a pointer or double pointer in a loop, for example,
while(some condition)
*aptr = something;
or
while(some condition)
**aptrptr = something;
The compiler will likely optimize so that the dereferencing is only done once at the start of the loop, so the cost is only 1 extra address fetch rather than N, where N is the numnber of times the loop executes.
EDIT:
(1) As Amit correctly points out the "how" of pointer access is not explicitly a C thing... it does depend on the underlying architecture. If your machine supports a double dereference as a single instruction then there might not be a big difference. He is using the index deferred addressing mode of the PDP11 as an example. You might find out that such an instruction still chews up more cycles... consult the hardware documentation and look at the optimization that your C compiler is able to apply for your specific architecture.
The PDP11 architecture is circa the 1970s. As far as I know (if someone knows are modern architecture that can do this pleas post!), most RISC architectures and don't have such a double dereference and will probably need to do two fetches as far as I am aware.
Therefore, to conclude using a single pointer is probably faster generally, but with the caveat that specific architectures may handle this better than others and compiler optimizations, as I discussed, could make the difference negligible... to be sure you just have to profile your code and read up about your architecture :)

Let's see it in this way:
int var = 5;
int *ptr_to_var = &var;
int **ptr_to_ptr = &ptr;
When the variable var is accessed then you need to
1.get the address of the variable
2.fetch its value from that address.
In case of pointer ptr_to_var you need to
1.get the address of the pointer variable
2.fetch its value from that address (i.e, address of the variable var)
3.fetch the value at the address pointed to.
In third case, pointer to pointer to int variable ptr_to_ptr, you need to
1.get the address of the pointer to pointer variable
2.fetch its value from that address (i.e, address of the pointer to variable ptr_var)
3.again fetch its value from the address fetched in the second step(i.e, address of the variable var)
4.fetch the value at the address pointed to.
So we can say that accessing via pointer to pointer variable is slower than that of pointer variable which in turn slower than that of normal variable accessing.

I got curious and set up the following scenario:
int v = 0;
int *pv = &v;
int **ppv = &pv;
I tried dereferencing the pointers and took a look at the disassembly, which showed the following:
int x;
x = *pv;
00B33C5B mov eax,dword ptr [pv]
00B33C5E mov ecx,dword ptr [eax]
00B33C60 mov dword ptr [x],ecx
x = **ppv;
00B33C63 mov eax,dword ptr [ppv]
00B33C66 mov ecx,dword ptr [eax]
00B33C68 mov edx,dword ptr [ecx]
00B33C6A mov dword ptr [x],edx
You can see that there is an additional mov instruction for dereferencing there so my best guess is: double dereferencing is inevitably slower.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight