How to determine offset values when disassembly? - c

long long int i=57745158985; #the C code
0000000000100004: li r7,13
0000000000100008: lis r8,29153
000000000010000c: ori r8,r8,0x3349
0000000000100010: stw r7,24(rsp)
0000000000100014: stw r8,28(rsp)
0000000000100018: lfd fp0,24(rsp)
000000000010001c: stfd fp0,8(rsp)
Hello, when I disassemble the CodeWarrior c code, this code comes up. In the code, offset is given to the registers. I don't understand why the given offsets are 24, 28, and 8. How does it determine this? Would it work if it assigned any other value?

This depends on the ABI for your platform and architecture.
Assuming your platform is the PowerPC e500 Application Binary Interface.
The register in question (rsp) is the stack pointer. Those offsets are references to within the functions stack frame. To understand the stack frame of a function you need to consult the ABI.
If you look at Section 2.3 The Stack Frame of 1 you will see the layout for a functions stack frame. That is what the compiler is targeting when those offsets like 24 and 28 are being determined.

Related

Why I want to add an address to the stack pointer?

I'm trying to understand the first microcorruption challenge.
I want to ask about the first line of the main function.
Why would they add that address to the stack pointer?
This looks like a 16-bit ISA1, otherwise the disassembly makes no sense.
0xff9c is -100 in 16-bit 2's complement, so it looks like this is reserving 100 bytes of stack space for main to use. (Stacks grow downward on most machines). It's not an address, just a small offset.
See MSP430 Assembly Stack Pointer Behavior for a detailed example of MSP430 stack layout and usage.
Footnote 1: MSP430 possibly? http://mspgcc.sourceforge.net/manual/x82.html it's a 16-bit ISA with those register names, and those mnemonics, and I think its machine code uses variable-length 2 or 4 byte instructions.
It's definitely not ARM; call and jmp are not ARM mnemonics; that would be bl and b. Also, ARM uses op dst, src1, src2 syntax, while this disassembly uses op src, dst.

Stacktrace on ARM cortex-M4

When I run into a fault handler on my ARM cortex-M4 (Thumb) I get a snapshot of the CPU register just before the fault occured. With this information I can find the stack pointer where it was. Now, what I want is to backtrace through all functions it passed. The only problem I see here is that I don't have a frame pointer, so I cannot really see where a certain subroutine has saved the LR, ad infinitum.
How would one tackle this problem if the frame pointer is not available in r7?
This blog post discusses this issue with reference to the MIPS architecture - the principles can be readily adapted to ARM architectures.
In short, it describes three possibilities for locating the stack frame for a given SP and PC:
Using compiler-generated debug information (not included in the executable image) to calculate it.
Using compiler-generated stack-unwinding (exception handling) information (included in the executable image) to calculate it.
Scanning the call site to locate the prologue or epilogue code that adjusts the stack pointer, and deducing the stack frame address from that.
Obviously it's very compiler- and compiler-option dependent, and not guaranteed to work in all cases.
R7 is not the frame pointer on the M4, it's R11. R7 is the FP for Cortex-M0+/M1 where only the lower registers are generally available. In anycase, when Cortex-M makes a call to a function using BL and variants, it saves the return address into LR (link register). At function entry, the LR is saved onto the stack. So in theory, to get a call trace, you would "chase" the chain of the LRs.
Unfortunately, the saved location of LR on the stack is not defined by the calling convention, and its location must be deduced from the debug info for that function entry in the DWARF records (in the .elf file). I do not know if there is an utility that would extract the LR locations from an ELF file, but it should not be too difficult.
Richard at ImageCraft is right.
More information can be found here
This works fine with C code. I had a harder applying it to C++ but it's not impossible.

x86_64 : is stack frame pointer almost useless?

Linux x86_64.
gcc 5.x
I was studying the output of two codes, with -fomit-frame-pointer and without (gcc at "-O3" enables that option by default).
pushq %rbp
movq %rsp, %rbp
...
popq %rbp
My question is :
If I globally disable that option, even for, at the extreme, compiling an operating system, is there a catch ?
I know that interrupts use that information, so is that option good only for user space ?
The compilers always generate self consistent code, so disabling the frame pointer is fine as long as you don't use external/hand crafted code that makes some assumption about it (e.g. by relying on the value of rbp for example).
The interrupts don't use the frame pointer information, they may use the current stack pointer for saving a minimal context but this is dependent on the type of interrupt and OS (an hardware interrupt uses a Ring 0 stack probably).
You can look at Intel manuals for more information on this.
About the usefulness of the frame pointer:
Years ago, after compiling a couple of simple routines and looking at the generated 64 bit assembly code I had your same question.
If you don't mind reading a whole lot of notes I have written for myself back then, here they are.
Note: Asking about the usefulness of something is a little bit relative. Writing assembly code for the current main 64 bit ABIs I found my self using the stack frame less and less. However this is just my coding style and opinion.
I like using the frame pointer, writing the prologue and epilogue of a function, but I like direct uncomfortable answers too, so here's how I see it:
Yes, the frame pointer is almost useless in x86_64.
Beware it is not completely useless, especially for humans, but a compiler doesn't need it anymore.
To better understand why we have a frame pointer in the first place it is better to recall some history.
Back in the real mode (16 bit) days
When Intel CPUs supported only "16 bit mode" there were some limitation on how to access the stack, particularly this instruction was (and still is) illegal
mov ax, WORD [sp+10h]
because sp cannot be used as a base register. Only a few designated registers could be used for such purpose, for example bx or the more famous bp.
Nowadays it's not a detail everybody put their eyes on but bp has the advantage over other base register that by default it implicitly implicates the use of ss as a segment/selector register, just like implicit usages of sp (by push, pop, etc), and like esp does on later 32-bit processors.
Even if your program was scattered all across memory with each segment register pointing to a different area, bp and sp acted the same, after all that was the intent of the designers.
So a stack frame was usually necessary and consequently a frame pointer.
bp effectively partitioned the stack in four parts: the arguments area, the return address, the old bp (just a WORD) and the local variables area. Each area being identified by the offset used to access it: positive for the arguments and return address, zero for the old bp, negative for the local variables.
Extended effective addresses
As the Intel CPUs were evolving, the more extensive 32-bit addressing modes were added.
Specifically the possibility to use any 32-bit general-purpose register as a base register, this includes the use of esp.
Being instructions like this
mov eax, DWORD [esp+10h]
now valid, the use of the stack frame and the frame pointer seems doomed to an end.
Likely this was not the case, at least in the beginnings.
It is true that now it is possible to use entirely esp but the separation of the stack in the mentioned four areas is still useful, especially for humans.
Without the frame pointer a push or a pop would change an argument or local variable offset relative to esp, giving form to code that look non intuitive at first sight. Consider how to implement the following C routine with cdecl calling convention:
void my_routine(int a, int b)
{
return my_add(a, b);
}
without and with a framestack
my_routine:
push DWORD [esp+08h]
push DWORD [esp+08h]
call my_add
ret
my_routine:
push ebp
mov ebp, esp
push DWORD [ebp+0Ch]
push DWORD [ebp+08h]
call my_add
pop ebp
ret
At first sight it seems that the first version pushes the same value twice. It actually pushes the two separate arguments however, as the first push lowers esp so the same effective address calculation points the second push to a different argument.
If you add local variables (especially lots of them) then the situation quickly becomes hard to read: Does mov eax, [esp+0CAh] refer to a local variable or to an argument? With a stack frame we have fixed offsets for the arguments and local variables.
Even the compilers at first still preferred the fixed offsets given by the use of the frame base pointer. I see this behavior changing first with gcc.
In a debug build the stack frame effectively adds clarity to the code and makes it easy for the (proficient) programmer to follow what is going on and, as pointed out in the comment, lets them recover the stack frame more easily.
The modern compilers however are good at math and can easily keep count of the stack pointer movements and generate the appropriate offsets from esp, omitting the stack frame for faster execution.
When a CISC requires data alignment
Until the introduction of SSE instructions the Intel processors never asked much from the programmers compared to their RISC brothers.
In particular they never asked for data alignment, we could access 32 bit data on an address not a multiple of 4 with no major complaint (depending on the DRAM data width, this may result on increased latency).
SSE used 16 bytes operands that needed to be accessed on 16 byte boundary, as the SIMD paradigm becomes implemented efficiently in the hardware and becomes more popular the alignment on 16 byte boundary becomes important.
The main 64 bit ABIs now require it, the stack must be aligned on paragraphs (ie, 16 bytes).
Now, we are usually called such that after the prologue the stack is aligned, but suppose we are not blessed with that guarantee, we would need to do one of this
push rbp push rbp
mov rbp, rsp mov rbp, rsp
and spl, 0f0h sub rsp, xxx
sub rsp, 10h*k and spl, 0f0h
One way or another the stack is aligned after these prologues, however we can no longer use a negative offset from rbp to access local vars that need alignment, because the frame pointer itself is not aligned.
We need to use rsp, we could arrange a prologue that has rbp pointing at the top of an aligned area of local vars but then the arguments would be at unknown offsets.
We can arrange a complex stack frame (maybe with more than one pointer) but the key of the old fashioned frame base pointer was its simplicity.
So we can use the frame pointer to access the arguments on the stack and the stack pointer for the local variables, fair enough.
Alas the role of stack for arguments passing has been reduced and for a small number of arguments (currently four) it is not even used and in the future it will probably be used even less.
So we don't use the frame pointer for local variables (mostly), nor for the arguments (mostly), for what do we use it?
It saves a copy of the original rsp, so to restore the stack pointer at function exit, a mov is enough. If the stack is aligned with an and, which is not invertible, an original copy is necessary.
Actually some ABIs guarantee that after the standard prologue the stack is aligned thereby allowing us to use the frame pointer as usual.
Some variables don't need alignment and can be accessed with an unaligned frame pointer, this is usually true for hand crafted code.
Some functions require more than four parameters.
Summary
The frame pointer is a vestigial paradigm from 16 bit programs that has proven itself still useful on 32 bit machines because of its simplicity and clarity when accessing local variables and arguments.
On 64 bit machines however the strict requirements vanish most of the simplicity and clarity, the frame pointer remains used in debug mode however.
On the fact that the frame pointer can be used to make fun things: it is true I guess, I've never seen such code but I can image how it would work.
I, however, focused on the housekeeping role of the frame pointer as this is the way I always have seen it.
All the crazy things can be done with any pointer set to the same value of the frame pointer, I give the latter a more "special" role.
VS2013 for example sometimes uses rdi as a "frame pointer", but I don't consider it a real frame pointer if it doesn't use rbp/ebp/bp.
To me the use of rdi means a Frame Pointer Omission optimization :)

Memory alignment today and 20 years ago

In the famous paper "Smashing the Stack for Fun and Profit", its author takes a C function
void function(int a, int b, int c) {
char buffer1[5];
char buffer2[10];
}
and generates the corresponding assembly code output
pushl %ebp
movl %esp,%ebp
subl $20,%esp
The author explains that since computers address memory in multiples of word size, the compiler reserved 20 bytes on the stack (8 bytes for buffer1, 12 bytes for buffer2).
I tried to recreate this example and got the following
pushl %ebp
movl %esp, %ebp
subl $16, %esp
A different result! I tried various combinations of sizes for buffer1 and buffer2, and it seems that modern gcc does not pad buffer sizes to multiples of word size anymore. Instead it abides the -mpreferred-stack-boundary option.
As an illustration -- using the paper's arithmetic rules, for buffer1[5] and buffer2[13] I'd get 8+16 = 24 bytes reserved on the stack. But in reality I got 32 bytes.
The paper is quite old and a lot of stuff happened since. I'd like to know, what exactly motivated this change of behavior? Is it the move towards 64bit machines? Or something else?
Edit
The code is compiled on a x86_64 machine using gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) like that:
$ gcc -S -o example1.s example1.c -fno-stack-protector -m32
What has changed is SSE, which requires 16 byte alignment, this is covered in this older gcc document for -mpreferred-stack-boundary=num which says (emphasis mine):
On Pentium and PentiumPro, double and long double values should be aligned to an 8 byte boundary (see -malign-double) or suffer significant run time performance penalties. On Pentium III, the Streaming SIMD Extension (SSE) data type __m128 suffers similar penalties if it is not 16 byte aligned.
This is also backed up by the paper Smashing The Modern Stack For Fun And Profit which covers this an other modern changes that break Smashing the Stack for Fun and Profit.
Memory alignment of which stack alignment is just one aspect depends on the architecture. It is partly defined in the Applicaion Binary Interface of the language and a Procedure Call Standard (sometimes it is both in a single spec) for the architecture (CPU, it might even vary depending on platform) and also depends on the compiler/toolchain where the former documents leave room for variations.
The former two documents (names may vary) are mostly for the external interface between functions; they might leave internal structure to the toolchain. Howwever, that has to match the architecture. Normally the hardware requires a minimal alignment, but allows for a larger alignment for performance reasons (e.g.: byte-alignment minimum, but this would require multiple bus-cycles to read a 32 bit word, so the compiler uses a 32 bit alignment).
Normally, the compiler (following the PCS) uses an alignment optimal for the architecture and under control of optimization settings (optimize for speed or size). It takes into account not only the size of the object (aligned to its natural boundary), but also sizes of internal busses (e.g. a 32 bit x86 has internal 64 or 128 bit busses, ARM CPUs have internal 32 to 128 (possibly even wider) bit busses), caches, etc. For local variables, it may also take into account access-patterns, so two adjacent variables may be loaded in parallel into a register pair instead of two separate loads or even reorder such variables.
The stackpointer might require a higher alignment for instance, so the CPU can push in an interrupt frame two registers at once, push vector registers which require higher alignment, etc. You can write quite a thick book about this subject (and I bet, someone already has).
So, in general, there is no single one-alignment-fits all rule. However, for struct and array packing, the C standard does define some rules for packing/alignment, mostly to guarantee consistence of e.g. sizeof(type) and the address in an array (required for correct malloc()).
Even char arrays might be aligned for optimal cache layout. Note it is not only the CPU which might have caches, but also PCIe bridges, not to mention PCIe transfers themselves down to DRAM pages.
I have not tried that specific version of compiler or the distribution version you report. My guess would be the 16 is from byte alignment requirements on stack (i.e. all stack adjustments would be x byte aligned and x may be 16 for your invocation).
Note that variable alignment you seem to have started with, is slightly different from the above and is controlled by align markings on the variable in gcc. Try using those and you should see a difference.

Intel x86 to ARM assembly conversion

I am currently learning ARM assembly language;
To do so, I am trying to convert some x86 code (AT&T Syntax) to ARM assembly (Intel Syntax) code.
__asm__("movl $0x0804c000, %eax;");
__asm__("mov R0,#0x0804c000");
From this document, I learn that in x86 the Chunk 1 of the heap structure starts from 0x0804c000. But I when I try do the same in arm,
I get the following error:
/tmp/ccfNZp9F.s:174: Error: invalid constant (804c000) after fixup
I am assuming the problem is that ARM can only load 32bit instructions.
Question 1: Any idea what would be the first chunk in case of ARM processors?
Question 2:
From my previous question, I know how memory indirect addressing works.
Are the snippets written below doing the same job?
movl (%eax), %ebx
LDR R0,[R1]
I am using ARMv7 Processor rev 4 (v7l)
Trying to learn arm by looking at x86 is not a good idea one is CISC and quite ugly the other is RISC and much cleaner.. Just learn ARM by looking at the instruction set reference in the architectural reference manual. Look up the mov instruction the add instruction, etc.
ARM doesnt use intel syntax it uses ARM syntax.
Dont learn by using inline assembly, write real assembly. Use an instruction set simulator first not hardware.
ARM, Mips and others aim for fixed word length. So how would you for example fit an instruction that says move some immediate to a register, specify the register, and fit the 32 bit immediate all in 32 bits? not possible. So for fixed length instruction sets you cannot simply load any immediate you want into any register. You must read up on the rules for that instruction set. mips allows for 16 bit immediates, arm for 8 plus or minus depending on the flavor of arm instruction set and the instruction. mips limits where you can put those 16 bits either high or low, arm lets you put those 8 bits anywhere in the 32 bit register depending on the flavor of arm instruction set (arm, thumb, thumb2 extensions).
As with most assembly languages you can solve this problem by doing something like this
ldr r0,my_value
...
my_value: .word 0x12345678
With CISC that immediate is simply tacked onto the instruciton, so whether it 0 bytes a way or 20 bytes away it is still there with either approach.
ARM assemblers also generally allow you this shortcut:
ldr r0,=something
...
something:
which says load r0 with the ADDRESS of something, not the contents at that location but the address (like an lea)
But that lends itself to this immediate shortcut
ldr r0,=0x12345678
which if supported by the assembler will allocate a memory location to hold the value and generate a ldr r0,[pc,offset] instruction to read it. If the immediate is within the rules for a mov then the assembler might optimize it into a mov rd,#immediate.
Answer to Question 1
The MOV instruction on ARM only has 12 bits available for an immediate value, and those bits are used this way: 8 bits for value, and 4 bits to specify the number of rotations to the right (the number of rotations is multiplied by 2, to increase the range).
This means that only a limited number of values can be used with that instruction. They are:
0-255
256, 260, 264,..., 1020
1024, 1040, 1056, ..., 4080
etc
And so on. You are getting that error because your constant can't be created using the 8 bits + rotations. You can load that value onto the register following instruction:
LDR r0, =0x0804c000
Notice that this is a pseudo-instruction though. The assembler will basically put that constant somewhere in your code and load it as a memory location with some offset to the PC (program counter).
Answer to question 2
Yes those instructions are equivalent.

Resources