Too many Local Variables and Stack Base Pointer Offset Overflows - c

So the %ebp (stack base pointer) + a constant is used to reference local variables in assembly. What if there are too many local variables and the required constant is soo large that it does not fit in one line of assembly code (32 or 64 bits)? How are edge cases like this handled?
For example, in the above image assume that there are 2^30 local variables. To reference the last one we would need an offset of 2^32. If we are working in a 32bit environment this offset won't fit in one line of code considering there is the opcode, destination etc also in that same line.

In its 32 bit and 64 bit operation modes, the x86 architecture addressing modes allow for either no displacement, an 8 bit displacement, or a 32 bit displacement.
In 32 bit mode, a 32 bit displacement is sufficient to describe every possible displacement (and thus, every possible stack offset). For your concern: The stack couldn't possibly contain 230 variables as that would be 4 GiB of stack space, leaving no space to store the machine code.
In 64 bit mode, it is indeed possible to have displacements that cannot be described with a 32 bit displacement. This rarely happens in reality (which is why the AMD engineers decided to leave the displacement size at 32 bit) but it can happen occasionally. In such cases, the displacement has to be applied through a register:
mov rax,0x123456789abcdef0 ; displacement
mov eax,[rax,rbp] ; value

Related

what is the size of every memory cell in the stack and is it possible to split one cell

In a 64 bit system every memory cell is 64 bit, so how does it save an int variable that contains less space? Wouldn't it spend one 64 bit address any way? If so why bother to use difference types of variables if they going to catch one cell any way.
Your use of terminology is all over the place.
A memory cell typically corresponds to a logic gate on the hardware level and is very likely to be 1 bit large assuming binary computers.
What I think you are asking about is the smallest addressable unit in a computer, also known as a byte, which is very likely 8 bits large.
This has nothing to do with the data register width of the CPU, which is what one usually refers to when talking about "64 bit computers". The data register width is the largest chunk of data that the CPU can process in a single instruction, but not necessarily the smallest. And this has no relation with the address bus width of the computer, though they are often the same nowadays.
When you declare a variable in C, the size allocated depends on the system. An int is for example very likely 32 bit large on all 32 bit and 64 bit computers. Notably, all mainstream 64 bit computers also support 32 bit or smaller instructions. So it doesn't necessarily make sense for the compiler to allocate more memory than 32 bit - you might get larger memory use for no speed gained.
I believe the term you are fishing for is alignment. It is only inefficient for the computer to read smaller chunks in case they are allocated on misaligned addresses. That is, an address which is not evenly divisible by the data register width (expressed in bytes). Such accesses are typically slower, or in some cases not supported at all. So a 64 bit compiler might therefore decide to allocate a small variable inside a 8 byte chunk, and leave the remaining bytes that aren't used as padding bytes. However, in case the compiler optimizes for size, it may chose to store data in a more memory-effective way, at the cost of access time.

32-bit fixed instruction length in 64-bit memory space

I am currently reading up on the AArch64 architecture by ARM. They are using a RISC-like instruction set with a fixed instruction length of 32-bit while operating on 64-bit addresses. I am still new to the topic of ISA so my question is: how can you operate with 64-bit long addresses when you only have 32-bit length in your instructions?
32-bit is the length of an instruction in bytes, not the operand-size or the address size.
ARM 32-bit (like all other 32-bit RISCs that use 32-bit fixed-width instruction words) can't fit a 32-bit address as an immediate into a single instruction either: there'd be no room for an opcode to say what instruction it is.
The width of an instruction limits the number of registers you can have. With 3 registers per instruction (dst, src1, src2), AArch64's increase from 16 to 32 registers means that each instruction needs 3 * log2(32) = 3* 5 = 15 bits to encode the registers. Or fewer for instructions with only 2 or 1 registers. (e.g. mov-immediate or add-immediate). The rest of the space goes to number of possible opcodes, and the size of immediates.
To get an address into a register, ARM compilers will typically load it from a nearby pool of constants (with a PC-relative addressing mode).
The other option is what most RISC CPUs do: use a 2-instruction sequence to put a 16-bit immediate in the upper half of a register, then OR a 16-bit immediate into the low half. (Or use the lower half of a static address as the displacement to a load/store instruction that uses an addressing mode like register + 16-bit offset.)
MIPS is a good example of a very simple RISC, see it's ISA with binary encoding. Its lui reg, imm16 puts imm16 <<16 into a register. (Load Upper Immediate). Then lw dst, imm16(base_reg) is a load like I was talking about in the last paragraph.
Even in 64-bit code, most numbers are still small, so there's not much need for wider immediate operands (except for addresses). e.g. x86 still uses a choice of 32-bit or 8-bit immediate operands for add r64, imm. x86 being a variable length ISA saves space when immediates are between -128 and +127 in a lot of cases.
You could have a 4-bit CPU that operates on 4096 bit values, it just wouldn't be terribly fast at doing it. Any programming language with a "bignum" type will work this way, though usually not to the same extreme.
The way a 32-bit CPU can operate on 64-bit values is because the hardware or software allows it, no other reason.
It's not like we couldn't manipulate 64-bit integer values before we had 64-bit CPUs. Even the old Intel 80386 could support 64-bit operations and it was released in 1985.

How does one access individual characters of a string properly aligned in memory, on ARM platform?

Since (from what I have read) ARM9 platform may fail to correctly load data at an unaligned memory address, let's assume unaligned meaning that the address value is not multiple of 2 (i.e. not aligned on 16-bit), then how would one access say, fourth character on a string of characters pointed to by a properly aligned pointer?
char buf[] = "Hello world.";
buf[3]; // (buf + 3) is unaligned here, is it not?
Does compiler generate extra code, as opposed to the case when buf + 3 is properly aligned? Or will the last statement in the example above have undesired results at runtime - yielding something else than the fourth character, the second l in Hello?
Byte accesses don't have to be aligned. The compiler will generate a ldrb instruction, which does not need any sort of alignment.
If you're curious as to why, this is because ARM will load the entire aligned word that contains the target byte, and then simply select that byte out of the four it just loaded.
The concept to remember is that the compiler will try to optimize access based on the type in order to get the most efficiency of your processor. So when accessing ints, it'll want to use things like the ldr instruction which will fault if it's an unaligned access. For something link a char access, the compiler will work some of the details for you. Where you have to be concerned are things like:
Casting pointers. If you cast a char * to an int * and the pointer is not aligned correctly, you'll get an alignment trap. In general, it's okay to cast down (from an int to a char), but not the other way around. You would not want to do this:
char buf[] = "12345678";
int *p = &buf[1];
printf("0x%08X\n", *p); // *p is badness here!
Trying to pull data off the wire with structures. I've seen this done a lot, and it's just plain bad practice. Endianess issues aside, you can cause an alignment trap if the elements aren't aligned correctly for the platform.
FWIW, casting pointers is probably the number one issue I've seen in practice.
There's a great book called Write Portable Code which goes over quite a few details about writing code for multiple platforms. The sample chapter on the linked site actually contains a section talking about alignment.
There's a little more that's going on too. Memory functions, like malloc, also give you back aligned blocks (generally on a double-word boundary) so that you can write in data and not hit an alignment fault.
One last bit, while newer ARMs can cope with unaligned accesses better, that does not mean they're performant. It just means they're tolerant. The same can be said for the X86 processors too. They'll do the unaligned access, but you're forcing extra memory fetches by doing so.
Most systems use byte based addressing. The address 0x1234 is in terms of bytes for example. Assume that I mean 8 bit bytes for this answer.
The definition of unaligned as to do with the size of the transfer. A 32 bit transfer for example is 4 bytes. 4 is 2 to the power 2 so if the lower 2 bits of the address are anything other than zeros then that address is an unaligned 32 bit transfer.
So using a table like this or just understanding powers of 2
8 1 0 []
16 2 1 [0]
32 4 2 [1:0]
64 8 3 [2:0]
128 16 4 [3:0]
the first column is the number of bits in the transfer. the second is the number of bytes that represents, the third is the number of bits at the bottom of the address that have to be zero to make it an aligned transfer, and the last column describes those bits.
It is not possible to have an unaligned 8 bit transfer. Not on arm, not on any system. Please understand that.
16 bit transfers. Once we get into transfers larger than 16 bits then you can START to talk about being unaligned. Then problem with unaligned transfers has to do with the number of bus cycles. Say you are doing 16 bit transfers on a system with a 16 bit wide bus and 16 bit wide memories. That means that we have items at memory at these addresses for example, address on left, data on right:
0x0100 : 0x1234
0x0102 : 0x5678
If you want to do a 16 bit transfer that is aligned the lsbit of your address must be zero, 0x100, 0x102, 0x104, etc. Unaligned transfers would be at addresses with the lsbit set, 0x101, 0x103, 0x105, etc. Why are they a problem? In this hypothetical (there were and are still real systems like this) system in order to get two bytes at address 0x0100 we only need to access the memory one time and take all 16 bits from that one address resulting in 0x1234. But if we want 16 bits starting at address 0x0101. We have to do two memory transactions 0x0100 and 0x0102 and take one byte from each combine those to get the result which little endian is 0x7812. That takes more clock cycles, more logic, etc. Inefficient and costly. Intel x86 and other systems from that era which were 8 or 16 bit processors but used 8 bit memory, everything larger than an 8 bit transfer was multiple clock cycles, instructions themselves took multiple clock cycles to execute before the next one could start, burning clock cycles and complication in the logic was not of interest (they saved themselves from pain in other ways).
The older arms may or may not have been from that era, but post acorn, the armv4 to the present is a 32 bit system from a perspective of the size of the general purpose registers, the data bus is 32 or 64 bits (the newest arms have 64 bit registers and I would assume if not already 128 bit busses) depending on your system. The core that put ARM on the map the ARM7TDMI which is an ARMv4T, I assume is a 32 bit data bus. The ARM7 and ARM9 ARM ARM (ARM Architectural Reference Manual) changed its language on each revision (I have several revisions going back to the paper only ones) with respect to words like UNPREDICTABLE RESULTS. When and where they would list something as bad or broken. Some of this was legal, understand ARM does not make chips, they sell IP, back then it was masks for a particular foundry today you get the source code to their core and you deal with it. So to survive you need a good legal defense, your secrets are exposed to customers, some of these items that were claimed not to be supported actually have deterministic results, if ARM were to find a clone (which is yet another legal discussion) with these unpredictable results being predictable and matching what arms logic does you have to be pretty good at explaining why. The clones have been crushed when they have tried (that or legally become licensed arm cores) so some of this is just interesting history. Another arm manual described quite clearly what happens when you do an unaligned transfer on the older ARM7 systems. And it is a bit of a duh moment when you see it, quite obvious what was going on (just plain keep it simple stupid logic).
The byte lanes rotated. On a 32 bit bus somewhere in the system, likely not on the amba/axi bus but inside the memory controller you would effectively get this:
0x0100 : 0x12345678
0x0101 : 0x78123456
0x0102 : 0x56781234
0x0103 : 0x34567812
address on the left resulting data on the right. Now why is that obvious you ask and what is the size of that transfer? The size of the transfer is irrelevant, doesnt matter, look at that address/data this way:
0x0100 : 0x12345678
0x0101 : 0xxx123456
0x0102 : 0xxxxx1234
0x0103 : 0xxxxxxx12
Using aligned transfers, 0x0100 is legal for 32, 16, and 8 bit and look at the lower 8, 16, or 32 bits you get the right answer with the data as shown. For address 0x0101 only an 8 bit transfer is legal, and the lower 8 bits of that data is in the lower 8 bits, just copy those over to the registers lower 8 bits. for address 0x0102 8 and 16 are legal, unaligned, transfers and 0x1234 is the right answer for 16 bit and 0x34 for 8. lastly 0x0103 8 bit is the only transfer size without alignment issues and 0x12 is the right answer.
This above information is all from publicly available documents, no secrets here or special insider knowledge, just generic programming experience.
ARM put an exception in, data abort or prefetch abort (thumb is a separate topic) to discourage the use of unaligned transfers as do other architectures. Unfortunately x86 has lead people to be very lazy and also not care about the performance hit that they incur when doing such a thing on an x86, which allows the transfer at the price of extra cycles and extra logic. The prefetch abort if I remember was not on by default on the ARM7 platforms I used, but was on by default on the ARM9 platforms I used, my memory could be wrong and since I dont know how the defaults worked that could have been a strap option on the core so it could have varied from chip to chip, vendor to vendor. You could disable it and do unaligned transfers so long as you understood what happened with the data (rotate not spill over into the next word).
More modern ARM processors do support unaligned transfers and they are as one would expect, I wont use 64 bit examples here to save typing and space but go back to that 16 bit example to paint the picture
0x0100: 0x1234
0x0102: 0x5678
With a 16 bit wide system, memory and bus, little endian, if you did a 16 bit unaligned transfer at address 0x0101 you would expect to see 0x7812 and that is what you get now on the modern arm systems. But it is still a software controlled feature, you can enable exceptions on unaligned transfers and you will get a data abort instead of a completed transfer.
As far as your question goes look at the ldrb instruction, that instruction does an 8 bit read from memory, being 8 bit there is no such thing as unaligned all addresses are valid, if buf[] happened to live at address 0x1234 then buf[3] is at address 0x1237 and that is a perfectly valid address for an 8 bit read. No alignment issues of any kind, no exceptions will fire. Where you would get into trouble is if you do one of these very ugly programming hacks:
char buf[]="hello world";
short *sptr;
int *iptr;
sptr=(short *)&buf[3];
iptr=(int *)&buf[3];
...
something=*sptr;
something=*iptr;
...
short_something=*(short *)&buf[3];
int_something=*(int *)&buf[3];
And then yes you would need to worry about unaligned transfers as well as hoping that you dont have any compiler optimization issues making the code not work as you had thought it would. +1 to jszakmeister for already covering this sub topic.
short answer:
char buf[]="hello world";
char is generally assumed to mean an 8 bit byte so this is a quantity of 8 bit items. certainly compiled for ARM that is what you will get (or mips or x86 or power pc, etc). So accessing buf[X] for any X within that string, cannot be unaligned because
something = buf[X];
Is an 8 bit transfer and you cant have unaligned 8 bit transfers. If you were to do this
short buf[]={1,2,1,2,3,2,1};
short is assumed but not always the case, to be 16, bit, for the arm compilers I know it is 16 bit. but that doesnt matter buf[X] here also cannot be unaligned because the compiler computes the offset for you. As follows address of buf[X] is base_address_of_buf + (X<<1). And the compiler and/or linker will insure, on ARM, MIPS, and other systems that buf is placed on a 16 bit aligned address so that math will always result in an aligned address.

why there is no any concept of near, far & huge pointer in 32 bit compiler?

Why there is no concept of near,far & huge pointer in a 32 bit compiler? As far as I understand, programs created on 16 bit 8086 architecture complier can have 1 mb size in which the data segment, graphics segments etc are there. To access all those segment and to maintain pointer increment concept we need these various pointers, but why in 32 bit its not necessary?
32-bit compilers can address the entire address space made available to the program (or to the OS) with a single, 32-bit pointer. There is no need for basing because the pointer is large enough to address any byte in the available address space.
One could theoretically conceive of a 32-bit OS that addresses > 4GB of memory (and therefore would need a segment system common with 16-bit OS's), but the practicality is that 64-bit systems became available before the need for that complexity arose.
why there is no concept of near,far & huge pointer in 32 bit compiler?
It depends on the platform and the compiler. Open Watcom C/C++ supports near, far and huge pointers in 16-bit code and near and far pointers in 32-bit code.
As i know programs created on 16 bit 8086 architecture complier can have 1 mb size in which datasegment graphics segments etc are there. to access all those segment and to maintain pointer increment concept we need these various pointers, but why in 32 bit its not necessary?
Because in most cases near 32-bit pointers are enough to cover the entire address space (all 232 bytes = 4 GB of it), which is not the case with near or far 16-bit pointers that as you said yourself can only cover up to 1 MB of memory (strictly speaking, in 16-bit protected mode of 80286+, you can use 16-bit far pointers to address up to at least 16 MB of memory, that's because those pointers are relative to the beginning of segments and segments on 80286+ can start anywhere in the first 16 MB since the segment descriptors in the global descriptor table (GDT) or the local descriptor table (LDT) reserve 24 bits for the start address of a segment (224 bytes = 16 MB)).

Why does an 8-byte array (C) in 64-bit Ubuntu take 16 bytes?

I've recently been (relearning) lower level CS material and I've been exploring buffer overflows. I created a basic C program that has an 8-byte array char buffer[8];. I then used GDB to explore and disassemble the program and step through its execution. I'm on a 64-bit version of Ubuntu, and I noticed that my 8-byte char array is actually represented in 16 bytes in memory - the high order bits all just being 0.
E.g. Instead of 0xDEADBEEF 0x12345678 as I might expect to represent the 8 byte array, it's actually something like 0x00000000 0xDEADBEEF 0x00000000 0x12345678.
I did some googling and was able to get GCC to compile my program as a 32-bit program (using -m32 flag) - which resulted in the expected 8 bytes as normal.
I'm just looking for an unambiguous explanation as to why the 8-byte character array is represented in 16 bytes on a 64-bit system. Is it because the minimum word size / addressable unit is 16 bytes (64 bits) and GDB is simply printing based on an 8-byte word size?
Hopefully this is clear, but let me know if clarification is needed.
64bit systems are geared toward aligning all memory to 16 byte boundries (16 byte stack alignment is part of the System-V ABI), for stack allocations, there are two parts to this, firstly, the stack itself needs to be aligned, secondly any allocations then try to preserve that alignment.
This explains the first part as to why the 8 byte array becomes 16 bytes on the stack, as to why it gets split into two 8byte qwords, this is a little more difficult to tell, as you haven't provided any code (assembly or C) as to the use of this buffer. And trying to replicated this using mingw64 provides the 16 byte alignment, but not the funny layout you are seeing.
Of course, the other possibility stemming from the lack of ASM is that GDB is displaying 2xQWORD's even though its in fact 2xDWORD's (in other words, try using p/x (char[8]) to dump the contents...).

Resources