What is overalignment of execution regions and input sections? - c

I came across code similar to the following today and I am curious as to what is actually happening:
#pragma pack(1)
__align(2) static unsigned char multi_array[7][24] = { 0 };
__align(2) static unsigned char another_multi_array[7][24] = { 0 };
#pragma pack()
When searching for a reference to the __align keyword in the Keil compiler, I came across this:
Overalignment of execution regions and input sections There are situations when you want to overalign code and data sections... If you have access to the original source code, you can do this at compile time with the __align(n) keyword...
I do not understand what is meant by "overaligning code and data sections". Can someone help to clarify how this overalignment occurrs?

The compiler will naturally "align" data based on the needs of the system. For example, on a typical 32-bit system, a 32-bit integer should always be a single 4-byte word (as opposed to being partly in one word and partly on the next), so it will always start on a 4-byte-word boundary. (This mostly has to do with the instructions available on the processor. A system is very likely to have an instruction to load a single word from memory into a register, and much less likely to have a single instruction to load an arbitrary sequence of four adjacent bytes into a register.)
The compiler normally does this by introducing gaps in the data; for example, a struct with a char followed by a 32-bit int, on such a system, would require eight bytes: one byte for the char, three bytes of filler so the int is aligned right, and four bytes for the int itself.
To "overalign" the data is to request greater alignment than the compiler would naturally provide. For example, you might request that a 32-bit integer start on an 8-byte boundary, even on a system that uses 4-byte words. (One major reason to do this would be if you're aiming for byte-level interoperability with a system that uses 8-byte words: if you pass structs from one system to the other, you want the same gaps in both systems.)

Overalignment is when the data is aligned to more than its default alignment. For example, a 4-byte int usually has a default alignment of 4 bytes. (meaning the address will be divisible by 4)
The default alignment of a datatype is quite-often (but not always) the size of the datatype.
Overalignment allows you to increase this alignment to something greater than the default.
As for why you would want to do this:
One reason for this is to be able access the data with a larger datatype (that has a larger alignment).
For example:
char buffer[16];
int *ptr = (int*)&buffer;
ptr[0] = 1;
ptr[1] = 2;
By default, buffer will only be aligned to 1 byte. However, int requires a 4-byte alignment. If buffer isn't aligned to 4 bytes, you will get a misalignment exception. (AFAIK, ARM doesn't allow misaligned memory access... x86/64 usually does, but with performance penalty)
__align() will let you force the alignment higher to make it work:
__align(4) char buffer[16];
A similar situation appears when using SIMD instructions. You will be accessing smaller datatype with a large SIMD datatype - which will likely require a larger alignment.

By overalign, Keil mean nothing more complex than aligning an object to a larger alignment boundary than the data type requires.
See the documentation for __align: "You can only overalign. That is, you can make a two-byte object four-byte aligned but you cannot align a four-byte object at 2 bytes."
In the case of the linker, you can force an extra alignment onto sections within other binary modules using the ALIGNALL or OVERALIGN directives. This may be useful for performance reasons, but isn't a common scenario.

Related

what is aligned attribute and what are the uses of it

I have following lines in the code
# define __align_(x) __attribute__((aligned(x)))
I can use it int i __align_; what difference does it makes like like
I am using aligned attribute as above or if I am just creating my variable like int i; does it differ in how variable get created in memory?
I can use it int i __align_; what difference does it makes like like
This will not work because the macro is defined to have a parameter, __align_(x). When it is used without a parameter, it will not be replaced, and the compiler will report a syntax error. Also, identifiers starting with __ are reserved for the C implementation (for the use of the compiler, the standard library, and any other parts forming the C implementation), so a regular program should not use such a name.
When you use the macro correctly, it changes the normal alignment requirement for the type.
Generally, objects of various types have alignment requirements: They should be located in memory at addresses that are multiples of their requirement. The reasons for this are because computer hardware is usually designed to work with groups of bytes, so it may fetch data from memory in groups of, for example, four bytes: Bytes from 0 to 3, bytes from 4 to 7, bytes from 8 to 11, and so on.
If a four-byte object with four-byte alignment requirement is located at a multiple of four bytes, then it can be read from memory easily, by loading the group of bytes it is in. It can also be written to memory easily.
If the object were not at a multiple of four bytes, it cannot be loaded as one group of bytes. It can be loaded by loading the two groups of bytes it straddles, extracting the desired bytes, and combining the desired bytes in one processor register. However, that takes more work, so we want to avoid it. The compiler is written to automatically align things as desired for the C implementation, and it writes load and store instructions that expect the desired alignment.1
Different object types can have different alignment requirements even though they are bound by the same hardware behavior. For example, with a two-byte short, the alignment requirement may be two bytes. This is because, whether it starts at byte 0 or byte 2 within a group (say at address 100, 102, 104, or 106), we can load the short by loading a single group of four bytes and taking just the two bytes we want. However, if it started at byte 3 (say at address 103), we would have to load two groups of bytes (100 to 103 and 104 to 107) to get the bytes we needed for the short (103 and 104). So two-byte alignment suffices for this short even though the hardware is designed with four-byte groups.
As mentioned, the compiler handles alignment automatically. When you define a structure with multiple members of different types, the compiler inserts padding so that each member is aligned correctly, and it inserts padding at the end of the structure so that an array of them keeps the alignment from element to element in the array.
There are times when we want to override the compiler’s automatic behavior. When we are preparing to send data over a network connection, the communication protocol might require the different fields of a message to be packed together in consecutive bytes, with no padding. In this case, we can define a structure with an alignment requirement of 1 byte for it and all its members. When we are ready to send a message, we could copy data into this structure’s members and then write the structure to the network device.
When you tell the compiler an object is not aligned normally, the compiler will generate instructions for that. Instead of the normal load or store instructions, it will use special unaligned load or store instructions if the computer architecture has them. If it does not, the compiler will use instructions to shift and store individual bytes or to shift and merge bytes and store them as aligned words, depending on what instructions are available in the computer architecture. This is generally inefficient; it will slow down your program. So it should not be used in normal programming. Decreasing the alignment requirements should be used only when there is a need for controlling the layout of data in memory.
Sometimes increasing the alignment requirements is used for performance. For example, an array of four-byte float elements generally only needs four-byte alignment. However, some computers have special instructions to process four float elements (16 bytes) at a time, and the benefit from having that data aligned to a multiple of 16 bytes. (And some computers have instructions for even more data at one time.) In this case, we might increase the alignment requirement for our float array (but not its individual elements) so that it is aligned to be good with these instructions.
Footnote
1 What happens if you force an object to be located at an undesired alignment without telling the compiler varies. In some computers, when a load instruction is executed with an unaligned address, the processor will “trap,” meaning it stops normal program execution and transfers control to the operating system, reporting an error in your program. In some computers, the processor will ignore the low bits of the address and load the wrong data. In some computers, the processor will load the two groups of bytes, extract the desired bytes, and merge them. On computers that trap, the operating system might do the manual fix-up of loading the bytes, or it might terminate your program or report the error to your program.
The attribute tells the compiler that the variable in question must be placed in memory in addresses that are aligned to a certain number of bytes (addr % alignement == 0).
This is important because the CPU can only work on some integer values if they are aligned - such as int32 must be 4 bytes aligned and int64 must be 8 bytes aligned, pointers need to be 4/8 (32/64 bit cpu) aligned too.
The attribute is mostly used for structures, where certain fields within the structure must be memory aligned in order to allow the CPU to do integer operations on them (like mov.l) without hitting a BUS ERROR from the memory controller.
If structures aren't properly aligned, the compiler will have to add extra instructions to first load the unaligned value into a register with several memory operations which is more expensive in performance.
It can also be used to bump performance in more performance sensitive systems by creating buffers that are page aligned (4k usually) so that paging will have less of an impact, or if you want to create DMA-able buffer zones - but that's a bit more advanced...

Byte Alignment in Files

I've been looking at file formats and information on byte alignment in files is hard to come by. I can find information on memory byte alignment ("Data Structure Alignment"), but that's a different matter.
In setting up a standard format, is there an optimal way to align bytes in a file that is good or even necessary for various systems? This is not for one data type, but for many. Is 2-byte alignment sufficient, or is it really even necessary? What about 4-byte alignment? How well will a 32-bit or 64-bit system handle this?
When working with binary data, very often you'll just write memory directly to the file. In that case, data in the file is aligned exactly as it is in memory. This has the advantage of not requiring any intermediate steps when reading the information back into your memory data structures. It does use a bit more disk space than absolutely required if you were to eliminate the alignment, but typically not a lot of space.
You have to be careful, though, if you'll be reading that data from other programs. They have to be written to take the padding bytes into account. For example if you have this structure:
struct foo
{
int a;
char b;
int c;
}
And you tell it to align on 32-bit boundaries, your memory (and therefore disk) layout will be:
4 bytes - a
1 byte - b
3 bytes - padding
4 bytes - c
If the other program isn't written to take that into account and instead assumes byte alignment, it'll try to read c from the four bytes immediately following b. The result, as you can imagine, wouldn't be good.
When I'm working with binary data, I usually just write the data to the file, ignoring the typically small amount of "waste" that's due to data alignment.

structure padding - what is the purpose of natural alignment? [duplicate]

This question already has answers here:
Padding in structures in C
(5 answers)
Closed 8 years ago.
I was learning about structure padding and data alignment. I came about this point that all the elements of the structure in the memory should be in natural alignment. so for example if I have following structure declared:
struct align{
char c;
double d;
int s;
};
If I take a 32 bit architecture, then it fetches 4 bytes at a time.So keeping this point in mind,if I start padding I will get(my assumption):
1byte(char) + 3bytes(padding) + 8bytes(double) + 4bytes(int) ---------> 1
all these shall be fetched with minimum machine cycles.
But originally the following is happening:
1byte(char) + 7bytes(padding) + 8bytes(double) + 4bytes(int) ----------> 2
why is it that we need this natural alignment for double when we could save 4bits while going with method 1 (while fetching each element with same no. of machine cycles in both cases) ?
Natural alignment refers to the size of the variable, not the size of the processor register and/or data path. A floating point double is 8 bytes, and so its natural alignment is 8 bytes. To be more precise, the natural alignment is the smallest power of 2 that is large enough to hold the variable, that definition covers the case of "long double" or x86 extended precision which is a 10-byte variable and whose natural alignment is a multiple of 16 bytes. For x86 processors see the optimization manual and search for alignment, you will find this is a subject rich in detail and specifics vary by micro-architecture, even within the same processor family. In particular, section 3.6.4 Alignment says
For best performance, align data as follows:
Align 8-bit data at any address.
Align 16-bit data to be contained within an aligned 4-byte word.
Align 32-bit data so that its base address is a multiple of four.
Align 64-bit data so that its base address is a multiple of eight.
Align 80-bit data so that its base address is a multiple of sixteen.
Align 128-bit data so that its base address is a multiple of sixteen.
The Pentium 4 is a 32-bit processor, part of the IA-32 family, yet it has a 64-bit data path (Front Side Bus). There are 32-bit processors that have only 16-bit buses, see 32-bit computing historical perspective. Accessing a variable at an alignment other than its natural alignment may result in a performance penalty, or an alignment fault, depending on the processor, in some cases the setting of a control bit, the type of variable, the instruction used, etc.
The actual alignment is up to the compiler and the calling conventions. For structures the requirement is that the first member variable must be at offset 0 (zero) and variables must be allocated in the order they are declared, padding may be inserted between variables for alignment and after the last variable to pad the size of the structure. In 32-bit Windows the stack is only required to be 4-byte aligned, so the compiler would have to generate extra code to ensure 8-byte alignment of a double allocated on the stack.
In Agner Fog's Calling Conventions document you will find details on the alignment used in different operating systems and by different compilers. The stack has a 4-byte alignment in 32-bit Windows, which explains why you may have observed a floating point double aligned at a 4-byte but not 8-byte boundary when allocated on the stack - the compiler doesn't have a clue when a function gets called whether the stack will be 8-byte aligned or not. In table-2 of that document it shows the alignment of various data types allocated in static storage as implemented by various compilers, you will notice that in 32-bit Windows the only compiler that allows 4-byte alignment for double is the Borland compiler.
When allocating in a structure according to that document the Borland compiler allows double to be at any byte offset (which I find surprising).
Here's the text description in the document, copied here for reference
Table 3 shows the alignment in bytes of data members of structures
and classes. The compiler will insert unused bytes, as required,
between members to obtain this alignment. The compiler will also
insert unused bytes at the end of the structure so that the total size
of the structure is a multiple of the alignment of the element that
requires the highest alignment. Many compilers have options to change
the default alignments. Differences in structure member alignment will
cause incompatibility between different programs or modules accessing
the same data and when data are stored in binary files. The programmer
can avoid such compatibility problems by ordering the structure
members so that no unused bytes need to be inserted. Likewise, the
padding at the end of the structure may be specified explicitly by
inserting dummy members of the required size. The size of the virtual
table pointer, if any, must be taken into account (see chapter 11).
5 Stack alignment
The stack pointer must be aligned by the stack word
size at all times. Some systems require a higher alignment. The Gnu
compiler version 3.x and later for 32-bit Linux and Mac OS X makes the
stack pointer aligned by 16 at every function call instruction.
Consequently it can rely on ESP = 12 modulo 16 at every function
entry. This alignment is not consistently implemented. It is
specified in the Mac OS ABI, but nowhere else. The stack is not
aligned when compiling with option -Os or
-mpreferred-stack-boundary=2, but apparently the Gnu compiler erroneously relies on the stack being aligned by 16 despite these
options. The Intel compiler (v. 9.1.038) for 32 bit Linux does not
have the same alignment. (I have submitted bug reports to Gnu and
Intel about this in 2006. In 2009 Intel added a -falign-stack=
assume-16-byte option to ICC version 11.0 to fix the problem). The
stack is aligned by 4 in 32-bit Windows. The 64 bit systems keep the
stack aligned by 16. The stack word size is 8 bytes, but the stack
must be aligned by 16 before any call instruction. Consequently, the
value of the stack 10 pointer is always 8 modulo 16 at the entry of a
procedure. A procedure must subtract an odd multiple of 8 from the
stack pointer before any call instruction. A procedure can rely on
these rules when storing XMM data that require 16-byte alignment. This
applies to all 64 bit systems (Windows, Linux, BSD). Where at least
one function parameter of type __m256 is transferred on the stack,
Unix systems (32 and 64 bit) align the parameter by 32 and the called
function can rely on the stack being aligned by 32 before the call
(i.e. the stack pointer is 32 minus the word size modulo 32 at the
function entry). This does not apply if the parameter is transferred
in a register. Various methods for aligning the stack are described
in Intel's application note AP 589 "Software Conventions for
Streaming SIMD Extensions", "Data Alignment and Programming Issues
for the Streaming SIMD Extensions with the Intel® C/C++ Compiler", and
"IA-32 Intel ® Architecture Optimization Reference Manual".
Your comment is valid, and you'll probably get the result you are looking for if, instead of using a struct, you simply lay down the variables as part of the local stack inside a function. Something along these lines :
void alignTest()
{
char c;
double d;
int s;
printf("%x %x %x", (int)&c, (int)&d, (int)&s);
}
In this example, the compiler is free to make its optimal choices performance and memory wise. Heck, it can even re-order variables if it wishes. On this setup, I've already witnessed double on 4-bytes boundaries (not 8) using 32-bits compilers.
On the other hand, using a struct, you need to keep in mind that it is part of an interface contract. It's not just a matter of the compiler selecting whatever choice it feels better : if part of an API, this struct will be used by other programs, potentially using another compiler, or another version of the same compiler. It happens all the time : think DLL, wrapper from other languages (calling a C function from a Delphi or Python program) etc.
You can't have an interface element in a "random state", with different choices depending on compiler. In this case, the allocation rules regarding variables inside a struct are set in stone by the specification.
In this specification, variable order is always respected, and double are aligned on 8 bytes.

Data structure padding and memory allocation

According to Wikipedia, a structure containing a single byte and a four-byte integer, in this order, would require three additional bytes of padding because the four-byte integer has to be 4 bytes aligned.
A structure containing a four-byte integer and a single byte, in this order, would require no additional padding bytes because one byte will be 1-byte aligned?
The size of the first structure will be 8 but the size of the second structure will be 5?
What about another four-byte integer allocated in memory after the second structure above? Will it be allocated after a gap of 3 bytes so that it respect the 4 bytes alignment?
[update from comment:]
I forgot to mention my example is on a 32 bit system.
[UPDATE]
I just found out that pack instructions added at the beginning and end of a structure only apply to the members of the structure and does not propagate to other structures. This means if you have a structure of structures, you have to pack them individually, not just the parent structure.
Maybe, maybe not. You might be on an architecture that likes padding to 8-byte boundaries.
Possibly. Never assume the same, predictable binary representation of a C structure across compilers. Or even across different options in the same compiler.
Maybe. In the example architecture, probably. But the gap may in fact be larger if the compiler's libraries tend to allocate bigger chunks.
A missing consideration in data alignment and packing is that there are at least 2 aspects of data alignment.
Performance: Certain alignments of types, like a 4-byte int often perform faster with an alignment on a matching (quad) address boundary. This is often a compiler default. Sometimes other lower performing alignments are possible. Compiler specific pack options may use this less optimal speed layout to achieve less padding.
Required: Certain alignments of types are required, like a 2-byte integer may cause a bus-fault on an odd address. Compiler specific pack options will not violate this. Packing may reduced padding, yet some padding may remain.
To answer OP's questions:
All are "maybe". It is compiler specific with consideration to its options and target hardware.

Is it unnecessary to store a double member of a structure at an address that is a multiple of 8?

Suppose that sizeof(int) and sizeof(double) are 4 and 8 respectively and that there is no preprocessor command such as #pragma pack before the following code or compiler options with the same function as #pragma pack used in the compiler command line
typedef struct
{
int n;
double d;
} T;
then how much is sizeof(T)?
I think that it depends on the width of the data bus between the CPU and RAM. If the width is 32 bits, sizeof(T) is 12. If the width is 64 bits, sizeof(T) is 16. On a computer with a 32-bit data bus, to transfer a 64-bit number from CPU to RAM or vice versa, CPU has to access the data bus twice, reading or writing 32 bits at a time, so there is no point in storing the member d of the structure T at an address that is a multiple of 8.
Do you agree?
(Sorry for my poor English)
then how much is sizeof(T)?
You are correct, this is highly dependent on the system, the compiler, and the optimization settings. Generally speaking, the compiler knows best, at least in theory, what alignment to pick for the 8-byte double member of the structure. Moreover, compiler's decision could be different when you ask it to optimize for a smaller memory footprint compared to when you ask it to optimize for the fastest speed.
Finally, there may be systems where reading eight bytes from addresses aligned at four-byte boundary but not at eight-byte boundary may carry no penalty at all. Again, your compiler is in the best position to know that fact, and avoid padding your struct unnecessarily.
The most important thing to remember about the alignment is that you should not assume a particular layout of your struct, even if you are not intended to port your product to a different platform, because a change as simple as adding an optimization flag to the makefile may be sufficient to invalidate your assumptions.

Resources