Assemblers and word alignment

Assemblers and word alignment - c

Today I learned that if you declare a char variable (which is 1 byte), the assembler actually uses 4 bytes in memory so that the boundaries lie on multiples of the word size.
If a char variable uses 4 bytes anyway, what is the point of declaring it as a char? Why not declare it as an int? Don't they use the same amount of memory?

When you are writing in assembly language and declare space for a character, the assembler allocates space for one character and no more. (I write in regard to common assemblers.) If you want to align objects in assembly language, you must include assembler directives for that purpose.
When you write in C, and the compiler translates it to assembly and/or machine code, space for a character may be padded. Typically this is not done because of alignment benefits for character objects but because you have several things declared in your program. For example, consider what happens when you declare:
char a;
char b;
int i;
char c;
double d;
A naïve compiler might do this:
Allocate one byte for a at the beginning of the relevant memory, which happens to be aligned to a multiple of, say, 16 bytes.
Allocate the next byte for b.
Then it wants to place the int i which needs four bytes. On this machine, int objects must be aligned to multiples of four bytes, or a program that attempts to access them will crash. So the compiler skips two bytes and then sets aside four bytes for i.
Allocate the next byte for c.
Skip seven bytes and then set aside eight bytes for d. This makes d aligned to a multiple of eight bytes, which is beneficial on this hypothetical machine.
So, even with a naïve compiler, a character object does not require four whole bytes to itself. It can share with neighbor character objects, or other objects that do not require greater alignment. But there will be some wasted space.
A smarter compiler will do this:
Sort the objects it has to allocate space for according to their alignment requirements.
Place the most restrictive object first: Set aside eight bytes for d.
Place the next most restrictive object: Set aside four bytes for i. Note that i is aligned to a multiple of four bytes because it follows d, which is an eight-byte object aligned to a multiple of eight bytes.
Place the least restrictive objects: Set aside one byte each for a, b, and c.
This sort of reordering avoids wasting space, and any decent compiler will use it for memory that it is free to arrange (such as automatic objects on stack or static objects in global memory).
When you declare members inside a struct, the compiler is required to use the order in which you declare the members, so it cannot perform this reordering to save space. In that case, declaring a mixture of character objects and other objects can waste space.

Q: Does a program allocate four bytes for every "char" you declare?
A: No - absolutely not ;)
Q: Is it possible that, if you allocate a single byte, the program might "pad" with extra bytes?
A: Yes - absolutely yes.
The issue is "alignment". Some computer architectures must access a data value with respect to a particular offset: 16 bits, 32 bits, etc. Other architectures perform better if they always access a byte with respect to an offset. Hence "padding":
http://en.wikipedia.org/wiki/Byte_padding#Data_structure_padding

There may indeed not be any point in declaring a single char variable.
There may however be many good reasons to want a char-array, where an int-array really wouldn't do the trick!
(Try padding a data structure with ints...)

Others have for the most part answered this. Assuming a char is a single byte, does declaring a char mean that it always pads to an alignment? Nope, some compilers do by default some dont, and many you can change the default using some sort of command somewhere. Does this mean you shouldnt use a char? It depends, first off the padding doesnt always happen so the few wasted bytes dont always happen. You are programming in a high level language using a compiler so if you think that you have only 3 wasted bytes in your whole binary...think again. Depending on the architecture using chars can have some savings, for example loading immediates saves you three bytes or more on some architectures. Other architectures just to do simple operations with the register extra instructions are required to sign extend or clip the larger register to behave like a byte sized register. If you are on a 32 bit computer and you are using an 8 bit character because you are only counting from 1 to 100, you might want to use a full sized int, in the long run you are probably not saving anyone anything by using the char. Now if this is an 8086 based pc running dos, that is a different story. Or an 8 bit microcontroller, then you want to lean toward the 8 bit variables as much as possible.

Related

Is it better for performance to have alignment padding at the end of a small struct instead of between 2 members?

We know that there is padding in some structures in C. Please consider the following 2:
struct node1 {
int a;
int b;
char c;
};
struct node2 {
int a;
char c;
int b;
};
Assuming sizeof(int) = alignof(int) = 4 bytes:
sizeof(node1) = sizeof(node2) = 12, due to padding.
What is the performance difference between the two? (if any, w.r.t. the compiler or the architecture of the system, especially with GCC)

These are bad examples - in this case it doesn't matter, since the amount of padding will be the same in either case. There will not be any performance differences.
The compiler will always strive to fill up trailing padding at the end of a struct or otherwise using arrays of structs wouldn't be feasible, since the first member should always be aligned. If not for trailing padding in some item struct_array[0], then the first member in struct_array[1] would end up misaligned.
The order would matter if we were to do this though:
struct node3 {
int a;
char b;
int c;
char d;
};
Assuming 4 byte int and 4 byte alignment, then b occupies 1+3 bytes here, and d an additional 1+3 bytes. This could have been written better if the two char members were placed adjacently, in which case the total amount of padding would just have been 2 bytes.

I would not be surprised if the interviewer's opinion was based on the old argument of backward compatibility when extending the struct in the future. Additional fields (char, smallint) may benefit from the space occupied by the trailing padding, without the risk of affecting the memory offset of the existing fields.
In most cases, it's a moot point. The approach itself is likely to break compatibility, for two reasons:
Starting the extensions on a new alignment boundary (as would happen to node2) may not be memory-optimal, but it might well prevent the new fields from accidentally being overwritten by the padding of a 'legacy' struct.
When compatibility is that much of an issue (e.g. when persisting or transferring data), then it makes more sense to serialize/deserialize (even if binary is a requirement) than to depend on a binary format that varies per architecture, per compiler, even per compiler option.

OK, I might be completely off the mark here since this is a bit out of my league. If so, please correct me. But this is how I see it:
First of all, why do we need padding and alignment at all? It's just wasted bytes, isn't it? Well, turns out that processors like it. That is, if you issue an instruction to the CPU that operates on a 32-bit integer, the CPU will demand that this integer resides at a memory address which is dividable by 4. For a 64-bit integer it will need to reside in an address dividable by 8. And so on. This is done to make the CPU design simpler and better performant.
If you violate this requirement (aka "unaligned memory access"), most CPUs will raise an exception. x86 is actually an oddity because it will still perform the operation - but it will take more than twice as long because it will fetch the value from memory in two passes rather than one and then do bitwise magic to stick the value together from these separate accesses.
So this is the reason why compilers add padding to structs - so that all the members would be properly aligned and the CPU could access them quickly (or at all). Well, that's assuming the struct itself is located at a proper memory address. But it will also take care of that as long as you stick to standard operations for allocating the memory.
But it is possible to explicitly tell the compiler that you want a different alignment too. For example, if you want to use your struct to read in a bunch of data from a tightly packed file, you could explicitly set the padding to 1. In that case the compiler will also have to emit extra instructions to compensate for potential misalignment.
TL;DR - wrong alignment makes everything slower (or under certain conditions can crash your program entirely).
However this doesn't answer the question "where to better put the padding?" Padding is needed, yes, but where? Well, it doesn't make much difference directly, however by rearranging your members carefully you can reduce the size of the entire struct. And less memory used usually means a faster program. Especially if you create large arrays of these structs, using less memory will mean less memory accesses and more efficient use of CPU cache.
In your example however I don't think there's any difference.
P.S. Why does your struct end with a padding? Because arrays. The compiler wants to make sure that if you allocate an array of these structs, they will all be properly aligned. Because array members don't have any padding between them.

What is the performance difference between the two?
The performance difference is "indeterminable". For most cases it won't make any difference.
For cases where it does make a difference; either version might be faster, depending on how the structure is used. For one example, if you have a large array of these structures and frequently select a structure in the array "randomly"; then if you only access a and b of the randomly selected structure the first version can be faster (because a and b are more likely to be in the same cache line), and if you only access a and c then the second version can be faster.

Computer Memory Allocation for Duplicate Inputs

I'm taking Introduction to CS (CS50, Harvard) and we're learning type declaration in C. When we declare a variable and assign a type, the computer's allocating a specific amount of bits/bytes (1 byte for char, 4 bytes for int, 8 bytes for doubles etc...).
For instance, if we declare the string "EMMA", we're using 5 bytes, 1 for each "char" and 1 extra for the \0 null byte.
Well, I was wondering why 2 M's are allocated separate bytes. Can't the computer make use of the chars or integers currently taking space in the memory and refer to that specific slot when it wants to reuse it?
Would love some education on the matter (without getting too deep, as I'm fairly new to the field).
Edit: Fixed some bits into bytes — my bad

1 bit for char, 4 bytes for int, 8 bytes for doubles etc...
These are general values but they depend on the architecture (per this answer, there are even still 9-bit per byte architectures being sold these days).
Can't the computer make use of the chars or integers currently taking space in the memory and refer to that specific slot when it wants to reuse it?
While this idea is certainly feasible in theory, in practice the overhead is way too big for simple data like characters: one character is usually a single byte.
If we were to set up a system in which we allocate memory for the character value and only refer to it from the string, the string would be made of a series of elements which would be used to store which character should be there: in C this would be a pointer (you will encounter them at some point in your course) and is usually either 4 or 8 bytes long (32 or 64 bits). Assuming you use a 32-bit pointer, you would use 24 bytes of memory to store the string in this complex manner instead of 5 bytes using the simpler method (to expand on this answer, you would need even more metadata to be able to properly modify the string during your program's execution).
Your idea of storing a chunk of data and referring to it multiple times does however exist in several cases:
virtual memory (you will encounter this if you go towards OS development), where copy-on-write is used
higher level languages (like C++)
filesystems which implement a copy-on-write feature, like BTRFS
some backup systems (like borg or rsync) which deduplicate the files/chunks they store
Facebook's zstandard compression algorithm, where a dictionnary of small common chunks of data is used to improve compression ratio and speed
In such settings, where lots of data are stored, the relative size of the information required to store the data once and refer to it multiple times while improving copy time is worth the added complexity.

For instance if we declare the string "EMMA", we're using 5 bits
I am sure you are speaking about 5 bytes instead of 5 bits.
Well, I was wondering why 2 M's are allocated separate bits. Can't the
computer make use of the chars or integers currently taking space in
the memory and refer to that specific slot when it wants to reuse it?
A pointer to a "slot" usually occupies 4 or 8 bytes. So there is no sense to spend 8 bytes to point to an object that occupies only one byte
Moreover "EMMA" is a character array that consists from adjacent bytes. So all elements of the array has the same type and correspondingly size.
The compiler can reduce the memory usage by avoiding duplicated string literals. For example it can stores the same string literals as one string literal. This depends on a compiler option.
So if in the program the same string literal occurs for example two times as in these statements
char *s = malloc( sizeof( "EMMA" ) );
strcpy( s, "EMMA" );
then the compiler can store only one copy of the string literal.

The compiler is not supposed to be the code/program but something that does the minimal and it has to perform tasks such that it is easy for programmers to understand and manipulate,in other words it has to be general.
as a programmer you can make your program to save data in the suggested way but it won't be general .
eg- i am making a database for my school and i entered a wrong name and now i want to change the 2nd 'm' in "EMMA",now this would be troublesome if the system worked as suggested by you.
would love to clarify further if needed. :)

Statement about least significant bits in the next field of a linked list in C

I found the following statement about the least significant bits in the next field of a linked list in C:
"In C, the next field is a pointer. For performance reason related to memory subsystem on a processor, memory is allocated on word boundaries, and (at least) two least significant bits in the next pointers are 0."
Is this true? I can't understand why so if so. Please help.

Many processor architectures are designed so that operations are supposed to be performed on word-aligned addresses. For example, some 32-bit processors are designed so that any word operation must be done at addresses that are multiples of 4 bytes (32 bits), such as addresses 0, 4, 8, 12, 16, 20, etc. Similarly, some 64-bit processors only allow word operations to be done at addresses that are multiples of 8 bytes. This has various advantages in hardware, such as being able to more easily detect if two different instructions refer to the same word in memory, which makes the processor faster. In some processors, you'll get a bus error if you try to do a nonaligned read, while in others it's legal to do so but the performance will be significantly degraded.
Because of this, most memory allocation libraries are designed so that they align all allocations at word boundaries. This means that on a 32-bit system, the two low-order bits of the address will be 0 (because the number is a multiple of four) and on a 64-bit system the three low-order bits of the address will be 0. Many data structures compress their representations by using these low-order bits to store extra information. For example, some implementations of red/black trees will place the bit that stores whether a node is red or black into the low order bits of one of the pointers, and some AVL trees (which need to store two bits of information) will pack those bits into the low-order bits of these pointers. Some garbage collection algorithms use similar techniques to store mark bits.
EDIT: In C, some compilers support a uintptr_t type that represents an integer large enough to hold a pointer. You can cast the pointer to a uintptr_t and then use standard bitwise operators on the uintptr_t variable to set or clear the bits, then cast back to a pointer to store the result. In C++, to the best of my knowledge, this would produce undefined behavior.
Hope this helps!

This is because when you request a block of memory it is delivered to you in a block that is aligned with the word of the architecture, in others words it will be starting from a memory address that is a multiple of the word. If it is on a multiple of the word it is an even number, ruling out the first least significant bit to be on. On a 32-bit machine the word size chunk will be 4, and 4 in binary is 100, hence the second least significant bit is turned off.
Here is an example of what I mean by alignment with the word. Consider the following structure (assuming 32-bit):
struct sample {
char a;
int b;
char c;
char d;
};
... requires 8 bytes, not 7 (due to data structure alignment).
Remark: Compilers must adhere to this, however it is not required. Most do, or have the option to.

Naturally aligned memory address

I need to extract a memory address from within an existing 64-bit value, and this address points to a 4K array, the starting value is:
0x000000030c486000
The address I need is stored within bits 51:12, so I extract those bits using:
address = start >> 12 & 0x0000007FFFFFFFFF
This leaves me with the address of:
0x000000000030c486
However, the documentation I'm reading states that the array stored at the address is 4KB in size, and naturally aligned.
I'm a little bit confused over what naturally aligned actually means. I know with page aligned stuff the address normally ends with '000' (although I could be wrong on that).
I'm assuming that as the address taken from the starting value is only 40 bits long, I need to perform an additional bitshifting operation to arrange the bits so that they can be correctly interpreted any further.
If anyone could offer some advice on doing this, I'd appreciate it.
Thanks

Normally, "naturally aligned" means that any item is aligned to at least a multiple of its own size. For example, a 4-byte object is aligned to an address that's a multiple of 4, an 8-byte object is aligned to an address that's a multiple of 8, etc.
For an array, you don't normally look at the size of the whole array, but at the size of an element of the array.
Likewise, for a struct or union, you normally look at the size of the largest element.

Natural alignment requires that every N byte access must be aligned on a memory address boundary of N. We can express this in terms of the modulus operator: addr % N must be zero. for examples:
Accessing 4 bytes of memory from address 0x10004 is aligned (0x10004 % 4 = 0).
Accessing 4 bytes of memory from address 0x10005 is unaligned (0x10005 % 4 = 1).

From a hardware perspective, memory is typically divided into chunks of some size, such that any or all of the data within a chunk can be read or written in a single operation, but any single operation can only affect data within a single chunk.
A typical 80386-era system would have memory grouped into four-byte chunks. Accessing a two-byte or four-byte value which fit entirely within a single chunk would require one operation. If the value was stored partially in one chunk and partially in another, two operations would be required.
Over the years, chunk sizes have gotten larger than data sizes, to the point that most randomly-placed 32-bit values would fit entirely within a chunk, but a second issue may arise with some processors: if a chunk is e.g. 512 bits (64 bytes) and a 32-bit word is known to be aligned at a multiple of four bytes (32 bits), fetching each bit of the word can come from any of 16 places. If the word weren't known to be aligned, each bit could come from any of 61 places for the cases where the word fits entirely within the chunk. The circuitry to quickly select from among 61 choices is more complex than circuitry to select among 16, and most code will use aligned data, so even in cases where an unaligned word would fit within a single accessible chunk, hardware might still need a little extra time to extract it.

A “naturally aligned” address is one that is a multiple of some value that is preferred for the data type on the processor. For most elementary data types on most common processors, the preferred alignment is the same as the size of the data: Four-byte integers should be aligned on multiples of four bytes, eight-byte floating-point should be aligned on multiples of eight bytes, and so on. Some platforms require alignment, some merely prefer it. Some types have alignment requirements different from their sizes. For example, a 12-byte long float may require four-byte alignment. Specific values depend in your target platform. “Naturally aligned” is not a formal term, so some people might define it only as preferred alignment that is a multiple of the data size, while others might allow it to be used for other alignments that are preferred on the processor.
Taking bits out of a 64-bit value suggests the address has been transformed in some way. For example, key bits from the address have been stored in a page table entry. Reconstructing the original address might or might not be as simple as extracting the bits and shifting them all the way to the “right” (low end). However, it is also common for bits such as this to be shifted to a different position (with zeroes left in the low bits). You should check the documentation carefully.
Note that a 4 KiB array, 4096 bytes, corresponds to 212 bytes. The coincidence of 12 with the 51:12 field in the 64-bit value suggests that the address might be obtained simply by extracting those 40 bits without shifting them at all.

What is overalignment of execution regions and input sections?

I came across code similar to the following today and I am curious as to what is actually happening:
#pragma pack(1)
__align(2) static unsigned char multi_array[7][24] = { 0 };
__align(2) static unsigned char another_multi_array[7][24] = { 0 };
#pragma pack()
When searching for a reference to the __align keyword in the Keil compiler, I came across this:
Overalignment of execution regions and input sections There are situations when you want to overalign code and data sections... If you have access to the original source code, you can do this at compile time with the __align(n) keyword...
I do not understand what is meant by "overaligning code and data sections". Can someone help to clarify how this overalignment occurrs?

The compiler will naturally "align" data based on the needs of the system. For example, on a typical 32-bit system, a 32-bit integer should always be a single 4-byte word (as opposed to being partly in one word and partly on the next), so it will always start on a 4-byte-word boundary. (This mostly has to do with the instructions available on the processor. A system is very likely to have an instruction to load a single word from memory into a register, and much less likely to have a single instruction to load an arbitrary sequence of four adjacent bytes into a register.)
The compiler normally does this by introducing gaps in the data; for example, a struct with a char followed by a 32-bit int, on such a system, would require eight bytes: one byte for the char, three bytes of filler so the int is aligned right, and four bytes for the int itself.
To "overalign" the data is to request greater alignment than the compiler would naturally provide. For example, you might request that a 32-bit integer start on an 8-byte boundary, even on a system that uses 4-byte words. (One major reason to do this would be if you're aiming for byte-level interoperability with a system that uses 8-byte words: if you pass structs from one system to the other, you want the same gaps in both systems.)

Overalignment is when the data is aligned to more than its default alignment. For example, a 4-byte int usually has a default alignment of 4 bytes. (meaning the address will be divisible by 4)
The default alignment of a datatype is quite-often (but not always) the size of the datatype.
Overalignment allows you to increase this alignment to something greater than the default.
As for why you would want to do this:
One reason for this is to be able access the data with a larger datatype (that has a larger alignment).
For example:
char buffer[16];
int *ptr = (int*)&buffer;
ptr[0] = 1;
ptr[1] = 2;
By default, buffer will only be aligned to 1 byte. However, int requires a 4-byte alignment. If buffer isn't aligned to 4 bytes, you will get a misalignment exception. (AFAIK, ARM doesn't allow misaligned memory access... x86/64 usually does, but with performance penalty)
__align() will let you force the alignment higher to make it work:
__align(4) char buffer[16];
A similar situation appears when using SIMD instructions. You will be accessing smaller datatype with a large SIMD datatype - which will likely require a larger alignment.

By overalign, Keil mean nothing more complex than aligning an object to a larger alignment boundary than the data type requires.
See the documentation for __align: "You can only overalign. That is, you can make a two-byte object four-byte aligned but you cannot align a four-byte object at 2 bytes."
In the case of the linker, you can force an extra alignment onto sections within other binary modules using the ALIGNALL or OVERALIGN directives. This may be useful for performance reasons, but isn't a common scenario.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight