Alignment in 64bit machine is not 8 Bytes - c

I am trying to find out alignment on my 64bit machine(Win10 on Intel iCore7). I thought of this experiment:
void check_alignment(char c1, char c2 )
{
printf("delta=%d\n", (int)&c2 - (int)&c1); // prints 4 instead of 8
}
void main(){
check_alignment('a','b');
}
I was expecting delta=8. Since it's 64bit machine char c1 and char c2 should be stored on multiples of 8. Isn't that right?
Even if we assume compiler has done optimization to have them stored in less space, why not just store them back to back delta=1? Why 4 byte alignment?
I repeated the above experiment with float types, and still gives delta=4
void check_alignment(float f1, float f2 )
{
printf("delta=%d\n", (int)&c2 - (int)&c1); // prints 4
}
void main(){
check_alignment(1.0,1.1);
}

Firstly, if your platform is 64-bit, then why are you casting your pointer values to int? Is int 64-bit wide on your platform? If not, your subtraction is likely to produce a meaningless result. Use intptr_t or ptrdiff_t for that purpose, not int.
Secondly, in a typical implementation a 1-byte type will typically be aligned at 1-byte boundary, regardless of whether your platform is 64-bit or not. To see a 8-byte alignment you'd need a 8-byte type. And in order to see how it is aligned you have to inspect the physical value of the address (i.e. whether it is divisible by 1, 2, 4, 8 etc.), not analyze how far apart two variables are spaced.
Thirdly, how far apart c1 and c2 are in memory has little to do with alignment requirements of char type. It is determined by how char values are passed (or stored locally) on your platform. In your case they are apparently allocated 4-byte storage cells each. That's perfectly fine. Nobody every promised you that two unrelated objects with 1-byte alignment will be packed next to each other as tightly as possible.
If you want to determine alignment by measuring how far from each other two objects are stored, declare an array. Do not try to measure the distance between two independent objects - this is meaningless.

To determine the greatest fundamental alignment in your C implementation, use:
#include <stdio.h>
#include <stddef.h>
int main(void)
{
printf("%zd bytes\n", _Alignof(max_align_t));
}
To determine the alignment requirement of any particular type, replace max_align_t above with that type.
Alignment is not purely a function of the processor or other hardware. Hardware might support aligned or unaligned accesses with different performance effects, and some instructions might support unaligned access while others do not. A particular C implementation might choose to require or not require certain alignment in combination with choosing to use or not use various instructions. Additionally, on some hardware, whether unaligned access is supported is configurable by the operating system.

Related

Understanding memory alignment constraints and padding bytes in C

I have following code snippet.
#include<stdio.h>
int main(){
typedef struct{
int a;
int b;
int c;
char ch1;
int d;
} str;
printf("Size: %d \n",sizeof(str));
return 0;
}
Which is giving output as follows
Size: 20
I know that size of the structure is greater than the summation of the sizes of components of the structure because of padding added to satisfy memeory alignment constraints.
I want to know how it is decided that how many bytes of padding have to be added. On what does it depend ? Does it depends on CPU architecture ? And does it depends on compiler also ? I am using here 64bit CPU and gcc compiler. How will the output change if these parameters change.
I know there are similar questions on StackOverflow, but they do not explain this memory alignment constraints thoroughly.
It in general depends on the requirements of the architecture. There's loads over here, but it can be summarized as follows:
Storage for the basic C datatypes on an x86 or ARM processor doesn’t
normally start at arbitrary byte addresses in memory. Rather, each
type except char has an alignment requirement; chars can start on any
byte address, but 2-byte shorts must start on an even address, 4-byte
ints or floats must start on an address divisible by 4, and 8-byte
longs or doubles must start on an address divisible by 8. Signed or
unsigned makes no difference.
In your case the following is probably taking place: sizeof(str) = 4 (4 bytes for int) + 4 (4 bytes for int) + 1 ( 1 byte for char) + 7 (3 bytes padding + 4 bytes for int) = 20
The padding is there so that int is at an address that's a multiple of 4 bytes. This requirement comes from the fact that int is 4 bytes long (my assumption regarding the architecture you're using). But this will vary from one architecture to another.
On what does it depend ? Does it depends on CPU architecture ? And does it depends on compiler also ?
CPU, operating system, and compiler at least.
I know that it depends on the CPU architecture, I think you can find some interesting articles that talks about this on the net, wikipedia is not bad in my opinion.
For me I am using a 64 bit linux machine, and what I can say is that, every field is aligned so that it would be on a memory address divisible by its size (for basic types), for example :
int and float are aligned by 4 (must be in a memory adress divisible by 4)
char and bool by 1 ( which means no padding)
double and pointers are aligned by 8
Best way to avoid padding is to put your fields from largest size to smallest (when there is only basic fields)
When there is composed fields, it a little more difficult for me to explain here, but I think you can figure it out yourself in a paper

How do I force the program to use unaligned addresses?

I've heard reads and writes of aligned int's are atomic and safe, I wonder when does the system make non malloc'd globals unaligned other than packed structures and casting/pointer arithmetic byte buffers?
[X86-64 linux] In all of my normal cases, the system always chooses integer locations that don't get word torn, for example, two byte on one word and the other two bytes on the other word. Can any one post a program/snip (C or assembly) that forces the global variable to unaligned address such that the integer gets torn and the system has to use two reads to load one integer value ?
When I print the below program, the addresses are close to each other such that multiple variables are within 64bits but never once word tearing is seen (smartness in the system or compiler ?)
#include <stdio.h>
int a;
char b;
char c;
int d;
int e = 0;
int isaligned(void *p, int N)
{
if (((int)p % N) == 0)
return 1;
else
return 0;
}
int main()
{
printf("processor is %d byte mode \n", sizeof(int *));
printf ( "a=%p/b=%p/c=%p/d=%p/f=%p\n", &a, &b, &c, &d, &e );
printf ( " check for 64bit alignment of test result of 0x80 = %d \n", isaligned( 0x80, 64 ));
printf ( " check for 64bit alignment of a result = %d \n", isaligned( &a, 64 ));
printf ( " check for 64bit alignment of d result = %d \n", isaligned( &e, 64 ));
return 0;}
Output:
processor is 8 byte mode
a=0x601038/b=0x60103c/c=0x60103d/d=0x601034/f=0x601030
check for 64bit alignment of test result of 0x80 = 1
check for 64bit alignment of a result = 0
check for 64bit alignment of d result = 0
How does a read of a char happen in the above case ? Does it read from 8 byte aligned boundary (in my case 0x601030 ) and then go to 0x60103c ?
Memory access granularity is always word size isn't it ?
Thx.
1) Yes, there is no guarantee that unaligned accesses are atomic, because [at least sometimes, on certain types of processors] the data may be written as two separate writes - for example if you cross over a memory page boundary [I'm not talking about 4KB pages for virtual memory, I'm talking about DDR2/3/4 pages, which is some fraction of the total memory size, typically 16Kbits times whatever the width is of the actual memory chip - which will vary depending on the memory stick itself]. Equally, on other processors than x86, you get a trap for reading unaligned memory, which would either cause the program to abort, or the read be emulated in software as multiple reads to "fix" the unaligned read.
2) You could always make an unaligned memory region by something like this:
char *ptr = malloc(sizeof(long long) * number+1);
long long *unaligned = (long long *)&ptr[2];
for(i = 0; i < number; i++)
temp = unaligned[i];
By the way, your alignment check checks if the address is aligned to 64 bytes, not 64 bits. You'll have to divide by 8 to check that it's aligned to 64 bits.
3) A char is a single byte read, and the address will be on the actual address of the byte itself. The actual memory read performed is probably for a full cache-line, starting at the target address, and then cycling around, so for example:
0x60103d is the target address, so the processor will read a cache line of 32 bytes, starting at the 64-bit word we want: 0x601038 (and as soon as that's completed the processor goes on to the next instruction - meanwhile the next read will be performed to fill the cacheline), then cacheline is filled with 0x601020, 0x601028, 0x601030. But should we turn the cache off [if you want your 3GHz latest x86 processor to be slightly slower than a 66MHz 486, disabling the cache is a good way to achieve that], the processor would just read one byte at 0x60103d.
4) Not on x86 processors, they have byte addressing - but for normal memory, reads are done on a cacheline basis, as explained above.
Note also that "may not be atomic" is not at all the same as "will not be atomic" - so you'll probably have a hard time making it go wrong by will - you really need to get all the timings of two different threads just right, and straddle cachelines, straddle memory page boundaries, and so on to make it go wrong - this will happen if you don't want it to happen, but trying to make it go wrong can be darn hard [trust me, I've been there, done that].
It probably doesn't, outside of those cases.
In assembly it's trivial. Something like:
.org 0x2
myglobal:
.word SOME_NUMBER
But on Intel, the processor can safely read unaligned memory. It might not be atomic, but that might not be apparent from the generated code.
Intel, right? The Intel ISA has single-byte read/write opcodes. Disassemble your program and see what it's using.
Not necessarily - you might have a mismatch between memory word size and processor word size.
1) This answer is platform-specific. In general, though, the compiler will align variables unless you force it to do otherwise.
2) The following will require two reads to load one variable when run on a 32-bit CPU:
uint64_t huge_variable;
The variable is larger than a register, so it will require multiple operations to access. You can also do something similar by using packed structures:
struct unaligned __attribute__ ((packed))
{
char buffer[2];
int unaligned;
char buffer2[2];
} sample_struct;
3) This answer is platform-specific. Some platforms may behave like you describe. Some platforms have instructions capable of fetching a half-register or quarter-register of data. I recommend examining the assembly emitted by your compiler for more details (make sure you turn off all compiler optimizations first).
4) The C language allows you to access memory with byte-sized granularity. How this is implemented under the hood and how much data your CPU fetches to read a single byte is platform-specific. For many CPUs, this is the same as the size of a general-purpose register.
The C standards guarantee that malloc(3) returns a memory area that complies to the strictest alignment requirements, so this just can't happen in that case. If there are unaligned data, it is probably read/written by pieces (that depends on the exact guarantees the architecture provides).
On some architectures unaligned access is allowed, on others it is a fatal error. When allowed, it is normally much slower than aligned access; when not allowed the compiler must take the pieces and splice them together, and that is even much slower.
Characters (really bytes) are normally allowed to have any byte address. The instructions working with bytes just get/store the individual byte in that case.
No, memory access is according to the width of the data. But real memory access is in terms of cache lines (read up on CPU cache for this).
Non-aligned objects can never come into existence without you invoking undefined behavior. In other words, there is no sequence of actions, all having well-defined behavior, which a program can take that will result in a non-aligned pointer coming into existence. In particular, there is no portable way to get the compiler to give you misaligned objects. The closest thing is the "packed structure" many compilers have, but that only applies to structure members, not independent objects.
Further, there is no way to test alignedness in portable C. You can use the implementation-defined conversions of pointers to integers and inspect the low bits, but there is no fundamental requirement that "aligned" pointers have zeros in the low bits, or that the low bits after conversion to integer even correspond to the "least significant" bits of the pointer, whatever that would mean. In other words, conversions between pointers and integers are not required to commute with arithmetic operations.
If you really want to make some misaligned pointers, the easiest way to do it, assuming alignof(int)>1, is something like:
char buf[2*sizeof(int)+1];
int *p1 = (int *)buf, *p2 = (int *)(buf+sizeof(int)+1);
It's impossible for both buf and buf+sizeof(int)+1 to be simultaneously aligned for int if alignof(int) is greater than 1. Thus at least one of the two (int *) casts gets applied to a misaligned pointer, invoking undefined behavior, and the typical result is a misaligned pointer.

Map memory to another address

X86-64, Linux, Windows.
Consider that I'd want to make some sort of "free launch for tag pointers". Basically I want to have two pointers that point to the same actual memory block but whose bits are different. (For example I want one bit to be used by GC collection or for some other reason).
intptr_t ptr = malloc()
intptr_t ptr2 = map(ptr | GC_FLAG_REACHABLE) //some magic call
int* p = int*(ptr);
int* p2 = int*(ptr2);
*p = 10;
*p2 = 20;
assert(*p == 20)
assert(p != p2)
On Linux, mmap() the same file twice. Same thing on Windows really, but it has its own set of functions for that.
Mapping the same memory (mmap on POSIX as Ignacio mentions, MapViewOfFile on Windows) to multiple virtual addresses may provide you some interesting coherency puzzles (are writes at one address visible when read at another address?). Or maybe not. I'm not sure what all the platform guarantees are.
More commonly, one simply reserves a few bits in the pointer and shifts things around as necessary.
If all your objects are aligned to 8-byte boundaries, it's common to simply store tags in the 3 least-significant bits of a pointer, and mask them off before dereferencing (as thkala mentions). If you choose a higher alignment, such as 16-bytes or 32-bytes, then there are 3 or 5 least-significant bits that can be used for tagging. Equivalently, choose a few most-significant bits for tagging, and shift them off before dereferencing. (Sometimes non-contiguous bits are used, for example when packing pointers into the signalling NaNs of IEEE-754 floats (223 values) or doubles (251 values).)
Continuing on the high end of the pointer, current implementations of x86-64 use at most 48 bits out of a 64-bit pointer (0x0000000000000000-0x00007fffffffffff + 0xffff800000000000-0xffffffffffffffff) and Linux and Windows only hand out addresses in the first range to userspace, leaving 17 most-significant bits that can be safely masked off. (This is neither portable nor guaranteed to remain true in the future, though.)
Another approach is to stop considering "pointers" and simply use indices into a larger memory array, as the JVM does with -XX:+UseCompressedOops. If you've allocated a 512MB pool and are storing 8-byte aligned objects, there are 226 possible object locations, so a 32-value has 6 bits to spare in addition to the index. A dereference will require adding the index times the alignment to the base address of the array, saved elsewhere (it's the same for every "pointer"). If you look at things carefully, this is simply a generalization of the previous technique (which always has base at 0, where things line up with real pointers).
Once upon a time I worked on a Prolog implementation that used the following technique to have spare bits in a pointer:
Allocate a memory area with a known alignment. malloc() usually allocates memory with a 4-byte or 8-byte alignment. If necessary, use posix_memalign() to get areas with a higher alignment size.
Since the resulting pointer is aligned to intervals of multiple bytes, but it represents byte-accurate addresses, you have a few spare bits that will by definition be zero in the memory area pointer. For example a 4-byte alignment gives you two spare bits on the LSB side of the pointer.
You OR (|) your flags with those bits and now have a tagged pointer.
As long as you take care to properly mask the pointer before using it for memory access, you should be perfectly fine.

Why doesn't size of struct is equals to sum of sizes of its individual member types? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Why isn’t sizeof for a struct equal to the sum of sizeof of each member?
I guess similar (duplicate) questions must have been asked on SO before. But I'm unable to find them. Basically I don't know what to search for. So asking it here.
Why doesn't size of struct is equals to sum of sizes of its individual member types?
I'm using visual C++ compiler.
For example, assuming 32-bit machine. {=> sizeof(int) == 4; sizeof(char) == 1; sizeof(short) == 2; }
struct {
int k;
char c;
} s;
The size expected is 4+1 = 5; but sizeof(s) gives 8. Here char is occupying 4 bytes instead of 1. I don't know the exact reason for this but my guess is compiler is doing so for efficiency purposes.
struct{
long long k;
int i;
} s;
expected size is 4+4 = 8 (on 32 bit machine) and 8+4=12 (on 64 bit machine). But strangely sizeof(s) give 16. Here both int & long long are occupying 8 bytes each.
What is this thing called?
What exactly is going on?
Why is compiler doing this?
Is there a way to tell compiler to stop doing this?
It is because the compiler uses padding to bring each element into word alignment that is specific to the architecture for which you are compiling.
It can be for one of several reasons but usually:
Because some CPU's simply cannot read a long or long long multi-byte value when it isn't on an address multiple of its own size.
Because CPU's that can read off-aligned data may do it much slower than aligned.
You can often force it off with a compiler-specific directive or pragma.
When you do this, the compiler will generate relatively inefficient code to access the off-aligned data using multiple read/write operations.
This is called padding; which involves adding some more bytes in order to align the structure on addresses that are divisible by some special number, usually 2, 4, or 8. A compiler can even place padding between members to align the fields themselves on those boundaries.
This is a performance optimization: access to aligned memory addresses is faster, and some architectures don't even support accessing unaligned addresses.
For VC++, you can use the pack pragma to control padding between fields. However, note that different compilers have different ways of handling this, so if, for example, you also want to support GCC, you'll have to use a different declaration for that.
The compiler can insert padding between members or at the end of the struct. Padding between members is typically done to keep the members aligned to maximize access speed. Padding at the end is done for roughly the same reason, in case you decide to create an array of the structs.
To prevent it from happening, you use something like #pragma pack(1).

does 8-bit processor have to face endianness problem?

If I have a int32 type integer in the 8-bit processor's memory, say, 8051, how could I identify the endianess of that integer? Is it compiler specific? I think this is important when sending multybyte data through serial lines etc.
With an 8 bit microcontroller that has no native support for wider integers, the endianness of integers stored in memory is indeed up to the compiler writer.
The SDCC compiler, which is widely used on 8051, stores integers in little-endian format (the user guide for that compiler claims that it is more efficient on that architecture, due to the presence of an instruction for incrementing a data pointer but not one for decrementing).
If the processor has any operations that act on multi-byte values, or has an multi-byte registers, it has the possibility to have an endian-ness.
http://69.41.174.64/forum/printable.phtml?id=14233&thread=14207 suggests that the 8051 mixes different endian-ness in different places.
The endianness is specific to the CPU architecture. Since a compiler needs to target a particular CPU, the compiler would have knowledge of the endianness as well. So if you need to send data over a serial connection, network, etc you may wish to use build-in functions to put data in network byte order - especially if your code needs to support multiple architectures.
For more information, see: http://www.gnu.org/s/libc/manual/html_node/Byte-Order.html
It's not just up to the compiler - '51 has some native 16-bit registers (DPTR, PC in standard, ADC_IN, DAC_OUT and such in variants) of given endianness which the compiler has to obey - but outside of that, the compiler is free to use any endianness it prefers or one you choose in project configuration...
An integer does not have endianness in it. You can't determine just from looking at the bytes whether it's big or little endian. You just have to know: For example if your 8 bit processor is little endian and you're receiving a message that you know to be big endian (because, for example, the field bus system defines big endian), you have to convert values of more than 8 bits. You'll need to either hard-code that or to have some definition on the system on which bytes to swap.
Note that swapping bytes is the easy thing. You may also have to swap bits in bit fields, since the order of bits in bit fields is compiler-specific. Again, you basically have to know this at build time.
unsigned long int x = 1;
unsigned char *px = (unsigned char *) &x;
*px == 0 ? "big endian" : "little endian"
If x is assigned the value 1 then the value 1 will be in the least significant byte.
If we then cast x to be a pointer to bytes, the pointer will point to the lowest memory location of x. If that memory location is 0 it is big endian, otherwise it is little endian.
#include <stdio.h>
union foo {
int as_int;
char as_bytes[sizeof(int)];
};
int main() {
union foo data;
int i;
for (i = 0; i < sizeof(int); ++i) {
data.as_bytes[i] = 1 + i;
}
printf ("%0x\n", data.as_int);
return 0;
}
Interpreting the output is up to you.

Resources