Interlocked operations and alignment with _aligned_malloc - c

I am concerned about alignment and Interlocked operations. Again. The documentation for these functions states that the varaible we want to update should be aligned on a 32bit boundary, and that we can achieve this via _aligned_malloc. Fine.
So I have this small test program:
struct S
{
char c;
long l;
}an_S;
printf("%p, %p", (void*)(&(an_S.c)), (void*)(&(an_S.l)));
On release mode, output from this always gives me an address of the long which is 4 bytes after the address of the char, so hence it starts on a 32bit boundary.
1) Is this purely by chance, or can I rely on this hence no need for _aligned_malloc?
2) If I have to use aligned_malloc, can someone clarify how to do so? I've read the documentation at https://msdn.microsoft.com/en-us/library/8z34s9c6.aspx but that doesn't seem to show how to assign a value to the memory that is 'allocated'...
3) (Assuming I do need aligned_malloc) If I want an array of structures that have a long variable like the above, that needs to be operated on via an Interlocked operation, do I need to add some sort of constructor to set this up or would there be an easier way of doing it?
4) I did a Google search for _aligned_malloc+interlockedCompareExchange and it bought back only 70 results. That tells me that either the bulk of the code out there that uses InterlockedCompareExchange (62,800 results) is wrong or _aligned_malloc isn't necessary. Can someone please clarify?

If your structures are aligned, which is the default, then each member will be aligned suitable for the member type.
As far as malloc goes, the documentation for MSVC explains that on 32 targets the memory is 8 byte aligned, on 64 bit targets it is 16 byte aligned. So you are fine to use malloc.

Related

Is it better for performance to have alignment padding at the end of a small struct instead of between 2 members?

We know that there is padding in some structures in C. Please consider the following 2:
struct node1 {
int a;
int b;
char c;
};
struct node2 {
int a;
char c;
int b;
};
Assuming sizeof(int) = alignof(int) = 4 bytes:
sizeof(node1) = sizeof(node2) = 12, due to padding.
What is the performance difference between the two? (if any, w.r.t. the compiler or the architecture of the system, especially with GCC)
These are bad examples - in this case it doesn't matter, since the amount of padding will be the same in either case. There will not be any performance differences.
The compiler will always strive to fill up trailing padding at the end of a struct or otherwise using arrays of structs wouldn't be feasible, since the first member should always be aligned. If not for trailing padding in some item struct_array[0], then the first member in struct_array[1] would end up misaligned.
The order would matter if we were to do this though:
struct node3 {
int a;
char b;
int c;
char d;
};
Assuming 4 byte int and 4 byte alignment, then b occupies 1+3 bytes here, and d an additional 1+3 bytes. This could have been written better if the two char members were placed adjacently, in which case the total amount of padding would just have been 2 bytes.
I would not be surprised if the interviewer's opinion was based on the old argument of backward compatibility when extending the struct in the future. Additional fields (char, smallint) may benefit from the space occupied by the trailing padding, without the risk of affecting the memory offset of the existing fields.
In most cases, it's a moot point. The approach itself is likely to break compatibility, for two reasons:
Starting the extensions on a new alignment boundary (as would happen to node2) may not be memory-optimal, but it might well prevent the new fields from accidentally being overwritten by the padding of a 'legacy' struct.
When compatibility is that much of an issue (e.g. when persisting or transferring data), then it makes more sense to serialize/deserialize (even if binary is a requirement) than to depend on a binary format that varies per architecture, per compiler, even per compiler option.
OK, I might be completely off the mark here since this is a bit out of my league. If so, please correct me. But this is how I see it:
First of all, why do we need padding and alignment at all? It's just wasted bytes, isn't it? Well, turns out that processors like it. That is, if you issue an instruction to the CPU that operates on a 32-bit integer, the CPU will demand that this integer resides at a memory address which is dividable by 4. For a 64-bit integer it will need to reside in an address dividable by 8. And so on. This is done to make the CPU design simpler and better performant.
If you violate this requirement (aka "unaligned memory access"), most CPUs will raise an exception. x86 is actually an oddity because it will still perform the operation - but it will take more than twice as long because it will fetch the value from memory in two passes rather than one and then do bitwise magic to stick the value together from these separate accesses.
So this is the reason why compilers add padding to structs - so that all the members would be properly aligned and the CPU could access them quickly (or at all). Well, that's assuming the struct itself is located at a proper memory address. But it will also take care of that as long as you stick to standard operations for allocating the memory.
But it is possible to explicitly tell the compiler that you want a different alignment too. For example, if you want to use your struct to read in a bunch of data from a tightly packed file, you could explicitly set the padding to 1. In that case the compiler will also have to emit extra instructions to compensate for potential misalignment.
TL;DR - wrong alignment makes everything slower (or under certain conditions can crash your program entirely).
However this doesn't answer the question "where to better put the padding?" Padding is needed, yes, but where? Well, it doesn't make much difference directly, however by rearranging your members carefully you can reduce the size of the entire struct. And less memory used usually means a faster program. Especially if you create large arrays of these structs, using less memory will mean less memory accesses and more efficient use of CPU cache.
In your example however I don't think there's any difference.
P.S. Why does your struct end with a padding? Because arrays. The compiler wants to make sure that if you allocate an array of these structs, they will all be properly aligned. Because array members don't have any padding between them.
What is the performance difference between the two?
The performance difference is "indeterminable". For most cases it won't make any difference.
For cases where it does make a difference; either version might be faster, depending on how the structure is used. For one example, if you have a large array of these structures and frequently select a structure in the array "randomly"; then if you only access a and b of the randomly selected structure the first version can be faster (because a and b are more likely to be in the same cache line), and if you only access a and c then the second version can be faster.

Explanation of packed attribute in C

I was wondering if anyone could offer a more full explanation to the meaning of the packed attribute used in the bitmap example in pset4.
"Our use, incidentally, of the attribute called packed ensures that clang does not try to "word-align" members (whereby the address of each member’s first byte is a multiple of 4), lest we end up with "gaps" in our structs that don’t actually exist on disk."
I do not understand the comment around gaps in our structs. Does this refer to gaps in the memory location between each struct (i.e. one byte between each 3 byte RGB if it was to word-algin)? Why does this matter in for optimization?
typedef uint8_t BYTE;
typedef struct
{
BYTE rgbtBlue;
BYTE rgbtGreen;
BYTE rgbtRed;
} __attribute__((__packed__))
RGBTRIPLE;
Beware: prejudices on display!
As noted in comments, when the compiler adds the padding to a structure, it does so to improve performance. It uses the alignments for the structure elements that will give the best performance.
Not so very long ago, the DEC Alpha chips would handle a 'unaligned memory request' (umr) by doing a page fault, jumping into the kernel, fiddling with the bytes to get the required result, and returning the correct result. This was painfully slow by comparison with a correctly aligned memory request; you avoided such behaviour at all costs.
Other RISC chips (used to) give you a SIGBUS error if you do misaligned memory accesses. Even Intel chips have to do some fancy footwork to deal with misaligned memory accesses.
The purpose of removing padding is to (decrease performance but) benefit by being able to serialize and unserialize the data without doing the job 'properly' — it is a form of laziness that actually doesn't work properly when the machines communicating are not of the same type, so proper serialization should have been done in the first place.
What I mean is that if you are writing data over the network, it seems simpler to be able to send the data by writing the contents of a structure as a block of memory (error checking etc omitted):
write(fd, &structure, sizeof(structure));
The receiving end can read the data:
read(fd, &structure, sizeof(structure));
However, if the machines are of different types (for example, one has an Intel CPU and the other a SPARC or Power CPU), the interpretation of the data in those structures will vary between the two machines (unless every element of the array is either a char or an array of char). To relay the information reliably, you have to agree on a byte order (e.g. network byte order — this is very much a factor in TCP/IP networking, for example), and the data should be transmitted in the agreed upon order so that both ends can understand what the other is saying.
You can define other mechanisms: you could use a 'sender makes right' mechanism, in which the 'receiver' let's the sender know how it wants the data presented and the sender is responsible for fixing up the transmitted data. You can also use a 'receiver makes right' mechanism which works the other way around. Both these have been used commercially — see DRDA for one such protocol.
Given that the type of BYTE is uint8_t, there won't be any padding in the structure in any sane (commercially viable) compiler. IMO, the precaution is a fantasy or phobia without a basis in reality. I'd certainly need a carefully documented counter-example to believe that there's an actual problem that the attribute helps with.
I was led to believe that you could encounter issues when you pass the entire struct to a function like fread as it assumes you're giving it an array like chunk of memory, with no gaps in it. If your struct has gaps, the first byte ends up in the right place, but the next two bytes get written in the gap, which you don't have a proper way to access.
Sorta...but mostly no. The issue is that the values in the padding bytes are indeterminate. However, in the structure shown, there will be no padding in any compiler I've come across; the structure will be 3 bytes long. There is no reason to put any padding anywhere inside the structure (between elements) or after the last element (and the standard prohibits padding before the first element). So, in this context, there is no issue.
If you write binary data to a file and it has holes in it, then you get arbitrary byte values written where the holes are. If you read back on the same (type of) machine, there won't actually be a problem. If you read back on a different (type of) machine, there may be problems — hence my comments about serialization and deserialization. I've only been programming in C a little over 30 years; I've never needed packed, and don't expect to. (And yes, I've dealt with serialization and deserialization using a standard layout — the system I mainly worked on used big-endian data transfer, which corresponds to network byte order.)
Sometimes, the elements of a struct are simply aligned to a 4-byte boundary (or whatever the size of a register is in the CPU) to optimize read/write access to RAM. Often, smaller elements are packed together, but alignment is dictated by a larger type in the struct.
In your case, you probably don't need to pack the struct, but it doesn't hurt.
With some compilers, each byte in your struct could end up taking 4 bytes of RAM each (so, 12 bytes for the entire struct). Packing the struct removes the alignment requirement for each of the BYTEs, and ensures that the entire struct is placed into one 4-byte DWORD (unless the alignment for the entire program is set to one byte, or the struct is in an array of said structs, in that case it would literally be stored in 3 contiguous bytes of RAM).
See comments below for further discussion...
The objective is exactly what you said, not having gaps between each struct. Why is this important? Mostly because of cache. Memory access is slow!!! Cache is really fast. If you can fit more in cache you avoid cache misses (memory accesses).
Edit: Seems I was wrong, didn't seem really useful if the objective was structure padding since the struct has 3 BYTE

Is this pointer code legal on 64-bit computers

I plan to use memory across two pointers. Let's call them pointer1 and pointer2. Each pointer will be connected to its own share of memory as defined by block1 and block2 respectively.
I think this way works for all systems (both 32 and 64 bit):
char block1[100000];
char *pointer1=block1;
char block2[100000];
char *pointer2=block2;
However I think a faster way would be to use this code:
char block[200000];
char *pointer1=block;
char *pointer2=block+100000;
My question is would the last line of the last code fragment be compatible with 64-bit architecture?
The address space of a 32-bit architecture is of 2**32 = 4294967296. For a 64-bit is 18446744073709551616. I think you will be ok. THe compiler should handle it on its own. For your use case, it is just plain simple pointer arithmetics that is still in the address space.
What you have done is set up a memory pool in its most basic form. Your example uses char arrays and pointers, so you are unlikely to get unwanted results; however if your second pointer was , for instance, long * (with proper casting) you would get differences in alignment which could cause significantly slower code unless you take special precautions to align them manually (using hex values instead of decimal for offsets makes this a bit more obvious)
So in a more complex scenario, it would matter because long may need to be aligned to 8 bytes or 4.
I apologize for going a bit beyond the scope of the question, but I didn't want someone mistakenly extrapolating what is fine for char to mixed types onto a char[]

How are addresses resolved by a compiler in a medium memory model?

I'm new to programming small/medium memory models CPUs. I am working with an embedded processor that has 256KB of flash code space contained in addresses 0x00000 to 0x3FFFF, and with 20KB of RAM contained in addresses 0xF0000 to 0xFFFFF. There are compiler options to choose between small, medium, or large memory models. I have medium selected. My question is, how does the compiler differentiate between a code/flash address and a RAM address?
Take for example I have a 1 byte variable at RAM address 10, and I have a const variable at the real address 10. I did something like:
value = *((unsigned char *)10);
How would the compiler choose between the real address 10 or the (virtual?) address 10. I suppose if I wanted to specify the value at real address 10 I would use:
value = *((const unsigned char *)10);
?
Also, can you explain the following code which I believe is related to the answer:
uint32_t var32; // 32 bit unsigned integer.
unsigned char *ptr; // 2 byte pointer.
ptr = (unsigned char *)5;
var32 = (uint32_t)ptr;
printf("%lu", var32)
The code prints 983045 (0xf0005 hex). It seems unrealistic, how can a 16 bit variable return a value greater than what 16 bits can store?
Read your compiler's documentation to find out details about each memory model.
It may have various sorts of pointer, e.g. char near * being 2-byte, and char far * being 4-byte. Alternatively (or as well as), it might have instructions for changing code pages which you'd have to manually invoke.
how can a 16 bit variable return a value greater than what 16 bits can store?
It can't. Your code converts the pointer to a 32-bit int. , and 0xF0005 can fit in a 32-bit int. Based on your description, I'd guess that char * is only pointing to the data area, and you would use a different sort of pointer to point to the code area.
I tried to comment on Matt's answer but my comment was too long, and I think it might be an answer, so here's my comment:
I think this is an answer, I'm really looking for more details though. I've read the manual but it doesn't have much information on the topic. You are right, the compiler has near/far keywords you can use to manually specify the address (type?). I guess the C compiler knows if a variable is a near or far pointer, and if it's a near pointer it generates instructions that map the 2 byte near pointer to a real address; and these generated mapping instructions are opaque to the C programmer. That would be my only guess. This is why the pointer returns a value greater than its 16 bit value; the compiler is mapping the address to an absolute address before it stores the value in var32. This is possible because 1) the RAM addresses begin at 0xF0000 and end at 0xFFFFF, so you can always map a near address to its absolute address by or'ing the address with 0xF0000, and 2) there is no overlap between a code (far) pointer and a near pointer or'd with 0xF0000. Can anyone confirm?
My first take would be read the documentation, however as I had seen, it was already done.
So my assumption would be that you somehow got to work for example on a large existing codebase which was developed with a not too widely supported compiler on a not too well known architecture.
In such a case (after all my attempts with acquiring proper documentation failed) my take would be generating assembler outputs for test programs, and analysing those. I did this a while ago, so it is not from thin air (it was a 8051 PL/M compiler running on an MDS-70, which was emulated by a DOS based emulator from the late 80s, for which DOS was emulated by DOSBox - yes, and for the huge codebase we needed to maintain we couldn't get around this mess).
So build simple programs which would do something with some pointers, compile those without optimizations to assembly (or request an assembly dump, whatever the compiler can do for you), and understand the output. Try to cover all pointer types and memory models you know of in your compiler. It will clarify what is happening, and hopefully the existing documentations will also help once you understand their gaps this way. Finally, don't stop at understanding just enough for the immediate problem, try to document the gaps properly, so later you won't need to redo the experiments to figure out things you once almost done.

What is overalignment of execution regions and input sections?

I came across code similar to the following today and I am curious as to what is actually happening:
#pragma pack(1)
__align(2) static unsigned char multi_array[7][24] = { 0 };
__align(2) static unsigned char another_multi_array[7][24] = { 0 };
#pragma pack()
When searching for a reference to the __align keyword in the Keil compiler, I came across this:
Overalignment of execution regions and input sections There are situations when you want to overalign code and data sections... If you have access to the original source code, you can do this at compile time with the __align(n) keyword...
I do not understand what is meant by "overaligning code and data sections". Can someone help to clarify how this overalignment occurrs?
The compiler will naturally "align" data based on the needs of the system. For example, on a typical 32-bit system, a 32-bit integer should always be a single 4-byte word (as opposed to being partly in one word and partly on the next), so it will always start on a 4-byte-word boundary. (This mostly has to do with the instructions available on the processor. A system is very likely to have an instruction to load a single word from memory into a register, and much less likely to have a single instruction to load an arbitrary sequence of four adjacent bytes into a register.)
The compiler normally does this by introducing gaps in the data; for example, a struct with a char followed by a 32-bit int, on such a system, would require eight bytes: one byte for the char, three bytes of filler so the int is aligned right, and four bytes for the int itself.
To "overalign" the data is to request greater alignment than the compiler would naturally provide. For example, you might request that a 32-bit integer start on an 8-byte boundary, even on a system that uses 4-byte words. (One major reason to do this would be if you're aiming for byte-level interoperability with a system that uses 8-byte words: if you pass structs from one system to the other, you want the same gaps in both systems.)
Overalignment is when the data is aligned to more than its default alignment. For example, a 4-byte int usually has a default alignment of 4 bytes. (meaning the address will be divisible by 4)
The default alignment of a datatype is quite-often (but not always) the size of the datatype.
Overalignment allows you to increase this alignment to something greater than the default.
As for why you would want to do this:
One reason for this is to be able access the data with a larger datatype (that has a larger alignment).
For example:
char buffer[16];
int *ptr = (int*)&buffer;
ptr[0] = 1;
ptr[1] = 2;
By default, buffer will only be aligned to 1 byte. However, int requires a 4-byte alignment. If buffer isn't aligned to 4 bytes, you will get a misalignment exception. (AFAIK, ARM doesn't allow misaligned memory access... x86/64 usually does, but with performance penalty)
__align() will let you force the alignment higher to make it work:
__align(4) char buffer[16];
A similar situation appears when using SIMD instructions. You will be accessing smaller datatype with a large SIMD datatype - which will likely require a larger alignment.
By overalign, Keil mean nothing more complex than aligning an object to a larger alignment boundary than the data type requires.
See the documentation for __align: "You can only overalign. That is, you can make a two-byte object four-byte aligned but you cannot align a four-byte object at 2 bytes."
In the case of the linker, you can force an extra alignment onto sections within other binary modules using the ALIGNALL or OVERALIGN directives. This may be useful for performance reasons, but isn't a common scenario.

Resources