What guarantees about low order bits does malloc provide? - c

When you call C's malloc, is there any guarantee about what the first few low order bits will be? If you're writing a compiler/interpreter for a dynamic language but want to have fixnums of the form bbbbbbbb bbbbbbbb . . . bbbbbbb1 (where b is a bit) and pointers of the form bbbbbbbb bbbbbbbb . . . bbbbbbb0 (or vice versa), is there any way to guarantee that malloc will return pointers that fit with such a scheme?
Should I just allocate two more bytes than I need, increment the return value by one if necessary to fit with the bit scheme, and store the actual pointer returned by malloc in the second byte so that I know what to free?
Can I just assume that malloc will return a pointer with zero as the final bit? Can I assume that an x86 will have two zero bits at the end and that an x64 will have four zero bits at the end?

C doesn't guarantee anything about the low order bits being zero only that the pointer is aligned for all possibly types. In practice 2 or 3 of the lowest bits will probably be zero, but do not count on it.
You can ensure it yourself, but a better way is to use something like posix_memalign.
If you want to do it yourself you need to overallocate memory and keep track of the original pointer. Something like (assuming you want 16-byte alignment, could be made generic, not tested):
void* my_aligned_malloc16(size_t sz) {
void* p = malloc(sz + 15 + 16); // 15 to ensure alignment, 16 for payload
if (!p) return NULL;
size_t aligned_addr = ((size_t)p + 15) & (~15);
void* aligned_ptr = (void*) aligned_addr;
memcpy(aligned_ptr, &p, sizeof(void*)); // save original pointer as payload
return (void*)(aligned_addr + 16); // return aligned address past payload
}
void my_aligned_free16(void* ptr){
void** original_pointer = (void**)( ((size_t)ptr) - 16 );
free(*original_pointer);
}
As you can see this is rather ugly, so prefer using something like posix_memalign. Your runtime probably has a similar function if that one is unavailable, e.g. memalign (thanks #R..) or _aligned_malloc when using MSVC.

Malloc is guaranteed to return a pointer with correct alignment for any data type, which on some architectures will be four-byte boundaries and on others might be eight. You can imagine an environment where the alignment would be one, though. The C library will have a macro somewhere that will tell you what the actual rounding is.

My answer may be slightly off-topic, but my cockpit alarms are going off like mad -- "Pull Up! Pull Up!"
This scheme sounds much too dependent on the HW architecture and the deep implementation details of the compiler and the C libraries. C allows one to get to this level of the nitty-gritty bits and bytes and to peek at the bit pattern and alignment of objects in order to code device drivers and the like. You, on the other hand, are building a high-level entity, a interpreter and run-time. C can do this as well, but you don't have to use C's low-level bit twiddling capabilities; IMHO, you should avoid doing so. Been there, done that, have the holes in the feet to show for it.
If you are successful in creating your interpreter, you will want to port it to a different platform, where the rules about bit representation and alignment can differ, perhaps radically. Coming up with a design that does not depend on such tricky bit manipulation and peeking under the covers will help you immensely down the road.
-k

Related

Typecasting Arrays for Variable Width Access

Sorry, I am not sure if I wrote the title accurately.
But first, here are my constraints:
Array[], used as a register map, is declared as an unsigned 8-bit array (uint8_t),
this is so that indexing(offset) is per byte.
Data to be read/written into the array has varying width (8-bit, 16-bit, 32-bit and 64-bit).
Very Limited Memory and Speed is a must.
What are the caveats in doing the following
uint8_t some_function(uint16_t offset_addr) //16bit address
{
uint8_t Array[0x100];
uint8_t data_byte = 0xAA;
uint16_t data_word;
uint32_t data_double = 0xBEEFFACE;
\\ A. Storing wider-data into the array
*((uint32_t *) &Array[offset_addr]) = data_double;
\\ B. Reading multiple-bytes from the array
data_word = *((uint16_t *) &Array[offset_addr]);
return 0;
}
I know i could try writing the data per byte, but that would be slow due to bit shifting.
Is there going to be a significant problem with this usage?
I have run this on my hardware and have not seen any problems so far, but I want to take note of potential problems this implementation might cause.
Is there going to be a significant problem with this usage?
It produces undefined behavior. Therefore, even if in practice that manifests as you intend on your current C implementation, hardware, program, and data, you might find that it breaks unexpectedly when something (anything) changes.
Even if the compiler implements the cast and dereference in the obvious way (which it is not obligated to do, because UB) misaligned accesses resulting from your approach will at least slow many CPUs, and will produce traps on some.
The standard-conforming way to do what you want is this:
uint8_t some_function(uint16_t offset_addr) {
uint8_t Array[0x100];
uint8_t data_byte = 0xAA;
uint16_t data_word;
uint32_t data_double = 0xBEEFFACE;
\\ A. Storing wider-data into the array
memcpy(Array + offset_addr, &data_double, sizeof data_double);
\\ B. Reading multiple-bytes from the array
memcpy(&data_word, Array + offset_addr, sizeof data_word);
return 0;
}
This is not necessarily any slower than your version, and it has defined behavior as long as you do not overrun the bounds of your array.
This is probably fine. Many have done things like this. C performs well with this kind of thing.
Two things to watch out for:
Buffer overruns. You know those zero-days like Eternal Blue and hacks like WannaCry? Many of them exploited bugs in code like yours. Malicious input caused the code to write too much stuff into data structures like your uint8_t Array[0x100]. Be careful. Avoid allocating buffers on the stack (as function-local variables) as you have done because clobbering the stack is exploitable. Make them big enough. Check that you don't overrun them.
Machine byte ordering vs. network byte ordering, aka endianness. If these data structures move from machine to machine over the net you may get into trouble.

Why its not recommended to use pointer for array access in C

I'm learning C programming and I cam across this tutorial online, which state that you should always prefer using [] operator over pointer arithmetic as much as possible.
https://www.cs.swarthmore.edu/~newhall/unixhelp/C_arrays.html#dynamic
you can use pointer arithmetic (but in general don't)
consider the following code in C
int *p_array;
p_array = (int *)malloc(sizeof(int)*50);
for(i=0; i < 50; i++) {
p_array[i] = 0;
}
What is the difference in doing it using pointer arithmetic like the following code (and why its not recommended)?
int *p_array;
p_array = (int *)malloc(sizeof(int)*50); // allocate 50 ints
int *dptr = p_array;
for(i=0; i < 50; i++) {
*dptr = 0;
dptr++;
}
What are the cases where using pointer arithmetic can cause issues in the software? is it bad practice or is it inexperienced engineer can be not paying attention?
Since there seems to be all out confusion on this:
In the old days, we had 16bit CPU's think 8088, 268 etc.
To formulate an address you had to load your segment register (16 bit register) and your address register. if accessing an array, you could load your array base into the segment register and the address register would be the index.
C compilers for these platforms did exist but pointer arithmetic involved checking the address for overruns and bumping the segment register if necessary (inefficient) Flat addressed pointers simply weren't possible in hardware.
Fast forward to the 80386 Now we have a full 32 bit space. Hardware pointers are possible Index + base addressing incurs a 1 clock cycle penalty. The segments are also 32 bit though, so arrays can be loaded using segments avoiding this penalty even if you are running 32 bit mode. The 368 also increases the number of segment registers by 2. (No idea why Intel thought that was a good idea) There was still a lot of 16bit code around though
These days, segment registers are disabled in 64 bit mode, Base+Index addressing is free.
Is there any platform where a flat pointer can outperform array addressing in hardware ? Well yes. the Motorola 68000 released in 1979 has a flat 32 bit address space, no segments and the Base + Index addressing mode incurs an 8 clock cycle penalty over immediate addressing. So if you're programming a early 80's era Sun station, Apple Lisa etc. this might be relevant.
In short. If you want an array, use an array. If you want a pointer use a pointer. Don't try and out smart your compiler. Convoluted code to turn arrays into pointers is exceedingly unlikely to provide any benefit, and may be slower.
This code is not recommended:
int *p_array;
p_array = (int *)malloc(sizeof(int)*50); // allocate 50 ints
int *dptr = p_array;
for(i=0; i < 50; i++) {
*dptr = 0;
dptr++;
}
because 1) for no reason you have two different pointers that point to the same place, 2) you don't check the result of malloc() -- it's known to return NULL occasionally, 3) the code is not easy to read and 4) it's easy to make a silly mistake very hard to spot later on.
All in all, I'd recommend to use this instead:
int array[50] = { 0 }; // make sure it's zero-initialized
int* p_array = array; // if you must =)
In your example, without compiler optimizations, pointer arithmetic may be more efficient, because it is easier to just increment a pointer than to calculate a new offset in every single loop iteration. However, most modern CPUs are optimized in such a way that accessing memory with an offset does not incur a (significant) performance penalty.
Even if you happen to be programming on a platform in which pointer arithmetic is faster, then it is likely that, if you activate compiler optimizations ("-O3" on most compilers), the compiler will use whatever method is fastest.
Therefore, it is mostly a matter of personal preference whether you use pointer arithmetic or not.
Code using array indexing instead of pointer arithmetic is generally easier to understand and less prone to errors.
Another advantage of not using pointer arithmetic is that pointer aliasing may be less of an issue (because you are using less pointers). That way, the compiler may have more freedom in optimizing your code (making your code faster).

Can I cast pointers like this?

Code:
unsigned char array_add[8]={0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};
...
if ((*((uint32_t*)array_add)!=0)||(*((uint32_t*)array_add+1)!=0))
{
...
}
I want to check if the array is all zero. So naturally I thought of casting the address of an array, which also happens to be the address of the first member, to an unsigned int 32 type, so I'll only need to do this twice, since it's a 64 bit, 8 byte array. Problem is, it was successfully compiled but the program crashes every time around here.
I'm running my program on an 8bit microcontroller, cortex-M0.
How wrong am I?
In theory this could work but in practice there is a thing you aren't considering: aligned memory accesses.
If a uint32_t requires aligned memory access (eg to 4 bytes), then casting an array of unsigned char which has 1 byte alignment requirement to an uint32_t* produces a pointer to an unaligned array of uint32_t.
According to documentation:
There is no support for unaligned accesses on the Cortex-M0 processor. Any attempt to perform an unaligned memory access operation results in a HardFault exception.
In practice this is just dangerous and fragile code which invokes undefined behavior in certain circumstances, as pointed out by Olaf and better explained here.
To test multiple bytes as once code could use memcmp().
How speedy this is depends more on the compiler as a optimizing compiler may simple emit code that does a quick 8 byte at once (or 2 4-byte) compare. Even the memcmp() might not be too slow on an 8-bit processor. Profiling code helps.
Take care in micro-optimizations, as they too often are not efficient use of coders` time for significant optimizations.
unsigned char array_add[8] = ...
const unsigned char array_zero[8]={0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};
if (memcmp(array_zero, array_add, 8) == 0) ...
Another method uses a union. Be careful not to assume if add.arr8[0] is the most or least significant byte.
union {
uint8_t array8[8];
uint64_t array64;
} add;
// below code will check all 8 of the add.array8[] is they are zero.
if (add.array64 == 0)
In general, focus on writing clear code and reserve such small optimizations to very select cases.
I am not sure but if your array has 8 bytes then just assign base address to a long long variable and compare it to 0. That should solve your problem of checking if the array is all 0.
Edit 1: After Olaf's comment I would say that replace long long with int64_t. However, why do you not a simple loop for iterating the array and checking. 8 chars is all you need to compare.
Edit 2: The other approach could be to OR all elements of array and then compare with 0. If all are 0 then OR will be zero. I do not know whether CMP will be fast or OR. Please refer to Cortex-M0 docs for exact CPU cycles requirement, however, I would expect CMP to be slower.

How do I force the program to use unaligned addresses?

I've heard reads and writes of aligned int's are atomic and safe, I wonder when does the system make non malloc'd globals unaligned other than packed structures and casting/pointer arithmetic byte buffers?
[X86-64 linux] In all of my normal cases, the system always chooses integer locations that don't get word torn, for example, two byte on one word and the other two bytes on the other word. Can any one post a program/snip (C or assembly) that forces the global variable to unaligned address such that the integer gets torn and the system has to use two reads to load one integer value ?
When I print the below program, the addresses are close to each other such that multiple variables are within 64bits but never once word tearing is seen (smartness in the system or compiler ?)
#include <stdio.h>
int a;
char b;
char c;
int d;
int e = 0;
int isaligned(void *p, int N)
{
if (((int)p % N) == 0)
return 1;
else
return 0;
}
int main()
{
printf("processor is %d byte mode \n", sizeof(int *));
printf ( "a=%p/b=%p/c=%p/d=%p/f=%p\n", &a, &b, &c, &d, &e );
printf ( " check for 64bit alignment of test result of 0x80 = %d \n", isaligned( 0x80, 64 ));
printf ( " check for 64bit alignment of a result = %d \n", isaligned( &a, 64 ));
printf ( " check for 64bit alignment of d result = %d \n", isaligned( &e, 64 ));
return 0;}
Output:
processor is 8 byte mode
a=0x601038/b=0x60103c/c=0x60103d/d=0x601034/f=0x601030
check for 64bit alignment of test result of 0x80 = 1
check for 64bit alignment of a result = 0
check for 64bit alignment of d result = 0
How does a read of a char happen in the above case ? Does it read from 8 byte aligned boundary (in my case 0x601030 ) and then go to 0x60103c ?
Memory access granularity is always word size isn't it ?
Thx.
1) Yes, there is no guarantee that unaligned accesses are atomic, because [at least sometimes, on certain types of processors] the data may be written as two separate writes - for example if you cross over a memory page boundary [I'm not talking about 4KB pages for virtual memory, I'm talking about DDR2/3/4 pages, which is some fraction of the total memory size, typically 16Kbits times whatever the width is of the actual memory chip - which will vary depending on the memory stick itself]. Equally, on other processors than x86, you get a trap for reading unaligned memory, which would either cause the program to abort, or the read be emulated in software as multiple reads to "fix" the unaligned read.
2) You could always make an unaligned memory region by something like this:
char *ptr = malloc(sizeof(long long) * number+1);
long long *unaligned = (long long *)&ptr[2];
for(i = 0; i < number; i++)
temp = unaligned[i];
By the way, your alignment check checks if the address is aligned to 64 bytes, not 64 bits. You'll have to divide by 8 to check that it's aligned to 64 bits.
3) A char is a single byte read, and the address will be on the actual address of the byte itself. The actual memory read performed is probably for a full cache-line, starting at the target address, and then cycling around, so for example:
0x60103d is the target address, so the processor will read a cache line of 32 bytes, starting at the 64-bit word we want: 0x601038 (and as soon as that's completed the processor goes on to the next instruction - meanwhile the next read will be performed to fill the cacheline), then cacheline is filled with 0x601020, 0x601028, 0x601030. But should we turn the cache off [if you want your 3GHz latest x86 processor to be slightly slower than a 66MHz 486, disabling the cache is a good way to achieve that], the processor would just read one byte at 0x60103d.
4) Not on x86 processors, they have byte addressing - but for normal memory, reads are done on a cacheline basis, as explained above.
Note also that "may not be atomic" is not at all the same as "will not be atomic" - so you'll probably have a hard time making it go wrong by will - you really need to get all the timings of two different threads just right, and straddle cachelines, straddle memory page boundaries, and so on to make it go wrong - this will happen if you don't want it to happen, but trying to make it go wrong can be darn hard [trust me, I've been there, done that].
It probably doesn't, outside of those cases.
In assembly it's trivial. Something like:
.org 0x2
myglobal:
.word SOME_NUMBER
But on Intel, the processor can safely read unaligned memory. It might not be atomic, but that might not be apparent from the generated code.
Intel, right? The Intel ISA has single-byte read/write opcodes. Disassemble your program and see what it's using.
Not necessarily - you might have a mismatch between memory word size and processor word size.
1) This answer is platform-specific. In general, though, the compiler will align variables unless you force it to do otherwise.
2) The following will require two reads to load one variable when run on a 32-bit CPU:
uint64_t huge_variable;
The variable is larger than a register, so it will require multiple operations to access. You can also do something similar by using packed structures:
struct unaligned __attribute__ ((packed))
{
char buffer[2];
int unaligned;
char buffer2[2];
} sample_struct;
3) This answer is platform-specific. Some platforms may behave like you describe. Some platforms have instructions capable of fetching a half-register or quarter-register of data. I recommend examining the assembly emitted by your compiler for more details (make sure you turn off all compiler optimizations first).
4) The C language allows you to access memory with byte-sized granularity. How this is implemented under the hood and how much data your CPU fetches to read a single byte is platform-specific. For many CPUs, this is the same as the size of a general-purpose register.
The C standards guarantee that malloc(3) returns a memory area that complies to the strictest alignment requirements, so this just can't happen in that case. If there are unaligned data, it is probably read/written by pieces (that depends on the exact guarantees the architecture provides).
On some architectures unaligned access is allowed, on others it is a fatal error. When allowed, it is normally much slower than aligned access; when not allowed the compiler must take the pieces and splice them together, and that is even much slower.
Characters (really bytes) are normally allowed to have any byte address. The instructions working with bytes just get/store the individual byte in that case.
No, memory access is according to the width of the data. But real memory access is in terms of cache lines (read up on CPU cache for this).
Non-aligned objects can never come into existence without you invoking undefined behavior. In other words, there is no sequence of actions, all having well-defined behavior, which a program can take that will result in a non-aligned pointer coming into existence. In particular, there is no portable way to get the compiler to give you misaligned objects. The closest thing is the "packed structure" many compilers have, but that only applies to structure members, not independent objects.
Further, there is no way to test alignedness in portable C. You can use the implementation-defined conversions of pointers to integers and inspect the low bits, but there is no fundamental requirement that "aligned" pointers have zeros in the low bits, or that the low bits after conversion to integer even correspond to the "least significant" bits of the pointer, whatever that would mean. In other words, conversions between pointers and integers are not required to commute with arithmetic operations.
If you really want to make some misaligned pointers, the easiest way to do it, assuming alignof(int)>1, is something like:
char buf[2*sizeof(int)+1];
int *p1 = (int *)buf, *p2 = (int *)(buf+sizeof(int)+1);
It's impossible for both buf and buf+sizeof(int)+1 to be simultaneously aligned for int if alignof(int) is greater than 1. Thus at least one of the two (int *) casts gets applied to a misaligned pointer, invoking undefined behavior, and the typical result is a misaligned pointer.

Map memory to another address

X86-64, Linux, Windows.
Consider that I'd want to make some sort of "free launch for tag pointers". Basically I want to have two pointers that point to the same actual memory block but whose bits are different. (For example I want one bit to be used by GC collection or for some other reason).
intptr_t ptr = malloc()
intptr_t ptr2 = map(ptr | GC_FLAG_REACHABLE) //some magic call
int* p = int*(ptr);
int* p2 = int*(ptr2);
*p = 10;
*p2 = 20;
assert(*p == 20)
assert(p != p2)
On Linux, mmap() the same file twice. Same thing on Windows really, but it has its own set of functions for that.
Mapping the same memory (mmap on POSIX as Ignacio mentions, MapViewOfFile on Windows) to multiple virtual addresses may provide you some interesting coherency puzzles (are writes at one address visible when read at another address?). Or maybe not. I'm not sure what all the platform guarantees are.
More commonly, one simply reserves a few bits in the pointer and shifts things around as necessary.
If all your objects are aligned to 8-byte boundaries, it's common to simply store tags in the 3 least-significant bits of a pointer, and mask them off before dereferencing (as thkala mentions). If you choose a higher alignment, such as 16-bytes or 32-bytes, then there are 3 or 5 least-significant bits that can be used for tagging. Equivalently, choose a few most-significant bits for tagging, and shift them off before dereferencing. (Sometimes non-contiguous bits are used, for example when packing pointers into the signalling NaNs of IEEE-754 floats (223 values) or doubles (251 values).)
Continuing on the high end of the pointer, current implementations of x86-64 use at most 48 bits out of a 64-bit pointer (0x0000000000000000-0x00007fffffffffff + 0xffff800000000000-0xffffffffffffffff) and Linux and Windows only hand out addresses in the first range to userspace, leaving 17 most-significant bits that can be safely masked off. (This is neither portable nor guaranteed to remain true in the future, though.)
Another approach is to stop considering "pointers" and simply use indices into a larger memory array, as the JVM does with -XX:+UseCompressedOops. If you've allocated a 512MB pool and are storing 8-byte aligned objects, there are 226 possible object locations, so a 32-value has 6 bits to spare in addition to the index. A dereference will require adding the index times the alignment to the base address of the array, saved elsewhere (it's the same for every "pointer"). If you look at things carefully, this is simply a generalization of the previous technique (which always has base at 0, where things line up with real pointers).
Once upon a time I worked on a Prolog implementation that used the following technique to have spare bits in a pointer:
Allocate a memory area with a known alignment. malloc() usually allocates memory with a 4-byte or 8-byte alignment. If necessary, use posix_memalign() to get areas with a higher alignment size.
Since the resulting pointer is aligned to intervals of multiple bytes, but it represents byte-accurate addresses, you have a few spare bits that will by definition be zero in the memory area pointer. For example a 4-byte alignment gives you two spare bits on the LSB side of the pointer.
You OR (|) your flags with those bits and now have a tagged pointer.
As long as you take care to properly mask the pointer before using it for memory access, you should be perfectly fine.

Resources