Processor heap read and prefetch - c

So i'm trying to figure out how heap reading can probably slow down processor capabilities with prefetching and it's only theoretical question so in example i use some C-like wonder-language.
So let's suppose we have some heap of 120 bytes and it's have some memory in use by a program.
[0...19, /* FREE /, 40-79, / FREE TILL THE END (119) */]
and i have some structs with magically aligned by memory of 20 bytes
#include <stdlib.h>
struct magic_struct {
long long int foo[3];
short int bar;
};
typedef MagicStruct struct magic_struct;
void read_magic_struct(MagicStruct* buzz) {
// Some code to read struct
}
int main(void) {
MagicStruct *str1 = malloc(sizeof(MagicStruct));
MagicStruct *str2 = malloc(sizeof(MagicStruct));
read_magic_struct(str1);
read_magic_struct(str2);
free(str1);
free(str2);
}
So let's suppose that our processor fetches cacheline of 40 bytes,
it means with our current memory representation processor can't prefetch str2 while reading str1 so it will be slow down in program execution? How do structs get allocated if there was an empty memory buffer or first empty memory chunk would be 40 bytes along? Would a processor hit to the cache misses if structs` size would be 50 bytes? Does some mechanism decides where and when allocate memory on heap?

Does some mechanism decides where and when allocate memory on heap?
The mechanism that decides where memory is allocated on the heap is a piece of code called "memory allocator" - a part of the C runtime library - and it doesn't care much about prefetching or anything like that. Most memory allocators do their best to keep "related" allocations "close together", but it's only on a best-effort basis. So you can't really assume anything about how far away those two allocations are.
How do structs get allocated if there was an empty memory buffer or first empty memory chunk would be 40 bytes along?
In any way that is possible: you can't know until you read the source code of the memory allocator used by the C runtime library on your system (e.g. glibc on most Linux systems). It's totally arbitrary and in general you can't predict how those structs will get allocated. And also in real life heap doesn't have to be contiguous, i.e. heap doesn't have to be a single big memory block, it will usually be many memory blocks, with gaps between them.
So let's suppose that our processor fetches cacheline of 40 bytes, it means with our current memory representation processor can't prefetch str2 while reading str1 so it will be slow down in program execution?
Assuming the heap layout you proposed, the only valid statement is as follows: the cacheline containing str1 does not contain str2. That's all you can say, and no more. It tells you nothing about prefetch, because prefetch has to do with fetching other cachelines ahead of time, before they are needed.
I think you're misusing the term "prefetch", because the processor only exchanges data with memory in terms of complete cachelines. Prefetch does not mean that when fetching a single cacheline some other useful data was also inside that cacheline. We call this cache coherency, and it's a property of how you lay out data in your program. If you need such fine control, you need to write your own memory allocator (even if it's simple).
Prefetch means that the processor has a system that monitors the cachelines being fetched, and anticipates the need for another cacheline, and fetches it before you use it. Prefetch can be triggered by explicit prefetch machine instructions, if you want to control it to such an extent, or it can be triggered by the prefetch algorithms implemented in the processor. Modern processors are extremely good at detecting sequential access that's contiguous or even separated by repeating gaps.
It's certainly good to ponder such things, but you have to understand that processors are pretty darn good, and if you think that prefetch may be your problem, you have to have some very good argument why it is so, and normally you'd need actual measurements. Talking about optimizing prefetch without having measurements that show that the processor is waiting for cachelines to come in from memory is absolutely foolish and a waste of time. So, in practical terms, if you'll want to explore this using real prefetch on real processors (as opposed to imaginary stuff), you'll have to install e.g. Intel VTune Profiler, or use Valgrind's cachegrind (on linux), run your program under those tools, then be able to interpret the results, pinpoint problems, address them by changing data layout or using explicit prefetch machine commands (or compiler intrinsic instructioins), and then be able to validate your solution by seeing improved cache hit rates. And ideally this should be automated so you won't have performance regressions, i.e. you'd run all this instrumentation and result interpretation as scripts under the continous integration system, so that you know that you won't accidentally break things. But this takes lots of work, and if you're starting from scratch (a new project), assuming that you were minimally fluent in scripting, CI systems, and C, then it'd still take potentially hundreds of hours to fully set up and shake down to be trustworthy. The guiding principle is:
Lack of measurement is prima-facie evidence that you don't care about the result
In other words, if you actually care about some aspect of performance, then the only way to show such care is to measure it, and have those measurements be a part of the normal workflow, i.e. that once they were set up, you don't even worry about them - you check in some new code, and if you broke performance, the performance tests will fail, and you'll have to fix it before you're allowed to merge. That's how it's normally done in an environment where that matters. Otherwise, the only interpretation is that you cannot be caring, since without automated measurements it is almost certain that some "minor" code change will affect performance and yet will slip unnoticed. Once you're told this, there's no going back, sorry :)

Related

Is there a performance cost to using large mmap calls that go beyond expected memory usage?

Edit: On systems that use on-demand paging
For initializing data structures that are both persistent for the duration of the program and require a dynamic amount of memory is there any reason not to mmap an upper bound from the start?
An example is an array that will persistent for the entire program's life but whose final size is unknown. The approach I am most familiar with is something along the lines of:
type * array = malloc(size);
and when the array has reached capacity doubling it with:
array = realloc(array, 2 * size);
size *= 2;
I understand this is probably the best way to do this if the array might freed mid execution so that its VM can be reused, but if it is persistent is there any reason not to just initialize the array as follows:
array = mmap(0,
huge_size,
PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_NORESERVE,
-1, 0)
so that the elements never needs to be copied.
Edit: Specifically for an OS that uses on-demand paging.
Don't try to be smarter than the standard library, unless you 100% know what you are doing.
malloc() already does this for you. If you request a large amount of memory, malloc() will mmap() you a dedicated memory area. If what you are concerned about is the performance hit coming from doing size *= 2; realloc(old, size), then just malloc(huge_size) at the beginning, and then keep track of the actual used size in your program. There really is no point in doing an mmap() unless you explicitly need it for some specific reason: it isn't faster nor better in any particular way, and if malloc() thinks it's needed, it will do it for you.
It's fine to allocate upper bounds as long as:
You're building a 64bit program: 32bit ones have restricted virtual space, even on 64bit CPUs
Your upper bounds don't approach 2^47, as a mathematically derived one might
You're fine with crashing as your out-of-memory failure mode
You'll only run on systems where overcommit is enabled
As a side note, an end user application doing this may want to borrow a page from GHC's book and allocate 1TB up front even if 10GB would do. This unrealistically large amount will ensure that users don't confuse virtual memory usage with physical memory usage.
If you know for a fact that wasting a chunk of memory (most likely an entire page which is likely 4096 bytes) will not cause your program or the other programs running on your system to run out of memory, AND you know for a fact that your program will only ever be compiled and run on UNIX machines, then this approach is not incorrect, but it is not good programming practice for the following reasons:
The <stdlib.h> file you #include to use malloc() and free() in your C programs is specified by the C standard, but it is specifically implemented for your architecture by the writers of the operating system. This means that your specific system was kept in-mind when these functions were written, so finding a sneaky way to improve efficiency for memory allocation is unlikely unless you know the inner workings of memory management in your OS better than those who wrote it.
Furthermore, the <sys/mman.h> file you include to mmap() stuff is not part of the C standard, and will only compile on UNIX machines, which reduces the portability of your code.
There's also a really good chance (assuming a UNIX environment) that malloc() and realloc() already use mmap() behind-the-scenes to allocate memory for your process anyway, so it's almost certainly better to just use them. (read that as "realloc doesn't necessarily actively allocate more space for me, because there's a good chance there's already a chunk of memory that my process has control of that can satisfy my new memory request without calling mmap() again")
Hope that helps!

Are stack float array ops faster than heap float ops on modern x86 systems?

On C float (or double) arrays small enough to fit in L1 or L2 cache (about 16k), and whos size I know at compile time, is there generally a speed benefit to define them within the function they are used in, so they are stack variables? If so is it a large difference? I know that in the old days heap variables were much slower than stack ones, but nowadays with the far more complicated structure of cpu addressing and cache, I don't know if this is true.
I need to do repeated runs of floating point math over these arrays in 'chunks', over and over again over the same arrays (about 1000 times), and I wonder if I should define them locally. I imagine keeping them in the closest / fastest locations will allow me to iterate over them repeatedly much faster but I dont understand the implications of caching in this scenario. Perhaps the compiler or cpu is clever enough to realize what I am doing and make these data arrays highly local on the hardware during the inner processing loops without my intervention, and perhaps it does a better job than I can at this.
Maybe I risk running out of stack space if I load large arrays in this way? Or is stack space not hugely limited on modern systems? The array size can be defined at compile time and I only need one array, and one CPU as I need to stick to a single thread for this work.
It is the allocation and deallocation speed that may make difference.
Allocating on the stack is just subtracting the required size from the stack pointer, which is normally done for all local variables once upon function entry anyway, so it is essentially free (unless alloca is used). Same applies to deallocating memory on the stack.
Allocating on the heap requires calling malloc or new which ends up executing an order of magnitude more instructions. Same applies to free and delete.
There should be no difference in the speed of access to the arrays once they were allocated. However, the stack could more likely be in the CPU cache already because previous function calls already used the same stack memory region for local variables.
If your architecture employs Non-uniform memory access (NUMA) there can be difference in access speed to different memory regions when your thread gets re-scheduled to run on a different CPU from the one that originally allocated memory.
For in-depth treatment of the subject have a read of What Every Programmer Should Know About Memory.
The answer is: probably not.
On a modern processor such as the i7 the L1/L2/L3 cache sizes are 64K/1MB/8MB, shared across 4x2 cores. Your numbers are a bit off.
The biggest thing to worry about is parallelism. If you can get all 8 cores running 100% that's a good start.
There is no difference between heap and stack memory, it's just memory. Heap allocation is way slower than stack allocation, but hopefully you don't do much of that.
Cache coherency matters. Cache prefetching matters. The order of accessing things in memory matters. Good reading here: https://software.intel.com/en-us/blogs/2009/08/24/what-you-need-to-know-about-prefetching.
But all this is useless until you can benchmark. You can't improve what you can't measure.
Re comment: there is nothing special about stack memory. The thing that usually does matter is keeping all your data close together. If you access local variables a lot then allocating arrays next to them on the stack might work. If you have several blocks of heap memory then subdividing a single allocation might work better than separate allocations. You'll only know if you read the generated code and benchmark.
They are the same speed on average. Assuming the cache lines that the array occupies have not been touched by other code.
One thing to make sure is that the array memory alignment is at least 32 bit or 64bit aligned (for float and double respectively) so an array element will not cross cache line boundaries. Cache lines are 64 byte on x86.
Another important element is to make sure the compiler is using SSE instructions for scalar floating point operations. This should be the default for modern compilers. The legacy floating point (a.k.a 387 with 80 bit register stack) is much slower and harder to optimize.
If this memory is frequently allocated and released, try to reduce calls to malloc/free by allocating it in a pool, globally or on the stack.

C allocation and memory overhead

Apologies if this is a stupid question, but it's been kinda bothering me for a long time.
I'd like to know some details on how the memory manager knows what memory is in use.
Imagine a one-chip microcomputer with 1024B of RAM - not much to spare.
Now you allocate 100 ints - each int is 4 bytes, each pointer 4 bytes too (yea, 8bit one-chip will most likely have smaller pointers, but w/e).
So you've just used 800B of ram for 100 ints? But it's worse - the allocation system must somehow take note on where the memory is malloc'd and where it's free - 200 extra bytes or something? Or some bit marks?
If this is true, why is C favoured over assembler so often?
Is this really how it works? So super inefficient?
(Or am I having a totally incorrect idea about it?)
It may surprise younger developers to learn that greying old ones like myself used to write in C on systems with 1 or 2k of RAM.
In systems this size, dynamic memory allocation would have been a luxury that we could not afford. It's not just the pointer overhead of managing the free store, but also the effects of free store fragmentation making memory allocation inefficient and very probably leading to a fatal out-of-memory condition (virtual memory was not an option).
So we used to use static memory allocation (i.e. global variables), kept a very tight control on the depth of function all nesting, and an even tighter control over nested interrupt handling.
When writing on these systems, I didn't even link the standard library. I wrote my own C startup routines and provided custom minimal I/O routines.
One program I wrote in a 2k ram system used the lower part of RAM as the data area and the upper part as the stack. In the final cut, I proved that the maximal use of stack reached so far down in memory that it was 1 byte away from the last variable in the data area.
Ah, the good old days...
EDIT:
To answer your question specifically, the original K&R free store manager used to add a header block to the beginning of each block of memory allocated via malloc.
The header block looked something like this:
union header {
struct {
union header *ptr;
unsigned size;
} s;
};
Where ptr is the address of the next header block and size is the size of the memory allocated (in blocks). The malloc function would actually return the address computed by &header + sizeof(header). The free function would subtract the size of the header from the pointer you provided in order to re-link the block back into the free list.
There are several approaches how you can do that:
as you write, malloc() one memory block for every int you have. Completely inefficient, thus I strike it out.
malloc() an array of 100 ints. That needs in total 100*sizeof(int) + 1* sizeof(int*) + whatever malloc() internally needs. Much better.
statically allocate the array. Here you just need 100*sizeof(int).
allocate the array on the stack. That needs the same, but only for the current function call.
Which of these you need depends on how long you need the memory and other criteria.
If you have that few RAM, it might even be questionable if it is useful to use malloc() at all. It can be an option if several code blocks need a lot of RAM, but not at the same time.
On how the memory addresses are tracked, that as well depends:
for malloc(), you have to put the pointer in a place where you don't lose it.
for an array on the stack, it is expressed relatively to the current frame pointer. The code sets up the frame and thus knows about the offset, so it is normally not needed to store it anywhere.
for an array in the data segment, the compiler and linker know about the address and statically put the address where it is needed.
If this is true, why is C favoured over assembler so often?
You're simplifying the problem too much. C or assembler - doesn't matter, you still need to manage the memory chunks. The main issue is fragmentation, not the actual management overhead. In a system like the one you described, you would probably just allocate the memory and never ever release it, thus no need to check what's free - whatever is below the watermark is free, and that's it.
Is this really how it works? So super inefficient?
There are many algorithms around this problem, but if you're simplifying - yes, it basically is. In reality its a much more complicated problem - and there are much more complicated systems revolving around servicing memory, dealing with fragmentation, garbage collection (on a OS level), etc etc.

What is the optimal amount to malloc at a given time when the total needed is not known?

I've implemented a multi-level cache simulator that needs to store the values currently in the simulator. With current configurations, the maximum size of all values being stored could reach 2G. Obviously I'm not going to assume this worst case scenario and allocate all of that memory up-front. Instead, I have the program set to allocate memory as needed in chunks. The expense of this allocation is exacerbated by the fact that I'm callocing in order to provide 0 values when no write has occurred previously at the specified location.
My question is, is there a good heuristic for how much memory should be allocated each time more is needed? Currently I'm using an arbitrary value and I considered some solution that would use some ratio of the total system memory (I presume it's possible to dynamically detect this at compile and/or runtime), but even with the latter I'm using an arbitrary ratio with still doesn't sit well with me.
Any insight into best practices for this kind of situation would be appreciated!
A common rule of thumb is to grow geometrically, for example by doubling, on each reallocation.
It's best to understand allocation patterns of your program, if this is a problem you need to optimize for. This comes by understanding the program's implementation, the architecture(s) it runs within, and by observation (e.g. time and memory profiling).
The truth is, you can optimize from many perspectives, but things change over time (inputs change, environments change). In the user-land, your memory usage is already second guessed.
Given your allocation sizes, I assume you are already depending on a system which will default to a backing store as needed. As such, you don't have much control over what is paged or when. Peeking at available physical memory is not worth consideration in this case, and you will have to work hard to do better than the system's existing virtual memory implementation. Several of these systems try to use all available memory (e.g. "Unused RAM is wasted RAM").
Having said that and if those assumptions are correct: It's often better to just reduce your allocation sizes and working sets and do I/O yourself as needed.
Your OSs probably use disk caching as well; reads and writes are probably faster than you suspect for large blocks of memory.
Even deeper: Use virtual memory or memory mapped files for these large data sets. Your kernel will likely handle these cases very well.
Obviously I'm not going to assume this worst case scenario and allocate all of that memory up-front.
Then you will likely be surprised to learn that a 2 GB calloc alone may be better than other alternatives people come up with in some environments because a large calloc could just reserve a domain in virtual memory, loading/initializing pages only when you access them. Depending on your usage, this approach will be much better than some alternatives you may be given.
A good starting point for many problems when understanding a program or input's allocation patterns is to start out conservative, and then make the most beneficial adjustments based on observation. In many cases, you will need little more information than a) accurately determining how much to resize by when resizing is necessary b) reusing allocations where appropriate c) designing your data well for the problem at hand.

Fragmentation-resistant Microcontroller Heap Algorithm

I am looking to implement a heap allocation algorithm in C for a memory-constrained microcontroller. I have narrowed my search down to 2 options I'm aware of, however I am very open to suggestions, and I am looking for advice or comments from anyone with experience in this.
My Requirements:
-Speed definitely counts, but is a secondary concern.
-Timing determinism is not important - any part of the code requiring deterministic worst-case timing has its own allocation method.
-The MAIN requirement is fragmentation immunity. The device is running a lua script engine, which will require a range of allocation sizes (heavy on the 32 byte blocks). The main requirement is for this device to run for a long time without churning its heap into an unusable state.
Also Note:
-For reference, we are talking about a Cortex-M and PIC32 parts, with memory ranging from 128K and 16MB or memory (with a focus on the lower end).
-I don't want to use the compiler's heap because 1) I want consistent performance across all compilers and 2) their implementations are generally very simple and are the same or worse for fragmentation.
-double indirect options are out because of the huge Lua code base that I don't want to fundamtnetally change and revalidate.
My Favored Approaches Thus Far:
1) Have a binary buddy allocator, and sacrifice memory usage efficiency (rounding up to a power of 2 size).
-this would (as I understand) require a binary tree for each order/bin to store free nodes sorted by memory address for fast buddy-block lookup for rechaining.
2) Have two binary trees for free blocks, one sorted by size and one sorted by memory address. (all binary tree links are stored in the block itself)
-allocation would be best-fit using a lookup on the table by size, and then remove that block from the other tree by address
-deallocation would lookup adjacent blocks by address for rechaining
-Both algorithms would also require storing an allocation size before the start of the allocated block, and have blocks go out as a power of 2 minus 4 (or 8 depending on alignment). (Unless they store a binary tree elsewhere to track allocations sorted by memory address, which I don't consider a good option)
-Both algorithms require height-balanced binary tree code.
-Algorithm 2 does not have the requirement of wasting memory by rounding up to a power of two.
-In either case, I will probably have a fixed bank of 32-byte blocks allocated by nested bit fields to off-load blocks this size or smaller, which would be immune to external fragmentation.
My Questions:
-Is there any reason why approach 1 would be more immune to fragmentation than approach 2?
-Are there any alternatives that I am missing that might fit the requirements?
If block sizes are not rounded up to powers of two or some equivalent(*), certain sequences of allocation and deallocation will generate an essentially-unbounded amount of fragmentation even if the number of non-permanent small objects that exist at any given time is limited. A binary-buddy allocator will, of course, avoid that particular issue. Otherwise, if one is using a limited number of nicely-related object sizes but not using a "binary buddy" system, one may still have to use some judgment in deciding where to allocate new blocks.
Another approach to consider is having different allocation methods for things that are expected to be permanent, temporary, or semi-persistent. Fragmentation often causes the most trouble when temporary and permanent things get interleaved on the heap. Avoiding such interleaving may minimize fragmentation.
Finally, I know you don't really want to use double-indirect pointers, but allowing object relocation can greatly reduce fragmentation-related issues. Many Microsoft-derived microcomputer BASICs used a garbage-collected string heap; Microsoft's garbage collector was really horrible, but its string-heap approach can be used with a good one.
You can pick up a (never used for real) Buddy system allocator at http://www.mcdowella.demon.co.uk/buddy.html, with my blessing for any purpose you like. But I don't think you have a problem that is easily solved just by plugging in a memory allocator. The long-running high integrity systems I am familiar with have predictable resource usage, described in 30+ page documents for each resource (mostly cpu and I/O bus bandwidth - memory is easy because they tend to allocate the same amount at startup every time and then never again).
In your case none of the usual tricks - static allocation, free lists, allocation on the stack, can be shown to work because - at least as described to us - you have a Lua interpreted hovering in the background ready to do who knows what at run time - what if it just gets into a loop allocating memory until it runs out?
Could you separate the memory use into two sections - traditional code allocating almost all of what it needs on startup, and never again, and expendable code (e.g. Lua) allowed to allocate whatever it needs when it needs it, from whatever is left over after static allocation? Could you then trigger a restart or some sort of cleanup of the expendable code if it manages to use all of its area of memory, or fragments it, without bothering the traditional code?

Resources