loading huge database into memory visual c - c

We have a somewhat unusual c app in that it is a database of about 120 gigabytes, all of which is loaded into memory for maximum performance. The machine it runs on has about a quarter terabyte of memory, so there is no issue with memory availability. The database is read-only.
Currently we are doing all the memory allocation dynamically, which is quite slow, but it is only done once so it is not an issue in terms of time.
We were thinking about whether it would be faster, either in startup or in runtime performance, if we were to use global data structures instead of dynamic allocation. But it appears that Visual Studio limits global data structures to a meager 4gb, even if you set the linker heap commit and reserve size much larger.
Anyone know of a way around this?

One way to do this would be to have your database as a persistent memory mapped file and then use the query part of your database to access that instead of dynamically allocated structures. It could be worth a try, I don't think performance would suffer that much (but of course will be slower).

How many regions of memory are you allocating? (1 x 120GB) or (120 Billion x 1 byte) etc.
I believe the work done when dynamically allocating memory is proportional to the number of allocated regions rather than their size.
Depending on your data and usage (elaborate and we can be more specific), you can allocate a large block of heap memory (e.g. 120 GB) once then manage that yourself.

Startup performance: If you're thinking of switching from dynamic to static global allocation, then I'd assume that you know how much you're allocating at compile time and there is a fixed number of allocations performed at runtime. I'd consider reducing the number of allocations performed, the actual call to new is the real bottleneck, not the actual allocation itself.
Runtime performance: No, it wouldn't improve runtime performance. Data structures of that size are going to end up on the heap, and subsequently in cache as they are read. To improve performance at runtime you should be aiming to improve locality of data so that data required subsequent to some you've just used, will end up on the same cache line, and paced in cache with the data you just used.
Both of these techniques I've used to great effect, efficiently ordering voxel data in 'batches', reducing the locality of data in a tree structure and reducing the number of calls to new, greatly increased the performance of a realtime renderer I worked on in a previous position. We're talking ~40GB voxel structures, possibly streaming of disk. Worked for us :).

Have you conducted an actual benchmark of your "in memory" solution versus having a well indexed read only table set on the solid state drives? Depending upon the overall solution it's entirely possible that your extra effort yields only small improvements to the end user. I happen to be aware of at least one solution approaching a half a petabyte of storage where the access pattern is completely random with an end user response time of less than 10 seconds with all data on disk.

Related

What is the optimal amount to malloc at a given time when the total needed is not known?

I've implemented a multi-level cache simulator that needs to store the values currently in the simulator. With current configurations, the maximum size of all values being stored could reach 2G. Obviously I'm not going to assume this worst case scenario and allocate all of that memory up-front. Instead, I have the program set to allocate memory as needed in chunks. The expense of this allocation is exacerbated by the fact that I'm callocing in order to provide 0 values when no write has occurred previously at the specified location.
My question is, is there a good heuristic for how much memory should be allocated each time more is needed? Currently I'm using an arbitrary value and I considered some solution that would use some ratio of the total system memory (I presume it's possible to dynamically detect this at compile and/or runtime), but even with the latter I'm using an arbitrary ratio with still doesn't sit well with me.
Any insight into best practices for this kind of situation would be appreciated!
A common rule of thumb is to grow geometrically, for example by doubling, on each reallocation.
It's best to understand allocation patterns of your program, if this is a problem you need to optimize for. This comes by understanding the program's implementation, the architecture(s) it runs within, and by observation (e.g. time and memory profiling).
The truth is, you can optimize from many perspectives, but things change over time (inputs change, environments change). In the user-land, your memory usage is already second guessed.
Given your allocation sizes, I assume you are already depending on a system which will default to a backing store as needed. As such, you don't have much control over what is paged or when. Peeking at available physical memory is not worth consideration in this case, and you will have to work hard to do better than the system's existing virtual memory implementation. Several of these systems try to use all available memory (e.g. "Unused RAM is wasted RAM").
Having said that and if those assumptions are correct: It's often better to just reduce your allocation sizes and working sets and do I/O yourself as needed.
Your OSs probably use disk caching as well; reads and writes are probably faster than you suspect for large blocks of memory.
Even deeper: Use virtual memory or memory mapped files for these large data sets. Your kernel will likely handle these cases very well.
Obviously I'm not going to assume this worst case scenario and allocate all of that memory up-front.
Then you will likely be surprised to learn that a 2 GB calloc alone may be better than other alternatives people come up with in some environments because a large calloc could just reserve a domain in virtual memory, loading/initializing pages only when you access them. Depending on your usage, this approach will be much better than some alternatives you may be given.
A good starting point for many problems when understanding a program or input's allocation patterns is to start out conservative, and then make the most beneficial adjustments based on observation. In many cases, you will need little more information than a) accurately determining how much to resize by when resizing is necessary b) reusing allocations where appropriate c) designing your data well for the problem at hand.

dealing with memory fragmentation for a simulation of dynamic memory allocation

I am working on a dynamic memory allocation simulation using a fixed sized array in C and i would like to know the best way to deal with fragmentation. My plan is to split the array into two parts, the left part reserved for small blocks and the right part reserved for big blocks. I would then use the best fit approach to find the smallest/largest memory block available to use. Is there another better approach to avoid fragmentation(where you have a bunch of blocks available throughout the array but a single one does not meet the space needed)?
The best approach depends on the modus operandi of your program (the user of your memory manager). If the usage pattern is to allocate many small fragments and delete them frequently, you don't need to be overly aggressive with defragmentation. In that case rare large block users will pay for the defragmentation operation. Similarly, if large block allocations are frequent, it might make sense to defragment more often. But the best strategy (assuming you still want to roll your own) is to program it in a general, tunable way and then measure performance impact (in fragmentation ops or otherwise) based on real program run.

Best fit vs segregated fit vs buddy system for least fragmentation

I am working on a dynamic memory allocation simulation(malloc() and free()) using a fixed sized array in C and i would like to know which of these will give the least amount of fragmentation(internal and external)? I do not care about speed, being able to reduce fragmentation is the issue i want to solve.
For the buddy system i've read that it has high internal fragmentation because most requested size are not to the power of 2.
For the best fit, the only negative i've heard of is that it is sequential so it takes a long time to search(not a problem in my case). In any case i can use a binary tree from my understanding to search instead. I could also split the fixed sized array into two where the left size is for smaller blocks and the right size is for bigger blocks. I don't know of any other negatives.
For segregated fit it is similar to the buddy system but i'm not sure if it has the same problems for fragmentation since it does not split by a power of 2.
Does anyone have some statistics or know the answer to my question from having used these before?
Controlling fragmentation is use-case dependent. The only scenario where fragmentation will never occur is when your malloc() function returns a fixed size memory chunk whenever you call it. This is more in to memory pooling. Some high-end servers often do this in an attempt to boost performance when they know what they are going to allocate and for how long they will keep that allocation.
The moment you start thinking of reducing fragmentation in a general purpose memory manager, these simple algorithms you mentioned will do no good. There are complex memory management algorithms like jemalloc, tcmalloc and Poul-Henning Kamp malloc that deal with this kind of problem. There are many whitepapers published around this. ACM and IEEE have plethora of literature around this.

Fragmentation-resistant Microcontroller Heap Algorithm

I am looking to implement a heap allocation algorithm in C for a memory-constrained microcontroller. I have narrowed my search down to 2 options I'm aware of, however I am very open to suggestions, and I am looking for advice or comments from anyone with experience in this.
My Requirements:
-Speed definitely counts, but is a secondary concern.
-Timing determinism is not important - any part of the code requiring deterministic worst-case timing has its own allocation method.
-The MAIN requirement is fragmentation immunity. The device is running a lua script engine, which will require a range of allocation sizes (heavy on the 32 byte blocks). The main requirement is for this device to run for a long time without churning its heap into an unusable state.
Also Note:
-For reference, we are talking about a Cortex-M and PIC32 parts, with memory ranging from 128K and 16MB or memory (with a focus on the lower end).
-I don't want to use the compiler's heap because 1) I want consistent performance across all compilers and 2) their implementations are generally very simple and are the same or worse for fragmentation.
-double indirect options are out because of the huge Lua code base that I don't want to fundamtnetally change and revalidate.
My Favored Approaches Thus Far:
1) Have a binary buddy allocator, and sacrifice memory usage efficiency (rounding up to a power of 2 size).
-this would (as I understand) require a binary tree for each order/bin to store free nodes sorted by memory address for fast buddy-block lookup for rechaining.
2) Have two binary trees for free blocks, one sorted by size and one sorted by memory address. (all binary tree links are stored in the block itself)
-allocation would be best-fit using a lookup on the table by size, and then remove that block from the other tree by address
-deallocation would lookup adjacent blocks by address for rechaining
-Both algorithms would also require storing an allocation size before the start of the allocated block, and have blocks go out as a power of 2 minus 4 (or 8 depending on alignment). (Unless they store a binary tree elsewhere to track allocations sorted by memory address, which I don't consider a good option)
-Both algorithms require height-balanced binary tree code.
-Algorithm 2 does not have the requirement of wasting memory by rounding up to a power of two.
-In either case, I will probably have a fixed bank of 32-byte blocks allocated by nested bit fields to off-load blocks this size or smaller, which would be immune to external fragmentation.
My Questions:
-Is there any reason why approach 1 would be more immune to fragmentation than approach 2?
-Are there any alternatives that I am missing that might fit the requirements?
If block sizes are not rounded up to powers of two or some equivalent(*), certain sequences of allocation and deallocation will generate an essentially-unbounded amount of fragmentation even if the number of non-permanent small objects that exist at any given time is limited. A binary-buddy allocator will, of course, avoid that particular issue. Otherwise, if one is using a limited number of nicely-related object sizes but not using a "binary buddy" system, one may still have to use some judgment in deciding where to allocate new blocks.
Another approach to consider is having different allocation methods for things that are expected to be permanent, temporary, or semi-persistent. Fragmentation often causes the most trouble when temporary and permanent things get interleaved on the heap. Avoiding such interleaving may minimize fragmentation.
Finally, I know you don't really want to use double-indirect pointers, but allowing object relocation can greatly reduce fragmentation-related issues. Many Microsoft-derived microcomputer BASICs used a garbage-collected string heap; Microsoft's garbage collector was really horrible, but its string-heap approach can be used with a good one.
You can pick up a (never used for real) Buddy system allocator at http://www.mcdowella.demon.co.uk/buddy.html, with my blessing for any purpose you like. But I don't think you have a problem that is easily solved just by plugging in a memory allocator. The long-running high integrity systems I am familiar with have predictable resource usage, described in 30+ page documents for each resource (mostly cpu and I/O bus bandwidth - memory is easy because they tend to allocate the same amount at startup every time and then never again).
In your case none of the usual tricks - static allocation, free lists, allocation on the stack, can be shown to work because - at least as described to us - you have a Lua interpreted hovering in the background ready to do who knows what at run time - what if it just gets into a loop allocating memory until it runs out?
Could you separate the memory use into two sections - traditional code allocating almost all of what it needs on startup, and never again, and expendable code (e.g. Lua) allowed to allocate whatever it needs when it needs it, from whatever is left over after static allocation? Could you then trigger a restart or some sort of cleanup of the expendable code if it manages to use all of its area of memory, or fragments it, without bothering the traditional code?

Memory mapped database

I have 8 terabytes of data composed of ~5000 arrays of small sized elements (under a hundred bytes per element). I need to load sections of these arrays (a few dozen megs at a time) into memory to use in an algorithm as quickly as possible. Are memory mapped files right for this use, and if not what else should I use?
Given your requirements I would definitely go with memory mapped files. It's almost exactly what they were made for. And since memory mapped files consume few physical resources, your extremely large files will have little impact on the system as compared to other methods, especially since smaller views can be mapped into the address space just before performing I/O (eg, those arrays of elements). The other big benefit is they give you the simplest working environment possible. You can (mostly) just view your data as a large memory address space and let Windows worry about the I/O. Obviously, you'll need to build in locking mechanisms to handle multiple threads, but I'm sure you know that.

Resources