temporary files vs malloc (in C) - c

I have a program that generates a variable amount of data that it has to store to use later.
When should I choose to use mallod+realloc and when should I choose to use temporary files?

mmap(2,3p) (or file mappings) means never having to choose between the two.

Use temporary files if the size of your data is larger than the virtual address space size of your target system (2-3 gb on 32-bit hosts) or if it's at least big enough that it would put serious resource strain on the system.
Otherwise use malloc.
If you go the route of temporary files, use the tmpfile function to create them, since on good systems they will never have names in the filesystem and have no chance of getting left around if your program terminates abnormally. Most people do not like temp file cruft like Microsoft Office products tend to leave all over the place. ;-)

Prefer a temporary file if you need/want it to be visible to other processes, and malloc/realloc if not. Also consider the amount of data compared to your address space and virtual memory: will the data consume too much swap space if left in memory? Also consider how good a fit the respective usage is for your application: file read/write etc. can be a pain compared to memory access... memory mapped files make it easier, but you may need custom library support to do dynamic memory allocation within them.

In a modern OS, all the memory gets paged out to disk if needed anyway, so feel free to malloc() anything up to a couple of gigabytes.

If you know the maximum size, it's not too big and you only need one copy, you should use a static buffer, allocated at program load time:
char buffer[1000];
int buffSizeUsed;
If any of those pre-conditions are false and you only need the information while the program is running, use malloc:
char *buffer = malloc (actualSize);
Just make sure you check that the allocations work and that you free whatever you allocate.
If the information has to survive the termination of your program or be usable from other programs at the same time, it'll need to go into a file (or long-lived shared memory if you have that capability).
And, if it's too big to fit into your address space at once, you'll need to store it in a file and read it in a bit at a time.
That's basically going from the easiest/least-flexible to the hardest/most-flexible possibilities.
Where your requirements lie along that line is a decision you need to make.

On a 32-bit system, you won't be able to malloc() more than 2GB or 3GB or so. The big advantage of files is that they are limited only by disk size. Even with a 64-bit system, it's unusual to be able to allocate more than 8GB or 16GB because there are usually limits on how large the swap file can grow.

Use ram for data that is private and for the life of a single process. Use a temp file if the data needs to persist beyond the a single process.

Related

Is there a performance cost to using large mmap calls that go beyond expected memory usage?

Edit: On systems that use on-demand paging
For initializing data structures that are both persistent for the duration of the program and require a dynamic amount of memory is there any reason not to mmap an upper bound from the start?
An example is an array that will persistent for the entire program's life but whose final size is unknown. The approach I am most familiar with is something along the lines of:
type * array = malloc(size);
and when the array has reached capacity doubling it with:
array = realloc(array, 2 * size);
size *= 2;
I understand this is probably the best way to do this if the array might freed mid execution so that its VM can be reused, but if it is persistent is there any reason not to just initialize the array as follows:
array = mmap(0,
huge_size,
PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_NORESERVE,
-1, 0)
so that the elements never needs to be copied.
Edit: Specifically for an OS that uses on-demand paging.
Don't try to be smarter than the standard library, unless you 100% know what you are doing.
malloc() already does this for you. If you request a large amount of memory, malloc() will mmap() you a dedicated memory area. If what you are concerned about is the performance hit coming from doing size *= 2; realloc(old, size), then just malloc(huge_size) at the beginning, and then keep track of the actual used size in your program. There really is no point in doing an mmap() unless you explicitly need it for some specific reason: it isn't faster nor better in any particular way, and if malloc() thinks it's needed, it will do it for you.
It's fine to allocate upper bounds as long as:
You're building a 64bit program: 32bit ones have restricted virtual space, even on 64bit CPUs
Your upper bounds don't approach 2^47, as a mathematically derived one might
You're fine with crashing as your out-of-memory failure mode
You'll only run on systems where overcommit is enabled
As a side note, an end user application doing this may want to borrow a page from GHC's book and allocate 1TB up front even if 10GB would do. This unrealistically large amount will ensure that users don't confuse virtual memory usage with physical memory usage.
If you know for a fact that wasting a chunk of memory (most likely an entire page which is likely 4096 bytes) will not cause your program or the other programs running on your system to run out of memory, AND you know for a fact that your program will only ever be compiled and run on UNIX machines, then this approach is not incorrect, but it is not good programming practice for the following reasons:
The <stdlib.h> file you #include to use malloc() and free() in your C programs is specified by the C standard, but it is specifically implemented for your architecture by the writers of the operating system. This means that your specific system was kept in-mind when these functions were written, so finding a sneaky way to improve efficiency for memory allocation is unlikely unless you know the inner workings of memory management in your OS better than those who wrote it.
Furthermore, the <sys/mman.h> file you include to mmap() stuff is not part of the C standard, and will only compile on UNIX machines, which reduces the portability of your code.
There's also a really good chance (assuming a UNIX environment) that malloc() and realloc() already use mmap() behind-the-scenes to allocate memory for your process anyway, so it's almost certainly better to just use them. (read that as "realloc doesn't necessarily actively allocate more space for me, because there's a good chance there's already a chunk of memory that my process has control of that can satisfy my new memory request without calling mmap() again")
Hope that helps!

What is the best way to allocate an array when reading big files

I have a file around 100 MB, which needs to be processed.
After I get the dimensions of that file (h & w), I should read the data into an array. I am thinking of several ways how to do that:
1. Static (automatic)
int matrix[h][w];
2. Dynamic
// similar to above, but using malloc
I am worried about the limitations (and freeing the memory).
Also, would a static array be freed whet it's scope is over?
In my situation, the solution is using dynamic allocation.
It seems that int matrix[h][w]; puts data in stack, which is limited (small), whereas using malloc() puts data in heap, which is as big as 75% of the virtual memory (in Linux).
It depends on the application, but one good approach most of the time is to map the file to memory. On a modern OS, this will return a pointer to the file contents and make a file on a hard drive work like the swap file: the application will just see its contents as memory, but the OS will only load the pages into memory when (it expects that) you access them. This could save you a lot of time and complication if you only need to read a small part of the file.
This is how the glib library does file I/O, or it is available through mmap() on Unix/Linux and a different set of functions in the Windows API.

malloc and other associated functions

I have an array named 'ArrayA' and it is full of ints but I want to add another 5 cell to the end of the array every time a condition is met. How would I do this? ( The internet is not being very helpful )
If this is a static array, you will have to create a new one with more space and copy the data yourself. If it was allocated with malloc(), as the title to your question suggests, then you can use realloc() to do this more-or-less automatically. Note that the address of your array will, in general, have changed.
It is precisely because of the need for "dynamic" arrays that grow (and shrink) as needed, that languages like C++ introduced vectors. They do the management under the covers.
You need the realloc function.
Also note that adding 5 cells is not the best performance solution.
It is best to double the size of your arrays every time an array increase is needed.
Use two variables, one for the size (the number of integers used) and one for capacity (the actual memory size of arrays)
In a modern OS it is generally safe to assume that if you allocate a lot of memory that you don't use then it will not actually consume physical RAM, but only exist as virtual mappings. The OS will provide physical RAM as soon as a page (today generally in chunks of 4Kb) is used for the first time.
You can specifically enforce this behavior by using mmap to create a large anonymous mapping (MAP_PRIVATE | MAP_ANONYMOUS) e.g. as much as you intend to hold at maximum. On modern x64 systems virtual mappings can be up to 64Tb large. It is logically memory available to your program, but in practice pages will be added to it as you start using them.
realloc as described by the other posters is the naiive way to resize a malloc mapping, but make sure that realloc was successful. It can fail!
Problems with memory arise when you use memory once, don't deallocate it and stop using it. In contrast allocated, but untouched memory generally does not actually use resources other then VM table entries.

How best to allocate and use memory when program traverses and reports on file system details

I have a program which reads all the file system file/dir names, size etc. and populate in a tree data structure. Once this is done, it will generate a report.
I want write my program to collect and then report this data using memory in the most efficient way and without exceeding my heap space.
I worry that if the file system has a lot of files and dirs., it will consume a lot of memory and might eventually run out (malloc() will start to fail).
Eventually this is genuine memory consumption, Is there any methods/techniques to overcome this?
You could employ the Flyweight Design Pattern for each folder node.
http://en.wikipedia.org/wiki/Flyweight_pattern
Instead of storing the full path for each item, you could have a bit array of pointers to partial paths (folder names). These could then be easily reconstructed if needed.
It also depends on what you need for your report. Do you need to hold all the information in memory during construction, or could you just accumulate some of the space count variables as you traverse the tree?
Perhaps using valgrind or Boehm's garbage collector could help you (at least on Linux).

How can I reserve memory addresses without allocating them

I would like (in *nix) to allocate a large, contigious address space, but without consuming resources straight away, i.e. I want to reserve an address range an allocate from it later.
Suppose I do foo=malloc(3*1024*1024*1024) to allocate 3G, but on a 1G computer with 1G of swap file. It will fail, right?
What I want to do is say "Give me a memory address range foo...foo+3G into which I will be allocating" so I can guarantee all allocations within this area are contiguous, but without actually allocating straight away.
In the example above, I want to follow the foo=reserve_memory(3G) call with a bar=malloc(123) call which should succeedd since reserve_memory hasn't consumed any resources yet, it just guarantees that bar will not be in the range foo...foo+3G.
Later I would do something like allocate_for_real(foo,0,234) to consume bytes 0..234 of foo's range. At this point, the kernel would allocate some virtual pages and map them to foo...foo+123+N
Is this possible in userspace?
(The point of this is that objects in foo... need to be contiguous and cannot reasonably be moved after they are created.)
Thank you.
Short answer: it already works that way.
Slightly longer answer: the bad news is that there is no special way of reserving a range, but not allocating it. However, the good news is that when you allocate a range, Linux does not actually allocate it, it just reserves it for use by you, later.
The default behavior of Linux is to always accept a new allocation, as long as there is address range left. When you actually start using the memory though, there better be some memory or at least swap backing it up. If not, the kernel will kill a process to free memory, usually the process which allocated the most memory.
So the problem in Linux with default settings gets shifted from, "how much can I allocate", into "how much can I allocate and then still be alive when I start using the memory?"
Here is some info on the subject.
I think, a simple way would be to do that with a large static array.
On any modern system this will not be mapped to existing memory (in the executable file on disk or in RAM of your execution machine) unless you will really access it. Once you will access it (and the system has enough resources) it will be miraculously initialized to all zeros.
And your program will seriously slow down once you reach the limit of physical memory and then randomly crash if you run out of swap.

Resources