Correct way to profile a memory allocator - c

I have written a memory allocator that is (supposedly) faster than using malloc/free.
I have written a small amout of code to test this but I'm not sure if this is the correct way to profile a memory allocator, can anyone give me some advice?
The output of this code is:
Mem_Alloc: 0.020000s
malloc: 3.869000s
difference: 3.849000s
Mem_Alloc is 193.449997 times faster.
This is the code:
int i;
int mem_alloc_time, malloc_time;
float mem_alloc_time_float, malloc_time_float, times_faster;
unsigned prev;
// Test Mem_Alloc
timeBeginPeriod (1);
mem_alloc_time = timeGetTime ();
for (i = 0; i < 100000; i++) {
void *p = Mem_Alloc (100000);
Mem_Free (p);
}
// Get the duration
mem_alloc_time = timeGetTime () - mem_alloc_time;
// Test malloc
prev = mem_alloc_time; // For getting the difference between the two times
malloc_time = timeGetTime ();
for (i = 0; i < 100000; i++) {
void *p = malloc (100000);
free (p);
}
// Get the duration
malloc_time = timeGetTime() - malloc_time;
timeEndPeriod (1);
// Convert both times to seconds
mem_alloc_time_float = (float)mem_alloc_time / 1000.0f;
malloc_time_float = (float)malloc_time / 1000.0f;
// Print the results
printf ("Mem_Alloc: %fs\n", mem_alloc_time_float);
printf ("malloc: %fs\n", malloc_time_float);
if (mem_alloc_time_float > malloc_time_float) {
printf ("difference: %fs\n", mem_alloc_time_float - malloc_time_float);
} else {
printf ("difference: %fs\n", malloc_time_float - mem_alloc_time_float);
}
times_faster = (float)max(mem_alloc_time_float, malloc_time_float) /
(float)min(mem_alloc_time_float, malloc_time_float);
printf ("Mem_Alloc is %f times faster.\n", times_faster);

Nobody cares[*] whether your allocator is faster or slower than their allocator, at allocating and then immediately freeing a 100k block 100k times. That is not a common memory allocation pattern (and for any situation where it occurs, there are probably better ways to optimize than using your memory allocator. For example, use the stack via alloca or use a static array).
People care greatly whether or not your allocator will speed up their application.
Choose a real application. Study its performance at allocation-heavy tasks with the two different allocators, and compare that. Then study more allocation-heavy tasks.
Just for one example, you might compare the time to start up Firefox and load the StackOverflow front page. You could mock the network (or at least use a local HTTP proxy), to remove a lot of the random variation from the test. You could also use a profiler to see how much time is spent in malloc and hence whether the task is allocation-heavy or not, but beware that stuff like "overcommit" might mean that not all of the cost of memory allocation is paid in malloc.
If you wrote the allocator in order to speed up your own application, you should use your own application.
One thing to watch out for is that often what people want in an allocator is good behavior in the worst case. That is to say, it's all very well if your allocator is 99.5% faster than the default most of the time, but if it does comparatively badly when memory gets fragmented then you lose in the end, because Firefox runs for a couple of hours and then can't allocate memory any more and falls over. Then you realise why the default is taking so long over what appears to be a trivial task.
[*] This may seem harsh. Nobody cares whether it's harsh ;-)

All your implementation you are testing against is missing is checking if current size of packet is same as previously fried one:
if(size == prev_free->size)
{
current = allocate(prev_free);
return current;
}
It is "trivial" to make efficient malloc/free functions for memory until memory is not fragmented. Challenge is when you allocate lot of memory of different sizes and you try to free some and then allocate some whit no specific order.
You have to check which library you tested against and check what conditions that library was optimised for.
de-fragmented memory handling efficiency
fast free, fast malloc (you can make either one O(1) ),
memory footprint
multiprocessor support
realloc
Check existing implementations and problems they were dealing whit and try to improve or solve difficulties they had. Try to figure out what users expects from library.
Make test on this assumptions, not just some operation you think is important.

Related

When using MPI, how can I fix the error: "_int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed"?

I'm using a scientific simulation code that my supervisor wrote about 10 years ago to run some calculations. An intermittent issue keeps arising when running it in parallel on our cluster (which has hyperthreading enabled) using mpirun. The error it produces is very terse, and simply tells me that a memory assignment has failed.
program-name: malloc.c:4036: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.
[MKlabgroup:3448077] *** Process received signal ***
[MKlabgroup:3448077] Signal: Aborted (6)
[MKlabgroup:3448077] Signal code: (-6)
I've used the advice here to start the program and halt it so that I can attach a debugger session to one of the instances on a single core. The error occurs during the partitioning of the input mesh (using metis) when a custom matrix function is called for the first time, and requests space for ~4000 rows and 4 columns, with each element being an 8 byte integer. This particular function (below) uses an array of n pointers addressing m arrays of integer pointers:
int **matrix_int(int n, int m)
{
int i;
int **mat;
// First: Assign the rows [n]
mat = (int **) malloc(n*sizeof(int*));
if (!mat)
{
program_error("ERROR[0]: Memory allocation failure for the rows #[matrix_int()]");
}
// Second: Assign the columns [m]
for (i = 0; i <= n-1; i++)
{
mat[i] = (int *) malloc(m*sizeof(int));
if (!mat[i])
{
program_error("ERROR[0]: Memory allocation failure for the columns #[matrix_int()]");
}
}
return mat;
}
My supervisor thinks that the issue has to do with automatic resource allocation on the CPU. As such, I've recently tried using the -rf option in mpirun in conjunction with a rankfile specifying which cores to use, but this has produced similarly intermittent results; sometimes a single process crashes, sometimes several, and sometimes it runs fine. It always runs reliably in serial, but the calculations are extremely slow on a single core.
Does anyone know of a change to the server configuration or the code itself that I can make (aside from globally disabling hyperthreading) that would allow this to run for certain every time?
(Any general tips on debugging in parallel would also be greatly appreciated! I'm still pretty new to C/C++ and MPI, and have another bug to chase after this one which is probably related.)
After using the compiler flags suggested by n. 1.8e9-where's-my-share m. to diagnose memory access violations I've discovered that the memory corruption is indeed caused by a function that is called just before the one in my original question.
The offending function reads in data from a text file using sscanf, and would allocate a 3-element array for each line of the file (for the 3 numbers to be read in per line). The next part is conjecture, but I think that the problem arose because sscanf returns a NULL at the end of a sequence it reads. I'm surmising that this NULL was written to the next byte along from the 3 allocated, such that the next time malloc tried to allocate data the first thing it saw was a NULL, causing it to return without actually having allocated any space. Then the next function to try and use the allocated memory would come along and crash because it's trying to access unassigned memory that malloc had reported to be allocated.
I was able to fix the bug by changing the size of the allocated array in the read function from 3 to 4 elements. This would seem to allow the NULL character to be stored without it interfering with subsequent memory allocations.

What algorithm to apply for continuious reallocation of small memory chunks?

In C program I face transactions that require to have alot of memory chunks, I need to know if there is an algorithm or best practice teqnique used to handle all these malloc/free, I've used arrays to store these memory chunks but at some point the array itself gets full and reallocating the array is just more waste, what is the elegant way to handle this issue?
The best algorithm in this case would be free list allocator + binary search tree.
You are asking one big memory chunk from system, and then allocating memory blocks of fixed size from this chunk. When chunk is full you are allocating another one, and link them into red-black or AVL binary search interval tree (otherwise searching for a chunk during free by iterating over the chunks list becomes a bottleneck)
And in the same time, multi-threading and thread synchronization becomes important. If you will simply use mutexes or trivial spin-locks, you will found you libc malloc works much faster than your custom memory allocator. Solution can be found in Hazard pointer, i.e. each thread have it's own memory allocator (or one allocator per CPU core). This will add another problem - when one thread allocates memory and another releasing it, this will require searching for the exact allocator during free function, and strick locking or lock-free data structures.
The best option - you can use jemalloc, tcmalloc or any another generic propose fast memory allocator to replace your libc (i.e. pdmalloc) default allocator completely or partially.
If you need to track these various buffers due to the functionality you are delivering then you are going to need to have some kind of buffer management functionality which includes allocating memory for the buffer management.
However if you are worried about memory fragmentation, that is a different question.
I have seen a lot written about how using malloc() and free() leads to memory fragmentation along with various ways to reduce or eliminate this problem.
I suspect years ago, it may have been a problem. However these days I question whether most programmers can do as good a job at managing memory as the people who develop the run time of modern compilers. I am sure there are special applications but compiler runtime development is very much aware of memory fragmentation and there are a number of approaches they use to help mitigate the problem.
For instance here is an older article from 2010 describing a modern compiler runtime implements malloc() and free(). A look at how malloc works on the Mac
Here is a bit about the GNU allocator.
This article from Nov of 2004 from IBM describes some of the considerations for memory management, Inside memory management and provides what they call "code for a simplistic implementation of malloc and free to help demonstrate what is involved with managing memory." Notice that this is a simplistic example meant to illustrate some of the issues and is not a demonstration of current practice.
I did a quick console application with Visual Studio 2015 that called a C function in a C source file that interspersed malloc() and free() calls of various sizes. I ran this while watching the process in Windows Task Manager. Peak Working Set (Memory) maxed out at 34MB. Watching the Memory (Private Working Set) measurement, I saw it rise and fall as the program ran.
#include <malloc.h>
#include <stdio.h>
void funcAlloc(void)
{
char *p[50000] = { 0 };
int i;
for (i = 0; i < 50000; i+= 2) {
p[i] = malloc(32 + (i % 1000));
}
for (i = 0; i < 50000; i += 2) {
free(p[i]); p[i] = 0;
}
for (i = 1; i < 50000; i += 2) {
p[i] = malloc(32 + (i % 1000));
}
for (i = 1; i < 50000; i += 2) {
free(p[i]); p[i] = 0;
}
for (i = 0; i < 50000; i++) {
p[i] = malloc(32 + (i % 1000));
}
for (i = 0; i < 50000; i += 3) {
free(p[i]); p[i] = 0;
}
for (i = 1; i < 50000; i += 3) {
free(p[i]); p[i] = 0;
}
for (i = 2; i < 50000; i += 3) {
free(p[i]); p[i] = 0;
}
}
void funcMain(void)
{
for (int i = 0; i < 5000; i++) {
funcAlloc();
}
}
Probably the only consideration that a programmer should practice to help the memory allocator under some conditions of churning memory with malloc() and free() is to use a set of standard buffer sizes for varying lengths of data.
For example, if you are creating temporary buffers for several different data structs which are of varying sizes, use a standard buffer size of the largest struct for all of the buffers so that the memory manager is working with equal sized chunks of memory so it is able to more efficiently reuse chunks of free memory. I have seen some dynamic data structure functionality use this approach such as allocating a dynamic string with a minimum length of 32 characters or rounding up a buffer request to a multiple of four or eight.

Performance of GNU C implementation of getcwd()

According to the GNU Lib C documentation on getcwd()...
The GNU C Library version of this function also permits you to specify a null pointer for the buffer argument. Then getcwd allocates a buffer automatically, as with malloc(see Unconstrained Allocation). If the size is greater than zero, then the buffer is that large; otherwise, the buffer is as large as necessary to hold the result.
I now draw your attention to the implementation using the standard getcwd(), described in the GNU documentation:
char* gnu_getcwd ()
{
size_t size = 100;
while (1)
{
char *buffer = (char *) xmalloc (size);
if (getcwd (buffer, size) == buffer)
return buffer;
free (buffer);
if (errno != ERANGE)
return 0;
size *= 2;
}
}
This seems great for portability and stability but it also looks like a clunky compromise with all that allocating and freeing memory. Is this a possible performance concern given that there may be frequent calls to the function?
*It's easy to say "profile it" but this can't account for every possible system; present or future.
The initial size is 100, holding a 99-char path, longer than most of the paths that exist on a typical system. This means in general that there is no “allocating and freeing memory”, and no more than 98 bytes are wasted.
The heuristic of doubling at each try means that at a maximum, a logarithmic number of spurious allocations take place. On many systems, the maximum length of a path is otherwise limited, meaning that there is a finite limit on the number of re-allocations caused.
This is about the best one can do as long as getcwd is used as a black box.
This is not a performance concern because it's the getcwd function. If that function is in your critical path then you're doing it wrong.
Joking aside, there's none of this code that could be removed. The only way you could improve this with profiling is to adjust the magic number "100" (it's a speed/space trade-off). Even then, you'd only have optimized it for your file system.
You might also think of replacing free/malloc with realloc, but that would result in an unnecessary memory copy, and with the error checking wouldn't even be less code.
Thanks for the input, everyone. I have recently concluded what should have been obvious from the start: define the value ("100" in this case) and the increment formula to use (x2 in this case) to be based on the target platform. This could account for all systems, especially with the use of additional flags.

Unexpectedly large number of TLB misses in simple PAPI profiling on x86

I am using the PAPI high level API to check TLB misses in a simple program looping through an array, but seeing larger numbers than expected.
In other simple test cases, the results seem quite reasonable, which leads me to think the results are real and the extra misses are due to a hardware prefetch or similar.
Can anyone explain the numbers or point me to some error in my use of PAPI?
int events[] = {PAPI_TLB_TL};
long long values[1];
char * databuf = (char *) malloc(4096 * 32);
if (PAPI_start_counters(events, 1) != PAPI_OK) exit(-1);
if (PAPI_read_counters(values, 1) != PAPI_OK) exit(-1); //Zeros the counters
for(int i=0; i < 32; ++i){
databuf[4096 * i] = 'a';
}
if (PAPI_read_counters(values, 1) != PAPI_OK) exit(-1); //Extracts the counters
printf("%llu\n", values[0]);
I expected the printed number to be in the region of 32, or at least some multiple, but consistently get a result of 93 or above (not consistently above 96 i.e. not simply 3 misses for every iteration). I am running pinned to a core with nothing else on it (apart from timer interrupts).
I am on Nehalem and not using huge pages, so there are 64 entries in DTLB (512 in L2).
Based on the comments:
~90 misses if malloc() is used.
32 misses if calloc() is used or if the array is iterated through before hand.
The reason is due to lazy allocation. The OS isn't actually giving you the memory until you touch it.
When you first touch the page, it will result in a page-fault. The OS will trap this page-fault and properly allocate it on the fly (which involves zeroing among other things). This is the overhead that is resulting in all those extra TLB misses.
But if you use calloc() or you touch all the memory ahead of time, you move this overhead to before you start the counters. Hence the smaller result.
As for the 32 remaining misses... I have no idea.
(Or as mentioned in the comments, it's probably the PAPI interference.)
The reason for this is possibly because you are jumping by 4096 in every loop iteration:
for(int i=0; i < 32; ++i){
databuf[4096 * i] = 'a';
}
There is a good chance that you get a cache miss for every access.

Purposely waste all of main memory to learn fragmentation

In my class we have an assignment and one of the questions states:
Memory fragmentation in C: Design, implement, and run a C-program that does the following: it allocated memory for a sequence of of 3m arrays of size 500000 elements each; then it deallocates all even-numbered arrays and allocates a sequence of m arrays of size 700000 elements each. Measure the amounts of time your program requires for the allocations of the first sequence and for the second sequence. Choose m so that you exhaust all of the main memory available to your program. Explain your timings
My implementation of this is as follows:
#include <iostream>
#include <time.h>
#include <algorithm>
void main(){
clock_t begin1, stop1, begin2, stop2;
double tdif = 0, tdif2 = 0;
for(int k=0;k<1000;k++){
double dif, dif2;
const int m = 50000;
begin1 = clock();
printf("Step One\n");
int *container[3*m];
for(int i=0;i<(3*m);i++)
{
int *tmpAry = (int *)malloc(500000*sizeof(int));
container[i] = tmpAry;
}
stop1 = clock();
printf("Step Two\n");
for(int i=0;i<(3*m);i+=2)
{
free(container[i]);
}
begin2 = clock();
printf("Step Three\n");
int *container2[m];
for(int i=0;i<m;i++)
{
int *tmpAry = (int *)malloc(700000*sizeof(int));
container2[i] = tmpAry;
}
stop2 = clock();
dif = (stop1 - begin1)/1000.00;
dif2 = (stop2 - begin2)/1000.00;
tdif+=dif;
tdif/=2;
tdif2+=dif2;
tdif2/=2;
}
printf("To Allocate the first array it took: %.5f\n",tdif);
printf("To Allocate the second array it took: %.5f\n",tdif2);
system("pause");
};
I have changed this up a few different ways, but the consistencies I see are that when I initially allocate the memory for 3*m*500000 element arrays it uses up all of the available main memory. But then when I tell it to free them the memory is not released back to the OS so then when it goes to allocate the m*700000 element arrays it does it in the page file (swap memory) so it does not actually display memory fragmentation.
The above code runs this 1000 times and averages it, it takes quite some time. The first sequence average took 2.06913 seconds and the second sequence took 0.67594 seconds. To me the second sequence is supposed to take longer to show how fragmentation works, but because of the swap being used this does not occur. Is there a way around this or am I wrong in my assumption?
I will ask the professor about what I have on monday but until then any help would be appreciated.
Many libc implementations (I think glibc included) don't release memory back to the OS when you call free(), but keep it so you can use it on the next allocation without a syscall. Also, because of the complexity of modern paging and virtual memory stratagies, you can never be sure where anything is in physical memory, which makes it almost imposible to intentionally fragment it (even if it comes fragmented). You have to remember, all virtual memory, and all physical memory are different beasts.
(The following is written for Linux, but probably applicable to Windows and OSX)
When your program makes the first allocations, let's say there is enough physical memory for the OS to squeeze all of the pages in. They aren't all next to each-other in physical memory -- they are scattered wherever they can be. Then the OS modifies the page table to make a set of continuous virtual addresses, that refer to the scattered pages around in memory. But here's the thing -- because you don't really use the first memory you allocate, it becomes a really good candidate for swapping out. So, when you come along to do the next allocations, the OS, running out of memory, will probably swap out some of those pages to make room for the new ones. Because of this, you are actually measuring disk speeds, and the efficiency of the operations systems paging mechanism -- not fragmentation.
Remember, an set of continuous virtual addresses is almost never physically continuous in practice (or even in memory).

Resources