Can some explain the performance behavior of the following memory allocating C program? - c

On my machine Time A and Time B swap depending on whether A is
defined or not (which changes the order in which the two callocs are called).
I initially attributed this to the paging system. Weirdly, when
mmap is used instead of calloc, the situation is even more bizzare -- both the loops take the same amount of time, as expected. As
can be seen with strace, the callocs ultimately result in two
mmaps, so there is no return-already-allocated-memory magic going on.
I'm running Debian testing on an Intel i7.
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <time.h>
#define SIZE 500002816
#ifndef USE_MMAP
#define ALLOC calloc
#else
#define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE, \
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#endif
int main() {
clock_t start, finish;
#ifdef A
int *arr1 = ALLOC(sizeof(int), SIZE);
int *arr2 = ALLOC(sizeof(int), SIZE);
#else
int *arr2 = ALLOC(sizeof(int), SIZE);
int *arr1 = ALLOC(sizeof(int), SIZE);
#endif
int i;
start = clock();
{
for (i = 0; i < SIZE; i++)
arr1[i] = (i + 13) * 5;
}
finish = clock();
printf("Time A: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);
start = clock();
{
for (i = 0; i < SIZE; i++)
arr2[i] = (i + 13) * 5;
}
finish = clock();
printf("Time B: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);
return 0;
}
The output I get:
~/directory $ cc -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.94
Time B: 0.34
~/directory $ cc -DA -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.34
Time B: 0.90
~/directory $ cc -DUSE_MMAP -DA -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.89
Time B: 0.90
~/directory $ cc -DUSE_MMAP -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.91
Time B: 0.92

You should also test using malloc instead of calloc. One thing that calloc does is to fill the allocated memory with zeros.
I believe in your case that when you calloc arr1 last and then assign to it, it is already faulted into cache memory, since it was the last one allocated and zero-filled. When you calloc arr1 first and arr2 second, then the zero-fill of arr2 pushes arr1 out of cache.

Guess I could have written more, or less, especially as less is more.
The reason can differ from system to system. However; for clib:
The total time used for each operation is the other way around; if you time
the calloc + the iteration.
I.e.:
Calloc arr1 : 0.494992654
Calloc arr2 : 0.000021250
Itr arr1 : 0.430646035
Itr arr2 : 0.790992411
Sum arr1 : 0.925638689
Sum arr2 : 0.791013661
Calloc arr1 : 0.503130736
Calloc arr2 : 0.000025906
Itr arr1 : 0.427719162
Itr arr2 : 0.809686047
Sum arr1 : 0.930849898
Sum arr2 : 0.809711953
The first calloc subsequently malloc has a longer execution time then
second. A call as i.e. malloc(0) before any calloc etc. evens out the time
used for malloc like calls in same process (Explanation below). One can
however see an slight decline in time for these calls if one do several in line.
The iteration time, however, will flatten out.
So in short; The total system time used is highest on which ever get alloc'ed first.
This is however an overhead that can't be escaped in the confinement of a process.
There is a lot of maintenance going on. A quick touch on some of the cases:
Short on page's
When a process request memory it is served a virtual address range. This range
translates by a page table to physical memory. If a page translated byte by
byte we would quickly get huge page tables. This, as one, is a reason why
memory ranges are served in chunks - or pages. The page size are system
dependent. The architecture can also provide various page sizes.
If we look at execution of above code and add some reads from /proc/PID/stat
we see this in action (Esp. note RSS):
PID Stat {
PID : 4830 Process ID
MINFLT : 214 Minor faults, (no page memory read)
UTIME : 0 Time user mode
STIME : 0 Time kernel mode
VSIZE : 2039808 Virtual memory size, bytes
RSS : 73 Resident Set Size, Number of pages in real memory
} : Init
PID Stat {
PID : 4830 Process ID
MINFLT : 51504 Minor faults, (no page memory read)
UTIME : 4 Time user mode
STIME : 33 Time kernel mode
VSIZE : 212135936 Virtual memory size, bytes
RSS : 51420 Resident Set Size, Number of pages in real memory
} : Post calloc arr1
PID Stat {
PID : 4830 Process ID
MINFLT : 51515 Minor faults, (no page memory read)
UTIME : 4 Time user mode
STIME : 33 Time kernel mode
VSIZE : 422092800 Virtual memory size, bytes
RSS : 51428 Resident Set Size, Number of pages in real memory
} : Post calloc arr2
PID Stat {
PID : 4830 Process ID
MINFLT : 51516 Minor faults, (no page memory read)
UTIME : 36 Time user mode
STIME : 33 Time kernel mode
VSIZE : 422092800 Virtual memory size, bytes
RSS : 51431 Resident Set Size, Number of pages in real memory
} : Post iteration arr1
PID Stat {
PID : 4830 Process ID
MINFLT : 102775 Minor faults, (no page memory read)
UTIME : 68 Time user mode
STIME : 58 Time kernel mode
VSIZE : 422092800 Virtual memory size, bytes
RSS : 102646 Resident Set Size, Number of pages in real memory
} : Post iteration arr2
PID Stat {
PID : 4830 Process ID
MINFLT : 102776 Minor faults, (no page memory read)
UTIME : 68 Time user mode
STIME : 69 Time kernel mode
VSIZE : 2179072 Virtual memory size, bytes
RSS : 171 Resident Set Size, Number of pages in real memory
} : Post free()
As we can see pages actually allocated in memory is postponed for arr2 awaiting
page request; which lasts until iteration begins. If we add a malloc(0) before
calloc of arr1 we can register that neither array is allocated in physical
memory before iteration.
As a page might not be used it is more efficient to do the mapping on request.
This is why when the process i.e. do a calloc the sufficient number of pages
are reserved, but not necessarily actually allocated in real memory.
When an address is referenced the page table is consulted. If the address is
in a page which is not allocated the system serves a page fault and the page
is subsequently allocated. Total sum of allocated pages is called Resident
Set Size (RSS).
We can do an experiment with our array by iterating (touching) i.e. 1/4 of it.
Here I have also added malloc(0) before any calloc.
Pre iteration 1/4:
RSS : 171 Resident Set Size, Number of pages in real meory
for (i = 0; i < SIZE / 4; ++i)
arr1[i] = 0;
Post iteration 1/4:
RSS : 12967 Resident Set Size, Number of pages in real meory
Post iteration 1/1:
RSS : 51134 Resident Set Size, Number of pages in real meory
To further speed up things most systems additionally cache the N most recent
page table entries in a translation lookaside buffer (TLB).
brk, mmap
When a process (c|m|…)alloc the upper bounds of the heap is expanded by
brk() or sbrk(). These system calls are expensive and to compensate for
this malloc collect multiple smaller calls in to one bigger brk().
This also affects free() as a negative brk() also is resource expensive
they are collected and performed as a bigger operation.
For huge request; i.e. like the one in your code, malloc() uses mmap().
The threshold for this, which is configurable by mallopt(), is an educated
value
We can have fun with this modifying the SIZE in your code. If we utilize
malloc.h and use,
struct mallinfo minf = mallinfo();
(no, not milf), we can show this (Note Arena and Hblkhd, …):
Initial:
mallinfo {
Arena : 0 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 0 (Number of chunks allocated with mmap)
Hblkhd : 0 (Bytes allocated with mmap)
Uordblks: 0 (Memory occupied by chunks handed out by malloc)
Fordblks: 0 (Memory occupied by free chunks)
Keepcost: 0 (Size of the top-most releasable chunk)
} : Initial
MAX = ((128 * 1024) / sizeof(int))
mallinfo {
Arena : 0 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 1 (Number of chunks allocated with mmap)
Hblkhd : 135168 (Bytes allocated with mmap)
Uordblks: 0 (Memory occupied by chunks handed out by malloc)
Fordblks: 0 (Memory occupied by free chunks)
Keepcost: 0 (Size of the top-most releasable chunk)
} : After malloc arr1
mallinfo {
Arena : 0 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 2 (Number of chunks allocated with mmap)
Hblkhd : 270336 (Bytes allocated with mmap)
Uordblks: 0 (Memory occupied by chunks handed out by malloc)
Fordblks: 0 (Memory occupied by free chunks)
Keepcost: 0 (Size of the top-most releasable chunk)
} : After malloc arr2
Then we subtract sizeof(int) from MAX and get:
mallinfo {
Arena : 266240 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 0 (Number of chunks allocated with mmap)
Hblkhd : 0 (Bytes allocated with mmap)
Uordblks: 131064 (Memory occupied by chunks handed out by malloc)
Fordblks: 135176 (Memory occupied by free chunks)
Keepcost: 135176 (Size of the top-most releasable chunk)
} : After malloc arr1
mallinfo {
Arena : 266240 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 0 (Number of chunks allocated with mmap)
Hblkhd : 0 (Bytes allocated with mmap)
Uordblks: 262128 (Memory occupied by chunks handed out by malloc)
Fordblks: 4112 (Memory occupied by free chunks)
Keepcost: 4112 (Size of the top-most releasable chunk)
} : After malloc arr2
We register that the system works as advertised. If size of allocation is
below threshold sbrk is used and memory handled internally by malloc,
else mmap is used.
The structure of this also helps on preventing fragmentation of memory etc.
Point being that the malloc family is optimized for general usage. However
mmap limits can be modified to meet special needs.
Note this (and down trough 100+ lines) when / if modifying mmap threshold.
.
This can be further observed if we fill (touch) every page of arr1 and arr2
before we do the timing:
Touch pages … (Here with page size of 4 kB)
for (i = 0; i < SIZE; i += 4096 / sizeof(int)) {
arr1[i] = 0;
arr2[i] = 0;
}
Itr arr1 : 0.312462317
CPU arr1 : 0.32
Itr arr2 : 0.312869158
CPU arr2 : 0.31
Also see:
Synopsis of compile-time options
Vital statistics
… actually the entire file is a nice read.
Sub notes:
So, the CPU knows the physical address then? Nah.
In the world of memory a lot has to be addressed ;). A core hardware for
this is the memory management unit (MMU). Either as an integrated part of
the CPU or external chip.
The operating system configure the MMU on boot and define access for various
regions (read only, read-write, etc.) thus giving a level of security.
The address we as mortals see is the logical address that the CPU uses. The
MMU translates this to a physical address.
The CPU's address consist of two parts: a page address and a offset.
[PAGE_ADDRESS.OFFSET]
And the process of getting a physical address we can have something like:
.-----. .--------------.
| CPU > --- Request page 2 ----> | MMU |
+-----+ | Pg 2 == Pg 4 |
| +------v-------+
+--Request offset 1 -+ |
| (Logical page 2 EQ Physical page 4)
[ ... ] __ | |
[ OFFSET 0 ] | | |
[ OFFSET 1 ] | | |
[ OFFSET 2 ] | | |
[ OFFSET 3 ] +--- Page 3 | |
[ OFFSET 4 ] | | |
[ OFFSET 5 ] | | |
[ OFFSET 6 ]__| ___________|____________+
[ OFFSET 0 ] | |
[ OFFSET 1 ] | ...........+
[ OFFSET 2 ] |
[ OFFSET 3 ] +--- Page 4
[ OFFSET 4 ] |
[ OFFSET 5 ] |
[ OFFSET 6 ]__|
[ ... ]
A CPU's logical address space is directly linked to the address length. A
32-bit address processor has a logical address space of 232 bytes.
The physical address space is how much memory the system can afford.
There is also the handling of fragmented memory, re-alignment etc.
This brings us into the world of swap files. If a process request more memory
then is physically available; one or several pages of other process(es) are
transfered to disk/swap and their pages "stolen" by the requesting process.
The MMU keeps track of this; thus the CPU doesn't have to worry about where
the memory is actually located.
This further brings us on to dirty memory.
If we print some information from /proc/[pid]/smaps, more specific the range
of our arrays we get something like:
Start:
b76f3000-b76f5000
Private_Dirty: 8 kB
Post calloc arr1:
aaeb8000-b76f5000
Private_Dirty: 12 kB
Post calloc arr2:
9e67c000-b76f5000
Private_Dirty: 20 kB
Post iterate 1/4 arr1
9e67b000-b76f5000
Private_Dirty: 51280 kB
Post iterate arr1:
9e67a000-b76f5000
Private_Dirty: 205060 kB
Post iterate arr2:
9e679000-b76f5000
Private_Dirty: 410096 kB
Post free:
9e679000-9e67d000
Private_Dirty: 16 kB
b76f2000-b76f5000
Private_Dirty: 12 kB
When a virtual page is created a system typically clears a dirty bit in the
page.
When the CPU writes to a part of this page the dirty bit is set; thus when
swapped the pages with dirty bits are written, clean pages are skipped.

Short Answer
The first time that calloc is called it is explicitly zeroing out the memory. While the next time that it is called it assumed that the memory returned from mmap is already zeroed out.
Details
Here's some of the things that I checked to come to this conclusion that you could try yourself if you wanted:
Insert a calloc call before your first ALLOC call. You will see that after this the Time for Time A and Time B are the same.
Use the clock() function to check how long each of the ALLOC calls take. In the case where they are both using calloc you will see that the first call takes much longer than the second one.
Use time to time the execution time of the calloc version and the USE_MMAP version. When I did this I saw that the execution time for USE_MMAP was consistently slightly less.
I ran with strace -tt -T which shows both the time of when the system call was made and how long it took. Here is part of the output:
Strace output:
21:29:06.127536 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff806fd000 <0.000014>
21:29:07.778442 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff093a0000 <0.000021>
21:29:07.778563 times({tms_utime=63, tms_stime=102, tms_cutime=0, tms_cstime=0}) = 4324241005 <0.000011>
You can see that the first mmap call took 0.000014 seconds, but that about 1.5 seconds elapsed before the next system call. Then the second mmap call took 0.000021 seconds, and was followed by the times call a few hundred microsecond later.
I also stepped through part of the application execution with gdb and saw that the first call to calloc resulted in numerous calls to memset while the second call to calloc did not make any calls to memset. You can see the source code for calloc here (look for __libc_calloc) if you are interested. As for why calloc is doing the memset on the first call but not subsequent ones I don't know. But I feel fairly confident that this explains the behavior you have asked about.
As for why the array that was zeroed memset has improved performance my guess is that it is because of values being loaded into the TLB rather than the cache since it is a very large array. Regardless the specific reason for the performance difference that you asked about is that the two calloc calls behave differently when they are executed.

It's just a matter of when the process memory image expands by a page.

Summary: The time difference is explained when analysing the time is takes to allocate the arrays. The last allocated calloc takes just a bit more time whereas the other (or all when using mmap) take virtualy no time. The real allocation in memory is probably deferred when first accessed.
I don't know enough about the internal of memory allocation on Linux. But I ran your script slightly modified: I've added a third array and some extra iterations per array operations. And I have taken into account the remark of Old Pro that the time to allocate the arrays was not taken into account.
Conclusion: Using calloc takes longer than using mmap for the allocation (mmap virtualy uses no time when you allocate the memory, it's probably postponed later when fist accessed), and using my program there is almost no difference in the end between using mmap or calloc for the overall program execution.
Anyway, first remark, both memory allocation happen in the memory mapping region and not in the heap. To verify this, I've added a quick n' dirty pause so you can check the memory mapping of the process (/proc//maps)
Now to your question, the last allocated array with calloc seems to be really allocated in memory (not postponed). As arr1 and arr2 behaves now exactly the same (the first iteration is slow, subsequent iterations are faster). Arr3 is faster for the first iteration because the memory was allocated earlier. When using the A macro, then it is arr1 which benefits from this. My guess would be that the kernel has preallocated the array in memory for the last calloc. Why? I don't know... I've tested it also with only one array (so I removed all occurence of arr2 and arr3), then I have the same time (roughly) for all 10 iterations of arr1.
Both malloc and mmap behave the same (results not shown below), the first iteration is slow, subsequent iterations are faster for all 3 arrays.
Note: all results were coherent accross the various gcc optimisation flags (-O0 to -O3), so it doesn't look like the root of the behaviour is derived from some kind of gcc optimsation.
Note2: Test run on Ubuntu Precise Pangolin (kernel 3.2), with GCC 4.6.3
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <time.h>
#define SIZE 500002816
#define ITERATION 10
#if defined(USE_MMAP)
# define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE, \
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#elif defined(USE_MALLOC)
# define ALLOC(a, b) (malloc(b * a))
#elif defined(USE_CALLOC)
# define ALLOC calloc
#else
# error "No alloc routine specified"
#endif
int main() {
clock_t start, finish, gstart, gfinish;
start = clock();
gstart = start;
#ifdef A
unsigned int *arr1 = ALLOC(sizeof(unsigned int), SIZE);
unsigned int *arr2 = ALLOC(sizeof(unsigned int), SIZE);
unsigned int *arr3 = ALLOC(sizeof(unsigned int), SIZE);
#else
unsigned int *arr3 = ALLOC(sizeof(unsigned int), SIZE);
unsigned int *arr2 = ALLOC(sizeof(unsigned int), SIZE);
unsigned int *arr1 = ALLOC(sizeof(unsigned int), SIZE);
#endif
finish = clock();
unsigned int i, j;
double intermed, finalres;
intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
printf("Time to create: %.2f\n", intermed);
printf("arr1 addr: %p\narr2 addr: %p\narr3 addr: %p\n", arr1, arr2, arr3);
finalres = 0;
for (j = 0; j < ITERATION; j++)
{
start = clock();
{
for (i = 0; i < SIZE; i++)
arr1[i] = (i + 13) * 5;
}
finish = clock();
intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
finalres += intermed;
printf("Time A: %.2f\n", intermed);
}
printf("Time A (average): %.2f\n", finalres/ITERATION);
finalres = 0;
for (j = 0; j < ITERATION; j++)
{
start = clock();
{
for (i = 0; i < SIZE; i++)
arr2[i] = (i + 13) * 5;
}
finish = clock();
intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
finalres += intermed;
printf("Time B: %.2f\n", intermed);
}
printf("Time B (average): %.2f\n", finalres/ITERATION);
finalres = 0;
for (j = 0; j < ITERATION; j++)
{
start = clock();
{
for (i = 0; i < SIZE; i++)
arr3[i] = (i + 13) * 5;
}
finish = clock();
intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
finalres += intermed;
printf("Time C: %.2f\n", intermed);
}
printf("Time C (average): %.2f\n", finalres/ITERATION);
gfinish = clock();
intermed = ((double)(gfinish - gstart))/CLOCKS_PER_SEC;
printf("Global Time: %.2f\n", intermed);
return 0;
}
Results:
Using USE_CALLOC
Time to create: 0.13
arr1 addr: 0x7fabcb4a6000
arr2 addr: 0x7fabe917d000
arr3 addr: 0x7fac06e54000
Time A: 0.67
Time A: 0.48
...
Time A: 0.47
Time A (average): 0.48
Time B: 0.63
Time B: 0.47
...
Time B: 0.48
Time B (average): 0.48
Time C: 0.45
...
Time C: 0.46
Time C (average): 0.46
With USE_CALLOC and A
Time to create: 0.13
arr1 addr: 0x7fc2fa206010
arr2 addr: 0xx7fc2dc52e010
arr3 addr: 0x7fc2be856010
Time A: 0.44
...
Time A: 0.43
Time A (average): 0.45
Time B: 0.65
Time B: 0.47
...
Time B: 0.46
Time B (average): 0.48
Time C: 0.65
Time C: 0.48
...
Time C: 0.45
Time C (average): 0.48
Using USE_MMAP
Time to create: 0.0
arr1 addr: 0x7fe6332b7000
arr2 addr: 0x7fe650f8e000
arr3 addr: 0x7fe66ec65000
Time A: 0.55
Time A: 0.48
...
Time A: 0.45
Time A (average): 0.49
Time B: 0.54
Time B: 0.46
...
Time B: 0.49
Time B (average): 0.50
Time C: 0.57
...
Time C: 0.40
Time C (average): 0.43

Related

Determine NUMA layout via latency/performance measurements

Recently I have been observing performance effects in memory-intensive workloads I was unable to explain. Trying to get to the bottom of this I started running several microbenchmarks in order to determine common performance parameters like cache line size and L1/L2/L3 cache size (I knew them already, I just wanted to see if my measurements reflected the actual values).
For the cache line test my code roughly looks as follows (Linux C, but the concept is similiar to Windows etc. of course):
char *array = malloc (ARRAY_SIZE);
int count = ARRAY_SIZE / STEP;
clock_gettime(CLOCK_REALTIME, &start_time);
for (int i = 0; i < ARRAY_SIZE; i += STEP) {
array[i]++;
}
clock_gettime(CLOCK_REALTIME, &end_time);
// calculate time per element here:
[..]
Varying STEP from 1 to 128 shows that from STEP=64 on, I saw that the time per element did not increase further, i.e. every iteration would need to fetch a new cache line dominating the runtime.
Varying ARRAY_SIZE from 1K to 16384K keeping STEP=64 I was able to create a nice plot exhibiting a step pattern that roughly corresponds to L1, L2 and L3 latency. It was necessary to repeat the for loop a number of times, for very small array sizes even 100,000s of times, to get reliable numbers, though. Then, on my IvyBridge notebook I can clearly see L1 ending at 64K, L2 at 256K and even the L3 at 6M.
Now on to my real question: In a NUMA system, any single core will obtain remote main memory and even shared cache that is not necessarily as close as its local cache and memory. I was hoping to see a difference in latency/performance thus determining how much memory I could allocate while staying in my fast caches/part of memory.
For this, I refined my test to walk through the memory in 1/10 MB chunks measuring the latency separately and later collect the fastest chunks, roughly like this:
for (int chunk_start = 0; chunk_start < ARRAY_SIZE; chunk_start += CHUNK_SIZE) {
int chunk_end = MIN (ARRAY_SIZE, chunk_start + CHUNK_SIZE);
int chunk_els = CHUNK_SIZE / STEP;
for (int i = chunk_start; i < chunk_end; i+= STEP) {
array[i]++;
}
// calculate time per element
[..]
As soon as I start increasing ARRAY_SIZE to something larger than the L3 size, I get wildy unrealiable numbers not even a large number of repeats is able to even out. There is no way I can make out a pattern usable for performance evaluation with this, let alone determine where exactly a NUMA stripe starts, ends or is located.
Then, I figured the Hardware prefetcher is smart enough to recognize my simple access pattern and simply fetch the needed lines into the cache before I access them. Adding a random number to the array index increases the time per element but did not seem to help much otherwise, probably because I had a rand () call every iteration. Precomputing some random values and storing them in an array did not seem a good idea to me as this array as well would be stored in a hot cache and skew my measurements. Increasing STEP to 4097 or 8193 did not help much either, the prefetcher must be smarter than me.
Is my approach sensible/viable or did I miss the larger picture? Is it possible to observe NUMA latencies like this at all? If yes, what am I doing wrong?
I disabled address space randomization just to be sure and preclude strange cache aliasing effects. Is there something else operating-sytem wise that has to be tuned before measuring?
Is it possible to observe NUMA latencies like this at all? If yes, what am I doing wrong?
Memory allocators are NUMA aware, so by default you will not observe any NUMA effects until you explicitly ask to allocate memory on another node. The most simple way to achieve the effect is numactl(8). Just run your application on one node and bind memory allocations to another, like so:
numactl --cpunodebind 0 --membind 1 ./my-benchmark
See also numa_alloc_onnode(3).
Is there something else operating-sytem wise that has to be tuned before measuring?
Turn off CPU scaling otherwise your measurements might be noisy:
find '/sys/devices/system/cpu/' -name 'scaling_governor' | while read F; do
echo "==> ${F}"
echo "performance" | sudo tee "${F}" > /dev/null
done
Now regarding the test itself. Sure, to measure the latency, access pattern must be (pseudo) random. Otherwise your measurements will be contaminated with fast cache hits.
Here is an example how you could achieve this:
Data Initialization
Fill the array with random numbers:
static void random_data_init()
{
for (size_t i = 0; i < ARR_SZ; i++) {
arr[i] = rand();
}
}
Benchmark
Perform 1M op operations per one benchmark iteration to reduce measurement noise. Use array random number to jump over few cache lines:
const size_t OPERATIONS = 1 * 1000 * 1000; // 1M operations per iteration
int random_step_sizeK(size_t size)
{
size_t idx = 0;
for (size_t i = 0; i < OPERATIONS; i++) {
arr[idx & (size - 1)]++;
idx += arr[idx & (size - 1)] * 64; // assuming cache line is 64B
}
return 0;
}
Results
Here are the results for i5-4460 CPU # 3.20GHz:
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
random_step_sizeK/4 4217004 ns 4216880 ns 166
random_step_sizeK/8 4146458 ns 4146227 ns 168
random_step_sizeK/16 4188168 ns 4187700 ns 168
random_step_sizeK/32 4180545 ns 4179946 ns 163
random_step_sizeK/64 5420788 ns 5420140 ns 129
random_step_sizeK/128 6187776 ns 6187337 ns 112
random_step_sizeK/256 7856840 ns 7856549 ns 89
random_step_sizeK/512 11311684 ns 11311258 ns 57
random_step_sizeK/1024 13634351 ns 13633856 ns 51
random_step_sizeK/2048 16922005 ns 16921141 ns 48
random_step_sizeK/4096 15263547 ns 15260469 ns 41
random_step_sizeK/6144 15262491 ns 15260913 ns 46
random_step_sizeK/8192 45484456 ns 45482016 ns 23
random_step_sizeK/16384 54070435 ns 54064053 ns 14
random_step_sizeK/32768 59277722 ns 59273523 ns 11
random_step_sizeK/65536 63676848 ns 63674236 ns 10
random_step_sizeK/131072 66383037 ns 66380687 ns 11
There are obvious steps between 32K/64K (so my L1 cache is ~32K), 256K/512K (so my L2 cache size is ~256K) and 6144K/8192K (so my L3 cache size is ~6M).

"-Nan" value for the total sum of array elements with GPU code

I am working on an OpenCL code which computes the sum of array elements. Every works fine up to a size of 1.024 * 1e+8 for the 1D input array but with 1.024 * 1e+9, the final value is "-Nan".
Here's the source of the code on this link
The Kernel code is on this link
and the Makefile on this link
Here's the result for the last array size which works (last value of size which works is 1.024 * 1e+8) :
$ ./sumReductionGPU 102400000
Max WorkGroup size = 4100
Number of WorkGroups = 800000
Problem size = 102400000
Final Sum Sequential = 5.2428800512000000000e+15
Final Sum GPU = 5.2428800512000000000e+15
Initializing Arrays : Wall Clock = 0 second 673785 micro
Preparing GPU/OpenCL : Wall Clock = 1 second 925451 micro
Time for one NDRangeKernel call and WorkGroups final Sum : Wall Clock = 0 second 30511 micro
Time for Sequential Sum computing : Wall Clock = 0 second 398485 micro
I have taken local_item_size = 128, so as it is indicated above, I have 800000 Work-Groups for NWorkItems = 1.024 * 1e+8.
Now If I take 1.024 * 10^9, the partial sums are no more computed, I get a "-nan" value for total sum of array elements.
$ ./sumReductionGPU 1024000000
Max WorkGroup size = 4100
Number of WorkGroups = 8000000
Problem size = 1024000000
Final Sum Sequential = 5.2428800006710899200e+17
Final Sum GPU = -nan
Initializing Arrays : Wall Clock = 24 second 360088 micro
Preparing GPU/OpenCL : Wall Clock = 19 second 494640 micro
Time for one NDRangeKernel call and WorkGroups final Sum : Wall Clock = 0 second 481910 micro
Time for Sequential Sum computing : Wall Clock = 166 second 214384 micro
Maybe I have reached the limit of what GPU unit can compute. But I would like to get your advice to confirm this.
If a double is 8 bytes, this will require 1.024 * 1e9 * 8 ~ 8 GBytes for the input array : isn't it too much ? I have only 8 GBytes of RAM.
From your experience, where this issue could come from ?
Thanks
As you already found out, your 1D input array requires a lot of memory. Thus, the memory allocation with malloc or clCreateBuffer are prone to fail.
For the malloc, I suggest to use a helper function checked_malloc which detects a failed memory allocation, prints out a message and exits the program.
#include <stdlib.h>
#include <stdio.h>
void * checked_malloc(size_t size, const char purpose[]) {
void *result = malloc(size);
if(result == NULL) {
fprintf(stderr, "ERROR: malloc failed for %s\n", purpose);
exit(1);
}
return result;
}
int main()
{
double *p1 = checked_malloc(1e8 * sizeof *p1, "array1");
double *p2 = checked_malloc(64 * 1e9 * sizeof *p2, "array2");
return 0;
}
On my PC which has only 48 GB of virtual memory, the second allocation failes and the program prints:
ERROR: malloc failed for array2
You can apply this scheme also for clCreateBuffer. But, you have to check the result of every OpenCL call anyway. So, I recommend to use a macro for this:
#define CHECK_CL_ERROR(result) if(result != CL_SUCCESS) { \
fprintf(stderr, "OpenCL call failed at: %s:%d with code %d\n", __FILE__, __LINE__, result); }
An example usage would be:
cl_mem inputBuffer = clCreateBuffer(context, CL_MEM_READ_ONLY,
nWorkItems * sizeof(double), NULL, &ret);
CHECK_CL_ERROR(ret);

Measurement of TLB effects on a Cortex-A9

After reading the following paper https://people.freebsd.org/~lstewart/articles/cpumemory.pdf ("What every programmer should know about memory") I wanted to try one of the author's test, that is, measuring the effects of TLB on the final execution time.
I am working on a Samsung Galaxy S3 that embeds a Cortex-A9.
According to the documentation:
we have two micro TLBs for instruction and data cache in L1 (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/Chddiifa.html)
The main TLB is located in L2 (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/Chddiifa.html)
Data micro TLB has 32 entries (instruction micro TLB has either 32 or 64 entries)
L1' size == 32 Kbytes
L1 cache line == 32 bytes
L2' size == 1MB
I wrote a small program that allocates an array of structs with N entries. Each entry's size is == 32 bytes so it fits in a cache line.
I perform several read access and I measure the execution time.
typedef struct {
int elmt; // sizeof(int) == 4 bytes
char padding[28]; // 4 + 28 = 32B == cache line size
}entry;
volatile entry ** entries = NULL;
//Allocate memory and init to 0
entries = calloc(NB_ENTRIES, sizeof(entry *));
if(entries == NULL) perror("calloc failed"); exit(1);
for(i = 0; i < NB_ENTRIES; i++)
{
entries[i] = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
if(entries[i] == MAP_FAILED) perror("mmap failed"); exit(1);
}
entries[LAST_ELEMENT]->elmt = -1
//Randomly access and init with random values
n = -1;
i = 0;
while(++n < NB_ENTRIES -1)
{
//init with random value
entries[i]->elmt = rand() % NB_ENTRIES;
//loop till we reach the last element
while(entries[entries[i]->elmt]->elmt != -1)
{
entries[i]->elmt++;
if(entries[i]->elmt == NB_ENTRIES)
entries[i]->elmt = 0;
}
i = entries[i]->elmt;
}
gettimeofday(&tStart, NULL);
for(i = 0; i < NB_LOOPS; i++)
{
j = 0;
while(j != -1)
{
j = entries[j]->elmt
}
}
gettimeofday(&tEnd, NULL);
time = (tEnd.tv_sec - tStart.tv_sec);
time *= 1000000;
time += tEnd.tv_usec - tStart.tv_usec;
time *= 100000
time /= (NB_ENTRIES * NBLOOPS);
fprintf(stdout, "%d %3lld.%02lld\n", NB_ENTRIES, time / 100, time % 100);
I have an outer loop that makes NB_ENTRIES vary from 4 to 1024.
As one can see in the figure below, while NB_ENTRIES == 256 entries, executing time is longer.
When NB_ENTRIES == 404 I get an "out of memory" (why? micro TLBs exceeded? main TLBs exceeded? Page Tables exceeded? virtual memory for the process exceeded?)
Can someone explain me please what is really going on from 4 to 256 entries, then from 257 to 404 entries?
EDIT 1
As it has been suggested, I ran membench (src code) and below the results:
EDIT 2
In the following paper (page 3) they ran (I suppose) the same benchmark. But the different steps are clearly visible from their plots, which is not my case.
Right now, according to their results and explanations, I only can identify few things.
plots confirm that L1 cache line size is 32 bytes because as they said
"once the array size exceeds the size of the data cache (32KB), the reads begin to generate misses [...] an inflection point occurs when every read generates a misse".
In my case the very first inflection point appears when stride == 32 Bytes.
- The graph shows that we have a second-level (L2) cache. I think it is depicted by the yellow line (1MB == L2 size)
- Therefore the two last plots above the latter probably reflects the latency while accessing Main Memory (+ TLB?).
However from this benchmark, I am not able to identify:
the cache associativity. Normally D-Cache and I-Cache are 4-way associative (Cortex-A9 TRM).
The TLB effects. As they said,
in most systems, a secondary increase in latency is indicative of the TLB, which caches a limited number of virtual to physical translations.[..] The absence of a rise in latency attributable to TLB indicates that [...]"
large page sizes have probably been used/implemented.
EDIT 3
This link explains the TLB effects from another membench graph. One can actually retrieve the same effects on my graph.
On a 4KB page system, as you grow your strides, while they're still < 4K, you'll enjoy less and less utilization of each page [...] you'll have to access the 2nd level TLB on each access [...]
The cortex-A9 supports 4KB pages mode.
Indeed as one can see in my graph up to strides == 4K, latencies are increasing, then, when it reachs 4K
you suddenly start benefiting again since you're actually skipping whole pages.
tl;dr -> Provide a proper MVCE.
This answer should be a comment but is too big to be posted as comment, so posting as answer instead:
I had to fix a bunch of syntax errors (missing semicolons) and declare undefined variables.
After fixing all those problems, the code did NOTHING (the program quit even prior to executing the first mmap. I'm giving the tip to use curly brackets all the time, here is your first and your second error caused by NOT doing so:
.
// after calloc:
if(entries == NULL) perror("calloc failed"); exit(1);
// after mmap
if(entries[i] == MAP_FAILED) perror("mmap failed"); exit(1);
both lines just terminate your program regardless of the condition.
Here you got an endless loop (reformatted, added curly brackets but no other change):
.
//Randomly access and init with random values
n = -1;
i = 0;
while (++n < NB_ENTRIES -1) {
//init with random value
entries[i]->elmt = rand() % NB_ENTRIES;
//loop till we reach the last element
while (entries[entries[i]->elmt]->elmt != -1) {
entries[i]->elmt++;
if (entries[i]->elmt == NB_ENTRIES) {
entries[i]->elmt = 0;
}
}
i = entries[i]->elmt;
}
First iteration starts by setting entries[0]->elmt to some random value, then inner loop increments until it reaches LAST_ELEMENT. Then i is set to that value (i.e. LAST_ELEMENT) and second loop overwrites end marker -1 to some other random value. Then it's constantly incremented mod NB_ENTRIES in the inner loop until you hit CTRL+C.
Conclusion
If you want help, then post a Minimal, Complete, and Verifiable example and not something else.

Why do very large stack allocations fail despite unlimited ulimit?

The following static allocation gives segmentation fault
double U[100][2048][2048];
But the following dynamic allocation goes fine
double ***U = (double ***)malloc(100 * sizeof(double **));
for(i=0;i<100;i++)
{
U[i] = (double **)malloc(2048 * sizeof(double *));
for(j=0;j<2048;j++)
{
U[i][j] = (double *)malloc(2048*sizeof(double));
}
}
The ulimit is set to unlimited in linux.
Can anyone give me some hint on whats happening?
When you say the ulimit is set to unlimited, are you using the -s option? As otherwise this doesn't change the stack limit, only the file size limit.
There appear to be stack limits regardless, though. I can allocate:
double *u = malloc(200*2048*2048*(sizeof(double))); // 6gb contiguous memory
And running the binary I get:
VmData: 6553660 kB
However, if I allocate on the stack, it's:
double u[200][2048][2048];
VmStk: 2359308 kB
Which is clearly not correct (suggesting overflow). With the original allocations, the two give the same results:
Array: VmStk: 3276820 kB
malloc: VmData: 3276860 kB
However, running the stack version, I cannot generate a segfault no matter what the size of the array -- even if it's more than the total memory actually on the system, if -s unlimited is set.
EDIT:
I did a test with malloc in a loop until it failed:
VmData: 137435723384 kB // my system doesn't quite have 131068gb RAM
Stack usage never gets above 4gb, however.
Assuming your machine actually has enough free memory to allocate 3.125 GiB of data, the difference most likely lies in the fact that the static allocation needs all of this memory to be contiguous (it's actually a 3-dimensional array), while the dynamic allocation only needs contiguous blocks of about 2048*8 = 16 KiB (it's an array of pointers to arrays of pointers to quite small actual arrays).
It is also possible that your operating system uses swap files for heap memory when it runs out, but not for stack memory.
There is a very good discussion of Linux memory management - and specifically the stack - here: 9.7 Stack overflow, it is worth the read.
You can use this command to find out what your current stack soft limit is
ulimit -s
On Mac OS X the hard limit is 64MB, see How to change the stack size using ulimit or per process on Mac OS X for a C or Ruby program?
You can modify the stack limit at run-time from your program, see Change stack size for a C++ application in Linux during compilation with GNU compiler
I combined your code with the sample there, here's a working program
#include <stdio.h>
#include <sys/resource.h>
unsigned myrand() {
static unsigned x = 1;
return (x = x * 1664525 + 1013904223);
}
void increase_stack( rlim_t stack_size )
{
rlim_t MIN_STACK = 1024 * 1024;
stack_size += MIN_STACK;
struct rlimit rl;
int result;
result = getrlimit(RLIMIT_STACK, &rl);
if (result == 0)
{
if (rl.rlim_cur < stack_size)
{
rl.rlim_cur = stack_size;
result = setrlimit(RLIMIT_STACK, &rl);
if (result != 0)
{
fprintf(stderr, "setrlimit returned result = %d\n", result);
}
}
}
}
void my_func() {
double U[100][2048][2048];
int i,j,k;
for(i=0;i<100;++i)
for(j=0;j<2048;++j)
for(k=0;k<2048;++k)
U[i][j][k] = myrand();
double sum = 0;
int n;
for(n=0;n<1000;++n)
sum += U[myrand()%100][myrand()%2048][myrand()%2048];
printf("sum=%g\n",sum);
}
int main() {
increase_stack( sizeof(double) * 100 * 2048 * 2048 );
my_func();
return 0;
}
You are hitting a limit of the stack. By default on Windows, the stack is 1M but can grow more if there is enough memory.
On many *nix systems default stack size is 512K.
You are trying to allocate 2048 * 2048 * 100 * 8 bytes, which is over 2^25 (over 2G for stack). If you have a lot of virtual memory available and still want to allocate this on stack, use a different stack limit while linking the application.
Linux:
How to increase the gcc executable stack size?
Change stack size for a C++ application in Linux during compilation with GNU compiler
Windows:
http://msdn.microsoft.com/en-us/library/tdkhxaks%28v=vs.110%29.aspx

sysinfo returns incorrect value for freeram (even with mem_unit)

My full C MATE applet can be found at github here: https://github.com/geniass/mate-resource-applet/tree/cmake (BRANCH CMAKE). It's a hacky mess right now, so see the code below.
I couldn't find an applet to display my computer's free ram, so that's basically what this is. I am using sysinfo to get this information, and it works fine for my system's total ram (roughly 4GB, it shows 3954 MB). htop shows 3157 MB used of 3954 MB.
However, the value sysinfo gives for free ram (136 MB) is obviously wrong (if free ram is ram that hasn't been allocated or something, I don't know).
This question is the same problem, but the solution, involving mem_unit, doesn't work because mem_unit = 1 on my system.
Here's a minimal program that gives the same values:
#include <stdio.h>
#include <sys/sysinfo.h>
int main() {
/* Conversion constants. */
const long minute = 60;
const long hour = minute * 60;
const long day = hour * 24;
const double megabyte = 1024 * 1024;
/* Obtain system statistics. */
struct sysinfo si;
sysinfo (&si);
/* Summarize interesting values. */
printf ("system uptime : %ld days, %ld:%02ld:%02ld\n",
si.uptime / day, (si.uptime % day) / hour,
(si.uptime % hour) / minute, si.uptime % minute);
printf ("total RAM : %5.1f MB\n", si.totalram / megabyte);
printf ("free RAM : %5.1f MB\n", si.freeram / megabyte);
printf ("mem_unit: : %u\n", si.mem_unit);
printf ("process count : %d\n", si.procs);
return 0;
}
Output:
system uptime : 0 days, 10:25:18
total RAM : 3953.9 MB
free RAM : 162.1 MB
mem_unit: : 1
process count : 531
What's going on here? Is freeram not what I think it is?
Linux doesn't like when free memory is used nowhere. So when there is free memory available, it takes it temporarily as cache memory and buffers. The free memory seems very low, but as soon as a program requires memory, Linux reduces its cache / buffers usage and give the program what it wants.
The Linux command free -m displays the state of the memory, cache and buffers.
See this link for example and detailed information.

Resources