Is 'malloc' possibly a bottleneck in my program? - c

My program calls malloc 10'000 times a second. I have absolutely no idea how long a malloc call takes.
As long as an uncontested mutex lock? (10-100 ns)
As long as compressing 1kb of data? (1-10 us)
As long as an SSD random read? (100-1000 us)
As long as a network transfer to Amsterdam? (10-100ms)
Instead of spending two hours to investigate this, only to find out that it is absolutely dwarfed by some other thing my program does, I would like to get a rough idea of what to expect. Ballpark. Not precise. Off by factor 10 does not matter at all.
The following picture was updooted 200 times here:

To state the obvious first: profiling for specific use cases is always required. However, this question asked for a rough general ballpark approximation guesstimate of the order of magnitude. That's something we do when we don't know if we should even think about a problem. Do I need to worry about my data being in cache when it is then sent to Amsterdam? Looking at the picture in the question, the answer is a resounding No. Yes, it could be a problem, but only if you messed up big. We assume that case to be ruled out and instead discuss the problem in probabilistic generality.
It may be ironic that the question arose when I was working on a program that cares very much about small details, where a performance difference of a few percent translates into millions of CPU hours. Profiling suggested malloc was not an issue, but before dismissing it outright, I wanted to sanity check: Is it theoretically plausible that malloc is a bottleneck?
As repeatedly suggested in a closed, earlier version of the question, there are large differences between environments.
I tried various machines (intel: i7 8700K, i5 5670, some early gen mobile i7 in a laptop; AMD: Ryzen 4300G, Ryzen 3900X), various OS (windows 10, debian, ubuntu) and compilers (gcc, clang-14, cygwin-g++, msvc; no debug builds).
I've used this to get an idea about the characteristics(*), using just 1 thread:
#include <stddef.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
const size_t allocs = 10;
const size_t repeats = 10000;
printf("chunk\tms\tM1/s\tGB/s\tcheck\n");
for (size_t size = 16; size < 10 * 1000 * 1000; size *= 2) {
float t0 = (float)clock() / CLOCKS_PER_SEC;
size_t check = 0;
for (size_t repeat = 0; repeat < repeats; ++repeat) {
char* ps[allocs];
for (size_t i = 0; i < allocs; i++) {
ps[i] = malloc(size);
if (!ps[i]) {
exit(1);
}
for (size_t touch = 0; touch < size; touch += 512) {
ps[i][touch] = 1;
}
}
for (size_t i = 0; i < allocs; i++) {
check += ps[i][0];
free(ps[i]);
}
}
float dt = (float)clock() / CLOCKS_PER_SEC - t0;
printf ("%d\t%1.5f\t%7.3f\t%7.1f\t%d\n",
size,
dt / allocs / repeats * 1000,
allocs / dt * repeats / 1000 / 1000,
allocs / dt * repeats * size / 1024 / 1024 / 1024,
check);
}
}
The variance is stark, but, as expected, the values still belong to the same ballpark.
the following table is representative, others were off by less than factor 10
chunk ms M1/s GB/s check
16 0.00003 38.052 0.6 100000
32 0.00003 37.736 1.1 100000
64 0.00003 37.651 2.2 100000
128 0.00004 24.931 3.0 100000
256 0.00004 26.991 6.4 100000
512 0.00004 26.427 12.6 100000
1024 0.00004 24.814 23.7 100000
2048 0.00007 15.256 29.1 100000
4096 0.00007 14.633 55.8 100000
8192 0.00008 12.940 98.7 100000
16384 0.00066 1.511 23.1 100000
32768 0.00271 0.369 11.3 100000
65536 0.00707 0.141 8.6 100000
131072 0.01594 0.063 7.7 100000
262144 0.04401 0.023 5.5 100000
524288 0.11226 0.009 4.3 100000
1048576 0.25546 0.004 3.8 100000
2097152 0.52395 0.002 3.7 100000
4194304 0.80179 0.001 4.9 100000
8388608 1.78242 0.001 4.4 100000
Here's one from a 3900X on cygwin-g++. You can clearly see the larger CPU cache, and after that, the higher memory throughput.
chunk ms M1/s GB/s check
16 0.00004 25.000 0.4 100000
32 0.00005 20.000 0.6 100000
64 0.00004 25.000 1.5 100000
128 0.00004 25.000 3.0 100000
256 0.00004 25.000 6.0 100000
512 0.00005 20.000 9.5 100000
1024 0.00004 25.000 23.8 100000
2048 0.00005 20.000 38.1 100000
4096 0.00005 20.000 76.3 100000
8192 0.00010 10.000 76.3 100000
16384 0.00015 6.667 101.7 100000
32768 0.00077 1.299 39.6 100000
65536 0.00039 2.564 156.5 100000
131072 0.00067 1.493 182.2 100000
262144 0.00093 1.075 262.5 100000
524288 0.02679 0.037 18.2 100000
1048576 0.14183 0.007 6.9 100000
2097152 0.26805 0.004 7.3 100000
4194304 0.51644 0.002 7.6 100000
8388608 1.01604 0.001 7.7 100000
So what gives?
With small chunk sizes, >= 10 million of calls per second are possible even on old commodity hardware.
Once sizes go beyond CPU cache, i.e. 1 to 100-ish MB, RAM access quickly dominates this (I did not test malloc without actually using the chunks).
Depending on what sizes you malloc, one or the other will be the (ballpark) limit.
However, with something like 10k allocs per second, this is something you can likely ignore for the time being.

Related

How does the CPU cache affect the performance of a C program

I am trying to understand more about how CPU cache affects performance. As a simple test I am summing the values of the first column of a matrix with varying numbers of total columns.
// compiled with: gcc -Wall -Wextra -Ofast -march=native cache.c
// tested with: for n in {1..100}; do ./a.out $n; done | tee out.csv
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
double sum_column(uint64_t ni, uint64_t nj, double const data[ni][nj])
{
double sum = 0.0;
for (uint64_t i = 0; i < ni; ++i) {
sum += data[i][0];
}
return sum;
}
int compare(void const* _a, void const* _b)
{
double const a = *((double*)_a);
double const b = *((double*)_b);
return (a > b) - (a < b);
}
int main(int argc, char** argv)
{
// set sizes
assert(argc == 2);
uint64_t const iter_max = 101;
uint64_t const ni = 1000000;
uint64_t const nj = strtol(argv[1], 0, 10);
// initialize data
double(*data)[nj] = calloc(ni, sizeof(*data));
for (uint64_t i = 0; i < ni; ++i) {
for (uint64_t j = 0; j < nj; ++j) {
data[i][j] = rand() / (double)RAND_MAX;
}
}
// test performance
double* dt = calloc(iter_max, sizeof(*dt));
double const sum0 = sum_column(ni, nj, data);
for (uint64_t iter = 0; iter < iter_max; ++iter) {
clock_t const t_start = clock();
double const sum = sum_column(ni, nj, data);
clock_t const t_stop = clock();
assert(sum == sum0);
dt[iter] = (t_stop - t_start) / (double)CLOCKS_PER_SEC;
}
// sort dt
qsort(dt, iter_max, sizeof(*dt), compare);
// compute mean dt
double dt_mean = 0.0;
for (uint64_t iter = 0; iter < iter_max; ++iter) {
dt_mean += dt[iter];
}
dt_mean /= iter_max;
// print results
printf("%2lu %.8e %.8e %.8e %.8e\n", nj, dt[iter_max / 2], dt_mean, dt[0],
dt[iter_max - 1]);
// free memory
free(data);
}
However, the results are not quite how I would expect them to be:
As far as I understand, when the CPU loads a value from data, it also places some of the following values of data in the cache. The exact number depends on the cache line size (64 byte on my machine). This would explain, why with growing nj the time to solution first increases linearly and levels out at some value. If nj == 1, one load places the next 7 values in the cache and thus we only need to load from RAM every 8th value. If nj == 2, following the same logic, we need to access the RAM every 4th value. After some size, we will have to access the RAM for every value, which should result in the performance leveling out. My guess, for why the linear section of the graph goes further than 4 is that in reality there are multiple levels of cache at work here and the way that values end up in these caches is a little more complex than what I explained here.
What I cannot explain is why there are these performance peaks at multiples of 16.
After thinking about this question for a bit, I decided to check if this also occurs for higher values of nj:
In fact, it does. And, there is more: Why does the performance increase again after ~250?
Could someone explain to me, or point me to some appropriate reference, why there are these peaks and why the performance increases for higher values of nj.
If you would like to try the code for yourself, I will also attach my plotting script, for your convenience:
import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt("out.csv")
data[:,1:] /= data[0,1]
dy = np.diff(data[:,2]) / np.diff(data[:,0])
for i in range(len(dy) - 1):
if dy[i] - dy[i + 1] > (dy.max() - dy.min()) / 2:
plt.axvline(data[i + 1,0], color='gray', linestyle='--')
plt.text(data[i + 1,0], 1.5 * data[0,3], f"{int(data[i + 1,0])}",
rotation=0, ha="center", va="center",
bbox=dict(boxstyle="round", ec='gray', fc='w'))
plt.fill_between(data[:,0], data[:,3], data[:,4], color='gray')
plt.plot(data[:,0], data[:,1], label="median")
plt.plot(data[:,0], data[:,2], label="mean")
plt.legend(loc="upper left")
plt.xlabel("nj")
plt.ylabel("dt / dt$_0$")
plt.savefig("out.pdf")
The plots show the combination of several complex low-level effects (mainly cache trashing & prefetching issues). I assume the target platform is a mainstream modern processor with cache lines of 64 bytes (typically a x86 one).
I can reproduce the problem on my i5-9600KF processor. Here is the resulting plot:
First of all, when nj is small, the gap between fetched address (ie. strides) is small and cache lines are relatively efficiently used. For example, when nj = 1, the access is contiguous. In this case, the processor can efficiently prefetch the cache lines from the DRAM so to hide its high latency. There is also a good spatial cache locality since many contiguous items share the same cache line. When nj=2, only half the value of a cache line is used. This means the number of requested cache line is twice bigger for the same number of operations. That being said the time is not much bigger due to the relatively high latency of adding two floating-point numbers resulting in a compute-bound code. You can unroll the loop 4 times and use 4 different sum variables so that (mainstream modern) processors can add multiple values in parallel. Note that most processors can also load multiple values from the cache per cycle. When nj = 4 a new cache line is requested every 2 cycles (since a double takes 8 bytes). As a result, the memory throughput can become so big that the computation becomes memory-bound. One may expect the time to be stable for nj >= 8 since the number of requested cache line should be the same, but in practice processors prefetch multiple contiguous cache lines so not to pay the overhead of the DRAM latency which is huge in this case. The number of prefetched cache lines is generally between 2 to 4 (AFAIK such prefetching strategy is disabled on Intel processors when the stride is bigger than 512, so when nj >= 64. This explains why the timings are sharply increasing when nj < 32 and they become relatively stable with 32 <= nj <= 256 with exceptions for peaks.
The regular peaks happening when nj is a multiple of 16 are due to a complex cache effect called cache thrashing. Modern cache are N-way associative with N typically between 4 and 16. For example, here are statistics on my i5-9600KF processors:
Cache 0: L1 data cache, line size 64, 8-ways, 64 sets, size 32k
Cache 1: L1 instruction cache, line size 64, 8-ways, 64 sets, size 32k
Cache 2: L2 unified cache, line size 64, 4-ways, 1024 sets, size 256k
Cache 3: L3 unified cache, line size 64, 12-ways, 12288 sets, size 9216k
This means that two fetched values from the DRAM with the respective address A1 and A2 can results in conflicts in my L1 cache if (A1 % 32768) / 64 == (A2 % 32768) / 64. In this case, the processor needs to choose which cache line to replace from a set of N=8 cache lines. There are many cache replacement policy and none is perfect. Thus, some useful cache line are sometime evicted too early resulting in additional cache misses required later. In pathological cases, many DRAM locations can compete for the same cache lines resulting in excessive cache misses. More information about this can be found also in this post.
Regarding the nj stride, the number of cache lines that can be effectively used in the L1 cache is limited. For example, if all fetched values have the same address modulus the cache size, then only N cache lines (ie. 8 for my processor) can actually be used to store all the values. Having less cache lines available is a big problem since the prefetcher need a pretty large space in the cache so to store the many cache lines needed later. The smaller the number of concurrent fetches, the lower memory throughput. This is especially true here since the latency of fetching 1 cache line from the DRAM is about several dozens of nanoseconds (eg. ~70 ns) while its bandwidth is about dozens of GiB/s (eg. ~40 GiB/s): dozens of cache lines (eg. ~40) should be fetched concurrently so to hide the latency and saturate the DRAM.
Here is the simulation of the number of cache lines that can be actually used in my L1 cache regarding the value of the nj:
nj #cache-lines
1 512
2 512
3 512
4 512
5 512
6 512
7 512
8 512
9 512
10 512
11 512
12 512
13 512
14 512
15 512
16 256 <----
17 512
18 512
19 512
20 512
21 512
22 512
23 512
24 512
25 512
26 512
27 512
28 512
29 512
30 512
31 512
32 128 <----
33 512
34 512
35 512
36 512
37 512
38 512
39 512
40 512
41 512
42 512
43 512
44 512
45 512
46 512
47 512
48 256 <----
49 512
50 512
51 512
52 512
53 512
54 512
55 512
56 512
57 512
58 512
59 512
60 512
61 512
62 512
63 512
64 64 <----
==============
80 256
96 128
112 256
128 32
144 256
160 128
176 256
192 64
208 256
224 128
240 256
256 16
384 32
512 8
1024 4
We can see that the number of available cache lines is smaller when nj is a multiple of 16. In this case, the prefetecher will preload data into cache lines that are likely evicted early by subsequent fetched (done concurrently). Loads instruction performed in the code are more likely to result in cache misses when the number of available cache line is small. When a cache miss happen, the value need then to be fetched again from the L2 or even the L3 resulting in a slower execution. Note that the L2 cache is also subject to the same effect though it is less visible since it is larger. The L3 cache of modern x86 processors makes use of hashing to better distributes things to reduce collisions from fixed strides (at least on Intel processors and certainly on AMD too though AFAIK this is not documented).
Here is the timings on my machine for some peaks:
32 4.63600000e-03 4.62298020e-03 4.06400000e-03 4.97300000e-03
48 4.95800000e-03 4.96994059e-03 4.60400000e-03 5.59800000e-03
64 5.01600000e-03 5.00479208e-03 4.26900000e-03 5.33100000e-03
96 4.99300000e-03 5.02284158e-03 4.94700000e-03 5.29700000e-03
128 5.23300000e-03 5.26405941e-03 4.93200000e-03 5.85100000e-03
192 4.76900000e-03 4.78833663e-03 4.60100000e-03 5.01600000e-03
256 5.78500000e-03 5.81666337e-03 5.77600000e-03 6.35300000e-03
384 5.25900000e-03 5.32504950e-03 5.22800000e-03 6.75800000e-03
512 5.02700000e-03 5.05165347e-03 5.02100000e-03 5.34400000e-03
1024 5.29200000e-03 5.33059406e-03 5.28700000e-03 5.65700000e-03
As expected, the timings are overall bigger in practice for the case where the number of available cache lines is much smaller. However, when nj >= 512, the results are surprising since they are significantly faster than others. This is the case where the number of available cache lines is equal to the number of ways of associativity (N). My guess is that this is because Intel processors certainly detect this pathological case and optimize the prefetching so to reduce the number of cache misses (using line-fill buffers to bypass the L1 cache -- see below).
Finally, for large nj stride, a bigger nj should results in higher overheads mainly due to the translation lookaside buffer (TLB): there are more page addresses to translate with bigger nj and the number of TLB entries is limited. In fact this is what I can observe on my machine: timings tends to slowly increase in a very stable way unlike on your target platform.
I cannot really explain this very strange behavior yet.
Here is some wild guesses:
The OS could tend to uses more huge pages when nj is large (so to reduce de overhead of the TLB) since wider blocks are allocated. This could result in more concurrency for the prefetcher as AFAIK it cannot cross page
boundaries. You can try to check the number of allocated (transparent) huge-pages (by looking AnonHugePages in /proc/meminfo in Linux) or force them to be used in this case (using an explicit memmap), or possibly by disabling them. My system appears to make use of 2 MiB transparent huge-pages independently of the nj value.
If the target architecture is a NUMA one (eg. new AMD processors or a server with multiple processors having their own memory), then the OS could allocate pages physically stored on another NUMA node because there is less space available on the current NUMA node. This could result in higher performance due to the bigger throughput (though the latency is higher). You can control this policy with numactl on Linux so to force local allocations.
For more information about this topic, please read the great document What Every Programmer Should Know About Memory. Moreover, a very good post about how x86 cache works in practice is available here.
Removing the peaks
To remove the peaks due to cache trashing on x86 processors, you can use non-temporal software prefetching instructions so cache lines can be fetched in a non-temporal cache structure and into a location close to the processor that should not cause cache trashing in the L1 (if possible). Such cache structure is typically a line-fill buffers (LFB) on Intel processors and the (equivalent) miss address buffers (MAB) on AMD Zen processors. For more information about non-temporal instructions and the LFB, please read this post and this one. Here is the modified code that also include a loop unroling optimization to speed up the code when nj is small:
double sum_column(uint64_t ni, uint64_t nj, double* const data)
{
double sum0 = 0.0;
double sum1 = 0.0;
double sum2 = 0.0;
double sum3 = 0.0;
if(nj % 16 == 0)
{
// Cache-bypassing prefetch to avoid cache trashing
const size_t distance = 12;
for (uint64_t i = 0; i < ni; ++i) {
_mm_prefetch(&data[(i+distance)*nj+0], _MM_HINT_NTA);
sum0 += data[i*nj+0];
}
}
else
{
// Unrolling is much better for small strides
for (uint64_t i = 0; i < ni; i+=4) {
sum0 += data[(i+0)*nj+0];
sum1 += data[(i+1)*nj+0];
sum2 += data[(i+2)*nj+0];
sum3 += data[(i+3)*nj+0];
}
}
return sum0 + sum1 + sum2 + sum3;
}
Here is the result of the modified code:
We can see that peaks no longer appear in the timings. We can also see that the values are much bigger due to dt0 being about 4 times smaller (due to the loop unrolling).
Note that cache trashing in the L2 cache is not avoided with this method in practice (at least on Intel processors). This means that the effect is still here with huge nj strides multiple of 512 (4 KiB) on my machine (it is actually a slower than before, especially when nj >= 2048). It may be a good idea to stop the prefetching when (nj%512) == 0 && nj >= 512 on x86 processors. The effect AFAIK, there is no way to address this problem. That being said, this is a very bad idea to perform such big strided accesses on very-large data structures.
Note that distance should be carefully chosen since early prefetching can result cache line being evicted before they are actually used (so they need to be fetched again) and late prefetching is not much useful. I think using value close to the number of entries in the LFB/MAB is a good idea (eg. 12 on Skylake/KabyLake/CannonLake, 22 on Zen-2).

Determine NUMA layout via latency/performance measurements

Recently I have been observing performance effects in memory-intensive workloads I was unable to explain. Trying to get to the bottom of this I started running several microbenchmarks in order to determine common performance parameters like cache line size and L1/L2/L3 cache size (I knew them already, I just wanted to see if my measurements reflected the actual values).
For the cache line test my code roughly looks as follows (Linux C, but the concept is similiar to Windows etc. of course):
char *array = malloc (ARRAY_SIZE);
int count = ARRAY_SIZE / STEP;
clock_gettime(CLOCK_REALTIME, &start_time);
for (int i = 0; i < ARRAY_SIZE; i += STEP) {
array[i]++;
}
clock_gettime(CLOCK_REALTIME, &end_time);
// calculate time per element here:
[..]
Varying STEP from 1 to 128 shows that from STEP=64 on, I saw that the time per element did not increase further, i.e. every iteration would need to fetch a new cache line dominating the runtime.
Varying ARRAY_SIZE from 1K to 16384K keeping STEP=64 I was able to create a nice plot exhibiting a step pattern that roughly corresponds to L1, L2 and L3 latency. It was necessary to repeat the for loop a number of times, for very small array sizes even 100,000s of times, to get reliable numbers, though. Then, on my IvyBridge notebook I can clearly see L1 ending at 64K, L2 at 256K and even the L3 at 6M.
Now on to my real question: In a NUMA system, any single core will obtain remote main memory and even shared cache that is not necessarily as close as its local cache and memory. I was hoping to see a difference in latency/performance thus determining how much memory I could allocate while staying in my fast caches/part of memory.
For this, I refined my test to walk through the memory in 1/10 MB chunks measuring the latency separately and later collect the fastest chunks, roughly like this:
for (int chunk_start = 0; chunk_start < ARRAY_SIZE; chunk_start += CHUNK_SIZE) {
int chunk_end = MIN (ARRAY_SIZE, chunk_start + CHUNK_SIZE);
int chunk_els = CHUNK_SIZE / STEP;
for (int i = chunk_start; i < chunk_end; i+= STEP) {
array[i]++;
}
// calculate time per element
[..]
As soon as I start increasing ARRAY_SIZE to something larger than the L3 size, I get wildy unrealiable numbers not even a large number of repeats is able to even out. There is no way I can make out a pattern usable for performance evaluation with this, let alone determine where exactly a NUMA stripe starts, ends or is located.
Then, I figured the Hardware prefetcher is smart enough to recognize my simple access pattern and simply fetch the needed lines into the cache before I access them. Adding a random number to the array index increases the time per element but did not seem to help much otherwise, probably because I had a rand () call every iteration. Precomputing some random values and storing them in an array did not seem a good idea to me as this array as well would be stored in a hot cache and skew my measurements. Increasing STEP to 4097 or 8193 did not help much either, the prefetcher must be smarter than me.
Is my approach sensible/viable or did I miss the larger picture? Is it possible to observe NUMA latencies like this at all? If yes, what am I doing wrong?
I disabled address space randomization just to be sure and preclude strange cache aliasing effects. Is there something else operating-sytem wise that has to be tuned before measuring?
Is it possible to observe NUMA latencies like this at all? If yes, what am I doing wrong?
Memory allocators are NUMA aware, so by default you will not observe any NUMA effects until you explicitly ask to allocate memory on another node. The most simple way to achieve the effect is numactl(8). Just run your application on one node and bind memory allocations to another, like so:
numactl --cpunodebind 0 --membind 1 ./my-benchmark
See also numa_alloc_onnode(3).
Is there something else operating-sytem wise that has to be tuned before measuring?
Turn off CPU scaling otherwise your measurements might be noisy:
find '/sys/devices/system/cpu/' -name 'scaling_governor' | while read F; do
echo "==> ${F}"
echo "performance" | sudo tee "${F}" > /dev/null
done
Now regarding the test itself. Sure, to measure the latency, access pattern must be (pseudo) random. Otherwise your measurements will be contaminated with fast cache hits.
Here is an example how you could achieve this:
Data Initialization
Fill the array with random numbers:
static void random_data_init()
{
for (size_t i = 0; i < ARR_SZ; i++) {
arr[i] = rand();
}
}
Benchmark
Perform 1M op operations per one benchmark iteration to reduce measurement noise. Use array random number to jump over few cache lines:
const size_t OPERATIONS = 1 * 1000 * 1000; // 1M operations per iteration
int random_step_sizeK(size_t size)
{
size_t idx = 0;
for (size_t i = 0; i < OPERATIONS; i++) {
arr[idx & (size - 1)]++;
idx += arr[idx & (size - 1)] * 64; // assuming cache line is 64B
}
return 0;
}
Results
Here are the results for i5-4460 CPU # 3.20GHz:
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
random_step_sizeK/4 4217004 ns 4216880 ns 166
random_step_sizeK/8 4146458 ns 4146227 ns 168
random_step_sizeK/16 4188168 ns 4187700 ns 168
random_step_sizeK/32 4180545 ns 4179946 ns 163
random_step_sizeK/64 5420788 ns 5420140 ns 129
random_step_sizeK/128 6187776 ns 6187337 ns 112
random_step_sizeK/256 7856840 ns 7856549 ns 89
random_step_sizeK/512 11311684 ns 11311258 ns 57
random_step_sizeK/1024 13634351 ns 13633856 ns 51
random_step_sizeK/2048 16922005 ns 16921141 ns 48
random_step_sizeK/4096 15263547 ns 15260469 ns 41
random_step_sizeK/6144 15262491 ns 15260913 ns 46
random_step_sizeK/8192 45484456 ns 45482016 ns 23
random_step_sizeK/16384 54070435 ns 54064053 ns 14
random_step_sizeK/32768 59277722 ns 59273523 ns 11
random_step_sizeK/65536 63676848 ns 63674236 ns 10
random_step_sizeK/131072 66383037 ns 66380687 ns 11
There are obvious steps between 32K/64K (so my L1 cache is ~32K), 256K/512K (so my L2 cache size is ~256K) and 6144K/8192K (so my L3 cache size is ~6M).

No performance gain with transpose of large 2d Matrix using Loop tiling

Transposing global 2D Square Matrix/Array of size 1 gb with tiling approach(Cache Aware) has no performance gain in single threaded execution over Normal transpose method. Not discussing the transpose speed up using AVX,SSE(SIMD) or any other cache oblivious transpose algorithm(http://supertech.csail.mit.edu/papers/FrigoLePr12.pdf)
#include <stdio.h>
#include <sys/time.h>
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];
void testNormalTranspose() {
int i, j, k, l;
b[0][9999] = 1.0;
for (i=0; i<SIZE; i++)
for (j=0; j<SIZE; j++)
a[i][j] = b[j][i];
}
void testTiledTranspose(){
int i, j, k, l;
b[0][9999] = 1.0;
int blocksize = 16;
for (i=0; i<SIZE; i+= blocksize) {
for (j=0; j<SIZE; j+=blocksize) {
for (int ii = i;ii <i + blocksize; ++ii) {
for (int jj = j; jj < j + blocksize; ++jj) {
a[ii][jj] = b[jj][ii];
}
}
}
}
}
int main()
{
struct timeval t1, t2;
/*
gettimeofday(&t1, NULL);
testNormalTranspose();
gettimeofday(&t2, NULL);
printf("Time for the Normal transpose is %ld milliseconds\n",
(t2.tv_sec - t1.tv_sec)*1000 +
(t2.tv_usec - t1.tv_usec) / 1000);
*/
gettimeofday(&t1, NULL);
testTiledTranspose();
gettimeofday(&t2, NULL);
printf("Time for the Tiled transpose is %ld milliseconds\n",
(t2.tv_sec - t1.tv_sec)*1000 +
(t2.tv_usec - t1.tv_usec) / 1000);
printf("%f\n", a[9999][0]);
}
Loop tiling helps in case the data is being reused. If you use an element SIZE times, you better use it SIZE times and only then proceed to a next element.
Unfortunately, transposing 2D matrix you are not reusing any elements neither of matrix a, nor b. Even more, since in the loop you mix rows and cols access (i.e. a[i][j] = b[j][i]), you will never get unit-stride memory access on both a and b arrays at the same time, but only on one of them.
So, in this case tiling is not that efficient, but still you might have some performance improvements even with "random" memory access if:
the element you are accessing now is on the same cache line with an element you were accessing previously AND
that cache line is still available.
So, to see any improvements the memory footprint of this "random" accesses must fit into the cache of your system. Basically this means you have to carefully chose the blocksize and 16 you have chosen in the example might work better on one system and worse on the other.
Here are the results from my computer for different power of 2 block sizes and SIZE 4096:
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
transpose_2d 32052765 ns 32051761 ns 21
tiled_transpose_2d/2 22246701 ns 22245867 ns 31
tiled_transpose_2d/4 16912984 ns 16912487 ns 41
tiled_transpose_2d/8 16284471 ns 16283974 ns 43
tiled_transpose_2d/16 16604652 ns 16604149 ns 42
tiled_transpose_2d/32 23661431 ns 23660226 ns 29
tiled_transpose_2d/64 32260575 ns 32259564 ns 22
tiled_transpose_2d/128 32107778 ns 32106793 ns 22
fixed_tile_transpose_2d 16735583 ns 16729876 ns 41
As you can see the version with blocksize 8 works the best for me and it almost double the performance.
Here are the results for SIZE 4131 and power of 3 block sizes:
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
transpose_2d 29875351 ns 29874381 ns 23
tiled_transpose_2d/3 30077471 ns 30076517 ns 23
tiled_transpose_2d/9 20420423 ns 20419499 ns 35
tiled_transpose_2d/27 13470242 ns 13468992 ns 51
tiled_transpose_2d/81 11318953 ns 11318646 ns 61
tiled_transpose_2d/243 10229250 ns 10228884 ns 65
fixed_tile_transpose_2d 10217339 ns 10217066 ns 67
Regarding 16384 size issue. I can not reproduce it, i.e. I still see the same gain for big matrix. Just please note, that 16384 * 16384 * sizeof(float) makes 4GB, which might expose some system issues...

analysis of cpu cache access time

I have the following program which I with the help of someother on stackoverflow wrote to understand cachelines and CPU caches.I have the result of the calculation posted below.
1 450.0 440.0
2 420.0 230.0
4 400.0 110.0
8 390.0 60.0
16 380.0 30.0
32 320.0 10.0
64 180.0 10.0
128 60.0 0.0
256 40.0 10.0
512 10.0 0.0
1024 10.0 0.0
I have plotted a graph using gnuplot which is posted below.
I have the following questions.
is my timing calculation in milliseconds correct ? 440ms seems to
be a lot of time?
From the graph cache_access_1 (redline) can we conclude that the
size of cache line is 32 bits (and not 64-bits?)
Between the for loops in the code is it a good idea to clear the
cache? If yes how do I do that programmatically?
As you can see I have some 0.0 values in the result above.?
What does this indicate? is the granularity of measurement too
coarse?
Kindly reply.
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#define MAX_SIZE (512*1024*1024)
int main()
{
clock_t start, end;
double cpu_time;
int i = 0;
int k = 0;
int count = 0;
/*
* MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works.
* It lives in a fixed-size buffer, set by the program executable's configuration according to the
* operating system, but its actual size is seldom checked against the available space.
*/
/*int arr[MAX_SIZE];*/
int *arr = (int*)malloc(MAX_SIZE * sizeof(int));
/*cpu clock ticks count start*/
for(k = 0; k < 3; k++)
{
start = clock();
count = 0;
for (i = 0; i < MAX_SIZE; i++)
{
arr[i] += 3;
/*count++;*/
}
/*cpu clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cpu time for loop 1 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
}
printf("\n");
for (k = 1 ; k <= 1024 ; k <<= 1)
{
/*cpu clock ticks count start*/
start = clock();
count = 0;
for (i = 0; i < MAX_SIZE; i += k)
{
/*count++;*/
arr[i] += 3;
}
/*cpu clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cpu time for loop 2 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
}
printf("\n");
/* Third loop, performing the same operations as loop 2,
but only touching 16KB of memory
*/
for (k = 1 ; k <= 1024 ; k <<= 1)
{
/*cpu clock ticks count start*/
start = clock();
count = 0;
for (i = 0; i < MAX_SIZE; i += k)
{
count++;
arr[i & 0xfff] += 3;
}
/*cpu clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cpu time for loop 3 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
}
return 0;
}
Since you are on Linux, I'll answer from that perspective. I will also write with an Intel (i.e., x86-64) architecture in mind.
440 ms is probably accurate. A better way to look at the results would be time per element or access. Note that increasing your k reduces the number of elements accessed. Now, cache access 2 shows a fairly steady result of 0.9ns / access. This time is roughly comparable to 1 - 3 cycles per access (depending on CPU's clock rate). So sizes 1 - 16 (maybe 32) are accurate.
No (although I will first assume you mean 32 versus 64 byte). You should ask yourself, what does "cache line size" look like? If you access smaller than the cache line, then you will miss and subsequently hit one or more times. If you are greater than or equal to the cache line size, every access will miss. At k=32 and above, the access time for access 1 is relatively constant at 20ns per access. At k=1-16, overall access time is constant, suggesting that there are approximately the same number of cache misses. So I would conclude that the cache line size is 64 bytes.
Yes, at least for the last loop that is only storing ~16KB. How? Either touch a lot of other data, like another GB array. Or call an instruction like x86's WBINVD, which writes to memory and then invalidates all cache contents; however, it requires you to be in kernel-mode.
As you noted, beyond size 32, the times hover around 10ms, which is showing your timing granularity. You need to either increase the time required (so that a 10ms granularity is sufficient) or switch to a different timing mechanism, which is what the comments are debating. I'm a fan of using the instruction rdtsc (read timestamp counter (i.e., cycle count)), but this can be even more problematic than the suggestions above. Switching your code to rdtsc basically required switching clock, clock_t, and CLOCKS_PER_SEC. However, you could still face clock drift if your thread migrates, but this is a fun test so I wouldn't concern myself with this issue.
More caveats: the trouble with consistent strides (like powers of 2) is that the processor likes to hide the cache miss penalty by prefetching. You can disable the prefetcher on many machines in the BIOS (see "Changing the Prefetcher for Intel Processors").
Page faults may also be impacting your results. You are allocating 500M ints or about 2GB of storage. Loop 1 tries to touch the memory so that the OS will allocate pages, but if you don't have this much available memory (not just total, as the OS, etc takes up some space) then your results will be skewed. Furthermore, the OS may start reclaiming some of the space so that you will always be page faulting on some of your accesses.
Related to the previous, the TLB is also going to have some impact on the results. The hardware keeps a small cache of mappings from virtual to physical address in a translation lookaside buffer (TLB). Each page of memory (4KB on Intel) needs a TLB entry. So your experiment is going to need 2GB / 4KB => ~500,000 entries. Most TLBs hold less than 1000 entries, so the measurements are also skewed by this miss. Fortunately, it is only once every 4KB or 1024 ints. It is possible that malloc is allocating "large" or "huge" pages for you, for more details - Huge Pages in Linux.
Another experiment would be to repeat the third loop, but change the mask that you are using, so that you can observe the size of each cache level (L1, L2, maybe L3, rarely L4). You may also find that different cache levels use different cacheline sizes.

Linux, timerfd accuracy

I have a system that needs at least 10 mseconds of accuracy for timers.
I went for timerfd as it suits me perfectly, but found that even for times up to 15 milliseconds it is not accurate at all, either that or I don't understand how it works.
The times I have measured were up to 21 mseconds on a 10 mseconds timer.
I have put together a quick test that shows my problem.
Here a test:
#include <sys/timerfd.h>
#include <time.h>
#include <string.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <inttypes.h>
int main(int argc, char *argv[]){
int timerfd = timerfd_create(CLOCK_MONOTONIC,0);
int milliseconds = atoi(argv[1]);
struct itimerspec timspec;
bzero(&timspec, sizeof(timspec));
timspec.it_interval.tv_sec = 0;
timspec.it_interval.tv_nsec = milliseconds * 1000000;
timspec.it_value.tv_sec = 0;
timspec.it_value.tv_nsec = 1;
int res = timerfd_settime(timerfd, 0, &timspec, 0);
if(res < 0){
perror("timerfd_settime:");
return 1;
}
uint64_t expirations = 0;
int iterations = 0;
while( res = read(timerfd, &expirations, sizeof(expirations))){
if(res < 0){ perror("read:"); continue; }
if(expirations > 1){
printf("%" PRIu64 " expirations, %d iterations\n", expirations, iterations);
break;
}
iterations++;
}
return 0;
}
And executed like this:
Zack ~$ for i in 2 4 8 10 15; do echo "intervals of $i milliseconds"; ./test $i;done
intervals of 2 milliseconds
2 expirations, 1 iterations
intervals of 4 milliseconds
2 expirations, 6381 iterations
intervals of 8 milliseconds
2 expirations, 21764 iterations
intervals of 10 milliseconds
2 expirations, 1089 iterations
intervals of 15 milliseconds
2 expirations, 3085 iterations
Even assuming some possible delays, 15 milliseconds delays sounds too much for me.
Try altering it as follows, this should pretty much garuntee that it'll never miss a wakeup, but be careful with it since running realtime priority can lock your machine hard if it doesn't sleep, also you may need to set things up so that your user has the ability to run stuff at realtime priority (see /etc/security/limits.conf)
#include <sys/timerfd.h>
#include <time.h>
#include <string.h>
#include <stdint.h>
#include <stdio.h>
#include <sched.h>
int main(int argc, char *argv[])
{
int timerfd = timerfd_create(CLOCK_MONOTONIC,0);
int milliseconds = atoi(argv[1]);
struct itimerspec timspec;
struct sched_param schedparm;
memset(&schedparm, 0, sizeof(schedparm));
schedparm.sched_priority = 1; // lowest rt priority
sched_setscheduler(0, SCHED_FIFO, &schedparm);
bzero(&timspec, sizeof(timspec));
timspec.it_interval.tv_sec = 0;
timspec.it_interval.tv_nsec = milliseconds * 1000000;
timspec.it_value.tv_sec = 0;
timspec.it_value.tv_nsec = 1;
int res = timerfd_settime(timerfd, 0, &timspec, 0);
if(res < 0){
perror("timerfd_settime:");
}
uint64_t expirations = 0;
int iterations = 0;
while( res = read(timerfd, &expirations, sizeof(expirations))){
if(res < 0){ perror("read:"); continue; }
if(expirations > 1){
printf("%ld expirations, %d iterations\n", expirations, iterations);
break;
}
iterations++;
}
}
If you are using threads you should use pthread_setschedparam instead of sched_setscheduler.
Realtime also isn't about low latency, it's about guarantees, RT means that if you want to wake up exactly once every second on the second, you WILL, the normal scheduling does not give you this, it might decide to wake you up 100ms later, because it had some other work to do at that time anyway. If you want to wake up every 10ms and you REALLY do need to, then you should set yourself to run as a realtime task then the kernel will wake you up every 10ms without fail. Unless a higher priority realtime task is busy doing stuff.
If you need to guarantee that your wakeup interval is exactly some time it doesn't matter if it's 1ms or 1 second, you won't get it unless you run as a realtime task. There are good reasons the kernel will do this to you (saving power is one of them, higher throughput is another, there are others), but it's well within it's rights to do so since you never told it you need better guarantees. Most stuff doesn't actually need to be this accurate, or need to never miss so you should think hard about whether or not you really do need it.
quoting from http://www.ganssle.com/articles/realtime.htm
A hard real time task or system is
one where an activity simply must be
completed - always - by a specified
deadline. The deadline may be a
particular time or time interval, or
may be the arrival of some event. Hard
real time tasks fail, by definition,
if they miss such a deadline.
Notice this definition makes no
assumptions about the frequency or
period of the tasks. A microsecond or
a week - if missing the deadline
induces failure, then the task has
hard real time requirements.
Soft realtime is pretty much the same, except that missing a deadline, while undesirable, is not the end of the world (for example video and audio playback are soft realtime tasks, you don't want to miss displaying a frame, or run out of buffer, but if you do it's just a momentary hiccough, and you simply continue). If what you are trying to do is 'soft' realtime I wouldn't bother with running at realtime priority since you should generally get your wakeups in time (or at least close to it).
EDIT:
If you aren't running realtime the kernel will by default give any timers you make some 'slack' so that it can merge your request to wake up with other events that happen at times close to the one you asked for (that is if the other event is within your 'slack' time it will not wake you at the time you asked, but a little earlier or later, at the same time it was already going to do something else, this saves power).
For a little more info see High- (but not too high-) resolution timeouts and Timer slack (note I'm not sure if either of those things is exactly what's really in the kernel since both those articles are about lkml mailing list discussions, but something like the first one really is in the kernel.
I've got a feeling that your test is very hardware dependent. When I ran your sample program on my system, it appeared to hang at 1ms. To make your test at all meaningful on my computer, I had to change from milliseconds to microseconds. (I changed the multiplier from 1_000_000 to 1_000.)
$ grep 1000 test.c
timspec.it_interval.tv_nsec = microseconds * 1000;
$ for i in 1 2 4 5 7 8 9 15 16 17\
31 32 33 47 48 49 63 64 65 ; do\
echo "intervals of $i microseconds";\
./test $i;done
intervals of 1 microseconds
11 expirations, 0 iterations
intervals of 2 microseconds
5 expirations, 0 iterations
intervals of 4 microseconds
3 expirations, 0 iterations
intervals of 5 microseconds
2 expirations, 0 iterations
intervals of 7 microseconds
2 expirations, 0 iterations
intervals of 8 microseconds
2 expirations, 0 iterations
intervals of 9 microseconds
2 expirations, 0 iterations
intervals of 15 microseconds
2 expirations, 7788 iterations
intervals of 16 microseconds
4 expirations, 1646767 iterations
intervals of 17 microseconds
2 expirations, 597 iterations
intervals of 31 microseconds
2 expirations, 370969 iterations
intervals of 32 microseconds
2 expirations, 163167 iterations
intervals of 33 microseconds
2 expirations, 3267 iterations
intervals of 47 microseconds
2 expirations, 1913584 iterations
intervals of 48 microseconds
2 expirations, 31 iterations
intervals of 49 microseconds
2 expirations, 17852 iterations
intervals of 63 microseconds
2 expirations, 24 iterations
intervals of 64 microseconds
2 expirations, 2888 iterations
intervals of 65 microseconds
2 expirations, 37668 iterations
(Somewhat interesting that I got the longest runs from 16 and 47 microseconds, but 17 and 48 were awful.)
time(7) has some suggestions on why our platforms are so different:
High-Resolution Timers
Before Linux 2.6.21, the accuracy of timer and sleep system
calls (see below) was also limited by the size of the jiffy.
Since Linux 2.6.21, Linux supports high-resolution timers
(HRTs), optionally configurable via CONFIG_HIGH_RES_TIMERS. On
a system that supports HRTs, the accuracy of sleep and timer
system calls is no longer constrained by the jiffy, but instead
can be as accurate as the hardware allows (microsecond accuracy
is typical of modern hardware). You can determine whether
high-resolution timers are supported by checking the resolution
returned by a call to clock_getres(2) or looking at the
"resolution" entries in /proc/timer_list.
HRTs are not supported on all hardware architectures. (Support
is provided on x86, arm, and powerpc, among others.)
All the 'resolution' lines in my /proc/timer_list are 1ns on my (admittedly ridiculously powerful) x86_64 system.
I decided to try to figure out where the 'breaking point' is on my computer, but gave up on the 110 microsecond run:
$ for i in 70 80 90 100 110 120 130\
; do echo "intervals of $i microseconds";\
./test $i;done
intervals of 70 microseconds
2 expirations, 639236 iterations
intervals of 80 microseconds
2 expirations, 150304 iterations
intervals of 90 microseconds
4 expirations, 3368248 iterations
intervals of 100 microseconds
4 expirations, 1964857 iterations
intervals of 110 microseconds
^C
90 microseconds ran for three million iterations before it failed a few times; that's 22 times better resolution than your very first test, so I'd say that given the right hardware, 10ms shouldn't be anywhere near difficult. (90 microseconds is 111 times better resolution than 10 milliseconds.)
But if your hardware doesn't provide the timers for high resolution timers, then Linux can't help you without resorting to SCHED_RR or SCHED_FIFO. And even then, perhaps another kernel could better provide you with the software timer support you need.
Good luck. :)
Here's a theory. If HZ is set to 250 for your system ( as is typical ) then you have a 4 millisecond timer resolution. Once your process is swapped out by the scheduler, it's likely that a number of other processes will be scheduled and run before your process gets another time slice. This might explain you seeing timer resolutions in the 15 to 21 millisecond range. The only way to get around this would be to run a real-time kernel.
The typical solution for high resolution timing on non-realtime systems is basically to busy wait with a call to select.
Depending on what else the system is doing, it may be a bit slow in switching back to your task. Unless you have a "real" realtime system, there's no guarantee it will do better than what you're seeing, although I agree that result is a bit disappointing.
You can (mostly) eliminate that task switch / scheduler time. If you have CPU power (and electrical power!) to spare, a brutal but effective solution would be a busy wait spin loop.
The idea is to run your program in a tight loop that continuously polls the clock for what time it is, and then calls your other code when the time is right. At the expense of making your system act very sluggish for everything else and heating up your CPU, you will end up with task scheduling that is mostly jitter free.
I wrote a system like this once under Windows XP to spin a stepper motor, supplying evenly spaced pulses up to 40K times per second, and it worked fine. Of course, your mileage may vary.

Resources