C microbenchmark 'bug' when measuring store latency

C microbenchmark 'bug' when measuring store latency - c

I have been trying a few experiments on x86 - namely the effect of mfence on store/load latencies, etc.
Here is what I have started with:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#define ARRAY_SIZE 10
#define DUMMY_LOOP_CNT 1000000
int main()
{
char array[ARRAY_SIZE];
for (int i =0; i< ARRAY_SIZE; i++)
array[i] = 'x'; //This is to force the OS to give allocate the array
asm volatile ("mfence\n");
for (int i=0;i<DUMMY_LOOP_CNT;i++); //A dummy loop to just warmup the processor
struct result_tuple{
uint64_t tsp_start;
uint64_t tsp_end;
int offset;
};
struct result_tuple* results = calloc(ARRAY_SIZE , sizeof (struct result_tuple));
for (int i = 0; i< ARRAY_SIZE; i++)
{
uint64_t *tsp_start,*tsp_end;
tsp_start = &results[i].tsp_start;
tsp_end = &results[i].tsp_end;
results[i].offset = i;
asm volatile (
"mfence\n"
"rdtscp\n"
"mov %%rdx,%[arg]\n"
"shl $32,%[arg]\n"
"or %%rax,%[arg]\n"
:[arg]"=&r"(*tsp_start)
::"rax","rdx","rcx","memory"
);
array[i] = 'y'; //A simple store
asm volatile (
"mfence\n"
"rdtscp\n"
"mov %%rdx,%[arg]\n"
"shl $32,%[arg]\n"
"or %%rax,%[arg]\n"
:[arg]"=&r"(*tsp_end)
::"rax","rdx","rcx","memory"
);
}
printf("Offset\tLatency\n");
for (int i=0;i<ARRAY_SIZE;i++)
{
printf("%d\t%lu\n",results[i].offset,results[i].tsp_end - results[i].tsp_start);
}
free (results);
}
I compile quite simply with gcc microbenchmark.c -o microbenchmark
My system configuration is as follows:
CPU : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
Operating system : GNU/Linux (Linux 5.4.80-2)
My issue is this:
In a single run, all the latencies are similar
When repeating the experiment over and over, I don't get results similar to the previous run!
For instance:
In run 1 I get:
Offset Latency
1 275
2 262
3 262
4 262
5 275
...
252 275
253 275
254 262
255 262
In another run I get:
Offset Latency
1 75
2 75
3 75
4 72
5 72
...
251 72
252 72
253 75
254 75
255 72
This is pretty surprising (The among-run variation is pretty high, whereas there is negligible within-run variation)! I am not sure how to explain this. What is the issue with my microbenchmark?
Note: I do understand that a normal store would be a write allocate store.. Technically making my measurement that of a load (rather than a store). Also, mfence should flush the store buffer, thereby ensuring that no stores are 'delayed'.

Your warm-up dummy loop only does 1 million iterations, ~6 mil clock cycles in a -O0 debug build - probably not be long enough to get the CPU up to max turbo, on a CPU before Skylake's hardware P-state management. (Idiomatic way of performance evaluation?)
RDTSCP counts fixed-frequency reference cycles, not core clock cycles. Your runs are so short that all the run-to-run variation is probably explained by the CPU frequency being low or high. See How to get the CPU cycle count in x86_64 from C++?
Also, this debug (-O0) build will do extra stores and reloads inside your timed region, but "fortunately" the results[i].offset = i; store plus the mfence before the first rdtscp ensures the result array is also hot in cache before entering the timed region.
Your array is tiny, and you're only doing 1-byte stores (so 64 stores are all in the same cache line.) It's very likely still in MESI Modified state from when you initialized it, so I wouldn't expect an RFO on any of the array[i] = 'y' stores. That already happened for the few lines of stack memory involved before your timed loop. If you want to pre-fault the array without also getting it cached, maybe touch one line per 4k page and leave the other lines untouched. But HW prefetch will get ahead of your stores, especially if you only store 1 byte at a time with 2 slow mfences per store, so again the waiting for off-core memory requests will be outside the timed region. You should expect data to already be in L1d cache or at least L2 in Exclusive state, ready to be flipped to Modified on a store.
BTW, having an offset member seems pointless; it can be implicit from the array index. e.g. print i instead of offset[i]. It's also not very useful to store both start and stop absolute TSC values. You could just store a 32-bit difference, then you wouldn't need to shift / OR in your inline asm, just declare a clobber on the unused EDX output.
Also note that "store latency" typically only matters for performance in real code when mfence is involved. Otherwise the important thing is store->load forwarding, which can happen from the store buffer before the store commits to L1d cache. That's about 6 cycles, or sometimes lower if the reload isn't attempted right away. (It's variable on Sandybridge-family.)

Related

How does the CPU cache affect the performance of a C program

I am trying to understand more about how CPU cache affects performance. As a simple test I am summing the values of the first column of a matrix with varying numbers of total columns.
// compiled with: gcc -Wall -Wextra -Ofast -march=native cache.c
// tested with: for n in {1..100}; do ./a.out $n; done | tee out.csv
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
double sum_column(uint64_t ni, uint64_t nj, double const data[ni][nj])
{
double sum = 0.0;
for (uint64_t i = 0; i < ni; ++i) {
sum += data[i][0];
}
return sum;
}
int compare(void const* _a, void const* _b)
{
double const a = *((double*)_a);
double const b = *((double*)_b);
return (a > b) - (a < b);
}
int main(int argc, char** argv)
{
// set sizes
assert(argc == 2);
uint64_t const iter_max = 101;
uint64_t const ni = 1000000;
uint64_t const nj = strtol(argv[1], 0, 10);
// initialize data
double(*data)[nj] = calloc(ni, sizeof(*data));
for (uint64_t i = 0; i < ni; ++i) {
for (uint64_t j = 0; j < nj; ++j) {
data[i][j] = rand() / (double)RAND_MAX;
}
}
// test performance
double* dt = calloc(iter_max, sizeof(*dt));
double const sum0 = sum_column(ni, nj, data);
for (uint64_t iter = 0; iter < iter_max; ++iter) {
clock_t const t_start = clock();
double const sum = sum_column(ni, nj, data);
clock_t const t_stop = clock();
assert(sum == sum0);
dt[iter] = (t_stop - t_start) / (double)CLOCKS_PER_SEC;
}
// sort dt
qsort(dt, iter_max, sizeof(*dt), compare);
// compute mean dt
double dt_mean = 0.0;
for (uint64_t iter = 0; iter < iter_max; ++iter) {
dt_mean += dt[iter];
}
dt_mean /= iter_max;
// print results
printf("%2lu %.8e %.8e %.8e %.8e\n", nj, dt[iter_max / 2], dt_mean, dt[0],
dt[iter_max - 1]);
// free memory
free(data);
}
However, the results are not quite how I would expect them to be:
As far as I understand, when the CPU loads a value from data, it also places some of the following values of data in the cache. The exact number depends on the cache line size (64 byte on my machine). This would explain, why with growing nj the time to solution first increases linearly and levels out at some value. If nj == 1, one load places the next 7 values in the cache and thus we only need to load from RAM every 8th value. If nj == 2, following the same logic, we need to access the RAM every 4th value. After some size, we will have to access the RAM for every value, which should result in the performance leveling out. My guess, for why the linear section of the graph goes further than 4 is that in reality there are multiple levels of cache at work here and the way that values end up in these caches is a little more complex than what I explained here.
What I cannot explain is why there are these performance peaks at multiples of 16.
After thinking about this question for a bit, I decided to check if this also occurs for higher values of nj:
In fact, it does. And, there is more: Why does the performance increase again after ~250?
Could someone explain to me, or point me to some appropriate reference, why there are these peaks and why the performance increases for higher values of nj.
If you would like to try the code for yourself, I will also attach my plotting script, for your convenience:
import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt("out.csv")
data[:,1:] /= data[0,1]
dy = np.diff(data[:,2]) / np.diff(data[:,0])
for i in range(len(dy) - 1):
if dy[i] - dy[i + 1] > (dy.max() - dy.min()) / 2:
plt.axvline(data[i + 1,0], color='gray', linestyle='--')
plt.text(data[i + 1,0], 1.5 * data[0,3], f"{int(data[i + 1,0])}",
rotation=0, ha="center", va="center",
bbox=dict(boxstyle="round", ec='gray', fc='w'))
plt.fill_between(data[:,0], data[:,3], data[:,4], color='gray')
plt.plot(data[:,0], data[:,1], label="median")
plt.plot(data[:,0], data[:,2], label="mean")
plt.legend(loc="upper left")
plt.xlabel("nj")
plt.ylabel("dt / dt$_0$")
plt.savefig("out.pdf")

The plots show the combination of several complex low-level effects (mainly cache trashing & prefetching issues). I assume the target platform is a mainstream modern processor with cache lines of 64 bytes (typically a x86 one).
I can reproduce the problem on my i5-9600KF processor. Here is the resulting plot:
First of all, when nj is small, the gap between fetched address (ie. strides) is small and cache lines are relatively efficiently used. For example, when nj = 1, the access is contiguous. In this case, the processor can efficiently prefetch the cache lines from the DRAM so to hide its high latency. There is also a good spatial cache locality since many contiguous items share the same cache line. When nj=2, only half the value of a cache line is used. This means the number of requested cache line is twice bigger for the same number of operations. That being said the time is not much bigger due to the relatively high latency of adding two floating-point numbers resulting in a compute-bound code. You can unroll the loop 4 times and use 4 different sum variables so that (mainstream modern) processors can add multiple values in parallel. Note that most processors can also load multiple values from the cache per cycle. When nj = 4 a new cache line is requested every 2 cycles (since a double takes 8 bytes). As a result, the memory throughput can become so big that the computation becomes memory-bound. One may expect the time to be stable for nj >= 8 since the number of requested cache line should be the same, but in practice processors prefetch multiple contiguous cache lines so not to pay the overhead of the DRAM latency which is huge in this case. The number of prefetched cache lines is generally between 2 to 4 (AFAIK such prefetching strategy is disabled on Intel processors when the stride is bigger than 512, so when nj >= 64. This explains why the timings are sharply increasing when nj < 32 and they become relatively stable with 32 <= nj <= 256 with exceptions for peaks.
The regular peaks happening when nj is a multiple of 16 are due to a complex cache effect called cache thrashing. Modern cache are N-way associative with N typically between 4 and 16. For example, here are statistics on my i5-9600KF processors:
Cache 0: L1 data cache, line size 64, 8-ways, 64 sets, size 32k
Cache 1: L1 instruction cache, line size 64, 8-ways, 64 sets, size 32k
Cache 2: L2 unified cache, line size 64, 4-ways, 1024 sets, size 256k
Cache 3: L3 unified cache, line size 64, 12-ways, 12288 sets, size 9216k
This means that two fetched values from the DRAM with the respective address A1 and A2 can results in conflicts in my L1 cache if (A1 % 32768) / 64 == (A2 % 32768) / 64. In this case, the processor needs to choose which cache line to replace from a set of N=8 cache lines. There are many cache replacement policy and none is perfect. Thus, some useful cache line are sometime evicted too early resulting in additional cache misses required later. In pathological cases, many DRAM locations can compete for the same cache lines resulting in excessive cache misses. More information about this can be found also in this post.
Regarding the nj stride, the number of cache lines that can be effectively used in the L1 cache is limited. For example, if all fetched values have the same address modulus the cache size, then only N cache lines (ie. 8 for my processor) can actually be used to store all the values. Having less cache lines available is a big problem since the prefetcher need a pretty large space in the cache so to store the many cache lines needed later. The smaller the number of concurrent fetches, the lower memory throughput. This is especially true here since the latency of fetching 1 cache line from the DRAM is about several dozens of nanoseconds (eg. ~70 ns) while its bandwidth is about dozens of GiB/s (eg. ~40 GiB/s): dozens of cache lines (eg. ~40) should be fetched concurrently so to hide the latency and saturate the DRAM.
Here is the simulation of the number of cache lines that can be actually used in my L1 cache regarding the value of the nj:
nj #cache-lines
1 512
2 512
3 512
4 512
5 512
6 512
7 512
8 512
9 512
10 512
11 512
12 512
13 512
14 512
15 512
16 256 <----
17 512
18 512
19 512
20 512
21 512
22 512
23 512
24 512
25 512
26 512
27 512
28 512
29 512
30 512
31 512
32 128 <----
33 512
34 512
35 512
36 512
37 512
38 512
39 512
40 512
41 512
42 512
43 512
44 512
45 512
46 512
47 512
48 256 <----
49 512
50 512
51 512
52 512
53 512
54 512
55 512
56 512
57 512
58 512
59 512
60 512
61 512
62 512
63 512
64 64 <----
==============
80 256
96 128
112 256
128 32
144 256
160 128
176 256
192 64
208 256
224 128
240 256
256 16
384 32
512 8
1024 4
We can see that the number of available cache lines is smaller when nj is a multiple of 16. In this case, the prefetecher will preload data into cache lines that are likely evicted early by subsequent fetched (done concurrently). Loads instruction performed in the code are more likely to result in cache misses when the number of available cache line is small. When a cache miss happen, the value need then to be fetched again from the L2 or even the L3 resulting in a slower execution. Note that the L2 cache is also subject to the same effect though it is less visible since it is larger. The L3 cache of modern x86 processors makes use of hashing to better distributes things to reduce collisions from fixed strides (at least on Intel processors and certainly on AMD too though AFAIK this is not documented).
Here is the timings on my machine for some peaks:
32 4.63600000e-03 4.62298020e-03 4.06400000e-03 4.97300000e-03
48 4.95800000e-03 4.96994059e-03 4.60400000e-03 5.59800000e-03
64 5.01600000e-03 5.00479208e-03 4.26900000e-03 5.33100000e-03
96 4.99300000e-03 5.02284158e-03 4.94700000e-03 5.29700000e-03
128 5.23300000e-03 5.26405941e-03 4.93200000e-03 5.85100000e-03
192 4.76900000e-03 4.78833663e-03 4.60100000e-03 5.01600000e-03
256 5.78500000e-03 5.81666337e-03 5.77600000e-03 6.35300000e-03
384 5.25900000e-03 5.32504950e-03 5.22800000e-03 6.75800000e-03
512 5.02700000e-03 5.05165347e-03 5.02100000e-03 5.34400000e-03
1024 5.29200000e-03 5.33059406e-03 5.28700000e-03 5.65700000e-03
As expected, the timings are overall bigger in practice for the case where the number of available cache lines is much smaller. However, when nj >= 512, the results are surprising since they are significantly faster than others. This is the case where the number of available cache lines is equal to the number of ways of associativity (N). My guess is that this is because Intel processors certainly detect this pathological case and optimize the prefetching so to reduce the number of cache misses (using line-fill buffers to bypass the L1 cache -- see below).
Finally, for large nj stride, a bigger nj should results in higher overheads mainly due to the translation lookaside buffer (TLB): there are more page addresses to translate with bigger nj and the number of TLB entries is limited. In fact this is what I can observe on my machine: timings tends to slowly increase in a very stable way unlike on your target platform.
I cannot really explain this very strange behavior yet.
Here is some wild guesses:
The OS could tend to uses more huge pages when nj is large (so to reduce de overhead of the TLB) since wider blocks are allocated. This could result in more concurrency for the prefetcher as AFAIK it cannot cross page
boundaries. You can try to check the number of allocated (transparent) huge-pages (by looking AnonHugePages in /proc/meminfo in Linux) or force them to be used in this case (using an explicit memmap), or possibly by disabling them. My system appears to make use of 2 MiB transparent huge-pages independently of the nj value.
If the target architecture is a NUMA one (eg. new AMD processors or a server with multiple processors having their own memory), then the OS could allocate pages physically stored on another NUMA node because there is less space available on the current NUMA node. This could result in higher performance due to the bigger throughput (though the latency is higher). You can control this policy with numactl on Linux so to force local allocations.
For more information about this topic, please read the great document What Every Programmer Should Know About Memory. Moreover, a very good post about how x86 cache works in practice is available here.
Removing the peaks
To remove the peaks due to cache trashing on x86 processors, you can use non-temporal software prefetching instructions so cache lines can be fetched in a non-temporal cache structure and into a location close to the processor that should not cause cache trashing in the L1 (if possible). Such cache structure is typically a line-fill buffers (LFB) on Intel processors and the (equivalent) miss address buffers (MAB) on AMD Zen processors. For more information about non-temporal instructions and the LFB, please read this post and this one. Here is the modified code that also include a loop unroling optimization to speed up the code when nj is small:
double sum_column(uint64_t ni, uint64_t nj, double* const data)
{
double sum0 = 0.0;
double sum1 = 0.0;
double sum2 = 0.0;
double sum3 = 0.0;
if(nj % 16 == 0)
{
// Cache-bypassing prefetch to avoid cache trashing
const size_t distance = 12;
for (uint64_t i = 0; i < ni; ++i) {
_mm_prefetch(&data[(i+distance)*nj+0], _MM_HINT_NTA);
sum0 += data[i*nj+0];
}
}
else
{
// Unrolling is much better for small strides
for (uint64_t i = 0; i < ni; i+=4) {
sum0 += data[(i+0)*nj+0];
sum1 += data[(i+1)*nj+0];
sum2 += data[(i+2)*nj+0];
sum3 += data[(i+3)*nj+0];
}
}
return sum0 + sum1 + sum2 + sum3;
}
Here is the result of the modified code:
We can see that peaks no longer appear in the timings. We can also see that the values are much bigger due to dt0 being about 4 times smaller (due to the loop unrolling).
Note that cache trashing in the L2 cache is not avoided with this method in practice (at least on Intel processors). This means that the effect is still here with huge nj strides multiple of 512 (4 KiB) on my machine (it is actually a slower than before, especially when nj >= 2048). It may be a good idea to stop the prefetching when (nj%512) == 0 && nj >= 512 on x86 processors. The effect AFAIK, there is no way to address this problem. That being said, this is a very bad idea to perform such big strided accesses on very-large data structures.
Note that distance should be carefully chosen since early prefetching can result cache line being evicted before they are actually used (so they need to be fetched again) and late prefetching is not much useful. I think using value close to the number of entries in the LFB/MAB is a good idea (eg. 12 on Skylake/KabyLake/CannonLake, 22 on Zen-2).

Determine NUMA layout via latency/performance measurements

Recently I have been observing performance effects in memory-intensive workloads I was unable to explain. Trying to get to the bottom of this I started running several microbenchmarks in order to determine common performance parameters like cache line size and L1/L2/L3 cache size (I knew them already, I just wanted to see if my measurements reflected the actual values).
For the cache line test my code roughly looks as follows (Linux C, but the concept is similiar to Windows etc. of course):
char *array = malloc (ARRAY_SIZE);
int count = ARRAY_SIZE / STEP;
clock_gettime(CLOCK_REALTIME, &start_time);
for (int i = 0; i < ARRAY_SIZE; i += STEP) {
array[i]++;
}
clock_gettime(CLOCK_REALTIME, &end_time);
// calculate time per element here:
[..]
Varying STEP from 1 to 128 shows that from STEP=64 on, I saw that the time per element did not increase further, i.e. every iteration would need to fetch a new cache line dominating the runtime.
Varying ARRAY_SIZE from 1K to 16384K keeping STEP=64 I was able to create a nice plot exhibiting a step pattern that roughly corresponds to L1, L2 and L3 latency. It was necessary to repeat the for loop a number of times, for very small array sizes even 100,000s of times, to get reliable numbers, though. Then, on my IvyBridge notebook I can clearly see L1 ending at 64K, L2 at 256K and even the L3 at 6M.
Now on to my real question: In a NUMA system, any single core will obtain remote main memory and even shared cache that is not necessarily as close as its local cache and memory. I was hoping to see a difference in latency/performance thus determining how much memory I could allocate while staying in my fast caches/part of memory.
For this, I refined my test to walk through the memory in 1/10 MB chunks measuring the latency separately and later collect the fastest chunks, roughly like this:
for (int chunk_start = 0; chunk_start < ARRAY_SIZE; chunk_start += CHUNK_SIZE) {
int chunk_end = MIN (ARRAY_SIZE, chunk_start + CHUNK_SIZE);
int chunk_els = CHUNK_SIZE / STEP;
for (int i = chunk_start; i < chunk_end; i+= STEP) {
array[i]++;
}
// calculate time per element
[..]
As soon as I start increasing ARRAY_SIZE to something larger than the L3 size, I get wildy unrealiable numbers not even a large number of repeats is able to even out. There is no way I can make out a pattern usable for performance evaluation with this, let alone determine where exactly a NUMA stripe starts, ends or is located.
Then, I figured the Hardware prefetcher is smart enough to recognize my simple access pattern and simply fetch the needed lines into the cache before I access them. Adding a random number to the array index increases the time per element but did not seem to help much otherwise, probably because I had a rand () call every iteration. Precomputing some random values and storing them in an array did not seem a good idea to me as this array as well would be stored in a hot cache and skew my measurements. Increasing STEP to 4097 or 8193 did not help much either, the prefetcher must be smarter than me.
Is my approach sensible/viable or did I miss the larger picture? Is it possible to observe NUMA latencies like this at all? If yes, what am I doing wrong?
I disabled address space randomization just to be sure and preclude strange cache aliasing effects. Is there something else operating-sytem wise that has to be tuned before measuring?

Is it possible to observe NUMA latencies like this at all? If yes, what am I doing wrong?
Memory allocators are NUMA aware, so by default you will not observe any NUMA effects until you explicitly ask to allocate memory on another node. The most simple way to achieve the effect is numactl(8). Just run your application on one node and bind memory allocations to another, like so:
numactl --cpunodebind 0 --membind 1 ./my-benchmark
See also numa_alloc_onnode(3).
Is there something else operating-sytem wise that has to be tuned before measuring?
Turn off CPU scaling otherwise your measurements might be noisy:
find '/sys/devices/system/cpu/' -name 'scaling_governor' | while read F; do
echo "==> ${F}"
echo "performance" | sudo tee "${F}" > /dev/null
done
Now regarding the test itself. Sure, to measure the latency, access pattern must be (pseudo) random. Otherwise your measurements will be contaminated with fast cache hits.
Here is an example how you could achieve this:
Data Initialization
Fill the array with random numbers:
static void random_data_init()
{
for (size_t i = 0; i < ARR_SZ; i++) {
arr[i] = rand();
}
}
Benchmark
Perform 1M op operations per one benchmark iteration to reduce measurement noise. Use array random number to jump over few cache lines:
const size_t OPERATIONS = 1 * 1000 * 1000; // 1M operations per iteration
int random_step_sizeK(size_t size)
{
size_t idx = 0;
for (size_t i = 0; i < OPERATIONS; i++) {
arr[idx & (size - 1)]++;
idx += arr[idx & (size - 1)] * 64; // assuming cache line is 64B
}
return 0;
}
Results
Here are the results for i5-4460 CPU # 3.20GHz:
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
random_step_sizeK/4 4217004 ns 4216880 ns 166
random_step_sizeK/8 4146458 ns 4146227 ns 168
random_step_sizeK/16 4188168 ns 4187700 ns 168
random_step_sizeK/32 4180545 ns 4179946 ns 163
random_step_sizeK/64 5420788 ns 5420140 ns 129
random_step_sizeK/128 6187776 ns 6187337 ns 112
random_step_sizeK/256 7856840 ns 7856549 ns 89
random_step_sizeK/512 11311684 ns 11311258 ns 57
random_step_sizeK/1024 13634351 ns 13633856 ns 51
random_step_sizeK/2048 16922005 ns 16921141 ns 48
random_step_sizeK/4096 15263547 ns 15260469 ns 41
random_step_sizeK/6144 15262491 ns 15260913 ns 46
random_step_sizeK/8192 45484456 ns 45482016 ns 23
random_step_sizeK/16384 54070435 ns 54064053 ns 14
random_step_sizeK/32768 59277722 ns 59273523 ns 11
random_step_sizeK/65536 63676848 ns 63674236 ns 10
random_step_sizeK/131072 66383037 ns 66380687 ns 11
There are obvious steps between 32K/64K (so my L1 cache is ~32K), 256K/512K (so my L2 cache size is ~256K) and 6144K/8192K (so my L3 cache size is ~6M).

optimized sum of an array of doubles in C [duplicate]

This question already has answers here:
How to optimize these loops (with compiler optimization disabled)?
(3 answers)
Closed 5 years ago.
I've got an assignment where I must take a program and make it more efficient in terms of time.
the original code is:
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
help++;
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
// ... and this one.
return 0;
}
I almost exclusively modified the second for loop by changing it to
double *j=array;
double *p=array+ARRAY_SIZE;
for(; j<p;j+=10){
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
{
this on its own was able to reduce the time down to the criteria...
it already seems to work but are there any mistakes i'm not seeing?

I posted an improved version of this answer on a duplicate of this: C loop optimization help for final assignment. It was originally just a repost, but then I made some changes to answer the differences in that question. I forget what's different, but you should probably read that one instead. Maybe I should just delete this one.
See also other optimization guides in the x86 tag wiki.
First of all, it's a really crap sample because it doesn't have anything to stop a smart compiler from optimizing away the entire thing. It doesn't even print the sum. Even gcc -O1 (instead of -O3) threw away some of the looping.
Normally you'd put your code in a function, and call it in a loop from main() in another file. And compile them separately, without whole-program cross-file optimisation, so the compiler can't do optimisations based on the compile-time constants you call it with. The repeat-loop being wrapped so tightly around the actual loop over the array is causing havoc with gcc's optimizer (see below).
Also:
gcc -Wall -O3 -march=native fast-loop-cs201.c -o fl
fast-loop-cs201.c: In function ‘main’:
fast-loop-cs201.c:17:14: warning: ‘help’ is used uninitialized in this function [-Wuninitialized]
long int help;
I have to agree with EOF's disparaging remarks about your prof. Giving out code that optimizes away to nothing, and with uninitialized variables, is utter nonsense.
Some people are saying in comments that "the compiler doesn't matter", and that you're supposed to do optimize your C source for the CPU microarchitecture, rather than letting the compiler do it. This is crap: for good performance, you have to be aware of what compilers can do, and can't do. Some optimizations are "brittle", and a small seemingly-innocent change to the source will stop the compiler from doing something.
I assume your prof mentioned a few things about performance. There are a crapton of different things that could come into play here, many of which I assume didn't get mentioned in a 2nd-year CS class.
Besides multithreading with openmp, there's vectorizing with SIMD. There are also optimizations for modern pipelined CPUs: specifically, avoid having one long dependency chain.
Further essential reading:
Agner Fog's guides for optimizing C and asm for x86. Some of it applies to all CPUs.
What Every Programmer Should Know About Memory
Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. However, usually the difference in rounding error is small. So the compiler can get a big speedup by re-ordering things if you use -ffast-math to allow it. This may have been what your unroll-by-10 allowed.
Instead of just unrolling, keeping multiple accumulators which you only add up at the end can keep the floating point execution units saturated, because FP instructions have latency != throughput. If you need the result of the last op to be complete before the next one can start, you're limited by latency. For FP add, that's one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it's 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)
In this case, there's also basic stuff like pulling things out of the loop, e.g. help += ARRAY_SIZE.
compiler options
I started out with the original inner loop, with just help += ARRAY_SIZE pulled out, and adding a printf at the end so gcc doesn't optimize everything away. Let's try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):
gcc -O0 fast-loop-cs201.c -o fl: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing / tuning code with -O0 is not useful.
-O1: 4.87s
-O2: 4.89s
-O3: 2.453s (uses SSE to do 2 at once. I'm of course using a 64bit system, so hardware support for -msse2 is baseline.)
-O3 -ffast-math -funroll-loops: 2.439s
-O3 -march=sandybridge -ffast-math -funroll-loops: 1.275s (uses AVX to do 4 at once.)
-Ofast ...: no gain
-O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win (because it collects the sums every time, instead of giving one thread the first 1/4 of the outer loop iterations).
-Ofast -fprofile-generate -march=sandybridge -ffast-math, run it, then
-Ofast -fprofile-use -march=sandybridge -ffast-math: 1.275s
clang-3.5 -Ofast -march=native -ffast-math: 1.070s. (clang doesn't support -march=sandybridge).
gcc -O3 vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an addpd on that. So it sees the same values are being added repeatedly, but even -ffast-math doesn't let gcc just turn it into a multiply. Or switch the loops.
clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn't need to broadcast. It even uses 4 vector registers as 4 separate accumulators. However, it doesn't assume that calloc returns aligned memory, and for some reason it thinks the best bet is a pair of 128b loads.
vmovupd -0x60(%rbx,%rcx,8),%xmm4`
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4
It's actually slower when I tell it that the array is aligned. (with a stupid hack like array = (double*)((ptrdiff_t)array & ~31); which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn't support gcc's __builtin_assume_aligned.) I think the way the tight loop of 4x vaddpd mem, %ymmX,%ymmX is aligned puts cmp $0x271c,%rcx crossing a 32B boundary, so it can't macro-fuse with jne. uop throughput shouldn't be an issue, though, since this code is only getting 0.65insns per cycle (and 0.93 uops / cycle), according to perf.
Ahh, I checked with a debugger, and calloc is only returning a 16B-aligned pointer. So half the 32B memory accesses are crossing a cache line, causing a big slowdown. I guess it is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. The compiler is making a good choice here.
Source level changes
As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:
for (j = 0; j < ARRAY_SIZE; j+=4) { // unroll 4 times
sum0 += array[j];
sum1 += array[j+1];
sum2 += array[j+2];
sum3 += array[j+3];
}
and then don't collect the 4 accumulators into one until after the end of the outer loop.
Your source change of
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the j values get added together first, and then added to sum. So the loop-carried dependency chain is still only the latency of one FP add, and there's lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.
With -O0, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. (Technically, at every "sequence point", as the C standards call it.) Writing longer expressions without updating any variables, even temporaries, will make -O0 run faster, but it's not a useful optimisation. Don't waste your time on changes that only help with -O0, esp. not at the expense of readability.
Using 4-accumulators and not adding them together until the end of the outer loop defeats clang's auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc's non-vectorized -O2 with one accumulator). Even gcc -O2 without -ffast-math also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn't include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It's really easy to get something wrong and read past the end of the array when doing this.
gcc, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it's doing multiple iterations of the outer loop, again.
Using gcc's platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:
// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help = 0;
typedef double v4df __attribute__ ((vector_size (8*4)));
v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};
const size_t array_bytes = ARRAY_SIZE*sizeof(double);
double *aligned_array = NULL;
// this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
exit (1);
}
memcpy(aligned_array, array, array_bytes); // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
/*
#if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
array = __builtin_assume_aligned(array, 32);
#else
// force-align for other compilers. This loop-invariant will be done outside the loop.
array = (double*) ((ptrdiff_t)array & ~31);
#endif
*/
assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) ); // We don't have a cleanup loop to handle where the array size isn't a multiple of 16
// incrementing pointers can be more efficient than indexing arrays
// esp. on recent Intel where micro-fusion only works with one-register addressing modes
// of course, the compiler can always generate pointer-incrementing asm from array-indexing source
const double *start = aligned_array;
while ( (ptrdiff_t)start & 31 ) {
// annoying loops like this are the reason people use aligned buffers
sum += *start++; // scalar until we reach 32B alignment
// in practice, this loop doesn't run, because we copy into an aligned buffer
// This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
}
const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
sum0 += p[0]; // p+=4 increments the pointer by 4 * 4 * 8 bytes
sum1 += p[1]; // make sure you keep track of what you're incrementing
sum2 += p[2];
sum3 += p[3];
}
// the compiler might be smart enough to pull this out of the inner loop
// in fact, gcc turns this into a 64bit movabs outside of both loops :P
help+= ARRAY_SIZE;
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
/* You could argue legalese and say that
if (i == 0) {
for (j ...)
sum += array[j];
sum *= N_TIMES;
}
* still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
*/
}
// You can add some final code between this comment ...
sum0 = (sum0 + sum1) + (sum2 + sum3);
sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
printf("sum = %g; help=%ld\n", sum, help); // defeat the compiler.
free (aligned_array);
free (array); // not strictly necessary, because this is the end of main(). Leaving it out for this special case is a bad example for a CS class, though.
// ... and this one.
return 0;
}
The inner loop compiles to:
4007c0: c5 e5 58 19 vaddpd (%rcx),%ymm3,%ymm3
4007c4: 48 83 e9 80 sub $0xffffffffffffff80,%rcx # subtract -128, because -128 fits in imm8 instead of requiring an imm32 to encode add $128, %rcx
4007c8: c5 f5 58 49 a0 vaddpd -0x60(%rcx),%ymm1,%ymm1 # one-register addressing mode can micro-fuse
4007cd: c5 ed 58 51 c0 vaddpd -0x40(%rcx),%ymm2,%ymm2
4007d2: c5 fd 58 41 e0 vaddpd -0x20(%rcx),%ymm0,%ymm0
4007d7: 4c 39 c1 cmp %r8,%rcx # compare with end with p
4007da: 75 e4 jne 4007c0 <main+0xb0>
(For more, see online compiler output at godbolt. Note I had to cast the return value of calloc, because godbolt uses C++ compilers, not C compilers. The inner loop is from .L3 to jne .L3. See https://stackoverflow.com/tags/x86/info for x86 asm links. See also Micro fusion and addressing modes, because this Sandybridge change hasn't made it into Agner Fog's manuals yet.).
performance:
$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec
CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000
Performance counter stats for './fl3-vec':
1086.571078 task-clock (msec) # 1.000 CPUs utilized
4,072,679,849 cycles # 3.748 GHz
2,629,419,883 instructions # 0.65 insns per cycle
# 1.27 stalled cycles per insn
4,028,715,968 r1b1 # 3707.733 M/sec # unfused uops
2,257,875,023 r10e # 2077.982 M/sec # fused uops. lower than insns because of macro-fusion
3,328,275,626 stalled-cycles-frontend # 81.72% frontend cycles idle
1,648,011,059 stalled-cycles-backend # 40.47% backend cycles idle
751,736,741 L1-dcache-load-misses # 691.843 M/sec
18,772 cache-misses # 0.017 M/sec
1.086925466 seconds time elapsed
I still don't know why it's getting such low instructions per cycle. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts shouldn't be the problem. Sandybridge L2 cache can sustain one 32B transfers per cycle, which should keep up with the one 32B FP vector add per cycle.
Loads 32B loads from L1 take 2 cycles (it wasn't until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we're not reaching).
Perhaps the loads need to be pipelined ahead of when they're used, to minimize having the ROB (re-order buffer) fill up when a load stalls? But the perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.
0.65 instructions per cycle is only about half way to saturating the vector FP adder. This is frustrating. Even IACA says the loop should run in 4 cycles per iteration. (i.e. saturate the load ports and port1 (where the FP adder lives)) :/
update: I guess L2 latency was the problem after all. Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That's 1.68 insns per cycle. (The inner loop is 7 total instructions for 4 FP adds, thus we are finally saturating the vector FP add unit, and the load ports.) IDK why the HW prefetcher can't get ahead after one stall, and then stay ahead. Possibly software prefetch could help? Maybe somehow avoid having the HW prefetcher run past the array, and instead start prefetching the start of the array again. (loop tiling is a much better solution, see below.)
Intel CPUs only have 32k each L1-data and L1-instruction caches. I think your array would just barely fit in the L1 on an AMD CPU.
Gcc's attempt to vectorize by broadcasting the same value into a parallel add doesn't seem so crazy. If it had managed to get this right (using multiple accumulators to hide latency), that would have allowed it to saturate the vector FP adder with only half the memory bandwidth. As-is, it was pretty much a wash, probably because of overhead in broadcasting.
Also, it's pretty silly. The N_TIMES is a just a make-work repeat. We don't actually want to optimize for doing the identical work multiple times. Unless we want to win at silly assignments like this. A source-level way to do this would be to increment i in the part of the code we're allowed to modify:
for (...) {
sum += a[j] + a[j] + a[j] + a[j];
}
i += 3; // The inner loop does 4 total iterations of the outer loop
More realistically, to deal with this you could interchange your loops (loop over the array once, adding each value N_TIMES times). I think I've read that Intel's compiler will sometimes do that for you.
A more general technique is called cache blocking, or loop tiling. The idea is to work on your input data in small blocks that fit in cache. Depending on your algorithm, it can be possible to do various stages of thing on a chunk, then repeat for the next chunk, instead of having each stage loop over the whole input. As always, once you know the right name for a trick (and that it exists at all), you can google up a ton of info.
You could rules-lawyer your way into putting an interchanged loop inside an if (i == 0) block in the part of the code you're allowed to modify. It would still do the same number of additions, but in a more cache-optimal order.

I would try this for the inner loop:
double* tmp = array;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += *tmp; // Use a pointer
tmp++; // because it is faster to increment the pointer
// than it is to recalculate array+j every time
help++;
}
or better
double* tmp = array;
double* end = array + ARRAY_SIZE; // Get rid of variable j by calculating
// the end criteria and
while (tmp != end) { // just compare if the end is reached
sum += *tmp;
tmp++;
help++;
}

I think You should read about openmp library if You could use multithreaded. But this is so simple example that I think could not be optimized.
Certain thing is that You don't need to declare i and j before for loop. That would do:
for (int i = 0; i < N_TIMES; i++)

Why does my CPU suddenly work twice as fast?

I've been trying to use a simple profiler to measure the efficiency of some C code on a school server, and I'm hitting an odd situation. After a short amount of time (half a second-ish), the processor suddenly starts executing instructions twice as fast. I've tested for just about every possible reason I could think of (caching, load balancing on cores, CPU frequency being altered due to coming out of sleep), but everything seems normal.
For what it's worth, I'm doing this testing on a school linux server, so it's possible there's an unusual configuration I don't know about, but the processor ID being used doesn't change, and (via top) the server was completely idle as I tested.
Test code:
#include <time.h>
#include <stdio.h>
#define MY_CLOCK CLOCK_MONOTONIC_RAW
// no difference if set to CLOCK_THREAD_CPUTIME_ID
typedef struct {
unsigned int tsc;
unsigned int proc;
} ans_t;
static ans_t rdtscp(void){
ans_t ans;
__asm__ __volatile__ ("rdtscp" : "=a"(ans.tsc), "=c"(ans.proc) : : "edx");
return ans;
}
static void nop(void){
__asm__ __volatile__ ("");
}
void test(){
for(int i=0; i<100000000; i++) nop();
}
int main(){
int c=10;
while(c-->0){
struct timespec tstart,tend;
ans_t start = rdtscp();
clock_gettime(MY_CLOCK,&tstart);
test();
ans_t end = rdtscp();
clock_gettime(MY_CLOCK,&tend);
unsigned int tdiff = (tend.tv_sec-tstart.tv_sec)*1000000000+tend.tv_nsec-tstart.tv_nsec;
unsigned int cdiff = end.tsc-start.tsc;
printf("%u cycles and %u ns (%lf GHz) start proc %u end proc %u\n",cdiff,tdiff,(double)cdiff/tdiff,start.proc,end.proc);
}
}
Output I see:
351038093 cycles and 125680883 ns (2.793091 GHz) start proc 14 end proc 14
350911246 cycles and 125639359 ns (2.793004 GHz) start proc 14 end proc 14
350959546 cycles and 125656776 ns (2.793001 GHz) start proc 14 end proc 14
351533280 cycles and 125862608 ns (2.792992 GHz) start proc 14 end proc 14
350903833 cycles and 125636787 ns (2.793002 GHz) start proc 14 end proc 14
350924336 cycles and 125644157 ns (2.793002 GHz) start proc 14 end proc 14
349827908 cycles and 125251782 ns (2.792997 GHz) start proc 14 end proc 14
175289886 cycles and 62760404 ns (2.793001 GHz) start proc 14 end proc 14
175283424 cycles and 62758093 ns (2.793001 GHz) start proc 14 end proc 14
175267026 cycles and 62752232 ns (2.793001 GHz) start proc 14 end proc 14
I get similar output (with it taking a different number of tests to double in efficiency) using different optimization levels (-O0 to -O3).
Could it perhaps have something to do with hyperthreading, where two logical cores in a physical core (the server is using Xeon X5560s which may have this effect) can somehow "merge" to form one twice-as-fast processor?

Some systems scale the processor speed depending on the system load. As you justly note, this is particularly annoying when benchmarking.
If your server is running Linux, please type
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
If this outputs ondemand, powersave or userspace, then CPU frequency scaling is active, and you're going to find it very difficult to do benchmarks. If this says performance, then CPU frequency scaling is disabled.

Some CPUs have optimizations on the chip, which are learning the path your code usually takes. By sucessfully forecast what the next if statement would do, it is not needed to discard the queue, and freshly load all the new operations from scratch. Depending on the chip and the algorithm, it might take 5 to 10 cycles, until it successfully forecasts the if statements. But somehow there are also reasons that speak against this as beeing the reason for this behaviour.
Looking at your Output i would say this might also just be the sheduling of the OS and or the CPU Frequency governor used there. Are you very sure the CPU frequency doesn't change during the execution of your code? No CPU boost?
Using linux tools like cpufreq are often used to regulate the cpu frequency.

Hyper-threading means replicating the register space, not the actual decode/execution units - so this is not a solution.
To test the accuracy of the micro-benchmark method I would do the following:
Run the program with high priority
Count the number of instructions to see if it is correct. I would do that using perf stat ./binary - that means you need to have perf. I would do this multiple times and look at the clocks and instructions metrics to see how multiple instructions can execute in a single cycle.
I have some additional remarks:
For each nop you also to a comparison and a conditional jump in the for loop. If you really want to execute NOPs I'd write a statement like this:
#define NOP5 __asm__ __volatile__ ("nop nop nop nop nop");
#define NOP25 NOP5 NOP5 NOP5 NOP5 NOP5
#define NOP100 NOP25 NOP25 NOP25 NOP25
#define NOP500 NOP100 NOP100 NOP100 NOP100 NOP100
...
for(int i=0; i<100000000; i++)
{
NOP500 NOP500 NOP500 NOP500
}
This construct will allow you to actually do NOP's instead of comparing i with 100M.

Tracking down cuda kernel register usage

I am trying to track down register usage and came across an interesting scenario. Consider the following source:
#define OL 20
#define NHS 10
__global__ void loop_test( float ** out, const float ** in,int3 gdims,int stride){
const int idx = blockIdx.x*blockDim.x + threadIdx.x;
const int idy = blockIdx.y*blockDim.y + threadIdx.y;
const int idz = blockIdx.z*blockDim.z + threadIdx.z;
const int index = stride*gdims.y*idz + idy*stride + idx;
int i = 0,j =0;
float sum =0.f;
float tmp;
float lf;
float u2, tW;
u2 = 1.0;
tW = 2.0;
float herm[NHS];
for(j=0; j < OL; ++j){
for(i = 0; i < NHS; ++i){
herm[i] += in[j][index];
}
}
for(j=0; j<OL; ++j){
for(i=0;i<NHS; ++i){
tmp = sum + herm[i]*in[j][index];
sum = tmp;
}
out[j][index] = sum;
sum =0.f;
}
}
As a side note on the source - the running sum I could do +=, but was playing with how changing that effects register usage (seems it doesn't - just adds an extra mov instruction).
Additionally this source is oriented for accessing memory mapped to 3D space.
Counting out the registers it would seem there are 22 registers ( I believe a float[N] takes up N+1 registers - please correct me if I'm wronge) based on the declarations.
However compiling with:
nvcc -cubin -arch=sm_20 -Xptxas="-v" src/looptest.cu
yields:
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 25 registers, 72 bytes cmem[0]
Ok so the number is different that what is 'expected'. Additionally if compiled with :
nvcc -cubin -arch=sm_13 -Xptxas="-v" src/looptest.cu
The register usage is far less - 8 to be exact ( apparently due to stronger adherence in sm_20 than sm_13 to IEEE floating point math standards?):
ptxas info : Compiling entry function '_Z9loop_testPPfPPKfS2_4int3i' for 'sm_13'
ptxas info : Used 17 registers, 40+16 bytes smem, 8 bytes cmem[1]
As a final note, change the macro OL to 40, and suddenly:
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 72 bytes cmem[0]
In conclusion I would like to know where registers are being eaten up, and what results in the couple observations I have made.
I don't have enough experience with assembly to get through a cuobjdump - the answer certainly lies buried in there - maybe someone can enlighten me about what I should be looking for or show me a guide as to how to approach the assembly dump.

sm_20 and sm_13 are very different architectures, with very different instruction set (ISA) design. The main difference that causes the increase in register usage that you see is that sm_1x has special-purpose address registers, while sm_2x and later do not. Instead, addresses are stored in general-purpose registers just like values are, which means most programs require more registers on sm_2x than on sm_1x.
sm_20 also has twice the register file size of sm_13, to compensate for this affect.

Register usage does not necessarily have a close correlation to the number of variables.
The compiler tries to assess the speed benefit of keeping a variable in a register between two points of use in the code by comparing the potential gain in a single kernel with the cost to all concurrently running kernels due to there being less registers available in the register pool. (A Fermi SM has 32768 registers). So, it's not surprising if changing your code causes unexpected fluctuations in the number of registers used.
You really should only be worried about register usage if the profiler says that your occupancy is limited by register usage. In that case, you can use the --maxrregcount setting to lower the number of registers used by a single kernel to see if it improves overall execution speed.
To help reduce the number of registers used by a kernel, you can try to keep variable use as local as possible. For instance, if you do:
set variable 1
set variable 2
use variable 1
use variable 2
That may cause 2 registers to be used. While, if you:
set variable 1
use variable 1
set variable 2
use variable 2
That might cause 1 register to be used.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight