CURAND - Device API and seed - c

Context: I am currently learning how to properly use CUDA, in particular how to generate random numbers using CURAND. I learned here that it might be wise to generate my random numbers directly when I need them, inside the kernel which performs the core calculation in my code.
Following the documentation, I decided to play a bit and try come up with a simple running piece of code which I can later adapt to my needs.
I excluded MTGP32 because of the limit of 256 concurrent threads in a block (and just 200 pre-generated parameter sets). Besides, I do not want to use doubles, so I decided to stick to the default generator (XORWOW).
Problem: I am having a hard time understanding why the same seed value in my code is generating different sequences of numbers for a number of threads per block bigger than 128 (when blockSize<129, everything runs as I would expect). After doing proper CUDA error checking, as suggested by Robert in his comment, it is somewhat clear that hardware limitations play a role. Moreover, not using "-G -g" flags at compile time raises the "trouble for threshold" from 128 to 384.
Questions: What exactly is causing this? Robert worte in his comment that "it might be a registers per thread issue". What does this mean? Is there an easy way to look at the hardware specs and say where this limit will be? Can I get around this issue without having to generate more random numbers per thread?
A related issue seems to have been discussed here but I do not think it applies to my case.
My code (see below) was mostly inspired by these examples.
Code:
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand_kernel.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true){
if (code != cudaSuccess){
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void setup_kernel(curandState *state, int seed, int n){
int id = threadIdx.x + blockIdx.x*blockDim.x;
if(id<n){
curand_init(seed, id, 0, &state[id]);
}
}
__global__ void generate_uniform_kernel(curandState *state, float *result, int n){
int id = threadIdx.x + blockIdx.x*blockDim.x;
float x;
if(id<n){
curandState localState = state[id];
x = curand_uniform(&localState);
state[id] = localState;
result[id] = x;
}
}
int main(int argc, char *argv[]){
curandState *devStates;
float *devResults, *hostResults;
int n = atoi(argv[1]);
int s = atoi(argv[2]);
int blockSize = atoi(argv[3]);
int nBlocks = n/blockSize + (n%blockSize == 0?0:1);
printf("\nn: %d, blockSize: %d, nBlocks: %d, seed: %d\n", n, blockSize, nBlocks, s);
hostResults = (float *)calloc(n, sizeof(float));
cudaMalloc((void **)&devResults, n*sizeof(float));
cudaMalloc((void **)&devStates, n*sizeof(curandState));
setup_kernel<<<nBlocks, blockSize>>>(devStates, s, n);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
generate_uniform_kernel<<<nBlocks, blockSize>>>(devStates, devResults, n);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
cudaMemcpy(hostResults, devResults, n*sizeof(float), cudaMemcpyDeviceToHost);
for(int i=0; i<n; i++) {
printf("\n%10.13f", hostResults[i]);
}
cudaFree(devStates);
cudaFree(devResults);
free(hostResults);
return 0;
}
I compiled two binaries, one using the "-G -g" debugging flags and the other without. I named them rng_gen_d and rng_gen, respectively:
$ nvcc -lcuda -lcurand -O3 -G -g --ptxas-options=-v rng_gen.cu -o rng_gen_d
ptxas /tmp/tmpxft_00002257_00000000-5_rng_gen.ptx, line 2143; warning : Double is not supported. Demoting to float
ptxas info : 77696 bytes gmem, 72 bytes cmem[0], 32 bytes cmem[14]
ptxas info : Compiling entry function '_Z12setup_kernelP17curandStateXORWOWii' for 'sm_10'
ptxas info : Used 43 registers, 32 bytes smem, 72 bytes cmem[1], 6480 bytes lmem
ptxas info : Compiling entry function '_Z23generate_uniform_kernelP17curandStateXORWOWPfi' for 'sm_10'
ptxas info : Used 10 registers, 36 bytes smem, 40 bytes cmem[1], 48 bytes lmem
$ nvcc -lcuda -lcurand -O3 --ptxas-options=-v rng_gen.cu -o rng_gen
ptxas /tmp/tmpxft_00002b73_00000000-5_rng_gen.ptx, line 533; warning : Double is not supported. Demoting to float
ptxas info : 77696 bytes gmem, 72 bytes cmem[0], 32 bytes cmem[14]
ptxas info : Compiling entry function '_Z12setup_kernelP17curandStateXORWOWii' for 'sm_10'
ptxas info : Used 20 registers, 32 bytes smem, 48 bytes cmem[1], 6440 bytes lmem
ptxas info : Compiling entry function '_Z23generate_uniform_kernelP17curandStateXORWOWPfi' for 'sm_10'
ptxas info : Used 19 registers, 36 bytes smem, 4 bytes cmem[1]
To start with, there is a strange warning message at compile time (see above):
ptxas /tmp/tmpxft_00002b31_00000000-5_rng_gen.ptx, line 2143; warning : Double is not supported. Demoting to float
Some debugging showed that the line causing this warning is:
curandState localState = state[id];
There are no doubles declared, so I do not know exactly how to solve this (or even if this needs solving).
Now, an example of the (actual) problem I am facing:
$ ./rng_gen_d 5 314 127
n: 5, blockSize: 127, nBlocks: 1, seed: 314
0.9151657223701
0.3925153017044
0.7007563710213
0.8806988000870
0.5301177501678
$ ./rng_gen_d 5 314 128
n: 5, blockSize: 128, nBlocks: 1, seed: 314
0.9151657223701
0.3925153017044
0.7007563710213
0.8806988000870
0.5301177501678
$ ./rng_gen_d 5 314 129
n: 5, blockSize: 129, nBlocks: 1, seed: 314
GPUassert: too many resources requested for launch rng_gen.cu 54
Line 54 is gpuErrchk() right after setup_kernel().
With the other binary (no "-G -g" flags at compile time), the "threshold for trouble" is raised to 384:
$ ./rng_gen 5 314 129
n: 5, blockSize: 129, nBlocks: 1, seed: 314
0.9151657223701
0.3925153017044
0.7007563710213
0.8806988000870
0.5301177501678
$ ./rng_gen 5 314 384
n: 5, blockSize: 384, nBlocks: 1, seed: 314
0.9151657223701
0.3925153017044
0.7007563710213
0.8806988000870
0.5301177501678
$ ./rng_gen 5 314 385
n: 5, blockSize: 385, nBlocks: 1, seed: 314
GPUassert: too many resources requested for launch rng_gen.cu 54
Finally, should this be somehow related to the hardware I am using for this preliminary testing (the project will be later launched on a much more powerful machine), here are the specs of the card I am using:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro NVS 160M"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 256 MBytes (268107776 bytes)
( 1) Multiprocessors, ( 8) CUDA Cores/MP: 8 CUDA Cores
GPU Clock rate: 1450 MHz (1.45 GHz)
Memory Clock rate: 702 Mhz
Memory Bus Width: 64-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = Quadro NVS 160M
Result = PASS
And this is it. Any guidance on this matter will most welcome. Thanks!
EDIT:
1) Added proper cuda error checking, as suggested by Robert.
2) Deleted the cudaMemset line, which was useless anyway.
3) Compiled and ran the code without the "-G -g" flags.
4) Updated the output accordingly.

First of all, when you're having trouble with CUDA code, it's always advisable to do proper cuda error checking. It will eliminate a certain amount of head scratching, probably save you some time, and will certainly improve the ability of folks to help you on sites like this one.
Now you've discovered you have a registers per thread issue. The compiler while generating code will use registers for various purposes. Each thread requires this complement of registers to run it's thread code. When you attempt to launch a kernel, one of the requirements that must be met is that the number of registers required per thread times the number of requested threads in the launch must be less than the total number of registers available per block. Note that the number of registers required per thread may have to be rounded up to some granular allocation increment. Also note that the number of threads requested will normally be rounded up to the next higher increment of 32 (if not evenly divisible by 32) as threads are launched in warps of 32. Also note that the max registers per block varies by compute capability, and this quantity can be inspected via the deviceQuery sample as you've shown. Also as you've discovered, certain command line switches like -G can affect how nvcc utilizes registers.
To get advance notice of these types of resource issues, you can compile your code with additional command line switches:
nvcc -arch=sm_11 -Xptxas=-v -o mycode mycode.cu
The -Xptxas=-v switch will generate resource usage output by the ptxas assembler (which converts intermediate ptx code to sass assembly code, i.e. machine code), including registers required per thread. Note that the output will be delivered per kernel in this case, as each kernel may have it's own requirements. You can get more info about the nvcc compiler in the documentation.
As a crude workaround, you can specify a switch at compile time to limit all kernel compilation to a max register usage number:
nvcc -arch=sm_11 -Xptxas=-v -maxrregcount=16 -o mycode mycode.cu
This would limit each kernel to using no more than 16 registers per thread. When multiplied by 512 (the hardware limit of threads per block for a cc1.x device) this yields a value of 8192, which is the hardware limit on total registers per threadblock for your device.
However the above method is crude in that it applies the same limit to all kernels in your program. If you wanted to tailor this to each kernel launch (for example if different kernels in your program were launching different numbers of threads) you could use the launch bounds methodology, which is described here.

Related

How does the CPU cache affect the performance of a C program

I am trying to understand more about how CPU cache affects performance. As a simple test I am summing the values of the first column of a matrix with varying numbers of total columns.
// compiled with: gcc -Wall -Wextra -Ofast -march=native cache.c
// tested with: for n in {1..100}; do ./a.out $n; done | tee out.csv
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
double sum_column(uint64_t ni, uint64_t nj, double const data[ni][nj])
{
double sum = 0.0;
for (uint64_t i = 0; i < ni; ++i) {
sum += data[i][0];
}
return sum;
}
int compare(void const* _a, void const* _b)
{
double const a = *((double*)_a);
double const b = *((double*)_b);
return (a > b) - (a < b);
}
int main(int argc, char** argv)
{
// set sizes
assert(argc == 2);
uint64_t const iter_max = 101;
uint64_t const ni = 1000000;
uint64_t const nj = strtol(argv[1], 0, 10);
// initialize data
double(*data)[nj] = calloc(ni, sizeof(*data));
for (uint64_t i = 0; i < ni; ++i) {
for (uint64_t j = 0; j < nj; ++j) {
data[i][j] = rand() / (double)RAND_MAX;
}
}
// test performance
double* dt = calloc(iter_max, sizeof(*dt));
double const sum0 = sum_column(ni, nj, data);
for (uint64_t iter = 0; iter < iter_max; ++iter) {
clock_t const t_start = clock();
double const sum = sum_column(ni, nj, data);
clock_t const t_stop = clock();
assert(sum == sum0);
dt[iter] = (t_stop - t_start) / (double)CLOCKS_PER_SEC;
}
// sort dt
qsort(dt, iter_max, sizeof(*dt), compare);
// compute mean dt
double dt_mean = 0.0;
for (uint64_t iter = 0; iter < iter_max; ++iter) {
dt_mean += dt[iter];
}
dt_mean /= iter_max;
// print results
printf("%2lu %.8e %.8e %.8e %.8e\n", nj, dt[iter_max / 2], dt_mean, dt[0],
dt[iter_max - 1]);
// free memory
free(data);
}
However, the results are not quite how I would expect them to be:
As far as I understand, when the CPU loads a value from data, it also places some of the following values of data in the cache. The exact number depends on the cache line size (64 byte on my machine). This would explain, why with growing nj the time to solution first increases linearly and levels out at some value. If nj == 1, one load places the next 7 values in the cache and thus we only need to load from RAM every 8th value. If nj == 2, following the same logic, we need to access the RAM every 4th value. After some size, we will have to access the RAM for every value, which should result in the performance leveling out. My guess, for why the linear section of the graph goes further than 4 is that in reality there are multiple levels of cache at work here and the way that values end up in these caches is a little more complex than what I explained here.
What I cannot explain is why there are these performance peaks at multiples of 16.
After thinking about this question for a bit, I decided to check if this also occurs for higher values of nj:
In fact, it does. And, there is more: Why does the performance increase again after ~250?
Could someone explain to me, or point me to some appropriate reference, why there are these peaks and why the performance increases for higher values of nj.
If you would like to try the code for yourself, I will also attach my plotting script, for your convenience:
import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt("out.csv")
data[:,1:] /= data[0,1]
dy = np.diff(data[:,2]) / np.diff(data[:,0])
for i in range(len(dy) - 1):
if dy[i] - dy[i + 1] > (dy.max() - dy.min()) / 2:
plt.axvline(data[i + 1,0], color='gray', linestyle='--')
plt.text(data[i + 1,0], 1.5 * data[0,3], f"{int(data[i + 1,0])}",
rotation=0, ha="center", va="center",
bbox=dict(boxstyle="round", ec='gray', fc='w'))
plt.fill_between(data[:,0], data[:,3], data[:,4], color='gray')
plt.plot(data[:,0], data[:,1], label="median")
plt.plot(data[:,0], data[:,2], label="mean")
plt.legend(loc="upper left")
plt.xlabel("nj")
plt.ylabel("dt / dt$_0$")
plt.savefig("out.pdf")
The plots show the combination of several complex low-level effects (mainly cache trashing & prefetching issues). I assume the target platform is a mainstream modern processor with cache lines of 64 bytes (typically a x86 one).
I can reproduce the problem on my i5-9600KF processor. Here is the resulting plot:
First of all, when nj is small, the gap between fetched address (ie. strides) is small and cache lines are relatively efficiently used. For example, when nj = 1, the access is contiguous. In this case, the processor can efficiently prefetch the cache lines from the DRAM so to hide its high latency. There is also a good spatial cache locality since many contiguous items share the same cache line. When nj=2, only half the value of a cache line is used. This means the number of requested cache line is twice bigger for the same number of operations. That being said the time is not much bigger due to the relatively high latency of adding two floating-point numbers resulting in a compute-bound code. You can unroll the loop 4 times and use 4 different sum variables so that (mainstream modern) processors can add multiple values in parallel. Note that most processors can also load multiple values from the cache per cycle. When nj = 4 a new cache line is requested every 2 cycles (since a double takes 8 bytes). As a result, the memory throughput can become so big that the computation becomes memory-bound. One may expect the time to be stable for nj >= 8 since the number of requested cache line should be the same, but in practice processors prefetch multiple contiguous cache lines so not to pay the overhead of the DRAM latency which is huge in this case. The number of prefetched cache lines is generally between 2 to 4 (AFAIK such prefetching strategy is disabled on Intel processors when the stride is bigger than 512, so when nj >= 64. This explains why the timings are sharply increasing when nj < 32 and they become relatively stable with 32 <= nj <= 256 with exceptions for peaks.
The regular peaks happening when nj is a multiple of 16 are due to a complex cache effect called cache thrashing. Modern cache are N-way associative with N typically between 4 and 16. For example, here are statistics on my i5-9600KF processors:
Cache 0: L1 data cache, line size 64, 8-ways, 64 sets, size 32k
Cache 1: L1 instruction cache, line size 64, 8-ways, 64 sets, size 32k
Cache 2: L2 unified cache, line size 64, 4-ways, 1024 sets, size 256k
Cache 3: L3 unified cache, line size 64, 12-ways, 12288 sets, size 9216k
This means that two fetched values from the DRAM with the respective address A1 and A2 can results in conflicts in my L1 cache if (A1 % 32768) / 64 == (A2 % 32768) / 64. In this case, the processor needs to choose which cache line to replace from a set of N=8 cache lines. There are many cache replacement policy and none is perfect. Thus, some useful cache line are sometime evicted too early resulting in additional cache misses required later. In pathological cases, many DRAM locations can compete for the same cache lines resulting in excessive cache misses. More information about this can be found also in this post.
Regarding the nj stride, the number of cache lines that can be effectively used in the L1 cache is limited. For example, if all fetched values have the same address modulus the cache size, then only N cache lines (ie. 8 for my processor) can actually be used to store all the values. Having less cache lines available is a big problem since the prefetcher need a pretty large space in the cache so to store the many cache lines needed later. The smaller the number of concurrent fetches, the lower memory throughput. This is especially true here since the latency of fetching 1 cache line from the DRAM is about several dozens of nanoseconds (eg. ~70 ns) while its bandwidth is about dozens of GiB/s (eg. ~40 GiB/s): dozens of cache lines (eg. ~40) should be fetched concurrently so to hide the latency and saturate the DRAM.
Here is the simulation of the number of cache lines that can be actually used in my L1 cache regarding the value of the nj:
nj #cache-lines
1 512
2 512
3 512
4 512
5 512
6 512
7 512
8 512
9 512
10 512
11 512
12 512
13 512
14 512
15 512
16 256 <----
17 512
18 512
19 512
20 512
21 512
22 512
23 512
24 512
25 512
26 512
27 512
28 512
29 512
30 512
31 512
32 128 <----
33 512
34 512
35 512
36 512
37 512
38 512
39 512
40 512
41 512
42 512
43 512
44 512
45 512
46 512
47 512
48 256 <----
49 512
50 512
51 512
52 512
53 512
54 512
55 512
56 512
57 512
58 512
59 512
60 512
61 512
62 512
63 512
64 64 <----
==============
80 256
96 128
112 256
128 32
144 256
160 128
176 256
192 64
208 256
224 128
240 256
256 16
384 32
512 8
1024 4
We can see that the number of available cache lines is smaller when nj is a multiple of 16. In this case, the prefetecher will preload data into cache lines that are likely evicted early by subsequent fetched (done concurrently). Loads instruction performed in the code are more likely to result in cache misses when the number of available cache line is small. When a cache miss happen, the value need then to be fetched again from the L2 or even the L3 resulting in a slower execution. Note that the L2 cache is also subject to the same effect though it is less visible since it is larger. The L3 cache of modern x86 processors makes use of hashing to better distributes things to reduce collisions from fixed strides (at least on Intel processors and certainly on AMD too though AFAIK this is not documented).
Here is the timings on my machine for some peaks:
32 4.63600000e-03 4.62298020e-03 4.06400000e-03 4.97300000e-03
48 4.95800000e-03 4.96994059e-03 4.60400000e-03 5.59800000e-03
64 5.01600000e-03 5.00479208e-03 4.26900000e-03 5.33100000e-03
96 4.99300000e-03 5.02284158e-03 4.94700000e-03 5.29700000e-03
128 5.23300000e-03 5.26405941e-03 4.93200000e-03 5.85100000e-03
192 4.76900000e-03 4.78833663e-03 4.60100000e-03 5.01600000e-03
256 5.78500000e-03 5.81666337e-03 5.77600000e-03 6.35300000e-03
384 5.25900000e-03 5.32504950e-03 5.22800000e-03 6.75800000e-03
512 5.02700000e-03 5.05165347e-03 5.02100000e-03 5.34400000e-03
1024 5.29200000e-03 5.33059406e-03 5.28700000e-03 5.65700000e-03
As expected, the timings are overall bigger in practice for the case where the number of available cache lines is much smaller. However, when nj >= 512, the results are surprising since they are significantly faster than others. This is the case where the number of available cache lines is equal to the number of ways of associativity (N). My guess is that this is because Intel processors certainly detect this pathological case and optimize the prefetching so to reduce the number of cache misses (using line-fill buffers to bypass the L1 cache -- see below).
Finally, for large nj stride, a bigger nj should results in higher overheads mainly due to the translation lookaside buffer (TLB): there are more page addresses to translate with bigger nj and the number of TLB entries is limited. In fact this is what I can observe on my machine: timings tends to slowly increase in a very stable way unlike on your target platform.
I cannot really explain this very strange behavior yet.
Here is some wild guesses:
The OS could tend to uses more huge pages when nj is large (so to reduce de overhead of the TLB) since wider blocks are allocated. This could result in more concurrency for the prefetcher as AFAIK it cannot cross page
boundaries. You can try to check the number of allocated (transparent) huge-pages (by looking AnonHugePages in /proc/meminfo in Linux) or force them to be used in this case (using an explicit memmap), or possibly by disabling them. My system appears to make use of 2 MiB transparent huge-pages independently of the nj value.
If the target architecture is a NUMA one (eg. new AMD processors or a server with multiple processors having their own memory), then the OS could allocate pages physically stored on another NUMA node because there is less space available on the current NUMA node. This could result in higher performance due to the bigger throughput (though the latency is higher). You can control this policy with numactl on Linux so to force local allocations.
For more information about this topic, please read the great document What Every Programmer Should Know About Memory. Moreover, a very good post about how x86 cache works in practice is available here.
Removing the peaks
To remove the peaks due to cache trashing on x86 processors, you can use non-temporal software prefetching instructions so cache lines can be fetched in a non-temporal cache structure and into a location close to the processor that should not cause cache trashing in the L1 (if possible). Such cache structure is typically a line-fill buffers (LFB) on Intel processors and the (equivalent) miss address buffers (MAB) on AMD Zen processors. For more information about non-temporal instructions and the LFB, please read this post and this one. Here is the modified code that also include a loop unroling optimization to speed up the code when nj is small:
double sum_column(uint64_t ni, uint64_t nj, double* const data)
{
double sum0 = 0.0;
double sum1 = 0.0;
double sum2 = 0.0;
double sum3 = 0.0;
if(nj % 16 == 0)
{
// Cache-bypassing prefetch to avoid cache trashing
const size_t distance = 12;
for (uint64_t i = 0; i < ni; ++i) {
_mm_prefetch(&data[(i+distance)*nj+0], _MM_HINT_NTA);
sum0 += data[i*nj+0];
}
}
else
{
// Unrolling is much better for small strides
for (uint64_t i = 0; i < ni; i+=4) {
sum0 += data[(i+0)*nj+0];
sum1 += data[(i+1)*nj+0];
sum2 += data[(i+2)*nj+0];
sum3 += data[(i+3)*nj+0];
}
}
return sum0 + sum1 + sum2 + sum3;
}
Here is the result of the modified code:
We can see that peaks no longer appear in the timings. We can also see that the values are much bigger due to dt0 being about 4 times smaller (due to the loop unrolling).
Note that cache trashing in the L2 cache is not avoided with this method in practice (at least on Intel processors). This means that the effect is still here with huge nj strides multiple of 512 (4 KiB) on my machine (it is actually a slower than before, especially when nj >= 2048). It may be a good idea to stop the prefetching when (nj%512) == 0 && nj >= 512 on x86 processors. The effect AFAIK, there is no way to address this problem. That being said, this is a very bad idea to perform such big strided accesses on very-large data structures.
Note that distance should be carefully chosen since early prefetching can result cache line being evicted before they are actually used (so they need to be fetched again) and late prefetching is not much useful. I think using value close to the number of entries in the LFB/MAB is a good idea (eg. 12 on Skylake/KabyLake/CannonLake, 22 on Zen-2).

C microbenchmark 'bug' when measuring store latency

I have been trying a few experiments on x86 - namely the effect of mfence on store/load latencies, etc.
Here is what I have started with:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#define ARRAY_SIZE 10
#define DUMMY_LOOP_CNT 1000000
int main()
{
char array[ARRAY_SIZE];
for (int i =0; i< ARRAY_SIZE; i++)
array[i] = 'x'; //This is to force the OS to give allocate the array
asm volatile ("mfence\n");
for (int i=0;i<DUMMY_LOOP_CNT;i++); //A dummy loop to just warmup the processor
struct result_tuple{
uint64_t tsp_start;
uint64_t tsp_end;
int offset;
};
struct result_tuple* results = calloc(ARRAY_SIZE , sizeof (struct result_tuple));
for (int i = 0; i< ARRAY_SIZE; i++)
{
uint64_t *tsp_start,*tsp_end;
tsp_start = &results[i].tsp_start;
tsp_end = &results[i].tsp_end;
results[i].offset = i;
asm volatile (
"mfence\n"
"rdtscp\n"
"mov %%rdx,%[arg]\n"
"shl $32,%[arg]\n"
"or %%rax,%[arg]\n"
:[arg]"=&r"(*tsp_start)
::"rax","rdx","rcx","memory"
);
array[i] = 'y'; //A simple store
asm volatile (
"mfence\n"
"rdtscp\n"
"mov %%rdx,%[arg]\n"
"shl $32,%[arg]\n"
"or %%rax,%[arg]\n"
:[arg]"=&r"(*tsp_end)
::"rax","rdx","rcx","memory"
);
}
printf("Offset\tLatency\n");
for (int i=0;i<ARRAY_SIZE;i++)
{
printf("%d\t%lu\n",results[i].offset,results[i].tsp_end - results[i].tsp_start);
}
free (results);
}
I compile quite simply with gcc microbenchmark.c -o microbenchmark
My system configuration is as follows:
CPU : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
Operating system : GNU/Linux (Linux 5.4.80-2)
My issue is this:
In a single run, all the latencies are similar
When repeating the experiment over and over, I don't get results similar to the previous run!
For instance:
In run 1 I get:
Offset Latency
1 275
2 262
3 262
4 262
5 275
...
252 275
253 275
254 262
255 262
In another run I get:
Offset Latency
1 75
2 75
3 75
4 72
5 72
...
251 72
252 72
253 75
254 75
255 72
This is pretty surprising (The among-run variation is pretty high, whereas there is negligible within-run variation)! I am not sure how to explain this. What is the issue with my microbenchmark?
Note: I do understand that a normal store would be a write allocate store.. Technically making my measurement that of a load (rather than a store). Also, mfence should flush the store buffer, thereby ensuring that no stores are 'delayed'.
Your warm-up dummy loop only does 1 million iterations, ~6 mil clock cycles in a -O0 debug build - probably not be long enough to get the CPU up to max turbo, on a CPU before Skylake's hardware P-state management. (Idiomatic way of performance evaluation?)
RDTSCP counts fixed-frequency reference cycles, not core clock cycles. Your runs are so short that all the run-to-run variation is probably explained by the CPU frequency being low or high. See How to get the CPU cycle count in x86_64 from C++?
Also, this debug (-O0) build will do extra stores and reloads inside your timed region, but "fortunately" the results[i].offset = i; store plus the mfence before the first rdtscp ensures the result array is also hot in cache before entering the timed region.
Your array is tiny, and you're only doing 1-byte stores (so 64 stores are all in the same cache line.) It's very likely still in MESI Modified state from when you initialized it, so I wouldn't expect an RFO on any of the array[i] = 'y' stores. That already happened for the few lines of stack memory involved before your timed loop. If you want to pre-fault the array without also getting it cached, maybe touch one line per 4k page and leave the other lines untouched. But HW prefetch will get ahead of your stores, especially if you only store 1 byte at a time with 2 slow mfences per store, so again the waiting for off-core memory requests will be outside the timed region. You should expect data to already be in L1d cache or at least L2 in Exclusive state, ready to be flipped to Modified on a store.
BTW, having an offset member seems pointless; it can be implicit from the array index. e.g. print i instead of offset[i]. It's also not very useful to store both start and stop absolute TSC values. You could just store a 32-bit difference, then you wouldn't need to shift / OR in your inline asm, just declare a clobber on the unused EDX output.
Also note that "store latency" typically only matters for performance in real code when mfence is involved. Otherwise the important thing is store->load forwarding, which can happen from the store buffer before the store commits to L1d cache. That's about 6 cycles, or sometimes lower if the reload isn't attempted right away. (It's variable on Sandybridge-family.)

Sum reduction with parallel algorithm - Bad performances compared to CPU version

I have achieved a small code for doing sum reduction of a 1D array. I am comparing a CPU sequential version and a OpenCL version.
The code is available on this link1
The kernel code is available on this link2
and if you want to compile : link3 for Makefile
My issue is about the bad performances of GPU version :
for size of vector lower than 1,024 * 10^9 elements (i.e with 1024, 10240, 102400, 1024000, 10240000, 102400000 elements) the runtime for GPU version is higher (slightly higher but higher) than CPU one.
As you can see, I have taken 2^n values in order to have a compatible number of workitems with the size of a workgroup.
Concerning the number of workgroups, I have taken :
// Number of work-groups
int nWorkGroups = size/local_item_size;
But for a high number of workitems, I wonder if the value of nWorkGroups is suitable ( for example, nWorkGroups = 1.024 * 10^8 / 1024 = 10^5 workgroups, isn't this too much ?? ).
I tried to modify loca_item_size in the range of [64, 128, 256, 512, 1024] but the performances remain bad for all these values.
I have good benefits only for size = 1.024 * 10^9 elements, here are the runtimes :
Size of the vector
1024000000
Problem size = 1024000000
GPU Parallel Reduction : Wall Clock = 20 second 977511 micro
Final Sum Sequential = 5.2428800006710899200e+17
Sequential Reduction : Wall Clock = 337 second 459777 micro
From your experiences, why do I get bad performances ? I though that advantages should be more significative compared to CPU version.
Maybe someone could see into source code a main mistake because, at the moment, I can't get to solve this issue.
Thanks
Well I can tell you some reasons:
You don't need to write the reduction buffer. You can directly clear it in GPU memory using clEnqueueFillBuffer() or a helper kernel.
ret = clEnqueueWriteBuffer(command_queue, reductionBuffer, CL_TRUE, 0,
local_item_size * sizeof(double), sumReduction, 0, NULL, NULL);
Dont use blocking calls, except for the last read. Otherwise you are wasting some time there.
You are doing the last reduction in CPU. Iterative processing trough the kernel can help.
Because if your kernel is just reducing 128 elements per pass. Your 10^9 number just gets down to 8*10^6. And the CPU does the rest. If you add there the data copy, it makes it completely non worth.
However, if you run 3 passes at 512 elements per pass, you read out from the GPU just 10^9/512^3 = 8 values. So, the only bottleneck would be the first GPU copy and the kernel launch.

Is there any limitation of array length in C?

I'm working on a program in C. I want to initialize an array which has a length of 1,000,000
It compiles without any errors or warnings but in the execution, windows sends a process termination.
I modified my code so there will be 4 arrays each having 500,000 integers. It again compiles without error or warning but the problem still exists.
I use CodeBlox (GCC compiler, I think)
Here is my code:
#include <stdio.h>
#include <math.h>
// Prototypes:
int checkprime(int n);
int main(){
int m=0;
int A[500001]={2,2,0};//from k=1 to 500000
int B[500000]={0};//from k=500001 to 1000000
int C[500000]={0};//from k=1000001 to 1500000
int D[500000]={0};//from k=1500001 to 2000000
int n=3;
int k=2;
for(n=3;n<2000001;n +=2){
if(checkprime(n)){
if (k<=500000)
{A[k]=n;
k +=1;}
else if ((k>500000)&&(k<=1000000))
{B[k-500001]=n;
k +=1;}
else if ((k>1000000)&&(k<=1500000)){
C[k-1000001]=n;
k +=1;
}
else if(k>1500000){
D[k-1500001]=n;
k +=1;}
}//end of if
}//end for
int i=0;
for(i=1;i<500001;i++)
{
m=m+A[i];
}
for(i=0;i<5000001;i++)
{
m=m+B[i];
}
for(i=0;i<5000001;i++)
{
m=m+C[i];
}
for(i=0;i<5000001;i++)
{
m=m+D[i];
}
printf("answer is %d",m);
return 0;//Successful end indicator
}//end of main
int checkprime(int n){
int m=sqrt(n);
if (!(m%2))
{
m=m+1;
}
int stop=0;
int d=0;
int isprime=1;
while((m!=1)&&(stop==0)){
d=n%m;
if (d==0){
stop=1;
isprime=0;
}
m -=2;
}//end of while
return isprime;
}//end of checkprime
Limit on maximum stack size controlled with ulimit command. Compiler can (or not) set limit smaller, but not bigger than that.
To see current limit (in kilobytes):
ulimit -s
To remove limit:
ulimit -s unlimited
I hope your huge initialized array is static or global. If it is a local variable it would overflow the stack at runtime.
I believe that old versions of GCC had sub-optimal behavior (perhaps quadratic time) when initializing an array.
I also believe that the C standard might define a (small) minimal array size which all conforming compilers should accept (there is such a lower limit for string sizes, it might be as small as 512).
IIRC, recent versions of GCC improved their behavior when initializing static arrays. Try with GCC 4.7
With my Debian/Sid's gcc-4.7.1 I am able to compile a biga.c file starting with
int big[] = {2 ,
3 ,
5 ,
7 ,
11 ,
13 ,
and ending with
399999937 ,
399999947 ,
399999949 ,
399999959 ,
};
and containing 23105402 lines:
% time gcc -c biga.c
gcc -c biga.c 43.51s user 1.87s system 96% cpu 46.962 total
% /usr/bin/time -v gcc -O2 -c biga.c
Command being timed: "gcc -O2 -c biga.c"
User time (seconds): 48.99
System time (seconds): 2.10
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:52.59
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 5157040
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 691666
Voluntary context switches: 25
Involuntary context switches: 5162
Swaps: 0
File system inputs: 32
File system outputs: 931512
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
This is on a i7 3770K desktop with 16Gb RAM.
Having huge arrays as locals, even inside main is always a bad idea. Either heap-allocate them (e.g. with calloc or malloc, then free them appropriately) or make them global or static. Space for local data on call stack is always a scare resource. A typical call frame (combined size of all local variables) should be less than a kilobyte. A call frame bigger than a megabyte is almost always bad and un-professional.
Yes, there is.
Local arrays are created on the stack, and if your array is too large, the stack will collide with something else in memory and crash the program.

Tracking down cuda kernel register usage

I am trying to track down register usage and came across an interesting scenario. Consider the following source:
#define OL 20
#define NHS 10
__global__ void loop_test( float ** out, const float ** in,int3 gdims,int stride){
const int idx = blockIdx.x*blockDim.x + threadIdx.x;
const int idy = blockIdx.y*blockDim.y + threadIdx.y;
const int idz = blockIdx.z*blockDim.z + threadIdx.z;
const int index = stride*gdims.y*idz + idy*stride + idx;
int i = 0,j =0;
float sum =0.f;
float tmp;
float lf;
float u2, tW;
u2 = 1.0;
tW = 2.0;
float herm[NHS];
for(j=0; j < OL; ++j){
for(i = 0; i < NHS; ++i){
herm[i] += in[j][index];
}
}
for(j=0; j<OL; ++j){
for(i=0;i<NHS; ++i){
tmp = sum + herm[i]*in[j][index];
sum = tmp;
}
out[j][index] = sum;
sum =0.f;
}
}
As a side note on the source - the running sum I could do +=, but was playing with how changing that effects register usage (seems it doesn't - just adds an extra mov instruction).
Additionally this source is oriented for accessing memory mapped to 3D space.
Counting out the registers it would seem there are 22 registers ( I believe a float[N] takes up N+1 registers - please correct me if I'm wronge) based on the declarations.
However compiling with:
nvcc -cubin -arch=sm_20 -Xptxas="-v" src/looptest.cu
yields:
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 25 registers, 72 bytes cmem[0]
Ok so the number is different that what is 'expected'. Additionally if compiled with :
nvcc -cubin -arch=sm_13 -Xptxas="-v" src/looptest.cu
The register usage is far less - 8 to be exact ( apparently due to stronger adherence in sm_20 than sm_13 to IEEE floating point math standards?):
ptxas info : Compiling entry function '_Z9loop_testPPfPPKfS2_4int3i' for 'sm_13'
ptxas info : Used 17 registers, 40+16 bytes smem, 8 bytes cmem[1]
As a final note, change the macro OL to 40, and suddenly:
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 72 bytes cmem[0]
In conclusion I would like to know where registers are being eaten up, and what results in the couple observations I have made.
I don't have enough experience with assembly to get through a cuobjdump - the answer certainly lies buried in there - maybe someone can enlighten me about what I should be looking for or show me a guide as to how to approach the assembly dump.
sm_20 and sm_13 are very different architectures, with very different instruction set (ISA) design. The main difference that causes the increase in register usage that you see is that sm_1x has special-purpose address registers, while sm_2x and later do not. Instead, addresses are stored in general-purpose registers just like values are, which means most programs require more registers on sm_2x than on sm_1x.
sm_20 also has twice the register file size of sm_13, to compensate for this affect.
Register usage does not necessarily have a close correlation to the number of variables.
The compiler tries to assess the speed benefit of keeping a variable in a register between two points of use in the code by comparing the potential gain in a single kernel with the cost to all concurrently running kernels due to there being less registers available in the register pool. (A Fermi SM has 32768 registers). So, it's not surprising if changing your code causes unexpected fluctuations in the number of registers used.
You really should only be worried about register usage if the profiler says that your occupancy is limited by register usage. In that case, you can use the --maxrregcount setting to lower the number of registers used by a single kernel to see if it improves overall execution speed.
To help reduce the number of registers used by a kernel, you can try to keep variable use as local as possible. For instance, if you do:
set variable 1
set variable 2
use variable 1
use variable 2
That may cause 2 registers to be used. While, if you:
set variable 1
use variable 1
set variable 2
use variable 2
That might cause 1 register to be used.

Resources