Slower SSE performance on large array sizes

Slower SSE performance on large array sizes - c

I am new to SSE programming so I am hoping someone out there can help me. I recently implemented a function using GCC SSE intrinsics to compute the sum of an array of 32-bit integers. The code for my implementation is given below.
int ssum(const int *d, unsigned int len)
{
static const unsigned int BLOCKSIZE=4;
unsigned int i,remainder;
int output;
__m128i xmm0, accumulator;
__m128i* src;
remainder = len%BLOCKSIZE;
src = (__m128i*)d;
accumulator = _mm_loadu_si128(src);
output = 0;
for(i=BLOCKSIZE;i<len-remainder;i+=BLOCKSIZE){
xmm0 = _mm_loadu_si128(++src);
accumulator = _mm_add_epi32(accumulator,xmm0);
}
accumulator = _mm_add_epi32(accumulator, _mm_srli_si128(accumulator, 8));
accumulator = _mm_add_epi32(accumulator, _mm_srli_si128(accumulator, 4));
output = _mm_cvtsi128_si32(accumulator);
for(i=len-remainder;i<len;i++){
output += d[i];
}
return output;
}
As you can see, it is a fairly straight forward implementation where I sum the array 4 at a time using the extended xmm registers and then clean up at the end by adding up the remaining elements.
I then compared the performance of this SIMD implementation against just a plain for loop. The result of this experiment is available here:
SIMD vs. for-loop
As you can see, in comparison to a for loop, this implementation does indeed show about ~60% speedup for a input sizes ( meaning the length of the array ) upto about 5M elements. However, for larger values of the input size the performance, in relation to a for loop, takes a dramatic dive and produces only about a 20% speed up.
I am at a loss to explain this dramatic decrease in performance. I am more or less stepping linearly through memory so the affect of cache misses and page faults should be about the same for both implementations. What am I missing here? Is there any way we can flatten that curve out? Any thoughts would be greatly appreciated.

For large input, the data is outside the cache, and the code is memory bounded.
For small input, the data is inside the cache (i.e L1 / L2 / L3 cache), and the code is computation bounded.
I assume you didn't try to flush the cache, before performance measurement.
The cache memory is inside the CPU, and the bandwidth between cache memory and ALU (or SSE) units is very high (high bandwidth - less time transferring data).
Your highest level cache (i.e L3) size is about 4MB to 8MB (depending your CPU model).
Larger amount of data must be located on the DDR SDRAM, witch is external RAM (outside the CPU).
The CPU is connected to the DDR SDRAM with memory bus, with has much lower bandwidth than the cache memory.
Example:
Assume your external RAM type is Dual Channel DDR3 SDRAM 1600.
The maximum theoretical bandwidth between external RAM and CPU is about 25GB/Sec.
Reading 100MBytes of data (at 25GB/S) from the RAM to the CPU takes about 100e6 / 25e9 = 4msec.
From my experience the utilized bandwidth is about half of theoretical bandwidth, so the reading time is about 8msec.
The computation time is shorter:
Assume each iteration of your loop takes about 2 CPU clocks (just an example).
Each iteration process 16 bytes of data.
Total CPU clocks for processing 100MB takes about (100e6 / 16)*2 = 12500000 clks.
Assume CPU frequency is 3GHz.
Total SSE processing time is about 12500000 / 3e9 = 4.2msec.
As you can see, reading the data from external RAM takes twice as much as SSE computation time.
Since the data transfer and computation occur in parallel, the total time is the maximum of 4.2mesc and 8msec (i.e 8msec).
Lets assume loop without using SSE takes twice as much computation time, so without using SSE the computation time is about 8.4msec.
In the above example the total improvement of using SSE is about 0.4msec.
Note: The selected numbers are just for example purposes.
Benchmarks:
I did some benchmarks on my system.
I am using Windows 10 and Visual Studio 2010.
Benchmark test: Summing 100MBytes of data (summing 25*1024^2 32bits integers).
CPU
Intel Core i5 3550 (Ivy Bridge).
CPU Base frequency is 3.3GHz.
Actual Core Speed during the test: 3.6GHz (Turbo boost is enabled).
L1 data cache size: 32KBytes.
L2 cache size: 256Bytes (single core L2 cache size).
L3 cache size: 6MBytes.
Memory:
8GB DDR3 Dual channel.
RAM Frequency: 666MHz (equivalent to 1333MHz without DDR).
Memory theoretical maximum bandwidth: (128*1333/8) / 1024 = 20.8GBytes/Sec.
Sum 100MB as large chunk with SSE (data in external RAM).
Processing time: 6.22msec
Sum 1KB 100 times with SSE (data inside cache).
Processing time: 3.86msec
Sum 100MB as large chunk without SSE (data in external RAM).
Processing time: 8.1msec
Sum 1KB 100 times without SSE (data inside cache).
Processing time: 4.73msec
Utilized memory bandwidth: 100/6.22 = 16GB/Sec (dividing data size by time).
Average clocks per iteration with SSE (data in cache): (3.6e9*3.86e-3)/(25/4*1024^2) = 2.1 clks/iteration (dividing total CPU clocks by number of iterations).

Related

How does the CPU cache affect the performance of a C program

I am trying to understand more about how CPU cache affects performance. As a simple test I am summing the values of the first column of a matrix with varying numbers of total columns.
// compiled with: gcc -Wall -Wextra -Ofast -march=native cache.c
// tested with: for n in {1..100}; do ./a.out $n; done | tee out.csv
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
double sum_column(uint64_t ni, uint64_t nj, double const data[ni][nj])
{
double sum = 0.0;
for (uint64_t i = 0; i < ni; ++i) {
sum += data[i][0];
}
return sum;
}
int compare(void const* _a, void const* _b)
{
double const a = *((double*)_a);
double const b = *((double*)_b);
return (a > b) - (a < b);
}
int main(int argc, char** argv)
{
// set sizes
assert(argc == 2);
uint64_t const iter_max = 101;
uint64_t const ni = 1000000;
uint64_t const nj = strtol(argv[1], 0, 10);
// initialize data
double(*data)[nj] = calloc(ni, sizeof(*data));
for (uint64_t i = 0; i < ni; ++i) {
for (uint64_t j = 0; j < nj; ++j) {
data[i][j] = rand() / (double)RAND_MAX;
}
}
// test performance
double* dt = calloc(iter_max, sizeof(*dt));
double const sum0 = sum_column(ni, nj, data);
for (uint64_t iter = 0; iter < iter_max; ++iter) {
clock_t const t_start = clock();
double const sum = sum_column(ni, nj, data);
clock_t const t_stop = clock();
assert(sum == sum0);
dt[iter] = (t_stop - t_start) / (double)CLOCKS_PER_SEC;
}
// sort dt
qsort(dt, iter_max, sizeof(*dt), compare);
// compute mean dt
double dt_mean = 0.0;
for (uint64_t iter = 0; iter < iter_max; ++iter) {
dt_mean += dt[iter];
}
dt_mean /= iter_max;
// print results
printf("%2lu %.8e %.8e %.8e %.8e\n", nj, dt[iter_max / 2], dt_mean, dt[0],
dt[iter_max - 1]);
// free memory
free(data);
}
However, the results are not quite how I would expect them to be:
As far as I understand, when the CPU loads a value from data, it also places some of the following values of data in the cache. The exact number depends on the cache line size (64 byte on my machine). This would explain, why with growing nj the time to solution first increases linearly and levels out at some value. If nj == 1, one load places the next 7 values in the cache and thus we only need to load from RAM every 8th value. If nj == 2, following the same logic, we need to access the RAM every 4th value. After some size, we will have to access the RAM for every value, which should result in the performance leveling out. My guess, for why the linear section of the graph goes further than 4 is that in reality there are multiple levels of cache at work here and the way that values end up in these caches is a little more complex than what I explained here.
What I cannot explain is why there are these performance peaks at multiples of 16.
After thinking about this question for a bit, I decided to check if this also occurs for higher values of nj:
In fact, it does. And, there is more: Why does the performance increase again after ~250?
Could someone explain to me, or point me to some appropriate reference, why there are these peaks and why the performance increases for higher values of nj.
If you would like to try the code for yourself, I will also attach my plotting script, for your convenience:
import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt("out.csv")
data[:,1:] /= data[0,1]
dy = np.diff(data[:,2]) / np.diff(data[:,0])
for i in range(len(dy) - 1):
if dy[i] - dy[i + 1] > (dy.max() - dy.min()) / 2:
plt.axvline(data[i + 1,0], color='gray', linestyle='--')
plt.text(data[i + 1,0], 1.5 * data[0,3], f"{int(data[i + 1,0])}",
rotation=0, ha="center", va="center",
bbox=dict(boxstyle="round", ec='gray', fc='w'))
plt.fill_between(data[:,0], data[:,3], data[:,4], color='gray')
plt.plot(data[:,0], data[:,1], label="median")
plt.plot(data[:,0], data[:,2], label="mean")
plt.legend(loc="upper left")
plt.xlabel("nj")
plt.ylabel("dt / dt$_0$")
plt.savefig("out.pdf")

The plots show the combination of several complex low-level effects (mainly cache trashing & prefetching issues). I assume the target platform is a mainstream modern processor with cache lines of 64 bytes (typically a x86 one).
I can reproduce the problem on my i5-9600KF processor. Here is the resulting plot:
First of all, when nj is small, the gap between fetched address (ie. strides) is small and cache lines are relatively efficiently used. For example, when nj = 1, the access is contiguous. In this case, the processor can efficiently prefetch the cache lines from the DRAM so to hide its high latency. There is also a good spatial cache locality since many contiguous items share the same cache line. When nj=2, only half the value of a cache line is used. This means the number of requested cache line is twice bigger for the same number of operations. That being said the time is not much bigger due to the relatively high latency of adding two floating-point numbers resulting in a compute-bound code. You can unroll the loop 4 times and use 4 different sum variables so that (mainstream modern) processors can add multiple values in parallel. Note that most processors can also load multiple values from the cache per cycle. When nj = 4 a new cache line is requested every 2 cycles (since a double takes 8 bytes). As a result, the memory throughput can become so big that the computation becomes memory-bound. One may expect the time to be stable for nj >= 8 since the number of requested cache line should be the same, but in practice processors prefetch multiple contiguous cache lines so not to pay the overhead of the DRAM latency which is huge in this case. The number of prefetched cache lines is generally between 2 to 4 (AFAIK such prefetching strategy is disabled on Intel processors when the stride is bigger than 512, so when nj >= 64. This explains why the timings are sharply increasing when nj < 32 and they become relatively stable with 32 <= nj <= 256 with exceptions for peaks.
The regular peaks happening when nj is a multiple of 16 are due to a complex cache effect called cache thrashing. Modern cache are N-way associative with N typically between 4 and 16. For example, here are statistics on my i5-9600KF processors:
Cache 0: L1 data cache, line size 64, 8-ways, 64 sets, size 32k
Cache 1: L1 instruction cache, line size 64, 8-ways, 64 sets, size 32k
Cache 2: L2 unified cache, line size 64, 4-ways, 1024 sets, size 256k
Cache 3: L3 unified cache, line size 64, 12-ways, 12288 sets, size 9216k
This means that two fetched values from the DRAM with the respective address A1 and A2 can results in conflicts in my L1 cache if (A1 % 32768) / 64 == (A2 % 32768) / 64. In this case, the processor needs to choose which cache line to replace from a set of N=8 cache lines. There are many cache replacement policy and none is perfect. Thus, some useful cache line are sometime evicted too early resulting in additional cache misses required later. In pathological cases, many DRAM locations can compete for the same cache lines resulting in excessive cache misses. More information about this can be found also in this post.
Regarding the nj stride, the number of cache lines that can be effectively used in the L1 cache is limited. For example, if all fetched values have the same address modulus the cache size, then only N cache lines (ie. 8 for my processor) can actually be used to store all the values. Having less cache lines available is a big problem since the prefetcher need a pretty large space in the cache so to store the many cache lines needed later. The smaller the number of concurrent fetches, the lower memory throughput. This is especially true here since the latency of fetching 1 cache line from the DRAM is about several dozens of nanoseconds (eg. ~70 ns) while its bandwidth is about dozens of GiB/s (eg. ~40 GiB/s): dozens of cache lines (eg. ~40) should be fetched concurrently so to hide the latency and saturate the DRAM.
Here is the simulation of the number of cache lines that can be actually used in my L1 cache regarding the value of the nj:
nj #cache-lines
1 512
2 512
3 512
4 512
5 512
6 512
7 512
8 512
9 512
10 512
11 512
12 512
13 512
14 512
15 512
16 256 <----
17 512
18 512
19 512
20 512
21 512
22 512
23 512
24 512
25 512
26 512
27 512
28 512
29 512
30 512
31 512
32 128 <----
33 512
34 512
35 512
36 512
37 512
38 512
39 512
40 512
41 512
42 512
43 512
44 512
45 512
46 512
47 512
48 256 <----
49 512
50 512
51 512
52 512
53 512
54 512
55 512
56 512
57 512
58 512
59 512
60 512
61 512
62 512
63 512
64 64 <----
==============
80 256
96 128
112 256
128 32
144 256
160 128
176 256
192 64
208 256
224 128
240 256
256 16
384 32
512 8
1024 4
We can see that the number of available cache lines is smaller when nj is a multiple of 16. In this case, the prefetecher will preload data into cache lines that are likely evicted early by subsequent fetched (done concurrently). Loads instruction performed in the code are more likely to result in cache misses when the number of available cache line is small. When a cache miss happen, the value need then to be fetched again from the L2 or even the L3 resulting in a slower execution. Note that the L2 cache is also subject to the same effect though it is less visible since it is larger. The L3 cache of modern x86 processors makes use of hashing to better distributes things to reduce collisions from fixed strides (at least on Intel processors and certainly on AMD too though AFAIK this is not documented).
Here is the timings on my machine for some peaks:
32 4.63600000e-03 4.62298020e-03 4.06400000e-03 4.97300000e-03
48 4.95800000e-03 4.96994059e-03 4.60400000e-03 5.59800000e-03
64 5.01600000e-03 5.00479208e-03 4.26900000e-03 5.33100000e-03
96 4.99300000e-03 5.02284158e-03 4.94700000e-03 5.29700000e-03
128 5.23300000e-03 5.26405941e-03 4.93200000e-03 5.85100000e-03
192 4.76900000e-03 4.78833663e-03 4.60100000e-03 5.01600000e-03
256 5.78500000e-03 5.81666337e-03 5.77600000e-03 6.35300000e-03
384 5.25900000e-03 5.32504950e-03 5.22800000e-03 6.75800000e-03
512 5.02700000e-03 5.05165347e-03 5.02100000e-03 5.34400000e-03
1024 5.29200000e-03 5.33059406e-03 5.28700000e-03 5.65700000e-03
As expected, the timings are overall bigger in practice for the case where the number of available cache lines is much smaller. However, when nj >= 512, the results are surprising since they are significantly faster than others. This is the case where the number of available cache lines is equal to the number of ways of associativity (N). My guess is that this is because Intel processors certainly detect this pathological case and optimize the prefetching so to reduce the number of cache misses (using line-fill buffers to bypass the L1 cache -- see below).
Finally, for large nj stride, a bigger nj should results in higher overheads mainly due to the translation lookaside buffer (TLB): there are more page addresses to translate with bigger nj and the number of TLB entries is limited. In fact this is what I can observe on my machine: timings tends to slowly increase in a very stable way unlike on your target platform.
I cannot really explain this very strange behavior yet.
Here is some wild guesses:
The OS could tend to uses more huge pages when nj is large (so to reduce de overhead of the TLB) since wider blocks are allocated. This could result in more concurrency for the prefetcher as AFAIK it cannot cross page
boundaries. You can try to check the number of allocated (transparent) huge-pages (by looking AnonHugePages in /proc/meminfo in Linux) or force them to be used in this case (using an explicit memmap), or possibly by disabling them. My system appears to make use of 2 MiB transparent huge-pages independently of the nj value.
If the target architecture is a NUMA one (eg. new AMD processors or a server with multiple processors having their own memory), then the OS could allocate pages physically stored on another NUMA node because there is less space available on the current NUMA node. This could result in higher performance due to the bigger throughput (though the latency is higher). You can control this policy with numactl on Linux so to force local allocations.
For more information about this topic, please read the great document What Every Programmer Should Know About Memory. Moreover, a very good post about how x86 cache works in practice is available here.
Removing the peaks
To remove the peaks due to cache trashing on x86 processors, you can use non-temporal software prefetching instructions so cache lines can be fetched in a non-temporal cache structure and into a location close to the processor that should not cause cache trashing in the L1 (if possible). Such cache structure is typically a line-fill buffers (LFB) on Intel processors and the (equivalent) miss address buffers (MAB) on AMD Zen processors. For more information about non-temporal instructions and the LFB, please read this post and this one. Here is the modified code that also include a loop unroling optimization to speed up the code when nj is small:
double sum_column(uint64_t ni, uint64_t nj, double* const data)
{
double sum0 = 0.0;
double sum1 = 0.0;
double sum2 = 0.0;
double sum3 = 0.0;
if(nj % 16 == 0)
{
// Cache-bypassing prefetch to avoid cache trashing
const size_t distance = 12;
for (uint64_t i = 0; i < ni; ++i) {
_mm_prefetch(&data[(i+distance)*nj+0], _MM_HINT_NTA);
sum0 += data[i*nj+0];
}
}
else
{
// Unrolling is much better for small strides
for (uint64_t i = 0; i < ni; i+=4) {
sum0 += data[(i+0)*nj+0];
sum1 += data[(i+1)*nj+0];
sum2 += data[(i+2)*nj+0];
sum3 += data[(i+3)*nj+0];
}
}
return sum0 + sum1 + sum2 + sum3;
}
Here is the result of the modified code:
We can see that peaks no longer appear in the timings. We can also see that the values are much bigger due to dt0 being about 4 times smaller (due to the loop unrolling).
Note that cache trashing in the L2 cache is not avoided with this method in practice (at least on Intel processors). This means that the effect is still here with huge nj strides multiple of 512 (4 KiB) on my machine (it is actually a slower than before, especially when nj >= 2048). It may be a good idea to stop the prefetching when (nj%512) == 0 && nj >= 512 on x86 processors. The effect AFAIK, there is no way to address this problem. That being said, this is a very bad idea to perform such big strided accesses on very-large data structures.
Note that distance should be carefully chosen since early prefetching can result cache line being evicted before they are actually used (so they need to be fetched again) and late prefetching is not much useful. I think using value close to the number of entries in the LFB/MAB is a good idea (eg. 12 on Skylake/KabyLake/CannonLake, 22 on Zen-2).

Effectively migrate a linux process with C

I need to estimate how much it costs to migrate a linux process on another core of the same computer. To migrate the process I'm using the sched_setaffinity system call, but i've noticed that migration does not always happens instantaneously, which is my requirement.
More in depth, I'm creating a C program which makes lots of simple computation two times each, the first without migration and the second with migration. Computing the difference between the two timestamps should give me a rough estimation of the migration overhead. However, I need to figure out how can I migrate the current process and wait until the migration happens
#define _GNU_SOURCE
#define _POSIX_C_SOURCE 199309L
#include <assert.h>
#include <sched.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <stdint.h>
//Migrates the process
int migrate(pid_t pid) {
const int totCPU = 8;
const int nextCPU = (sched_getcpu() +1) % totCPU;
cpu_set_t target;
CPU_SET(nextCPU, &target);
if(sched_setaffinity(pid, sizeof(target), &target) < 0)
perror("Setaffinity");
return nextCPU;
}
int main(void) {
long i =0;
const long iterations = 4;
uint64_t total_sequential_delays = 0;
uint64_t total_migration_delays = 0;
uint64_t delta_us;
for(int i=0; i < iterations; i++) {
struct timespec start, end;
//Migration benchmark only happens in odd iterations
bool do_migration = i % 2 == 1;
//Start timestamp
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
//Target CPU to migrate
int target;
if(do_migration) {
target = migrate(0);
//if current CPU is not the target CPU
if(target != sched_getcpu()) {
do {
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
}
while(target != sched_getcpu());
}
}
//Simple computation
double k = 5;
for(int j = 1; j <= 9999; j++) {
k *= j / (k-3);
}
//End timestamp
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
//Elapsed time
delta_us = (end.tv_sec - start.tv_sec) * 1000000 + (end.tv_nsec - start.tv_nsec) / 1000;
if(do_migration) total_migration_delays += delta_us;
else total_sequential_delays += delta_us;
}
//Compute the averages
double avg_migration = total_migration_delays / iterations;
double avg_sequential = total_sequential_delays / iterations;
//Print them
printf("\navg_migration=%f, avg_sequential=%f",avg_migration,avg_sequential);
return EXIT_SUCCESS;
}
The problem here is that the do-while loop (rows 46-49) sometimes runs forever.

I need to estimate how much it costs to migrate a linux process on another core of the same computer.
OK, the cost can be estimated as:
the time taken to set the new CPU affinity and do a "yield" or "sleep(0)" to force a task switch/reschedule (including task switch overhead, etc).
the cost of a cache miss for every future "was cached on the old CPU but not cached in the new CPU yet" memory access
the cost of a TLB miss for every future "virtual to physical translation was cached on the old CPU but not cached in the new CPU yet" memory access
NUMA penalties
load balancing problems (e.g. migrating from "lightly loaded" CPU or core to "heavily loaded by other processes" CPU or core may cause severe performance problems, including the cost of the kernel deciding to migrate other processes to different CPUs to fix the load balancing, where the costs/overheads paid by other processes probably should be included in the total cost caused by migrating your process).
Note that:
a) there are multiple levels of caches (trace cache, instruction cache, L1 data cache, L2 data cache, ..) and some caches are shared between some CPUs (e.g. L1 might be shared between logical CPUs within the same core, L2 might be shared by 2 cores, L3 might be shared by 8 cores).
b) TLB miss costs depend on a lot of things (e.g. if the kernel is using Meltdown mitigations without the PCID feature and blows away TLBs information every system call anyway).
c) NUMA penalties are latency costs - every access to RAM (e.g. cache miss) that was allocated on the previous CPU (for the previous NUMA node) will have higher latency than accesses to RAM that are allocated on the new/current CPU (correct NUMA node).
d) All of the cache miss costs, TLB miss costs and NUMA penalties depend on memory access patterns. A benchmark that has no memory accesses will be misleading.
e) The cache miss costs, TLB miss costs and NUMA penalties are highly dependent on the hardware involved - e.g. a benchmark on one "slow CPUs with fast RAM and no NUMA" computer will be completely irrelevant for a different "fast CPUs with slow RAM and many NUMA domains" computer. In the same way it's highly dependent on which CPUs (e.g. migrating from CPU #0 to CPU #1 might cost very little, and migrating from CPU #0 to CPU #15 might be very expensive).
To migrate the process I'm using the sched_setaffinity system call, but i've noticed that migration does not always happens instantaneously, which is my requirement.
Put a "sleep(0);" after the "sched_setaffinity();".

You aren't clearing the target set:
cpu_set_t target;
CPU_ZERO(&target); /* you need to add this line */
CPU_SET(nextCPU, &target);
so you don't know what you are setting your affinity to.
Another problem is that the cost of migration is not paid up front, so simply measuring the time from set_affinity() until you have settled isn't really a proper measurement. You might be better off running a workload contained to a single cpu, then two, three. Also, consider whether your workload should involve walking largish data structs, possibly with updates, since the loss of your cache investment is also a cost of migration.

ARM CMSIS DSP throughput for average of N float samples

I wrote a simple C-code to compute average of N floats that exist in an array. I got 10.5 ClockCyles per float as throughput for large N.
arm_mean_f32() is actually poorer in performance.
Isn't this too many CCs/float?
The 3 operations
load-from-memory
accumulation-of-loaded-values
increment of pointer
can happen in parallel.
Does ARM Cortex M4F do this?
The project was run on custom board with Freescale K24 processor having ARM Cortex M4F.

ARM implementation is very traditional you can check it on Github , they just do loop unrolling to reduce loop overhead then accumulate the sum of 4 samples each loop and finally divide the number of samples, i tried it with M4F and i got 5.3 cycles per float.
Here is the code i used
#include "arm_math.h"
#define MAX_BLOCKSIZE 32
float32_t src_buf_f32[MAX_BLOCKSIZE] =
{
-0.4325648115282207, -1.6655843782380970, 0.1253323064748307,
0.2876764203585489, -1.1464713506814637, 1.1909154656429988,
1.1891642016521031, -0.0376332765933176, 0.3272923614086541,
0.1746391428209245, -0.1867085776814394, 0.7257905482933027,
-0.5883165430141887, 2.1831858181971011, -0.1363958830865957,
0.1139313135208096, 1.0667682113591888, 0.0592814605236053,
-0.0956484054836690, -0.8323494636500225, 0.2944108163926404,
-1.3361818579378040, 0.7143245518189522, 1.6235620644462707,
-0.6917757017022868, 0.8579966728282626, 1.2540014216025324,
-1.5937295764474768, -1.4409644319010200, 0.5711476236581780,
-0.3998855777153632, 0.6899973754643451
};
float32_t result_f32;
int main(void)
{
arm_mean_f32(src_buf_f32, MAX_BLOCKSIZE, &result_f32);
return 0;
}
I think this is the best performance you can get with floating point, your poor performance can be due to that your are measuring the number of cycles incorrectly or your silicon. you can also try to increase compiler optimization.

Sum reduction with parallel algorithm - Bad performances compared to CPU version

I have achieved a small code for doing sum reduction of a 1D array. I am comparing a CPU sequential version and a OpenCL version.
The code is available on this link1
The kernel code is available on this link2
and if you want to compile : link3 for Makefile
My issue is about the bad performances of GPU version :
for size of vector lower than 1,024 * 10^9 elements (i.e with 1024, 10240, 102400, 1024000, 10240000, 102400000 elements) the runtime for GPU version is higher (slightly higher but higher) than CPU one.
As you can see, I have taken 2^n values in order to have a compatible number of workitems with the size of a workgroup.
Concerning the number of workgroups, I have taken :
// Number of work-groups
int nWorkGroups = size/local_item_size;
But for a high number of workitems, I wonder if the value of nWorkGroups is suitable ( for example, nWorkGroups = 1.024 * 10^8 / 1024 = 10^5 workgroups, isn't this too much ?? ).
I tried to modify loca_item_size in the range of [64, 128, 256, 512, 1024] but the performances remain bad for all these values.
I have good benefits only for size = 1.024 * 10^9 elements, here are the runtimes :
Size of the vector
1024000000
Problem size = 1024000000
GPU Parallel Reduction : Wall Clock = 20 second 977511 micro
Final Sum Sequential = 5.2428800006710899200e+17
Sequential Reduction : Wall Clock = 337 second 459777 micro
From your experiences, why do I get bad performances ? I though that advantages should be more significative compared to CPU version.
Maybe someone could see into source code a main mistake because, at the moment, I can't get to solve this issue.
Thanks

Well I can tell you some reasons:
You don't need to write the reduction buffer. You can directly clear it in GPU memory using clEnqueueFillBuffer() or a helper kernel.
ret = clEnqueueWriteBuffer(command_queue, reductionBuffer, CL_TRUE, 0,
local_item_size * sizeof(double), sumReduction, 0, NULL, NULL);
Dont use blocking calls, except for the last read. Otherwise you are wasting some time there.
You are doing the last reduction in CPU. Iterative processing trough the kernel can help.
Because if your kernel is just reducing 128 elements per pass. Your 10^9 number just gets down to 8*10^6. And the CPU does the rest. If you add there the data copy, it makes it completely non worth.
However, if you run 3 passes at 512 elements per pass, you read out from the GPU just 10^9/512^3 = 8 values. So, the only bottleneck would be the first GPU copy and the kernel launch.

Why doesn't this code scale linearly?

I wrote this SOR solver code. Don't bother too much what this algorithm does, it is not the concern here. But just for the sake of completeness: it may solve a linear system of equations, depending on how well conditioned the system is.
I run it with an ill conditioned 2097152 rows sparce matrix (that never converges), with at most 7 non-zero columns per row.
Translating: the outer do-while loop will perform 10000 iterations (the value I pass as max_iters), the middle for will perform 2097152 iterations, split in chunks of work_line, divided among the OpenMP threads. The innermost for loop will have 7 iterations, except in very few cases (less than 1%) where it can be less.
There is data dependency among the threads in the values of sol array. Each iteration of the middle for updates one element but reads up to 6 other elements of the array. Since SOR is not an exact algorithm, when reading, it can have any of the previous or the current value on that position (if you are familiar with solvers, this is a Gauss-Siedel that tolerates Jacobi behavior on some places for the sake of parallelism).
typedef struct{
size_t size;
unsigned int *col_buffer;
unsigned int *row_jumper;
real *elements;
} Mat;
int work_line;
// Assumes there are no null elements on main diagonal
unsigned int solve(const Mat* matrix, const real *rhs, real *sol, real sor_omega, unsigned int max_iters, real tolerance)
{
real *coefs = matrix->elements;
unsigned int *cols = matrix->col_buffer;
unsigned int *rows = matrix->row_jumper;
int size = matrix->size;
real compl_omega = 1.0 - sor_omega;
unsigned int count = 0;
bool done;
do {
done = true;
#pragma omp parallel shared(done)
{
bool tdone = true;
#pragma omp for nowait schedule(dynamic, work_line)
for(int i = 0; i < size; ++i) {
real new_val = rhs[i];
real diagonal;
real residual;
unsigned int end = rows[i+1];
for(int j = rows[i]; j < end; ++j) {
unsigned int col = cols[j];
if(col != i) {
real tmp;
#pragma omp atomic read
tmp = sol[col];
new_val -= coefs[j] * tmp;
} else {
diagonal = coefs[j];
}
}
residual = fabs(new_val - diagonal * sol[i]);
if(residual > tolerance) {
tdone = false;
}
new_val = sor_omega * new_val / diagonal + compl_omega * sol[i];
#pragma omp atomic write
sol[i] = new_val;
}
#pragma omp atomic update
done &= tdone;
}
} while(++count < max_iters && !done);
return count;
}
As you can see, there is no lock inside the parallel region, so, for what they always teach us, it is the kind of 100% parallel problem. That is not what I see in practice.
All my tests were run on a Intel(R) Xeon(R) CPU E5-2670 v2 # 2.50GHz, 2 processors, 10 cores each, hyper-thread enabled, summing up to 40 logical cores.
On my first set runs, work_line was fixed on 2048, and the number of threads varied from 1 to 40 (40 runs in total). This is the graph with the execution time of each run (seconds x number of threads):
The surprise was the logarithmic curve, so I thought that since the work line was so large, the shared caches were not very well used, so I dug up this virtual file /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size that told me this processor's L1 cache synchronizes updates in groups of 64 bytes (8 doubles in the array sol). So I set the work_line to 8:
Then I thought 8 was too low to avoid NUMA stalls and set work_line to 16:
While running the above, I thought "Who am I to predict what work_line is good? Lets just see...", and scheduled to run every work_line from 8 to 2048, steps of 8 (i.e. every multiple of the cache line, from 1 to 256). The results for 20 and 40 threads (seconds x size of the split of the middle for loop, divided among the threads):
I believe the cases with low work_line suffers badly from cache synchronization, while bigger work_line offers no benefit beyond a certain number of threads (I assume because the memory pathway is the bottleneck). It is very sad that a problem that seems 100% parallel presents such bad behavior on a real machine. So, before I am convinced multi-core systems are a very well sold lie, I am asking you here first:
How can I make this code scale linearly to the number of cores? What am I missing? Is there something in the problem that makes it not as good as it seems at first?
Update
Following suggestions, I tested both with static and dynamic scheduling, but removing the atomics read/write on the array sol. For reference, the blue and orange lines are the same from the previous graph (just up to work_line = 248;). The yellow and green lines are the new ones. For what I could see: static makes a significant difference for low work_line, but after 96 the benefits of dynamic outweighs its overhead, making it faster. The atomic operations makes no difference at all.

The sparse matrix vector multiplication is memory bound (see here) and it could be shown with a simple roofline model. Memory bound problems benefit from higher memory bandwidth of multisocket NUMA systems but only if the data initialisation is done in such a way that the data is distributed among the two NUMA domains. I have some reasons to believe that you are loading the matrix in serial and therefore all its memory is allocated on a single NUMA node. In that case you won't benefit from the double memory bandwidth available on a dual-socket system and it really doesn't matter if you use schedule(dynamic) or schedule(static). What you could do is enable memory interleaving NUMA policy in order to have the memory allocation spread among both NUMA nodes. Thus each thread would end up with 50% local memory access and 50% remote memory access instead of having all threads on the second CPU being hit by 100% remote memory access. The easiest way to enable the policy is by using numactl:
$ OMP_NUM_THREADS=... OMP_PROC_BIND=1 numactl --interleave=all ./program ...
OMP_PROC_BIND=1 enables thread pinning and should improve the performance a bit.
I would also like to point out that this:
done = true;
#pragma omp parallel shared(done)
{
bool tdone = true;
// ...
#pragma omp atomic update
done &= tdone;
}
is a probably a not very efficient re-implementation of:
done = true;
#pragma omp parallel reduction(&:done)
{
// ...
if(residual > tolerance) {
done = false;
}
// ...
}
It won't have a notable performance difference between the two implementations because of the amount of work done in the inner loop, but still it is not a good idea to reimplement existing OpenMP primitives for the sake of portability and readability.

Try running the IPCM (Intel Performance Counter Monitor). You can watch memory bandwidth, and see if it maxes out with more cores. My gut feeling is that you are memory bandwidth limited.
As a quick back of the envelope calculation, I find that uncached read bandwidth is about 10 GB/s on a Xeon. If your clock is 2.5 GHz, that's one 32 bit word per clock cycle. Your inner loop is basically just a multiple-add operation whose cycles you can count on one hand, plus a few cycles for the loop overhead. It doesn't surprise me that after 10 threads, you don't get any performance gain.

Your inner loop has an omp atomic read, and your middle loop has an omp atomic write to a location that could be the same one read by one of the reads. OpenMP is obligated to ensure that atomic writes and reads of the same location are serialized, so in fact it probably does need to introduce a lock, even though there isn't any explicit one.
It might even need to lock the whole sol array unless it can somehow figure out which reads might conflict with which writes, and really, OpenMP processors aren't necessarily all that smart.
No code scales absolutely linearly, but rest assured that there are many codes that do scale much closer to linearly than yours does.

I suspect you are having caching issues. When one thread updates a value in the sol array, it invalids the caches on other CPUs that are storing that same cache line. This forces the caches to be updated, which then leads to the CPUs stalling.

Even if you don't have an explicit mutex lock in your code, you have one shared resource between your processes: the memory and its bus. You don't see this in your code because it is the hardware that takes care of handling all the different requests from the CPUs, but nevertheless, it is a shared resource.
So, whenever one of your processes writes to memory, that memory location will have to be reloaded from main memory by all other processes that use it, and they all have to use the same memory bus to do so. The memory bus saturates, and you have no more performance gain from additional CPU cores that only serve to worsen the situation.