C, Way to make multiply in array element fast - c

import numpy as np
array = np.random.rand(16384)
array *= 3
above python code make each element in array has 3 times multiplied value of its own.
On my Laptop, these code took 5ms
Below code is what i tried on C language.
#include <headers...>
array = make 16384 elements...;
for(int i = 0 ; i < 16384 ; ++i)
array[i] *= 3
compile command was
gcc -O2 main.cpp
it takes almost 30ms.
Is there any way i can reduce process time of this?
P.S it was my fault. I confused unit of timestamp value.
this code is faster than numpy. sorry for this question.

This sounds pretty unbelievable. For reference, I wrote a trivial (but complete) program that does roughly what you seem to be describing. I used C++ so I could use its chrono library to get (much) more precise timing than C's clock provides, but I wouldn't expect that to affect the speed at all.
#include <iostream>
#include <chrono>
#define SIZE (16384)
float array[SIZE];
int main() {
using namespace std::chrono;
for (int i = 0; i < SIZE; i++) {
array[i] = i;
}
auto start = high_resolution_clock::now();
for (int i=0; i<SIZE; i++) {
array[i] *= 3.0;
}
auto stop = high_resolution_clock::now();
std::cout << duration_cast<microseconds>(stop - start).count() << '\n';
long total = 0;
for (int i = 0; i < SIZE; i++) {
total += i;
}
std::cout << "Ignore: " << total << "\n";
}
On my machine (2.8 GHz Haswell, so probably slower than whatever you're running) this shows a time of 7 or 8 microseconds, so around 600-700 times as fast as you're getting from Python.
Adding the compiler flag to use AVX 2 instructions reduces that to 4 microseconds, or a little more than 1000 times as fast (warning: AMD processors generally don't get as much of a speed boost from using AVX 2, but if you have a reasonably new AMD processor I'd expect it to be faster than this anyway).
Bottom line: the speed you're reporting for your C code only seems to make sense if you're running the code on some sort of slow microcontroller, or maybe a really old desktop system--though it would have to be quite old to run nearly as slow as you're reporting. My immediate guess is that even a 386 would be faster than that.
When/if you have something that takes enough time to justify it, you can also use OpenMP to run a loop like this in multiple threads. I tried that, but in this case the overhead of starting up and synchronizing the threads is (quite a bit) more than running in parallel can gain, so it's a net loss.
Compiler: VS 2019 (Microsoft (R) C/C++ Optimizing Compiler Version 19.27.28919.3 for x64).
Flags: /O2b2 /GL (and part of the time, /arch:AVX2)

Related

how to properly do multiply accumulate with NEON intrinsics

i need to do a simple multiply accumulate of two signed 8 bit arrays.
This routine runs every millisecond on an ARM7 embedded device. I am trying to speed it up a bit. I have already tried optimizing and enabling vector ops.
-mtune=cortex-a15.cortex-a7 -mfpu=neon-vfpv4 -ftree-vectorize -ffast-math -mfloat-abi=hard
this helped but I am still running close to the edge.
this is the 'c' code.
for(i = 4095; i >= 0; --i)
{
accum += arr1[i]*arr2[i];
}
I am trying to use NEON intrinsics. This loop runs ~5 times faster, but I get different results. I am pretty sure I am not properly retrieving the accumulation, or it rolls over before I do. Any help/pointers is greatly appreciated. Any detailed docs would also be helpful.
for(int i = 256; i > 0; --i)
{
int8x16_t vec16a = vld1q_s8(&arr1[index]);
int8x16_t vec16b = vld1q_s8(&arr2[index]);
vec16res = vmlaq_s8(vec16res, vec16a, vec16b);
index+=16;
}
EDIT to post solution.
Thanks to tips from all!
I dropped to to 8x8 and have a fast solution
using the below code I achieved a "fast enough" time. Not as fast as the 128bit version but good enough.
I added __builtin_prefetch() for the data, and did a 10 pass avg.
Neon is substantially faster.
$ ./test 10
original code time ~ 30392nS
optimized C time ~ 8458nS
NEON elapsed time ~ 3199nS
int32_t sum = 0;
int16x8_t vecSum = vdupq_n_s16(0);
int8x8_t vec8a;
int8x8_t vec8b;
int32x4_t sum32x4;
int32x2_t sum32x2;
#pragma unroll
for (i = 512; i > 0; --i)
{
vec8a = vld1_s8(&A[index]);
vec8b = vld1_s8(&B[index]);
vecSum = vmlal_s8(vecSum,vec8a,vec8b);
index += 8;
}
sum32x4 = vaddl_s16(vget_high_s16(vecSum),vget_low_s16(vecSum));
sum32x2 = vadd_s32(vget_high_s32(sum32x4),vget_low_s32(sum32x4));
sum += vget_lane_s32(vpadd_s32(sum32x2,sum32x2),0);
Your issue is likely overflow, so you'll need to lengthen when you do your multiply-accumulate.
As you're on ARMv7, you'll want vmlal_s8.
ARMv8 A64 has vmlal_high_s8 which allows you to stay in 128-bit vectors, which will give an added speed-up.
As mentioned in comments, seeing what auto-vectorization will do with -O options / pragma unroll is very valuable, and learning from the godbolt of that. Unrolling often gives speed-ups when doing by hand as well.
Lots more valuable tips on optimization in the Arm Neon resources.

OpenMP for beginners

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?
First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.
The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

Node js comparison to C

On my dual core machine, Node JS runs faster than an equivalent program written in C.
Is node so well optimized that it actually is more efficient, or is there something wrong with my C program that makes it slower?
Node js code:
var Parallel = require("paralleljs");
function slow(n){
var i = 0;
while(++i < n * n){}
return i;
}
var p = new Parallel([20001, 32311, 42222]);
p.map(slow).then(function(data){
console.log("Done!"+data.toString());
});
C code:
#include <stdio.h>
#include <pthread.h>
struct thread_s {
long int n;
long int r;
};
void *slow(void *p){
thread_s *t = (thread_s*)p;
long int i = 0;
while(++i < t->n * t->n){}
t->r = i;
pthread_exit( 0 );
}
thread_s arr[] = {{20001, 0}, {32311, 0}, {42222, 0}};
int main(){
pthread_t t[3];
for(int c = 0; c < 3; c++){
pthread_create(&t[c], NULL, slow, &arr[c]);
}
for(int c = 0; c < 3; c++){
pthread_join(t[c], NULL);
}
printf("Done! %ld %ld %ld\n", arr[0].r, arr[1].r, arr[2].r);
return 0;
}
You are benchmarking a toy program which is not a good way to compare compilers. Also, the loops you are doing have no side-effects. All it does is set i to n * n. The loops should be optimized out. Are you running unoptimized?
Try to compute something real that approximates the work-load that you will later apply in production. If your code will be numerics-heavy you could benchmark a naive matrix multiplication for example.
All basic operations (+-, Math.xx etc.) are mapped to V8 engine which just execute it as C programm. So you should have pretty same results for C vs Node.js in these kind of scenarios.
Also I have tried C#.NET vs Node Fibonacci of 45. And first time I ran it was 5 times slower for C#, that was really strange. In a moment I understood that this is due to debug mode I ran C# app.
Going to release make it very close(20sec node, 22sec C#), probably this is just measurement inconsistency.
In any case this is just a matter of percents.
Question currently lacks details on benchmark, so it is impossible to say anything definitive about it. However, general comparison between V8 running javascript, and a bknary program compiled from C source is possible.
V8 is pretty darn good at JIT compilation, so while there is the overhead of JIT compilation, this compensates for dynamic nature of JavaScript, so for simple integer operations in a loop there's no reason for JIT code to be slower. JIT
Another consideration is startup time. If you load node.js first and load javascript from interactive prompt, startup time of script is minimal even with JIT, especially compared to dynamically linked binary which needs to resolve symbols etc. If you have small statically linked binary, it will start very fast, and will have done a lot of processing by the time a new node.js is even started and starts to look for some Javascript to execute. You need to be careful how you handle this in benchmarks, or results will be meaningless.

OpenMP Parallel for-loop showing little performance increase

I am in the process of learning how to use OpenMP in C, and as a HelloWorld exercise I am writing a program to count primes. I then parallelise this as follows:
int numprimes = 0;
#pragma omp parallel for reduction (+:numprimes)
for (i = 1; i <= n; i++)
{
if (is_prime(i) == true)
numprimes ++;
}
I compile this code using gcc -g -Wall -fopenmp -o primes primes.c -lm (-lm for the math.h functions I am using). Then I run this code on an Intel® Core™2 Duo CPU E8400 # 3.00GHz × 2, and as expected, the performance is better than for a serial program.
The problem, however, comes when I try to run this on a much more powerful machine. (I have also tried to manually set the number of threads to use with num_threads, but this did not change anything.) Counting all the primes up to 10 000 000 gives me the following times (using time):
8-core machine:
real 0m8.230s
user 0m50.425s
sys 0m0.004s
dual-core machine:
real 0m10.846s
user 0m17.233s
sys 0m0.004s
And this pattern continues for counting more primes, the machine with more cores shows a slight performance increase, but not as much as I would expect for having so many more cores available. (I would expect 4 times more cores to imply almost 4 times less running time?)
Counting primes up to 50 000 000:
8-core machine:
real 1m29.056s
user 8m11.695s
sys 0m0.017s
dual-core machine:
real 1m51.119s
user 2m50.519s
sys 0m0.060s
If anyone can clarify this for me, it would be much appreciated.
EDIT
This is my prime-checking function.
static int is_prime(int n)
{
/* handle special cases */
if (n == 0) return 0;
else if (n == 1) return 0;
else if (n == 2) return 1;
int i;
for(i=2;i<=(int)(sqrt((double) n));i++)
if (n%i==0) return 0;
return 1;
}
This performance is happening because:
is_prime(i) takes longer the higher i gets, and
Your OpenMP implementation uses static scheduling by default for parallel for constructs without the schedule clause, i.e. it chops the for loop into equal sized contiguous chunks.
In other words, the highest-numbered thread is doing all of the hardest operations.
Explicitly selecting a more appropriate scheduling type with the schedule clause allows you to divide work among the threads fairly.
This version will divide the work better:
int numprimes = 0;
#pragma omp parallel for schedule(dynamic, 1) reduction(+:numprimes)
for (i = 1; i <= n; i++)
{
if (is_prime(i) == true)
numprimes ++;
}
Information on scheduling syntax is available via MSDN and Wikipedia.
schedule(dynamic, 1) may not be optimal, as High Performance Mark notes in his answer. There is a more in-depth discussion of scheduling granularity in this OpenMP wihtepaper.
Thanks also to Jens Gustedt and Mahmoud Fayez for contributing to this answer.
The reason for the apparently poor scaling of your program is, as #naroom has suggested, the variability in the run time of each call to your is_prime function. The run time does not simply increase with the value of i. Your code shows that the test terminates as soon as the first factor of i is found so the longest run times will be for numbers with few (and large) factors, including the prime numbers themselves.
As you've already been told, the default schedule for your parallelisation will parcel out the iterations of the master loop a chunk at a time to the available threads. For your case of 5*10^7 integers to test and 8 cores to use, the first thread will get the integers 1..6250000 to test, the second will get 6250001..12500000 and so on. This will lead to a severely unbalanced load across the threads because, of course, the prime numbers are not uniformly distributed.
Rather than using the default scheduling you should experiment with dynamic scheduling. The following statement tells the run-time to parcel out the iterations of your master loop m iterations at a time to the threads in your computation:
#pragma omp parallel for schedule(dynamic,m)
Once a thread has finished its m iterations it will be given m more to work on. The trick for you is to find the sweet spot for m. Too small and your computation will be dominated by the work that the run time does in parcelling out iterations, too large and your computation will revert to the unbalanced loads that you have already seen.
Take heart though, you will learn some useful lessons about the costs, and benefits, of parallel computation by working through all of this.
I think your code need to use dynamic so the threads each can consume different number of iterations as your iterations have different work load so the current code is balanced which won't help in your case try this out please:
int numprimes = 0;
#pragma omp parallel for reduction (+:numprimes) schedule(dynamic,1)
for (i = 1; i <= n; i++){
if (is_prime(i) == true)
++numprimes;
}

When should prefetch be used on modern machines? [duplicate]

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I'd like the example to meet the following criteria:
It is a simple, small, self-contained example.
Removing the __builtin_prefetch instruction results in performance degradation.
Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation.
That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn't be managed without it.
Here's an actual piece of code that I've pulled out of a larger project. (Sorry, it's the shortest one I can find that had a noticable speedup from prefetching.)
This code performs a very large data transpose.
This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.
To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.
#include <iostream>
using std::cout;
using std::endl;
#include <emmintrin.h>
#include <malloc.h>
#include <time.h>
#include <string.h>
#define ENABLE_PREFETCH
#define f_vector __m128d
#define i_ptr size_t
inline void swap_block(f_vector *A,f_vector *B,i_ptr L){
// To be super-optimized later.
f_vector *stop = A + L;
do{
f_vector tmpA = *A;
f_vector tmpB = *B;
*A++ = tmpB;
*B++ = tmpA;
}while (A < stop);
}
void transpose_even(f_vector *T,i_ptr block,i_ptr x){
// Transposes T.
// T contains x columns and x rows.
// Each unit is of size (block * sizeof(f_vector)) bytes.
//Conditions:
// - 0 < block
// - 1 < x
i_ptr row_size = block * x;
i_ptr iter_size = row_size + block;
// End of entire matrix.
f_vector *stop_T = T + row_size * x;
f_vector *end = stop_T - row_size;
// Iterate each row.
f_vector *y_iter = T;
do{
// Iterate each column.
f_vector *ptr_x = y_iter + block;
f_vector *ptr_y = y_iter + row_size;
do{
#ifdef ENABLE_PREFETCH
_mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0);
#endif
swap_block(ptr_x,ptr_y,block);
ptr_x += block;
ptr_y += row_size;
}while (ptr_y < stop_T);
y_iter += iter_size;
}while (y_iter < end);
}
int main(){
i_ptr dimension = 4096;
i_ptr block = 16;
i_ptr words = block * dimension * dimension;
i_ptr bytes = words * sizeof(f_vector);
cout << "bytes = " << bytes << endl;
// system("pause");
f_vector *T = (f_vector*)_mm_malloc(bytes,16);
if (T == NULL){
cout << "Memory Allocation Failure" << endl;
system("pause");
exit(1);
}
memset(T,0,bytes);
// Perform in-place data transpose
cout << "Starting Data Transpose... ";
clock_t start = clock();
transpose_even(T,block,dimension);
clock_t end = clock();
cout << "Done" << endl;
cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl;
_mm_free(T);
system("pause");
}
When I run it with ENABLE_PREFETCH enabled, this is the output:
bytes = 4294967296
Starting Data Transpose... Done
Time: 0.725 seconds
Press any key to continue . . .
When I run it with ENABLE_PREFETCH disabled, this is the output:
bytes = 4294967296
Starting Data Transpose... Done
Time: 0.822 seconds
Press any key to continue . . .
So there's a 13% speedup from prefetching.
EDIT:
Here's some more results:
Operating System: Windows 7 Professional/Ultimate
Compiler: Visual Studio 2010 SP1
Compile Mode: x64 Release
Intel Core i7 860 # 2.8 GHz, 8 GB DDR3 # 1333 MHz
Prefetch : 0.868
No Prefetch: 0.960
Intel Core i7 920 # 3.5 GHz, 12 GB DDR3 # 1333 MHz
Prefetch : 0.725
No Prefetch: 0.822
Intel Core i7 2600K # 4.6 GHz, 16 GB DDR3 # 1333 MHz
Prefetch : 0.718
No Prefetch: 0.796
2 x Intel Xeon X5482 # 3.2 GHz, 64 GB DDR2 # 800 MHz
Prefetch : 2.273
No Prefetch: 2.666
Binary search is a simple example that could benefit from explicit prefetching. The access pattern in a binary search looks pretty much random to the hardware prefetcher, so there is little chance that it will accurately predict what to fetch.
In this example, I prefetch the two possible 'middle' locations of the next loop iteration in the current iteration. One of the prefetches will probably never be used, but the other will (unless this is the final iteration).
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int binarySearch(int *array, int number_of_elements, int key) {
int low = 0, high = number_of_elements-1, mid;
while(low <= high) {
mid = (low + high)/2;
#ifdef DO_PREFETCH
// low path
__builtin_prefetch (&array[(mid + 1 + high)/2], 0, 1);
// high path
__builtin_prefetch (&array[(low + mid - 1)/2], 0, 1);
#endif
if(array[mid] < key)
low = mid + 1;
else if(array[mid] == key)
return mid;
else if(array[mid] > key)
high = mid-1;
}
return -1;
}
int main() {
int SIZE = 1024*1024*512;
int *array = malloc(SIZE*sizeof(int));
for (int i=0;i<SIZE;i++){
array[i] = i;
}
int NUM_LOOKUPS = 1024*1024*8;
srand(time(NULL));
int *lookups = malloc(NUM_LOOKUPS * sizeof(int));
for (int i=0;i<NUM_LOOKUPS;i++){
lookups[i] = rand() % SIZE;
}
for (int i=0;i<NUM_LOOKUPS;i++){
int result = binarySearch(array, SIZE, lookups[i]);
}
free(array);
free(lookups);
}
When I compile and run this example with DO_PREFETCH enabled, I see a 20% reduction in runtime:
$ gcc c-binarysearch.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3
$ gcc c-binarysearch.c -o no-prefetch -std=c11 -O3
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./with-prefetch
Performance counter stats for './with-prefetch':
356,675,702 L1-dcache-load-misses # 41.39% of all L1-dcache hits
861,807,382 L1-dcache-loads
8.787467487 seconds time elapsed
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./no-prefetch
Performance counter stats for './no-prefetch':
382,423,177 L1-dcache-load-misses # 97.36% of all L1-dcache hits
392,799,791 L1-dcache-loads
11.376439030 seconds time elapsed
Notice that we are doing twice as many L1 cache loads in the prefetch version. We're actually doing a lot more work but the memory access pattern is more friendly to the pipeline. This also shows the tradeoff. While this block of code runs faster in isolation, we have loaded a lot of junk into the caches and this may put more pressure on other parts of the application.
I learned a lot from the excellent answers provided by #JamesScriven and #Mystical. However, their examples give only a modest boost - the objective of this answer is to present a (I must confess somewhat artificial) example, where prefetching has a bigger impact (about factor 4 on my machine).
There are three possible bottle-necks for the modern architectures: CPU-speed, memory-band-width and memory latency. Prefetching is all about reducing the latency of the memory-accesses.
In a perfect scenario, where latency corresponds to X calculation-steps, we would have a oracle, which would tell us which memory we would access in X calculation-steps, the prefetching of this data would be launched and it would arrive just in-time X calculation-steps later.
For a lot of algorithms we are (almost) in this perfect world. For a simple for-loop it is easy to predict which data will be needed X steps later. Out-of-order execution and other hardware tricks are doing a very good job here, concealing the latency almost completely.
That is the reason, why there is such a modest improvement for #Mystical's example: The prefetcher is already pretty good - there is just not much room for improvement. The task is also memory-bound, so probably not much band-width is left - it could be becoming the limiting factor. I could see at best around 8% improvement on my machine.
The crucial insight from the #JamesScriven example: neither we nor the CPU knows the next access-address before the the current data is fetched from memory - this dependency is pretty important, otherwise out-of-order execution would lead to a look-forward and the hardware would be able to prefetch the data. However, because we can speculate about only one step there is not that much potential. I was not able to get more than 40% on my machine.
So let's rig the competition and prepare the data in such a way that we know which address is accessed in X steps, but make it impossible for hardware to find it out due to dependencies on not yet accessed data (see the whole program at the end of the answer):
//making random accesses to memory:
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
//the actual work is happening here
void operator()(){
//set up the oracle - let see it in the future oracle_offset steps
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//use oracle and prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work, the less the better
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index); #update oracle
index=next(mem[index]); #dependency on `mem[index]` is VERY important to prevent hardware from predicting future
}
}
Some remarks:
data is prepared in such a way, that the oracle is alway right.
maybe surprisingly, the less CPU-bound task the bigger the speed-up: we are able to hide the latency almost completely, thus the speed-up is CPU-time+original-latency-time/CPU-time.
Compiling and executing leads:
>>> g++ -std=c++11 prefetch_demo.cpp -O3 -o prefetch_demo
>>> ./prefetch_demo
#preloops time no prefetch time prefetch factor
...
7 1.0711102260000001 0.230566831 4.6455521002498408
8 1.0511602149999999 0.22651144600000001 4.6406494398521474
9 1.049024333 0.22841439299999999 4.5926367389641687
....
to a speed-up between 4 and 5.
Listing of prefetch_demp.cpp:
//prefetch_demo.cpp
#include <vector>
#include <iostream>
#include <iomanip>
#include <chrono>
const int SIZE=1024*1024*1;
const int STEP_CNT=1024*1024*10;
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
template<bool prefetch>
struct Worker{
std::vector<int> mem;
double result;
int oracle_offset;
void operator()(){
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work:
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index);
index=next(mem[index]);
}
}
Worker(std::vector<int> &mem_):
mem(mem_), result(0.0), oracle_offset(0)
{}
};
template <typename Worker>
double timeit(Worker &worker){
auto begin = std::chrono::high_resolution_clock::now();
worker();
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9;
}
int main() {
//set up the data in special way!
std::vector<int> keys(SIZE);
for (int i=0;i<SIZE;i++){
keys[i] = i;
}
Worker<false> without_prefetch(keys);
Worker<true> with_prefetch(keys);
std::cout<<"#preloops\ttime no prefetch\ttime prefetch\tfactor\n";
std::cout<<std::setprecision(17);
for(int i=0;i<20;i++){
//let oracle see i steps in the future:
without_prefetch.oracle_offset=i;
with_prefetch.oracle_offset=i;
//calculate:
double time_with_prefetch=timeit(with_prefetch);
double time_no_prefetch=timeit(without_prefetch);
std::cout<<i<<"\t"
<<time_no_prefetch<<"\t"
<<time_with_prefetch<<"\t"
<<(time_no_prefetch/time_with_prefetch)<<"\n";
}
}
From the documentation:
for (i = 0; i < n; i++)
{
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+j], 1, 1);
__builtin_prefetch (&b[i+j], 0, 1);
/* ... */
}
Pre-fetching data can be optimized to the Cache Line size, which for most modern 64-bit processors is 64 bytes to for example pre-load a uint32_t[16] with one instruction.
For example on ArmV8 I discovered through experimentation casting the memory pointer to a uint32_t 4x4 matrix vector (which is 64 bytes in size) halved the required instructions required as before I had to increment by 8 as it was only loading half the data, even though my understanding was that it fetches a full cache line.
Pre-fetching an uint32_t[32] original code example...
int addrindex = &B[0];
__builtin_prefetch(&V[addrindex]);
__builtin_prefetch(&V[addrindex + 8]);
__builtin_prefetch(&V[addrindex + 16]);
__builtin_prefetch(&V[addrindex + 24]);
After...
int addrindex = &B[0];
__builtin_prefetch((uint32x4x4_t *) &V[addrindex]);
__builtin_prefetch((uint32x4x4_t *) &V[addrindex + 16]);
For some reason int datatype for the address index/offset gave better performance. Tested with GCC 8 on Cortex-a53. Using an equivalent 64 byte vector on other architectures might give the same performance improvement if you find it is not pre-fetching all the data like in my case. In my application with a one million iteration loop, it improved performance by 5% just by doing this. There were further requirements for the improvement.
the 128 megabyte "V" memory allocation had to be aligned to 64 bytes.
uint32_t *V __attribute__((__aligned__(64))) = (uint32_t *)(((uintptr_t)(__builtin_assume_aligned((unsigned char*)aligned_alloc(64,size), 64)) + 63) & ~ (uintptr_t)(63));
Also, I had to use C operators instead of Neon Intrinsics, since they require regular datatype pointers (in my case it was uint32_t *) otherwise the new built in prefetch method had a performance regression.
My real world example can be found at https://github.com/rollmeister/veriumMiner/blob/main/algo/scrypt.c in the scrypt_core() and its internal function which are all easy to read. The hard work is done by GCC8. Overall improvement to performance was 25%.

Resources