Sieve of Eratosthenes Pthread implementation: thread number doesn't affect computation time - c

I'm trying to implement the parallel Sieve of Eratosthenes program with Pthread. I have finished my coding and the programs works correctly and as expected, which means that if I use more than 1 threads, the computation time would be less than the sequential program (only 1 thread is used). However, no matter how many extra threads I used, the computation time would be basically the same. For example, if I do the calculation from 1 to 1 billion, the sequential program used about 21 secs, and the parallel program with 2 threads used about 14 secs. But it would always takes about 14 secs when I used 3,4,5,10,20,50 threads as I tried. I want to know what leads to this problem and how to solve it. My code is listed below:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
//The group of arguments passed to thread
struct thrd_data{
long id;
long start;
long end; /* the sub-range is from start to end */
};
typedef struct {
pthread_mutex_t count_lock; /* mutex semaphore for the barrier */
pthread_cond_t ok_to_proceed; /* condition variable for leaving */
long count; /* count of the number of threads who have arrived */
} mylib_barrier_t;
//global variable
bool *GlobalList;//The list of nature number
long Num_Threads;
mylib_barrier_t barrier;/* barrier */
void mylib_barrier_init(mylib_barrier_t *b)
{
b -> count = 0;
pthread_mutex_init(&(b -> count_lock), NULL);
pthread_cond_init(&(b -> ok_to_proceed), NULL);
}
void mylib_barrier(mylib_barrier_t *b, long id)
{
pthread_mutex_lock(&(b -> count_lock));
b -> count ++;
if (b -> count == Num_Threads)
{
b -> count = 0; /* must be reset for future re-use */
pthread_cond_broadcast(&(b -> ok_to_proceed));
}
else
{
while (pthread_cond_wait(&(b -> ok_to_proceed), &(b -> count_lock)) != 0);
}
pthread_mutex_unlock(&(b -> count_lock));
}
void mylib_barrier_destroy(mylib_barrier_t *b)
{
pthread_mutex_destroy(&(b -> count_lock));
pthread_cond_destroy(&(b -> ok_to_proceed));
}
void *DoSieve(void *thrd_arg)
{
struct thrd_data *t_data;
long i,start, end;
long k=2;//The current prime number in first loop
long myid;
/* Initialize my part of the global array */
t_data = (struct thrd_data *) thrd_arg;
myid = t_data->id;
start = t_data->start;
end = t_data->end;
printf ("Thread %ld doing look-up from %ld to %ld\n", myid,start,end);
//First loop: find all prime numbers that's less than sqrt(n)
while (k*k<=end)
{
int flag;
if(k*k>=start)
flag=0;
else
flag=1;
//Second loop: mark all multiples of current prime number
for (i = !flag? k*k-1:start+k-start%k-1; i <= end; i += k)
GlobalList[i] = 1;
i=k;
//wait for other threads to finish the second loop for current prime number
mylib_barrier(&barrier,myid);
//find next prime number that's greater than current one
while (GlobalList[i] == 1)
i++;
k = i+1;
}
//decrement the counter of threads before exit
pthread_mutex_lock (&barrier.count_lock);
Num_Threads--;
if (barrier.count == Num_Threads)
{
barrier.count = 0; /* must be reset for future re-use */
pthread_cond_broadcast(&(barrier.ok_to_proceed));
}
pthread_mutex_unlock (&barrier.count_lock);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
long i, n,n_threads;
long k, nq, nr;
FILE *results;
struct thrd_data *t_arg;
pthread_t *thread_id;
pthread_attr_t attr;
/* Pthreads setup: initialize barrier and explicitly create
threads in a joinable state (for portability) */
mylib_barrier_init(&barrier);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
/* ask to enter n and n_threads from the user */
printf ("enter the range n = ");
scanf ("%ld", &n);
printf ("enter the number of threads n_threads = ");
scanf ("%ld", &n_threads);
time_t start = time(0);//set initial time
//Initialize global list
GlobalList=(bool *)malloc(sizeof(bool)*n);
for(i=0;i<n;i++)
GlobalList[i]=0;
/* create arrays of thread ids and thread args */
thread_id = (pthread_t *)malloc(sizeof(pthread_t)*n_threads);
t_arg = (struct thrd_data *)malloc(sizeof(struct thrd_data)*n_threads);
/* distribute load and create threads for computation */
nq = n / n_threads;
nr = n % n_threads;
k = 1;
Num_Threads=n_threads;
for (i=0; i<n_threads; i++){
t_arg[i].id = i;
t_arg[i].start = k;
if (i < nr)
k = k + nq + 1;
else
k = k + nq;
t_arg[i].end = k-1;
pthread_create(&thread_id[i], &attr, DoSieve, (void *) &t_arg[i]);
}
/* Wait for all threads to complete then print all prime numbers */
for (i=0; i<n_threads; i++) {
pthread_join(thread_id[i], NULL);
}
int j=1;
//Get the spent time for the computation works by all participanting threads
time_t stop = time(0);
printf("Time to do everything except print = %lu seconds\n", (unsigned long) (stop-start));
//print the result of prime numbers
printf("The prime numbers are listed below:\n");
for (i = 1; i < n; i++)
{
if (GlobalList[i] == 0)
{
printf("%ld ", i + 1);
j++;
}
if (j% 15 == 0)
printf("\n");
}
printf("\n");
// Clean up and exit
free(GlobalList);
pthread_attr_destroy(&attr);
mylib_barrier_destroy(&barrier); // destroy barrier object
pthread_exit (NULL);
}

You make a valid observation. More threads doesn't mean more work gets done.
You are running you program on a dual-core CPU. You already saturate the system with 2 threads.
With 1 thread only 1 core will get used. With 2 threads 2 cores will get used. With let say 4 threads you will see about the same performance as with 2 threads. Hyper-threading doesn't help because a logical core (HT core) shares the memory system with it's physical core.
Here is the output of running
perf stat -d sieve
23879.553188 task-clock (msec) # 1.191 CPUs utilized
3,666 context-switches # 0.154 K/sec
1,470 cpu-migrations # 0.062 K/sec
219,177 page-faults # 0.009 M/sec
76,070,790,848 cycles # 3.186 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
34,500,622,236 instructions # 0.45 insns per cycle
4,172,395,541 branches # 174.727 M/sec
1,020,010 branch-misses # 0.02% of all branches
21,065,385,093 L1-dcache-loads # 882.152 M/sec
1,223,920,596 L1-dcache-load-misses # 5.81% of all L1-dcache hits
69,357,488 LLC-loads # 2.904 M/sec
<not supported> LLC-load-misses:HG
This is the output of i5-4460 CPU's hardware performance monitor. It tracking some interesting statistics.
Notice the low instructions per cycle count. The cpu is doing 0.45 instructions per cycle. Normally you want to see this value > 1.
Update: The key here to notice is that increasing the number of threads doesn't help. The CPU can only do a finite amount of branching and memory access.

Two observations.
First, if you fix your sieve code then it should run about 25 times as fast as it does now, corresponding roughly to the expected gain from distributing your current code successfully over 32 cores.
Have a look at prime number summing still slow after using sieve where I showed how to sieve the numbers up to 2,000,000,000 in 1.25 seconds in C# of all languages. The article discusses (and benchmarks) each step/technique separately so that you pick what you like and roll a solution that strikes the perfect bang/buck ratio for your needs. Things will be even faster in C/C++ because there you can count on the compiler sweating the small stuff for you (at least with excellent compilers like gcc or VC++).
Second: when sieving large ranges the most important resource is the level 1 cache of the CPU. Everything else plays second fiddle. You can see this also from the benchmarks in my article. To distribute a sieving task across several CPUs, count the L1 caches in your system and assign a sieving job to each cache (1/kth of the range where k is the number of L1 caches). This is a bit of a simplification since you'd normally choose a finer granularity for the size of the work items, but it gives the general idea.
I said 'caches', not 'cores', 'virtual cores' or 'threads' because that is precisely what you need to do: assign the jobs such that each job has its own L1 cache. How that works depends not only on the operating system but also on the specific CPU(s) in your system. If two 'whatevers' share an L1 cache, give a job to only one of the two and ignore the other (or rather, set the affinity for the job such that it can run on any one of the two but nowhere else).
This is easy enough to do with operating system APIs (e.g. Win32) but I don't know enough about pthreads to tell whether it offers the required precision. As a first approximation you could match the number of threads to the suspected number of L1 caches.

Related

OpenMP: No Speedup in parallel workloads

So I can't really figure this bit out with my fairly simple OpenMP parallelized for loop. When running on the same input size, P=1 runs in ~50 seconds, but running P=2 takes almost 300 Seconds, with P=4 running ~250 Seconds.
Here's the parallelized loop
double time = omp_get_wtime();
printf("Input Size: %d\n", n);
#pragma omp parallel for private(i) reduction(+:in)
for(i = 0; i < n; i++) {
double x = (double)(rand() % 10000)/10000;
double y = (double)(rand() % 10000)/10000;
if(inCircle(x, y)) {
in++;
}
}
double ratio = (double)in/(double)n;
double est_pi = ratio * 4.0;
time = omp_get_wtime() - time;
Runtimes:
p=1, n=1073741824 - 52.764 seconds
p=2, n=1073741824 - 301.66 seconds
p=4, n=1073741824 - 274.784 seconds
p=8, n=1073741824 - 188.224 seconds
Running in a Ubuntu 20.04 VM with 8 cores of a Xeon 5650 and 16gb of DDR3 EEC RAM on top of a FreeNas installation on a Dual Xeon 5650 System with 70Gb of RAM.
Partial Solution:
The rand() function inside of the loop causes the time to jump when running on multiple threads.
Since rand() uses state saved from the previous call to generated the next PRN it can't run in multiple threads at the same time. Multiple threads would need to read/write the PRNG state at the same time.
POSIX states that rand() need not be thread safe. This means your code could just not work right. Or the C library might put in a mutex so that only one thread could call rand() at a time. This is what's happening, but it slows the code down considerably. The threads are nearly entirely consumed trying to get access to the rand critical section as nothing else they are doing takes any significant time.
To solve this, try using rand_r(), which does not use shared state, but instead is passed the seed value it should use for state.
Keep in mind that using the same seed for every thread will defeat the purpose of increasing the number of trials in your Monte Carlo simulation. Each thread would just use the exact same pseudo-random sequence. Try something like this:
unsigned int seed;
#pragma omp parallel private(seed)
{
seed = omp_get_thread_num();
#pragma omp for private(i) reduction(+:in)
for(i = 0; i < n; i++) {
double x = (double)(rand_r(&seed) % 10000)/10000;
double y = (double)(rand_r(&seed) % 10000)/10000;
if(inCircle(x, y)) {
in++;
}
}
}
BTW, you might notice your estimate is off. x and y need to be evenly distributed in the range [0, 1], and they are not.

pthread is slower than the "default" version

SITUATION
I want to see the advantage of using pthread. If I'm not wrong: threads allow me to execute given parts of program in parallel.
so here is what I try to accomplish: I want to make a program that takes a number(let's say n) and outputs the sum of [0..n].
code
#define MAX 1000000000
int
main() {
long long n = 0;
for (long long i = 1; i < MAX; ++i)
n += i;
printf("\nn: %lld\n", n);
return 0;
}
time: 0m2.723s
to my understanding I could simply take that number MAX and divide by 2 and let 2 threads
do the job.
code
#define MAX 1000000000
#define MAX_THREADS 2
#define STRIDE MAX / MAX_THREADS
typedef struct {
long long off;
long long res;
} arg_t;
void*
callback(void *args) {
arg_t *arg = (arg_t*)args;
for (long long i = arg->off; i < arg->off + STRIDE; ++i)
arg->res += i;
pthread_exit(0);
}
int
main() {
pthread_t threads[MAX_THREADS];
arg_t results[MAX_THREADS];
for (int i = 0; i < MAX_THREADS; ++i) {
results[i].off = i * STRIDE;
results[i].res = 0;
pthread_create(&threads[i], NULL, callback, (void*)&results[i]);
}
for (int i = 0; i < MAX_THREADS; ++i)
pthread_join(threads[i], NULL);
long long result;
result = results[0].res;
for (int i = 1; i < MAX_THREADS; ++i)
result += results[i].res;
printf("\nn: %lld\n", result);
return 0;
}
time: 0m8.530s
PROBLEM
The version with pthread runs slower. Logically this version should run faster, but maybe creation of threads is more expensive.
Can someone suggest a solution or show what I'm doing/understanding wrong here?
Your problem is cache thrashing combined with a lack of optimization (I bet you're compiling without it on).
The naive (-O0) code for
for (long long i = arg->off; i < arg->off + STRIDE; ++i)
arg->res += i;
will access the memory of *arg. With your results array being defined the way it is, that memory is very close to the memory of the next arg and the two threads will fight for the same cache-line, making RAM caching very ineffective.
If you compile with -O1, the loop should use a register instead and only write to memory at the end. Then, you should get better performance with threads (higher optimization levels on gcc seem to optimize the loop out completely)
Another (better) option is to align arg_t on a cache line:
typedef struct {
_Alignas(64) /*typical cache line size*/ long long off;
long long res;
} arg_t;
Then you should get better performance with threads regardless of whether or not you turn optimization on.
Good cache utilization is generally very important in multithreaded programming (and Ulrich Drepper has much to say on that topic in his infamous What Every Programmer Should Know About Memory).
Creating a whole bunch of threads is very unlikely to be quicker than simply adding numbers. The CPU can add an awfully large number of integers in the time it takes the kernel to set up and tear down a thread. To see the benefit of multithreading, you really need each thread to be doing a significant task -- significant compared to the overhead in creating the thread, anyway. Alternatively, you need to keep a pool of threads running, and assign them work according to some allocation strategy.
Multi-threading works best when an application consists of tasks that are somewhat independent, that would otherwise be waiting on one another to complete. It isn't a magic way to get more throughput.

False sharing in multi threads

The following code runs slower as I increase the NTHREADS. Why use more threads make the program run slower? Is there any way to fix it? Someone said it is about false sharing but I do not really understand that concept.
The program basicly calculate the sum from 1 to 100000000. The idea to use multithread is to seperate the number list into several chuncks, and calculate the sum of each chunck parallelly to make the calculation faster.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define LENGTH 100000000
#define NTHREADS 2
#define NREPEATS 10
#define CHUNCK (LENGTH / NTHREADS)
typedef struct {
size_t id;
long *array;
long result;
} worker_args;
void *worker(void *args) {
worker_args *wargs = (worker_args*) args;
const size_t start = wargs->id * CHUNCK;
const size_t end = wargs->id == NTHREADS - 1 ? LENGTH : (wargs->id+1) * CHUNCK;
for (size_t i = start; i < end; ++i) {
wargs->result += wargs->array[i];
}
return NULL;
}
int main(void) {
long* numbers = malloc(sizeof(long) * LENGTH);
for (size_t i = 0; i < LENGTH; ++i) {
numbers[i] = i + 1;
}
worker_args *args = malloc(sizeof(worker_args) * NTHREADS);
for (size_t i = 0; i < NTHREADS; ++i) {
args[i] = (worker_args) {
.id = i,
.array = numbers,
.result = 0
};
}
pthread_t thread_ids[NTHREADS];
for (size_t i = 0; i < NTHREADS; ++i) {
pthread_create(thread_ids+i, NULL, worker, args+i);
}
for (size_t i = 0; i < NTHREADS; ++i) {
pthread_join(thread_ids[i], NULL);
}
long sum = 0;
for (size_t i = 0; i < NTHREADS; ++i) {
sum += args[i].result;
}
printf("Run %2zu: total sum is %ld\n", n, sum);
free(args);
free(numbers);
}
Why use more threads make the program run slower?
There is an overhead creating and joining threads. If the threads hasn't much to do then this overhead may be more expensive than the actual work.
Your threads are only doing a simple sum which isn't that expensive. Also consider that going from e.g. 10 to 11 threads doesn't change the work load per thread a lot.
10 threads --> 10000000 sums per thread
11 threads --> 9090909 sums per thread
The overhead of creating an extra thread may exceed the "work load saved" per thread.
On my PC the program runs in less than 100 milliseconds. Multi-threading isn't worth the trouble.
You need a more processing intensive task before multi-threading is worth doing.
Also notice that it seldom make sense to create more threads than the number of cores (incl hyper thread) your computer has.
false sharing
yes, "false sharing" can impact the performance of a multi-threaded program but I doubt it's the real problem in your case.
"false sharing" is something that happens in (some) cache systems when two threads (or rather two cores) writes to two different variables that belongs to the same cache line. In such cases the two threads/cores competes to own the cache line (for writing) and consequently, they'll have to refresh the memory and the cache again and again. That's bad for performance.
As I said - I doubt that is your problem. A clever compiler will do your loop solely be using CPU registers and only write to memory at the end. You can check the disassemble of your code to see if that is the case.
You can avoid "false sharing" by increasing the sizeof of your struct so that each struct fits the size of a cache line on your system.

Why is the multithreaded version of this program slower?

I am trying to learn pthreads and I have been experimenting with a program that tries to detect the changes on an array. Function array_modifier() picks a random element and toggles it's value (1 to 0 and vice versa) and then sleeps for some time (big enough so race conditions do not appear, I know this is bad practice). change_detector() scans the array and when an element doesn't match it's prior value and it is equal to 1, the change is detected and diff array is updated with the detection delay.
When there is one change_detector() thread (NTHREADS==1) it has to scan the whole array. When there are more threads each is assigned a portion of the array. Each detector thread will only catch the modifications in its part of the array, so you need to sum the catch times of all 4 threads to get the total time to catch all changes.
Here is the code:
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#define TIME_INTERVAL 100
#define CHANGES 5000
#define UNUSED(x) ((void) x)
typedef struct {
unsigned int tid;
} parm;
static volatile unsigned int* my_array;
static unsigned int* old_value;
static struct timeval* time_array;
static unsigned int N;
static unsigned long int diff[NTHREADS] = {0};
void* array_modifier(void* args);
void* change_detector(void* arg);
int main(int argc, char** argv) {
if (argc < 2) {
exit(1);
}
N = (unsigned int)strtoul(argv[1], NULL, 0);
my_array = calloc(N, sizeof(int));
time_array = malloc(N * sizeof(struct timeval));
old_value = calloc(N, sizeof(int));
parm* p = malloc(NTHREADS * sizeof(parm));
pthread_t generator_thread;
pthread_t* detector_thread = malloc(NTHREADS * sizeof(pthread_t));
for (unsigned int i = 0; i < NTHREADS; i++) {
p[i].tid = i;
pthread_create(&detector_thread[i], NULL, change_detector, (void*) &p[i]);
}
pthread_create(&generator_thread, NULL, array_modifier, NULL);
pthread_join(generator_thread, NULL);
usleep(500);
for (unsigned int i = 0; i < NTHREADS; i++) {
pthread_cancel(detector_thread[i]);
}
for (unsigned int i = 0; i < NTHREADS; i++) fprintf(stderr, "%lu ", diff[i]);
fprintf(stderr, "\n");
_exit(0);
}
void* array_modifier(void* arg) {
UNUSED(arg);
srand(time(NULL));
unsigned int changing_signals = CHANGES;
while (changing_signals--) {
usleep(TIME_INTERVAL);
const unsigned int r = rand() % N;
gettimeofday(&time_array[r], NULL);
my_array[r] ^= 1;
}
pthread_exit(NULL);
}
void* change_detector(void* arg) {
const parm* p = (parm*) arg;
const unsigned int tid = p->tid;
const unsigned int start = tid * (N / NTHREADS) +
(tid < N % NTHREADS ? tid : N % NTHREADS);
const unsigned int end = start + (N / NTHREADS) +
(tid < N % NTHREADS);
unsigned int r = start;
while (1) {
unsigned int tmp;
while ((tmp = my_array[r]) == old_value[r]) {
r = (r < end - 1) ? r + 1 : start;
}
old_value[r] = tmp;
if (tmp) {
struct timeval tv;
gettimeofday(&tv, NULL);
// detection time in usec
diff[tid] += (tv.tv_sec - time_array[r].tv_sec) * 1000000 + (tv.tv_usec - time_array[r].tv_usec);
}
}
}
when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=1 file.c -pthread && ./a.out 100
I get:
665
but when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=4 file.c -pthread && ./a.out 100
I get:
152 190 164 242
(this sums up to 748).
So, the delay for the multithreaded program is larger.
My cpu has 6 cores.
Short Answer
You are sharing memory between thread and sharing memory between threads is slow.
Long Answer
Your program is using a number of thread to write to my_array and another thread to read from my_array. Effectively my_array is shared by a number of threads.
Now lets assume you are benchmarking on a multicore machine, you probably are hoping that the OS will assign different cores to each thread.
Bear in mind that on modern processors writing to RAM is really expensive (hundreds of CPU cycles). To improve performance CPUs have multi-level caches. The fastest Cache is the small L1 cache. A core can write to its L1 cache in the order of 2-3 cycles. The L2 cache may take on the order of 20 - 30 cycles.
Now in lots of CPU architectures each core has its own L1 cache but the L2 cache is shared. This means any data that is shared between thread (cores) has to go through the L2 cache which is much slower than the L1 cache. This means that shared memory access tends to be quite slow.
Bottom line is that if you want your multithreaded programs to perform well you need to ensure that threads do not share memory. Sharing memory is slow.
Aside
Never rely on volatile to do the correct thing when sharing memory between thread, either use your library atomic operations or use mutexes. This is because some CPUs allow out of order reads and writes that may do strange things if you do not know what you are doing.
It is rare that a multithreaded program scales perfectly with the number of threads. In your case you measured a speed-up factor of ca 0.9 (665/748) with 4 threads. That is not so good.
Here are some factors to consider:
The overhead of starting threads and dividing the work. For small jobs the cost of starting additional threads can be considerably larger than the actual work. Not applicable to this case, since the overhead isn't included in the time measurements.
"Random" variations. Your threads varied between 152 and 242. You should run the test multiple times and use either the mean or the median values.
The size of the test. Generally you get more reliable measurements on larger tests (more data). However, you need to consider how having more data affects the caching in L1/L2/L3 cache. And if the data is too large to fit into RAM you need to factor in disk I/O. Usually, multithreaded implementations are slower, because they want to work on more data at a time but in rare instances they can be faster, a phenomenon called super-linear speedup.
Overhead caused by inter-thread communication. Maybe not a factor in your case, since you don't have much of that.
Overhead caused by resource locking. Usually has a low impact on cpu utilization but may have a large impact on the total real time used.
Hardware optimizations. Some CPUs change the clock frequency depending on how many cores you use.
The cost of the measurement itself. In your case a change will be detected within 25 (100/4) iterations of the for loop. Each iteration takes but a few clock cycles. Then you call gettimeofday which probably costs thousands of clock cycles. So what you are actually measuring is more or less the cost of calling gettimeofday.
I would increase the number of values to check and the cost to check each value. I would also consider turning off compiler optimizations, since these can cause the program to do unexpected things (or skip some things entirely).

cpu cacheline and prefetch policy

I read this article http://igoro.com/archive/gallery-of-processor-cache-effects/. The article said that because cacheline delay, the code:
int[] arr = new int[64 * 1024 * 1024];
// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
will almost have same execute time, and I wrote some sample c code to test it. I run the code on Xeon(R) E3-1230 V2 with Ubuntu 64bit, ARMv6-compatible processor rev 7 with Debian, and also run it on Core 2 T6600. All results are not what the article said.
My code is as follows:
long int jobTime(struct timespec start, struct timespec stop) {
long int seconds = stop.tv_sec - start.tv_sec;
long int nsec = stop.tv_nsec - start.tv_nsec;
return seconds * 1000 * 1000 * 1000 + nsec;
}
int main() {
struct timespec start;
struct timespec stop;
int i = 0;
struct sched_param param;
int * arr = malloc(LENGTH * 4);
printf("---------sieofint %d\n", sizeof(int));
param.sched_priority = 0;
sched_setscheduler(0, SCHED_FIFO, &param);
//clock_gettime(CLOCK_MONOTONIC, &start);
//for (i = 0; i < LENGTH; i++) arr[i] *= 5;
//clock_gettime(CLOCK_MONOTONIC, &stop);
//printf("step %d : time %ld\n", 1, jobTime(start, stop));
clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i += 2) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);
printf("step %d : time %ld\n", 2, jobTime(start, stop));
}
Each time I choose one piece to compile and run (comment one and uncomment another).
compile with:
gcc -O0 -o cache cache.c -lrt
On Xeon I get this:
step 1 : 258791478
step 2 : 97875746
I want to know whether or not what the article said was correct? Alternatively, do the newest cpus have more advanced prefetch policies?
Short Answer (TL;DR): you're accessing uninitialized data, your first loop has to allocate new physical pages for the entire array within the timed loop.
When I run your code and comment each of the sections in turn, I get almost the same timing for the two loops. However, I do get the same results you report when I uncomment both sections and run them one after the other. This makes me suspect you also did that, and suffered from cold start effect when comparing the first loop with the second. It's easy to check - just replace the order of the loops and see if the first is still slower.
To avoid, either pick a large enough LENGTH (depending on your system) so that you dont get any cache benefits from the first loop helping the second, or just add a single traversal of the entire array that's not timed.
Note that the second option wouldn't exactly prove what the blog wanted to say - that memory latency masks the execution latency, so it doesn't matter how many elements of a cache line you use, you're still bottlenecked by the memory access time (or more accurately - the bandwidth)
Also - benchmarking code with -O0 is a really bad practice
Edit:
Here's what i'm getting (removed the scheduling as it's not related).
This code:
for (i = 0; i < LENGTH; i++) arr[i] = 1; // warmup!
clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i++) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);
printf("step %d : time %ld\n", 1, jobTime(start, stop));
clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i+=16) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);
Gives :
---------sieofint 4
step 1 : time 58862552
step 16 : time 50215446
While commenting the warmup line gives the same advantage as you reported on the second loop:
---------sieofint 4
step 1 : time 279772411
step 16 : time 50615420
Replacing the order of the loops (warmup is still commented) shows it's indeed not related to the step size but to the ordering:
---------sieofint 4
step 16 : time 250033980
step 1 : time 59168310
(gcc version 4.6.3, on Opteron 6272)
Now a note about what's going on here - in theory, you'd expect warmup to be meaningful only when the array is small enough to sit in some cache - in this case the LENGTH you used is too big even for the L3 on most machines. However, you're forgetting the pagemap - you didn't just skip warming the data itself - you avoided initializing it in the first place. This can never give you meaningful results in real life, but since this a benchmark you didn't notice that, you're just multiplying junk data for the latency of it.
This means that each new page you access on the first loop doesn't only go to memory, it would probably get a page fault and have to call the OS to map a new physical page for it. This is a lengthy process, multiplies by the number of 4K pages you use - accumulating to a very long time. At this array size you can't even benefit from TLBs (you have 16k different physical 4k pages, way more than most TLBs can support even with 2 levels), so it's just the question of the fault flows. This can probably be measures by any profiling tool.
The second iteration on the same array won't have this effect and would be much faster - even though is still has to do a full pagewalk on each new page (that's done purely in HW), and then fetch the data from memory.
By the way, this is also the reason when you benchmark some behavior, you repeat the same thing multiple times (in this case it would have solved your problem if you had run over the array several time with the same stride, and ignored the first few rounds).

Resources