False sharing in multi threads

False sharing in multi threads - c

The following code runs slower as I increase the NTHREADS. Why use more threads make the program run slower? Is there any way to fix it? Someone said it is about false sharing but I do not really understand that concept.
The program basicly calculate the sum from 1 to 100000000. The idea to use multithread is to seperate the number list into several chuncks, and calculate the sum of each chunck parallelly to make the calculation faster.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define LENGTH 100000000
#define NTHREADS 2
#define NREPEATS 10
#define CHUNCK (LENGTH / NTHREADS)
typedef struct {
size_t id;
long *array;
long result;
} worker_args;
void *worker(void *args) {
worker_args *wargs = (worker_args*) args;
const size_t start = wargs->id * CHUNCK;
const size_t end = wargs->id == NTHREADS - 1 ? LENGTH : (wargs->id+1) * CHUNCK;
for (size_t i = start; i < end; ++i) {
wargs->result += wargs->array[i];
}
return NULL;
}
int main(void) {
long* numbers = malloc(sizeof(long) * LENGTH);
for (size_t i = 0; i < LENGTH; ++i) {
numbers[i] = i + 1;
}
worker_args *args = malloc(sizeof(worker_args) * NTHREADS);
for (size_t i = 0; i < NTHREADS; ++i) {
args[i] = (worker_args) {
.id = i,
.array = numbers,
.result = 0
};
}
pthread_t thread_ids[NTHREADS];
for (size_t i = 0; i < NTHREADS; ++i) {
pthread_create(thread_ids+i, NULL, worker, args+i);
}
for (size_t i = 0; i < NTHREADS; ++i) {
pthread_join(thread_ids[i], NULL);
}
long sum = 0;
for (size_t i = 0; i < NTHREADS; ++i) {
sum += args[i].result;
}
printf("Run %2zu: total sum is %ld\n", n, sum);
free(args);
free(numbers);
}

Why use more threads make the program run slower?
There is an overhead creating and joining threads. If the threads hasn't much to do then this overhead may be more expensive than the actual work.
Your threads are only doing a simple sum which isn't that expensive. Also consider that going from e.g. 10 to 11 threads doesn't change the work load per thread a lot.
10 threads --> 10000000 sums per thread
11 threads --> 9090909 sums per thread
The overhead of creating an extra thread may exceed the "work load saved" per thread.
On my PC the program runs in less than 100 milliseconds. Multi-threading isn't worth the trouble.
You need a more processing intensive task before multi-threading is worth doing.
Also notice that it seldom make sense to create more threads than the number of cores (incl hyper thread) your computer has.
false sharing
yes, "false sharing" can impact the performance of a multi-threaded program but I doubt it's the real problem in your case.
"false sharing" is something that happens in (some) cache systems when two threads (or rather two cores) writes to two different variables that belongs to the same cache line. In such cases the two threads/cores competes to own the cache line (for writing) and consequently, they'll have to refresh the memory and the cache again and again. That's bad for performance.
As I said - I doubt that is your problem. A clever compiler will do your loop solely be using CPU registers and only write to memory at the end. You can check the disassemble of your code to see if that is the case.
You can avoid "false sharing" by increasing the sizeof of your struct so that each struct fits the size of a cache line on your system.

Related

pthread is slower than the "default" version

SITUATION
I want to see the advantage of using pthread. If I'm not wrong: threads allow me to execute given parts of program in parallel.
so here is what I try to accomplish: I want to make a program that takes a number(let's say n) and outputs the sum of [0..n].
code
#define MAX 1000000000
int
main() {
long long n = 0;
for (long long i = 1; i < MAX; ++i)
n += i;
printf("\nn: %lld\n", n);
return 0;
}
time: 0m2.723s
to my understanding I could simply take that number MAX and divide by 2 and let 2 threads
do the job.
code
#define MAX 1000000000
#define MAX_THREADS 2
#define STRIDE MAX / MAX_THREADS
typedef struct {
long long off;
long long res;
} arg_t;
void*
callback(void *args) {
arg_t *arg = (arg_t*)args;
for (long long i = arg->off; i < arg->off + STRIDE; ++i)
arg->res += i;
pthread_exit(0);
}
int
main() {
pthread_t threads[MAX_THREADS];
arg_t results[MAX_THREADS];
for (int i = 0; i < MAX_THREADS; ++i) {
results[i].off = i * STRIDE;
results[i].res = 0;
pthread_create(&threads[i], NULL, callback, (void*)&results[i]);
}
for (int i = 0; i < MAX_THREADS; ++i)
pthread_join(threads[i], NULL);
long long result;
result = results[0].res;
for (int i = 1; i < MAX_THREADS; ++i)
result += results[i].res;
printf("\nn: %lld\n", result);
return 0;
}
time: 0m8.530s
PROBLEM
The version with pthread runs slower. Logically this version should run faster, but maybe creation of threads is more expensive.
Can someone suggest a solution or show what I'm doing/understanding wrong here?

Your problem is cache thrashing combined with a lack of optimization (I bet you're compiling without it on).
The naive (-O0) code for
for (long long i = arg->off; i < arg->off + STRIDE; ++i)
arg->res += i;
will access the memory of *arg. With your results array being defined the way it is, that memory is very close to the memory of the next arg and the two threads will fight for the same cache-line, making RAM caching very ineffective.
If you compile with -O1, the loop should use a register instead and only write to memory at the end. Then, you should get better performance with threads (higher optimization levels on gcc seem to optimize the loop out completely)
Another (better) option is to align arg_t on a cache line:
typedef struct {
_Alignas(64) /*typical cache line size*/ long long off;
long long res;
} arg_t;
Then you should get better performance with threads regardless of whether or not you turn optimization on.
Good cache utilization is generally very important in multithreaded programming (and Ulrich Drepper has much to say on that topic in his infamous What Every Programmer Should Know About Memory).

Creating a whole bunch of threads is very unlikely to be quicker than simply adding numbers. The CPU can add an awfully large number of integers in the time it takes the kernel to set up and tear down a thread. To see the benefit of multithreading, you really need each thread to be doing a significant task -- significant compared to the overhead in creating the thread, anyway. Alternatively, you need to keep a pool of threads running, and assign them work according to some allocation strategy.
Multi-threading works best when an application consists of tasks that are somewhat independent, that would otherwise be waiting on one another to complete. It isn't a magic way to get more throughput.

How can I best "parallelise" a set of four nested for()-loops in a Brute-Force attack?

I have the following homework task:
I need to brute force 4-char passphrase with the following mask
%%##
( where # - is a numeric character, % - is an alpha character )
in several threads using OpenMP.
Here is a piece of code, but I'm not sure if it is doing the right thing:
int i, j, m, n;
const char alph[26] = "abcdefghijklmnopqrstuvwxyz";
const char num[10] = "0123456789";
#pragma omp parallel for private(pass) schedule(dynamic) collapse(4)
for (i = 0; i < 26; i++)
for (j = 0; j < 26; j++)
for (m = 0; m < 10; m++)
for (n = 0; n < 10; n++) {
pass[0] = alph[i];
pass[1] = alph[j];
pass[2] = num[m];
pass[3] = num[n];
/* Working with pass here */
}
So my question is :
How to correctly specify the "parallel for" instruction, in order to split the range of passphrases between several cores?
Help is much appreciated.

Your code is pretty much right, except for using alph instead of num. If you're able to define the pass variable within the loop, that'll save you many a headache.
A full MWE might look like:
//Compile with, e.g.: gcc -O3 temp.c -std=c99 -fopenmp
#include <stdio.h>
#include <unistd.h>
#include <string.h>
int PassCheck(char *pass){
usleep(50); //Sleep for 100 microseconds to simulate work
return strncmp(pass, "qr34", 4)==0;
}
int main(){
const char alph[27] = "abcdefghijklmnopqrstuvwxyz";
const char num[11] = "0123456789";
char goodpass[5] = "----"; //Provide a default password to indicate an error state
int i, j, m, n;
#pragma omp parallel for collapse(4)
for (i = 0; i < 26; i++)
for (j = 0; j < 26; j++)
for (m = 0; m < 10; m++)
for (n = 0; n < 10; n++){
char pass[4];
pass[0] = alph[i];
pass[1] = alph[j];
pass[2] = num[m];
pass[3] = num[n];
if(PassCheck(pass)){
//It is good practice to use `critical` here in case two
//passwords are somehow both valid. This won't arise in
//your code, but is worth thinking about.
#pragma omp critical
{
memcpy(goodpass, pass, 4);
goodpass[4] = '\0';
//#pragma omp cancel for //Escape for loops!
}
}
}
printf("Password was '%s'.\n",goodpass);
return 0;
}
Dynamic scheduling
Using a dynamic schedule here is probably pointless. Your expectation should be that each password will take, on average, about the same amount of time to check. Therefore, each iteration of the loop will take about the same amount of time. Therefore, there is no need to use dynamic scheduling because your loops will remain evenly distributed.
Visual noise
Note that the loop nest is stacked, rather than indented. You'll often see this in code where there are many nested loops as it tends to reduce visual noise.
Breaking early
#pragma omp cancel for is available as of OpenMP 4.0; however, I got a warning using it in this context, so I've commented it out. If you are able to get it working, that'll reduce your run-time by half since all effort is wasted once the correct password has been found and the password will, on average, be located half-way through the search space.
Where the guessed password is generated
One of the commentors suggests moving, e.g. pass[0] so that it is not in the innermost loop. This is a bad idea as doing so will prevent you from using collapse(4). As a result you could parallelize the outer loop, but you run the risk that its iteration count cannot be evenly divided by the number of threads, resulting in a large load imbalance. Alternatively, you could parallelize the inner loop, which exposes you to the same problem plus high synchronization costs each time the loop ends.
Why usleep?
The usleep function causes the code to run slowly. This is intentional; it provides feedback on the effect of parallelism, since the workload is so small.
If I remove the usleep, then the code completes in 0.003s on a single core and 0.004s on 4 cores. You cannot tell that the parallelism is even working. Leaving usleep in gives 8.950s on a single core and 2.257s on 4 cores, an apt demonstration of the effectiveness of the parallelism.
Naturally, you would remove this line once you're sure that parallelism is working correctly.
Further, any actual brute-force password cracker would likely be computing an expensive hash function inside the PassCheck function. Including usleep() here allows us to simulate that function and experiment with high-level design without having to the function first.

Non-deterministic CUDA C kernel

I'm still a beginner with CUDA and I have been trying to write a simple kernel to perform a parallel prime sieve on the GPU. Originally I had written my code in C but I wanted to investigate the speed up on a GPU so I rewrote it:
41.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define B 1024
#define T 256
#define N (B*T)
#define checkCudaErrors(error) {\
if (error != cudaSuccess) {\
printf("CUDA Error - %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(error));\
exit(1);\
}\
}\
__global__ void prime_sieve(int *primes) {
unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
primes[i] = i;
primes[0] = primes[1] = 0;
if (i > 1 && i<N) {
for (int j=2; j<N/2; j++) {
if (i*j < N) {
primes[i*j] = 0;
}
}
}
}
int main() {
int *h_primes=(int*)malloc(N * sizeof(int));
int *d_primes;
checkCudaErrors(cudaMalloc( (void**)&d_primes, N*sizeof(int)));
checkCudaErrors(cudaMemcpy(d_primes,h_primes,N*sizeof(int),cudaMemcpyHostToDevice));
prime_sieve<<<B,T>>>(d_primes);
checkCudaErrors(cudaMemcpy(h_primes,d_primes,N*sizeof(int),cudaMemcpyDeviceToHost));
checkCudaErrors(cudaFree(d_primes));
int size = 0;
int total = 0;
for (int i=2; i<N; i++) {
if (h_primes[i]) {
size++;
}
total++;
}
printf("\n");
printf("Length = %d\tPrimes = %d\n",total,size);
free(h_primes);
return 0;
}
I run the program on Ubuntu 16.04 (4.4.0-83-generic) and I compile using nvcc 41.cu -o 41.o -arch=sm_30 under version 8.0.61. The program is run on a GeForce GTX 780 Ti but everytime it runs, it always produces non-deterministic results:
Length = 262142 Primes = 49477
Length = 262142 Primes = 49486
Length = 262142 Primes = 49596
Length = 262142 Primes = 49589
There were no errors reported back. At first I thought it was a race condition but cuda-memcheck didn't report back any hazards for racecheck,initcheck or synccheck and I couldn't think of any problems with my assumptions. I was thinking this could be a synchronisation problem?
This non-deterministic behaviour only occurs when I increase the block size and thread size as seen in the code. When I tried a block size and thread size of say 16, then there were no problems (as far as I could tell). It seems that not all threads get the chance to execute? I was planning to run this on very large array sizes (< 1 billion integers) but I am stuck at this point.
What am I doing wrong here?

There is a giant race-condition
So prime[i] > 0 means prime, while prime[i]=0 means composite.
primes[i] = i; is executed as first update on primes by each thread. Keep this in mind.
Now let's see what happen when thread 16 executes. It marks primes[16]=16 and and all multiples of 16 too. Something like the following
primes[16] = primes[32] = primes[48]=....=primes[k*16]=0
Imagine that thread 48 gets scheduled just after thread 16 completed its job (or when j>3 in thread 16 loop`).
Thread 48 sets primes[48] = 48. You have lost the update made by thread 16.
That is a race condition.
When coding in CUDA you should make sure that the correctness of your code does not depend on a particular scheduling of warps.
You should think as the order of execution as something non-deterministic.

MPI with pthreads slow performance [duplicate]

I'm trying around on the new C++11 threads, but my simple test has abysmal multicore performance. As a simple example, this program adds up some squared random numbers.
#include <iostream>
#include <thread>
#include <vector>
#include <cstdlib>
#include <chrono>
#include <cmath>
double add_single(int N) {
double sum=0;
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand()/RAND_MAX);
}
return sum/N;
}
void add_multi(int N, double& result) {
double sum=0;
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand()/RAND_MAX);
}
result = sum/N;
}
int main() {
srand (time(NULL));
int N = 1000000;
// single-threaded
auto t1 = std::chrono::high_resolution_clock::now();
double result1 = add_single(N);
auto t2 = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
std::cout << "time single: " << time_elapsed << std::endl;
// multi-threaded
std::vector<std::thread> th;
int nr_threads = 3;
double partual_results[] = {0,0,0};
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < nr_threads; ++i)
th.push_back(std::thread(add_multi, N/nr_threads, std::ref(partual_results[i]) ));
for(auto &a : th)
a.join();
double result_multicore = 0;
for(double result:partual_results)
result_multicore += result;
result_multicore /= nr_threads;
t2 = std::chrono::high_resolution_clock::now();
time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
std::cout << "time multi: " << time_elapsed << std::endl;
return 0;
}
Compiled with 'g++ -std=c++11 -pthread test.cpp' on Linux and a 3core machine, a typical result is
time single: 33
time multi: 565
So the multi threaded version is more than an order of magnitude slower. I've used random numbers and a sqrt to make the example less trivial and prone to compiler optimizations, so I'm out of ideas.
edit:
This problem scales for larger N, so the problem is not the short runtime
The time for creating the threads is not the problem. Excluding it does not change the result significantly
Wow I found the problem. It was indeed rand(). I replaced it with an C++11 equivalent and now the runtime scales perfectly. Thanks everyone!

On my system the behavior is same, but as Maxim mentioned, rand is not thread safe. When I change rand to rand_r, then the multi threaded code is faster as expected.
void add_multi(int N, double& result) {
double sum=0;
unsigned int seed = time(NULL);
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand_r(&seed)/RAND_MAX);
}
result = sum/N;
}

As you discovered, rand is the culprit here.
For those who are curious, it's possible that this behavior comes from your implementation of rand using a mutex for thread safety.
For example, eglibc defines rand in terms of __random, which is defined as:
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
This kind of locking would force multiple threads to run serially, resulting in lower performance.

The time needed to execute the program is very small (33msec). This means that the overhead to create and handle several threads may be more than the real benefit. Try using programs that need longer times for the execution (e.g., 10 sec).

To make this faster, use a thread pool pattern.
This will let you enqueue tasks in other threads without the overhead of creating a std::thread each time you want to use more than one thread.
Don't count the overhead of setting up the queue in your performance metrics, just the time to enqueue and extract the results.
Create a set of threads and a queue of tasks (a structure containing a std::function<void()>) to feed them. The threads wait on the queue for new tasks to do, do them, then wait on new tasks.
The tasks are responsible for communicating their "done-ness" back to the calling context, such as via a std::future<>. The code that lets you enqueue functions into the task queue might do this wrapping for you, ie this signature:
template<typename R=void>
std::future<R> enqueue( std::function<R()> f ) {
std::packaged_task<R()> task(f);
std::future<R> retval = task.get_future();
this->add_to_queue( std::move( task ) ); // if we had move semantics, could be easier
return retval;
}
which turns a naked std::function returning R into a nullary packaged_task, then adds that to the tasks queue. Note that the tasks queue needs be move-aware, because packaged_task is move-only.
Note 1: I am not all that familiar with std::future, so the above could be in error.
Note 2: If tasks put into the above described queue are dependent on each other for intermediate results, the queue could deadlock, because no provision to "reclaim" threads that are blocked and execute new code is described. However, "naked computation" non-blocking tasks should work fine with the above model.

Why is the multithreaded version of this program slower?

I am trying to learn pthreads and I have been experimenting with a program that tries to detect the changes on an array. Function array_modifier() picks a random element and toggles it's value (1 to 0 and vice versa) and then sleeps for some time (big enough so race conditions do not appear, I know this is bad practice). change_detector() scans the array and when an element doesn't match it's prior value and it is equal to 1, the change is detected and diff array is updated with the detection delay.
When there is one change_detector() thread (NTHREADS==1) it has to scan the whole array. When there are more threads each is assigned a portion of the array. Each detector thread will only catch the modifications in its part of the array, so you need to sum the catch times of all 4 threads to get the total time to catch all changes.
Here is the code:
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#define TIME_INTERVAL 100
#define CHANGES 5000
#define UNUSED(x) ((void) x)
typedef struct {
unsigned int tid;
} parm;
static volatile unsigned int* my_array;
static unsigned int* old_value;
static struct timeval* time_array;
static unsigned int N;
static unsigned long int diff[NTHREADS] = {0};
void* array_modifier(void* args);
void* change_detector(void* arg);
int main(int argc, char** argv) {
if (argc < 2) {
exit(1);
}
N = (unsigned int)strtoul(argv[1], NULL, 0);
my_array = calloc(N, sizeof(int));
time_array = malloc(N * sizeof(struct timeval));
old_value = calloc(N, sizeof(int));
parm* p = malloc(NTHREADS * sizeof(parm));
pthread_t generator_thread;
pthread_t* detector_thread = malloc(NTHREADS * sizeof(pthread_t));
for (unsigned int i = 0; i < NTHREADS; i++) {
p[i].tid = i;
pthread_create(&detector_thread[i], NULL, change_detector, (void*) &p[i]);
}
pthread_create(&generator_thread, NULL, array_modifier, NULL);
pthread_join(generator_thread, NULL);
usleep(500);
for (unsigned int i = 0; i < NTHREADS; i++) {
pthread_cancel(detector_thread[i]);
}
for (unsigned int i = 0; i < NTHREADS; i++) fprintf(stderr, "%lu ", diff[i]);
fprintf(stderr, "\n");
_exit(0);
}
void* array_modifier(void* arg) {
UNUSED(arg);
srand(time(NULL));
unsigned int changing_signals = CHANGES;
while (changing_signals--) {
usleep(TIME_INTERVAL);
const unsigned int r = rand() % N;
gettimeofday(&time_array[r], NULL);
my_array[r] ^= 1;
}
pthread_exit(NULL);
}
void* change_detector(void* arg) {
const parm* p = (parm*) arg;
const unsigned int tid = p->tid;
const unsigned int start = tid * (N / NTHREADS) +
(tid < N % NTHREADS ? tid : N % NTHREADS);
const unsigned int end = start + (N / NTHREADS) +
(tid < N % NTHREADS);
unsigned int r = start;
while (1) {
unsigned int tmp;
while ((tmp = my_array[r]) == old_value[r]) {
r = (r < end - 1) ? r + 1 : start;
}
old_value[r] = tmp;
if (tmp) {
struct timeval tv;
gettimeofday(&tv, NULL);
// detection time in usec
diff[tid] += (tv.tv_sec - time_array[r].tv_sec) * 1000000 + (tv.tv_usec - time_array[r].tv_usec);
}
}
}
when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=1 file.c -pthread && ./a.out 100
I get:
665
but when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=4 file.c -pthread && ./a.out 100
I get:
152 190 164 242
(this sums up to 748).
So, the delay for the multithreaded program is larger.
My cpu has 6 cores.

Short Answer
You are sharing memory between thread and sharing memory between threads is slow.
Long Answer
Your program is using a number of thread to write to my_array and another thread to read from my_array. Effectively my_array is shared by a number of threads.
Now lets assume you are benchmarking on a multicore machine, you probably are hoping that the OS will assign different cores to each thread.
Bear in mind that on modern processors writing to RAM is really expensive (hundreds of CPU cycles). To improve performance CPUs have multi-level caches. The fastest Cache is the small L1 cache. A core can write to its L1 cache in the order of 2-3 cycles. The L2 cache may take on the order of 20 - 30 cycles.
Now in lots of CPU architectures each core has its own L1 cache but the L2 cache is shared. This means any data that is shared between thread (cores) has to go through the L2 cache which is much slower than the L1 cache. This means that shared memory access tends to be quite slow.
Bottom line is that if you want your multithreaded programs to perform well you need to ensure that threads do not share memory. Sharing memory is slow.
Aside
Never rely on volatile to do the correct thing when sharing memory between thread, either use your library atomic operations or use mutexes. This is because some CPUs allow out of order reads and writes that may do strange things if you do not know what you are doing.

It is rare that a multithreaded program scales perfectly with the number of threads. In your case you measured a speed-up factor of ca 0.9 (665/748) with 4 threads. That is not so good.
Here are some factors to consider:
The overhead of starting threads and dividing the work. For small jobs the cost of starting additional threads can be considerably larger than the actual work. Not applicable to this case, since the overhead isn't included in the time measurements.
"Random" variations. Your threads varied between 152 and 242. You should run the test multiple times and use either the mean or the median values.
The size of the test. Generally you get more reliable measurements on larger tests (more data). However, you need to consider how having more data affects the caching in L1/L2/L3 cache. And if the data is too large to fit into RAM you need to factor in disk I/O. Usually, multithreaded implementations are slower, because they want to work on more data at a time but in rare instances they can be faster, a phenomenon called super-linear speedup.
Overhead caused by inter-thread communication. Maybe not a factor in your case, since you don't have much of that.
Overhead caused by resource locking. Usually has a low impact on cpu utilization but may have a large impact on the total real time used.
Hardware optimizations. Some CPUs change the clock frequency depending on how many cores you use.
The cost of the measurement itself. In your case a change will be detected within 25 (100/4) iterations of the for loop. Each iteration takes but a few clock cycles. Then you call gettimeofday which probably costs thousands of clock cycles. So what you are actually measuring is more or less the cost of calling gettimeofday.
I would increase the number of values to check and the cost to check each value. I would also consider turning off compiler optimizations, since these can cause the program to do unexpected things (or skip some things entirely).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight