Do we need a lock in a single writer multi-reader system? - c

In os books they said there must be a lock to protect data from accessed by reader and writer at the same time.
but when I test the simple example in x86 machine,it works well.
I want to know, is the lock here nessesary?
#define _GNU_SOURCE
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
struct doulnum
{
int i;
long int l;
char c;
unsigned int ui;
unsigned long int ul;
unsigned char uc;
};
long int global_array[100] = {0};
void* start_read(void *_notused)
{
int i;
struct doulnum d;
int di;
long int dl;
char dc;
unsigned char duc;
unsigned long dul;
unsigned int dui;
while(1)
{
for(i = 0;i < 100;i ++)
{
dl = global_array[i];
//di = d.i;
//dl = d.l;
//dc = d.c;
//dui = d.ui;
//duc = d.uc;
//dul = d.ul;
if(dl > 5 || dl < 0)
printf("error\n");
/*if(di > 5 || di < 0 || dl > 10 || dl < 5)
{
printf("i l value %d,%ld\n",di,dl);
exit(0);
}
if(dc > 15 || dc < 10 || dui > 20 || dui < 15)
{
printf("c ui value %d,%u\n",dc,dui);
exit(0);
}
if(dul > 25 || dul < 20 || duc > 30 || duc < 25)
{
printf("uc ul value %u,%lu\n",duc,dul);
exit(0);
}*/
}
}
}
int start_write(void)
{
int i;
//struct doulnum dl;
while(1)
{
for(i = 0;i < 100;i ++)
{
//dl.i = random() % 5;
//dl.l = random() % 5 + 5;
//dl.c = random() % 5 + 10;
//dl.ui = random() % 5 + 15;
//dl.ul = random() % 5 + 20;
//dl.uc = random() % 5 + 25;
global_array[i] = random() % 5;
}
}
return 0;
}
int main(int argc,char **argv)
{
int i;
cpu_set_t cpuinfo;
pthread_t pt[3];
//struct doulnum dl;
//dl.i = 2;
//dl.l = 7;
//dl.c = 12;
//dl.ui = 17;
//dl.ul = 22;
//dl.uc = 27;
for(i = 0;i < 100;i ++)
global_array[i] = 2;
for(i = 0;i < 3;i ++)
if(pthread_create(pt + i,NULL,start_read,NULL) < 0)
return -1;
/* for(i = 0;i < 3;i ++)
{
CPU_ZERO(&cpuinfo);
CPU_SET_S(i,sizeof(cpuinfo),&cpuinfo);
if(0 != pthread_setaffinity_np(pt[i],sizeof(cpu_set_t),&cpuinfo))
{
printf("set affinity %d\n",i);
exit(0);
}
}
CPU_ZERO(&cpuinfo);
CPU_SET_S(3,sizeof(cpuinfo),&cpuinfo);
if(0 != pthread_setaffinity_np(pthread_self(),sizeof(cpu_set_t),&cpuinfo))
{
printf("set affinity recver\n");
exit(0);
}*/
start_write();
return 0;
}

If you don't synchronise reads and writes, a reader could read while a writer is writing, and read the data in a half-written state if the write operation is not atomic. So yes, synchronisation would be necessary to keep that from happening.

You surely need synchronization here . The simple reason being that there is a distinct possibility that data be in a inconsistent state when start_write is updating the information in the global array and one of your 3 threads try to read the same data from the global array .
What you quote is also incorrect . " must be a lock to protect data from accessed by reader and writer at the same time" should be "must be a lock to protect data from modified by reader and writer at the same time"
if the shared data is being modified by one of the threads and another thread is reading from it you need to use lock to protect it .
if the shared data is being accessed by two or more threads then you dont need to protect it .

It will work fine if the threads are just reading from global_array. printf should be fine since this does a single IO operation in append mode.
However, since the main thread calls start_write to update the global_array at the same time the other threads are in start_read then they are going to be reading the values in a very unpredictable manner. It depends highly on how the threads are implemented in the OS, how many CPUs/cores you have, etc.. This might work well on your dual core development box but then fail spectactuarly when you move to a 16 core production server.
For example, if the threads were not synchronizing, they might never see any updates to global_array in the right circumstances. Or some threads would see changes faster than others. It's all about the timing of when memory pages are flushed to central memory and when the threads see the changes in their caches. To ensure consistent results you need synchronization (memory barriers) to force the caches to the updated.

The general answer is you need some way to ensure/enforce necessary atomicity, so the reader doesn't see an inconsistent state.
A lock (done correctly) is sufficient but not always necessary. But in order to prove that it's not necessary, you need to be able to say something about the atomicity of the operations involved.
This involves both the architecture of the target host and, to some extent, the compiler.
In your example, you're writing a long to an array. In this case, the question is is the storage of a long atomic? It probably is, but it depends on the host. It's possible that the CPU writes out a portion of the long (upper/lower words/bytes) separately and thus the reader could get a value never written. (This is, I believe, unlikely on most modern CPU archs, but you'd have to check to be sure.)
It's also possible for there to be write buffering in the CPU. It's been a long time since I looked at this, but I believe it's possible to get store reordering if you don't have the necessary write barrier instructions. It's unclear from your example if you would be relying on this.
Finally, you'd probably need to flag the array as volatile (again, I haven't done this in a while so I'm rusty on the specifics) in order to ensure that the compiler doesn't make assumptions about the data not changing underneath it.

It depends on how much you care about portability.
At least on an actual Intel x86 processor, when you're reading/writing dword (32-bit) data that's also dword aligned, the hardware gives you atomicity "for free" -- i.e., without your having to do any sort of lock to enforce it.
Changing much of anything (up to an including compiler flags that might affect the data'a alignment) can break that -- but in ways that might remain hidden for a long time (especially if you have low contention over a particular data item). It also leads to extremely fragile code -- for example, switching to a smaller data type can break the code, even if you're only using a subset of the values.
The current atomic "guarantee" is pretty much an accidental side-effect of the way the cache and bus happen to be designed. While I'm not sure I'd really expect a change that broke things, I wouldn't consider it particularly far-fetched either. The only place I've seen documentation of this atomic behavior was in the same processor manuals that cover things like model-specific registers that definitely have changed (and continue to change) from one model of processor to the next.
The bottom line is that you really should do the locking, but you probably won't see a manifestation of the problem with your current hardware, no matter how much you test (unless you change conditions like mis-aligning the data).

Related

pthread is slower than the "default" version

SITUATION
I want to see the advantage of using pthread. If I'm not wrong: threads allow me to execute given parts of program in parallel.
so here is what I try to accomplish: I want to make a program that takes a number(let's say n) and outputs the sum of [0..n].
code
#define MAX 1000000000
int
main() {
long long n = 0;
for (long long i = 1; i < MAX; ++i)
n += i;
printf("\nn: %lld\n", n);
return 0;
}
time: 0m2.723s
to my understanding I could simply take that number MAX and divide by 2 and let 2 threads
do the job.
code
#define MAX 1000000000
#define MAX_THREADS 2
#define STRIDE MAX / MAX_THREADS
typedef struct {
long long off;
long long res;
} arg_t;
void*
callback(void *args) {
arg_t *arg = (arg_t*)args;
for (long long i = arg->off; i < arg->off + STRIDE; ++i)
arg->res += i;
pthread_exit(0);
}
int
main() {
pthread_t threads[MAX_THREADS];
arg_t results[MAX_THREADS];
for (int i = 0; i < MAX_THREADS; ++i) {
results[i].off = i * STRIDE;
results[i].res = 0;
pthread_create(&threads[i], NULL, callback, (void*)&results[i]);
}
for (int i = 0; i < MAX_THREADS; ++i)
pthread_join(threads[i], NULL);
long long result;
result = results[0].res;
for (int i = 1; i < MAX_THREADS; ++i)
result += results[i].res;
printf("\nn: %lld\n", result);
return 0;
}
time: 0m8.530s
PROBLEM
The version with pthread runs slower. Logically this version should run faster, but maybe creation of threads is more expensive.
Can someone suggest a solution or show what I'm doing/understanding wrong here?
Your problem is cache thrashing combined with a lack of optimization (I bet you're compiling without it on).
The naive (-O0) code for
for (long long i = arg->off; i < arg->off + STRIDE; ++i)
arg->res += i;
will access the memory of *arg. With your results array being defined the way it is, that memory is very close to the memory of the next arg and the two threads will fight for the same cache-line, making RAM caching very ineffective.
If you compile with -O1, the loop should use a register instead and only write to memory at the end. Then, you should get better performance with threads (higher optimization levels on gcc seem to optimize the loop out completely)
Another (better) option is to align arg_t on a cache line:
typedef struct {
_Alignas(64) /*typical cache line size*/ long long off;
long long res;
} arg_t;
Then you should get better performance with threads regardless of whether or not you turn optimization on.
Good cache utilization is generally very important in multithreaded programming (and Ulrich Drepper has much to say on that topic in his infamous What Every Programmer Should Know About Memory).
Creating a whole bunch of threads is very unlikely to be quicker than simply adding numbers. The CPU can add an awfully large number of integers in the time it takes the kernel to set up and tear down a thread. To see the benefit of multithreading, you really need each thread to be doing a significant task -- significant compared to the overhead in creating the thread, anyway. Alternatively, you need to keep a pool of threads running, and assign them work according to some allocation strategy.
Multi-threading works best when an application consists of tasks that are somewhat independent, that would otherwise be waiting on one another to complete. It isn't a magic way to get more throughput.

False sharing in multi threads

The following code runs slower as I increase the NTHREADS. Why use more threads make the program run slower? Is there any way to fix it? Someone said it is about false sharing but I do not really understand that concept.
The program basicly calculate the sum from 1 to 100000000. The idea to use multithread is to seperate the number list into several chuncks, and calculate the sum of each chunck parallelly to make the calculation faster.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define LENGTH 100000000
#define NTHREADS 2
#define NREPEATS 10
#define CHUNCK (LENGTH / NTHREADS)
typedef struct {
size_t id;
long *array;
long result;
} worker_args;
void *worker(void *args) {
worker_args *wargs = (worker_args*) args;
const size_t start = wargs->id * CHUNCK;
const size_t end = wargs->id == NTHREADS - 1 ? LENGTH : (wargs->id+1) * CHUNCK;
for (size_t i = start; i < end; ++i) {
wargs->result += wargs->array[i];
}
return NULL;
}
int main(void) {
long* numbers = malloc(sizeof(long) * LENGTH);
for (size_t i = 0; i < LENGTH; ++i) {
numbers[i] = i + 1;
}
worker_args *args = malloc(sizeof(worker_args) * NTHREADS);
for (size_t i = 0; i < NTHREADS; ++i) {
args[i] = (worker_args) {
.id = i,
.array = numbers,
.result = 0
};
}
pthread_t thread_ids[NTHREADS];
for (size_t i = 0; i < NTHREADS; ++i) {
pthread_create(thread_ids+i, NULL, worker, args+i);
}
for (size_t i = 0; i < NTHREADS; ++i) {
pthread_join(thread_ids[i], NULL);
}
long sum = 0;
for (size_t i = 0; i < NTHREADS; ++i) {
sum += args[i].result;
}
printf("Run %2zu: total sum is %ld\n", n, sum);
free(args);
free(numbers);
}
Why use more threads make the program run slower?
There is an overhead creating and joining threads. If the threads hasn't much to do then this overhead may be more expensive than the actual work.
Your threads are only doing a simple sum which isn't that expensive. Also consider that going from e.g. 10 to 11 threads doesn't change the work load per thread a lot.
10 threads --> 10000000 sums per thread
11 threads --> 9090909 sums per thread
The overhead of creating an extra thread may exceed the "work load saved" per thread.
On my PC the program runs in less than 100 milliseconds. Multi-threading isn't worth the trouble.
You need a more processing intensive task before multi-threading is worth doing.
Also notice that it seldom make sense to create more threads than the number of cores (incl hyper thread) your computer has.
false sharing
yes, "false sharing" can impact the performance of a multi-threaded program but I doubt it's the real problem in your case.
"false sharing" is something that happens in (some) cache systems when two threads (or rather two cores) writes to two different variables that belongs to the same cache line. In such cases the two threads/cores competes to own the cache line (for writing) and consequently, they'll have to refresh the memory and the cache again and again. That's bad for performance.
As I said - I doubt that is your problem. A clever compiler will do your loop solely be using CPU registers and only write to memory at the end. You can check the disassemble of your code to see if that is the case.
You can avoid "false sharing" by increasing the sizeof of your struct so that each struct fits the size of a cache line on your system.

Looking the cause of a race condition on a multicore corepack

I am using a simple software queue based on a write index and a read index.
Introduction details; Language: C, Compiler: GCC Optimization: -O3 with extra parameters, Architecture: Armv7a, CPU: Multicore, 2 Cortex A-15, L2 Cache: Shared and enabled, L1 Cache: Every CPU, enabled, Architecture is supposed to be cache coherent.
CPU 1 does the writing stuff and CPU 2 does the reading stuff. Below is the very simplified example code. You can assume the initial values of the indexes are zero.
COMMON:
#define QUE_LEN 4
unsigned int my_que_write_index = 0; //memory
unsigned int my_que_read_index = 0; //memory
struct my_que_struct{
unsigned int param1;
unsigned int param2;
};
struct my_que_struct my_que[QUE_LEN]; //memory
CPU 1 runs:
void que_writer
{
unsigned int write_index_local;
write_index_local = my_que_write_index; //my_que_write_index is in memory
my_que[write_index_local].param1 = 16; //my_que is my queue and stored in memory also
my_que[write_index_local].param2 = 32;
//similar writing stuff
++write_index_local;
if(write_index_local == QUE_LEN) write_index_local = 0;
my_que_write_index = write_index_local;
}
CPU 2 runs:
void que_reader()
{
unsigned int read_index_local, param1, param2;
read_index_local = my_que_read_index; //also in memory
while(read_index_local != my_que_write_index)
{
param1 = my_que[read_index_local].param1;
if(param1 == 0) FATAL_ERROR;
param2 = my_que[read_index_local].param2;
//similar reading stuff
my_que[read_index_local].param1 = 0;
++read_index_local;
if(read_index_local == QUE_LEN) read_index_local = 0;
}
my_que_read_index = read_index_local;
}
Okay, in a normal case, fatal error should never occur because param1 of the queue is always stored with a constant value of 16. But somehow param1 of the queue is happening 0 and fatal error occurs.
It is clear that this is somehow a race condition problem, but I can't figure how it is happening. Indexes are updated seperately by the CPUs.
I don't want to fill my code with memory barriers without understanding the core of the problem. Do you have any idea how this is happening?
Details: This is a baremetal system, these codes are interrupt-disabled, and there is no preemption or task switching.
The compiler and the CPU are allowed to rearrange stores and loads as they see fit (i.e. as long as a single threaded program would not be able to observe a difference). Of course for multi-threaded programs these effects are observable quite well.
For example, this code
write_index_local = my_que_write_index;
my_que[write_index_local].param1 = 16;
my_que[write_index_local].param2 = 32;
++write_index_local;
if(write_index_local == QUE_LEN) write_index_local = 0;
my_que_write_index = write_index_local;
could be reordered like this
a = my_que_write_index;
my_que_write_index = write_index_local == QUE_LEN - 1 ? 0 : a + 1;
my_que[a].param1 = 16;
my_que[a].param2 = 32;
Getting this stuff right requires atomics and barriers that avoid these kinds of reorderings. Check out Preshing's excellent series of blog posts to learn about atomics, this one is probably a good start: http://preshing.com/20120612/an-introduction-to-lock-free-programming/ but check out the following ones as well.

Why is the multithreaded version of this program slower?

I am trying to learn pthreads and I have been experimenting with a program that tries to detect the changes on an array. Function array_modifier() picks a random element and toggles it's value (1 to 0 and vice versa) and then sleeps for some time (big enough so race conditions do not appear, I know this is bad practice). change_detector() scans the array and when an element doesn't match it's prior value and it is equal to 1, the change is detected and diff array is updated with the detection delay.
When there is one change_detector() thread (NTHREADS==1) it has to scan the whole array. When there are more threads each is assigned a portion of the array. Each detector thread will only catch the modifications in its part of the array, so you need to sum the catch times of all 4 threads to get the total time to catch all changes.
Here is the code:
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#define TIME_INTERVAL 100
#define CHANGES 5000
#define UNUSED(x) ((void) x)
typedef struct {
unsigned int tid;
} parm;
static volatile unsigned int* my_array;
static unsigned int* old_value;
static struct timeval* time_array;
static unsigned int N;
static unsigned long int diff[NTHREADS] = {0};
void* array_modifier(void* args);
void* change_detector(void* arg);
int main(int argc, char** argv) {
if (argc < 2) {
exit(1);
}
N = (unsigned int)strtoul(argv[1], NULL, 0);
my_array = calloc(N, sizeof(int));
time_array = malloc(N * sizeof(struct timeval));
old_value = calloc(N, sizeof(int));
parm* p = malloc(NTHREADS * sizeof(parm));
pthread_t generator_thread;
pthread_t* detector_thread = malloc(NTHREADS * sizeof(pthread_t));
for (unsigned int i = 0; i < NTHREADS; i++) {
p[i].tid = i;
pthread_create(&detector_thread[i], NULL, change_detector, (void*) &p[i]);
}
pthread_create(&generator_thread, NULL, array_modifier, NULL);
pthread_join(generator_thread, NULL);
usleep(500);
for (unsigned int i = 0; i < NTHREADS; i++) {
pthread_cancel(detector_thread[i]);
}
for (unsigned int i = 0; i < NTHREADS; i++) fprintf(stderr, "%lu ", diff[i]);
fprintf(stderr, "\n");
_exit(0);
}
void* array_modifier(void* arg) {
UNUSED(arg);
srand(time(NULL));
unsigned int changing_signals = CHANGES;
while (changing_signals--) {
usleep(TIME_INTERVAL);
const unsigned int r = rand() % N;
gettimeofday(&time_array[r], NULL);
my_array[r] ^= 1;
}
pthread_exit(NULL);
}
void* change_detector(void* arg) {
const parm* p = (parm*) arg;
const unsigned int tid = p->tid;
const unsigned int start = tid * (N / NTHREADS) +
(tid < N % NTHREADS ? tid : N % NTHREADS);
const unsigned int end = start + (N / NTHREADS) +
(tid < N % NTHREADS);
unsigned int r = start;
while (1) {
unsigned int tmp;
while ((tmp = my_array[r]) == old_value[r]) {
r = (r < end - 1) ? r + 1 : start;
}
old_value[r] = tmp;
if (tmp) {
struct timeval tv;
gettimeofday(&tv, NULL);
// detection time in usec
diff[tid] += (tv.tv_sec - time_array[r].tv_sec) * 1000000 + (tv.tv_usec - time_array[r].tv_usec);
}
}
}
when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=1 file.c -pthread && ./a.out 100
I get:
665
but when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=4 file.c -pthread && ./a.out 100
I get:
152 190 164 242
(this sums up to 748).
So, the delay for the multithreaded program is larger.
My cpu has 6 cores.
Short Answer
You are sharing memory between thread and sharing memory between threads is slow.
Long Answer
Your program is using a number of thread to write to my_array and another thread to read from my_array. Effectively my_array is shared by a number of threads.
Now lets assume you are benchmarking on a multicore machine, you probably are hoping that the OS will assign different cores to each thread.
Bear in mind that on modern processors writing to RAM is really expensive (hundreds of CPU cycles). To improve performance CPUs have multi-level caches. The fastest Cache is the small L1 cache. A core can write to its L1 cache in the order of 2-3 cycles. The L2 cache may take on the order of 20 - 30 cycles.
Now in lots of CPU architectures each core has its own L1 cache but the L2 cache is shared. This means any data that is shared between thread (cores) has to go through the L2 cache which is much slower than the L1 cache. This means that shared memory access tends to be quite slow.
Bottom line is that if you want your multithreaded programs to perform well you need to ensure that threads do not share memory. Sharing memory is slow.
Aside
Never rely on volatile to do the correct thing when sharing memory between thread, either use your library atomic operations or use mutexes. This is because some CPUs allow out of order reads and writes that may do strange things if you do not know what you are doing.
It is rare that a multithreaded program scales perfectly with the number of threads. In your case you measured a speed-up factor of ca 0.9 (665/748) with 4 threads. That is not so good.
Here are some factors to consider:
The overhead of starting threads and dividing the work. For small jobs the cost of starting additional threads can be considerably larger than the actual work. Not applicable to this case, since the overhead isn't included in the time measurements.
"Random" variations. Your threads varied between 152 and 242. You should run the test multiple times and use either the mean or the median values.
The size of the test. Generally you get more reliable measurements on larger tests (more data). However, you need to consider how having more data affects the caching in L1/L2/L3 cache. And if the data is too large to fit into RAM you need to factor in disk I/O. Usually, multithreaded implementations are slower, because they want to work on more data at a time but in rare instances they can be faster, a phenomenon called super-linear speedup.
Overhead caused by inter-thread communication. Maybe not a factor in your case, since you don't have much of that.
Overhead caused by resource locking. Usually has a low impact on cpu utilization but may have a large impact on the total real time used.
Hardware optimizations. Some CPUs change the clock frequency depending on how many cores you use.
The cost of the measurement itself. In your case a change will be detected within 25 (100/4) iterations of the for loop. Each iteration takes but a few clock cycles. Then you call gettimeofday which probably costs thousands of clock cycles. So what you are actually measuring is more or less the cost of calling gettimeofday.
I would increase the number of values to check and the cost to check each value. I would also consider turning off compiler optimizations, since these can cause the program to do unexpected things (or skip some things entirely).

Bakery Lock when used inside a struct doesn't work

I'm new at multi-threaded programming and I tried to code the Bakery Lock Algorithm in C.
Here is the code:
int number[N]; // N is the number of threads
int choosing[N];
void lock(int id) {
choosing[id] = 1;
number[id] = max(number, N) + 1;
choosing[id] = 0;
for (int j = 0; j < N; j++)
{
if (j == id)
continue;
while (1)
if (choosing[j] == 0)
break;
while (1)
{
if (number[j] == 0)
break;
if (number[j] > number[id]
|| (number[j] == number[id] && j > id))
break;
}
}
}
void unlock(int id) {
number[id] = 0;
}
Then I run the following example. I run 100 threads and each thread runs the following code:
for (i = 0; i < 10; ++i) {
lock(id);
counter++;
unlock(id);
}
After all threads have been executed, the result of the shared counter is 10 * 100 = 1000 which is the expected value. I executed my program multiple times and the result was always 1000. So it seems that the implementation of the lock is correct. That seemed weird based on a previous question I had because I didn't use any memory barriers/fences. Was I just lucky?
Then I wanted to create a multi-threaded program that will use many different locks. So I created this (full code can be found here):
typedef struct {
int number[N];
int choosing[N];
} LOCK;
and the code changes to:
void lock(LOCK l, int id)
{
l.choosing[id] = 1;
l.number[id] = max(l.number, N) + 1;
l.choosing[id] = 0;
...
Now when executing my program, sometimes I get 997, sometimes 998, sometimes 1000. So the lock algorithm isn't correct.
What am I doing wrong? What can I do in order to fix it?
Is it perhaps a problem now that I'm reading arrays number and choosing from a struct
and that's not atomic or something?
Should I use memory fences and if so at which points (I tried using asm("mfence") in various points of my code, but it didn't help)?
With pthreads, the standard states that accessing a varable in one thread while another thread is, or might be, modifying it is undefined behavior. Your code does this all over the place. For example:
while (1)
if (choosing[j] == 0)
break;
This code accesses choosing[j] over and over while waiting for another thread to modify it. The compiler is entirely free to modify this code as follows:
int cj=choosing[j];
while(1)
if(cj == 0)
break;
Why? Because the standard is clear that another thread may not modify the variable while this thread may be accessing it, so the value can be assumed to stay the same. But clearly, that won't work.
It can also do this:
while(1)
{
int cj=choosing[j];
if(cj==0) break;
choosing[j]=cj;
}
Same logic. It is perfectly legal for the compiler to write back a variable whether it has been modified or not, so long as it does so at a time when the code could be accessing the variable. (Because, at that time, it's not legal for another thread to modify it, so the value must be the same and the write is harmless. In some cases, the write really is an optimization and real-world code has been broken by such writebacks.)
If you want to write your own synchronization functions, you have to build them with primitive functions that have the appropriate atomicity and memory visibility semantics. You must follow the rules or your code will fail, and fail horribly and unpredictably.

Resources