I'm not sure about how pthread dataspecific works : considering the next code (found on the web), does this means i can create for example 5 threads in the main, have a call to func in only some of them (let's say 2) those threads would have the data 'key' set to something (ptr = malloc(OBJECT_SIZE) ) and the other threads would have the same key existing but with a NULL value?
static pthread_key_t key;
static pthread_once_t key_once = PTHREAD_ONCE_INIT;
static void
make_key()
{
(void) pthread_key_create(&key, NULL);
}
func()
{
void *ptr;
(void) pthread_once(&key_once, make_key);
if ((ptr = pthread_getspecific(key)) == NULL) {
ptr = malloc(OBJECT_SIZE);
...
(void) pthread_setspecific(key, ptr);
}
...
}
Some explanation on how dataspecific works and how it may have been implemented in pthread (simple way) would be appreciated!
Your reasoning is correct. These calls are for thread-specific data. They're a way of giving each thread a "global" area where it can store what it needs, but only if it needs it.
The key is shared among all threads, since it's created with pthread_once() the first time it's needed, but the value given to that key is different for each thread (unless it remains set to NULL). By having the value a void* to a memory block, a thread that needs thread-specific data can allocate it and save the address for later use. And threads that don't call a routine that needs thread-specific data never waste memory since it's never allocated for them.
The one area where I have used them is to make a standard C library thread-safe. The strtok() function (as opposed to a thread-safe strtok_r() which was considered an abomination when we were doing this) in an implementation I was involved in used almost this exact same code the first time it was called, to allocate some memory which would be used by strtok() for storing information for subsequent calls. These subsequent calls would retrieve the thread-specific data to continue tokenizing the string without interfering with other threads doing the exact same thing.
It meant users of the library didn't have to worry about cross-talk between threads - they still had to ensure a single thread didn't call the function until the last one had finished but that's the same as with single-threaded code.
It allowed us to give a 'proper' C environment to each thread running in our system without the usual "you have to call these special non-standard re-entrant routines" limitations that other vendors imposed on their users.
As for implementation, from what I remember of DCE user-mode threads (which I think were the precursor to the current pthreads), each thread had a single structure which stored things like instruction pointers, stack pointers, register contents and so on. It was a very simple matter to add one pointer to this structure to achieve very powerful functionality with minimal cost. The pointer pointed to a array (linked list in some implementations) of key/pointer pairs so each thread could have multiple keys (e.g., one for strtok(), one for rand()).
The answer to your first question is yes. In simple terms, it allows each thread to allocate and save its own data. This is roughly equivalent to w/o each thread simply allocating and passing around its own data structure. The API saves you the trouble of passing the thread-local structure to all subfunctions, and allows you to look it up on demand instead.
The implementation really doesn't matter all that much (it may vary per-OS), as long as the results are the same.
You can think of it as a two-level hashmap. The key specifies which thread-local "variable" you want to access, and the second level might perform a thread-id lookup to request the per-thread value.
Related
I use a third side library written in C. It is designed to run as singleton and contains plenty of static functions, variables and user interface. I need to be able to run it with multiple instances so they do not interfere with each other. For example if one threads sets static variable
static int index = 0;
index = 10;
the second thread still sees index = 0.
I am not sure if it is even possible to implement.
What you are asking is not possible.
Let's assume for pedagogical purposes that you are on a unix machine.
Any process (such as the executable ./a.out) has the following Memory layout :
Text
Data
Initialized
Uninitialized
Heap
Stack
When you create a thread, then it shares all these memory segments except the Stack section(basically each thread gets a new stack pointer).
Moreover the static variables are stored in the Data segment (in your case Initialized data segment) which is a shared memory segment, hence when one thread changes it, it changes for all other threads as well.
So threads only have the following things local to themself
Stack pointer
Program Counter
registers
Image source : llnl.gov
Hope it helped :-).
I know that declaring a static variable within a function in C means that this variable retains its state between function invocations. In the context of threads, will this result in the variable retaining its state over multiple threads, or having a separate state between each thread?
Here is a past paper exam question I am struggling to answer:
The following C function is intended to be used to allocate unique identifiers (UIDs) to its callers:
get_uid()
{
static int i = 0;
return i++;
}
Explain in what way get_uid() might work incorrectly in an environment where it is being called by multiple threads. Using a
specific example scenario, give specific detail on why and how such
incorrect behaviour might occur.
At the moment I am assuming that each thread has a separate state for the variable, but I am not sure if that is correct or if the answer is more to do with the lack of mutual exclusion. If that is the case then how could semaphores be implemented in this example?
Your assumption (threads have their own copy) is not correct. The main problem with code is when multiple threads call that function get_uid(), there's a possible race condition as to which threads increments i and gets the ID which may not be unique.
All the threads of a process share the same address space. Since i is a static variable, it has a fixed address. Its "state" is just the content of the memory at that address, which is shared by all the threads.
The postfix ++ operator increments its argument and yields the value of the argument before the increment. The order in which these are done is not defined. One possible implementation is
copy i to R1
copy R1 to R2
increment R2
copy R2 to i
return R1
If more than one thread is running, they can both be executing these instructions simultaneously or interspersed. Work out for yourself sequences where various results obtain. (Note that each thread does have its own register state, even for threads running on the same CPU, because registers are saved and restored when threads are switched.)
A situation like this where there are different results depending on the indeterministic ordering of operations in different threads is called a race condition, because there's a "race" among the different threads as to which one does which operation first.
No, if you want a variable which value depends upon the thread in which it is used, you should have a look at Thread Local Storage.
A static variable, you can imagine it really like a completely global variable. It's really much the same. So it's shared by the whole system that knows its address.
EDIT: also as a comment reminds it, if you keep this implementation as a static variable, race conditions could make that the value i is incremented at the same time by several threads, meaning that you don't have any idea of the value which will be returned by the function calls. In such cases, you should protect access by so called synchronization objects like mutexes or critical sections.
Since this looks like homework, I'll answer only part of this and that is each thread will share the same copy of i. IOW, threads do not get their own copies. I'll leave the mutual exclusion bit to you.
Each thread will share the same static variable which is mostly likely a global variable. The scenario where some threads can have wrong value is the race condition (increment isn't done in one single execution rather it is done in 3 assembly instructions, load, increment, store). Read here and the diagram at the link explains it well.
Race Condition
If you are using gcc you can use the atomic builtin functions. I'm not sure what is available for other compilers.
int get_uid()
{
static int i = 0;
return __atomic_fetch_add(&i, 1, __ATOMIC_SEQ_CST);
}
This will ensure that the variable cannot be acted on by more than one thread at a time.
is sprintf thread safe ?
//Global log buffer
char logBuffer[20];
logStatus (char * status, int length)
{
snprintf(logBuffer, 19, status);
printf ("%s\n", logBuffer);
}
The thread safety of this function totally depends upon the thread safety of snprintf/sprintf .
Updates :
thanks for ur answers .
i dont mind, if the actual contents gts messed up. but want to confirm that the sprintf would not cause a memory corruption / buffer overflow going beyond 20 bytes in this case, when multiple threads are trying to write to logBuffer ?
There is no problem using snprintf() in multiple threads. But here you are writing to a shared string buffer, which I assume is shared across threads.
So your use of this function would not be thread safe.
Your question has an incorrect premise. Even if sprintf itself can be safely called from multiple threads at the same time (as I sure hope it can), your code is not protecting your global variable. The standard library can't possibly help you there.
You have several problems with your code.
Your usage of snprintf is very suspicious. Don't use it just to
copy a string. Generally don't pass dynamically allocated strings
with whatever content as format to any of the printf functions.
They interpret the contents and if there is anything in them that
resembles a %-format, you are doomed.
Don't use static buffers as you do. This is certainly neither
thread safe not re-entrant.
Either use printf with an appropriate format directly, or replace
the call by puts.
Then, Linux adheres to the POSIX standard, which requires that the standard IO functions are thread safe.
Regarding your update about not worrying if the logBuffer content get garbled:
I'm not sure why you want to avoid making your function completely thread safe by using a locally allocated buffer or some synchronization mechanism, but if you want to know what POSIX has to say about it, here you go (http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11):
Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. [followed by a list of functions which provide synchronization]
So, POSIX says that your program needs to make sure mutilple threads won't be modifying logBuffer concurrently (or modifying logBuffer in one thread while reading it in another). If you don't hold to that, there's no promise made that the worst that will happen is garbled data in logBuffer. There's simply no promise made at all about what the results will be. I don't know if Linux might document a more specific behavior, but I doubt it does.
"There is no problem using snprintf() in multiple threads."
Not true.
Not true, at least in case of POSIX functions.
All of the standard vararg functions are not mt-safe - this includes all the printf() family (1), but also every other variadic function as well (2)
sprintf() for example is: "MT-Safe locale|AS-Unsafe heap|AC-Unsafe mem" - what means, that it can fail if locale is set asynchronously or if asynchronous cancellation of threads is used. In other words, special attention must be paid when using such functions in MT environment.
va_arg is not mt-safe: |MT-Safe race:ap|AS-Safe|AC-Unsafe corrupt| - what means, that inter-locking is needed.
Additionally, what should be obvious, even totally mt-safe function can be used in unsafe way - what happens for example if two or more threads are operating the same data/memory areas.
It's not thread safe, since the buffer where you sprintf is shared between all threads.
"Do you have a refernce which says that they are not thread safe? When I Google, it seems that they are"
My previous answer to this question has been removed/deleted (why?), so I'll try again, using different approach:
AC (async. cancellation of threads): this is obviously a case when almost all of the "apparently MT-safe" code can fail, simply because the thread is interrupted at a random point of time, so none of synchronization methods are guaranted to work correctly (i.e. any form of mutex can't be really guranteed to work correctly)
Threads can use the same malloc() arena, what means, that if one of the threads will fail (i.e. it'll damage the malloc arena) then all the consecutive calls to malloc() will/can cause critical errors - this of course depends on system configuration - but it also means, that nobody should assume that malformed memory (de)allocations are safe.
Since all of the systems are providing the option to use different local settings, it is obvious, that async. change to the "locale" settings can cause errors...
Regards.
all the threads share memory location. For example a global variable changes in one thread will reflect in another thread. Since each thread has its own stack, the local
variables that are created inside the thread is unique. In this case, why do we need
to go for thread specific data mechanism?. Can't it be achieved by auto storage varibles
inside the thread function ?
Kindly clarify!!!.
BR
Rj
Normal globals are shared between threads. Local variables are specific to a particular invocation of a function. If you want something that (for example) is visible to a number of functions running in the same thread, but unique to that thread, then thread specific data is what you're looking for.
It's not required but it's rather handy. Some functions like rand and strtok use static storage duration information which is likely to be problematic when shared among threads.
Say you have a random number function where you want to maintain a different sequence (hence seed) for each thread. You have two approaches.
You can use something like the kludgy:
int seed;
srand (&seed, time (NULL));
int r = rand_r (void *seed);
where the seed has to be created by the caller and passed in each time.
Or you can use the rather nicer, ISO-compliant:
srand (time (NULL));
int r = rand();
that uses thread-local storage to maintain a thread-specific seed. Similarly with the information used by strtok regarding the locations within the string it's processing.
That way, you don't have to muck about with changing your code between threaded and non-threaded versions.
Now you could create that information in the thread function but how is the rand function going to know about it's address without it being passed down. And what if rand is called 87 stack levels down? That's an awful lot of levels to be transferring a pointer through.
And, even if you do something like:
void pthread_fn (void *unused) {
int seed;
rand_set_seed_location (&seed);
:
}
and rand subsequently uses that value regardless of how deep it is in the stack, that's still a code change from the standard. It may work but so may writing an operating system in COBOL. That doesn't make it a good idea :-)
Yes, the stack is one way of allocating thread-local storage (including handles to heap allocations local to the particular thread).
The best example for thread specific data is the "errno". When a call to some function in c library failed, the errno is set, and you can check it out to find the reason of the failure. If there's no thread specific data, it's impossible to port these functions to multi-thread environment because the errno could be set by other threads before you check it.
As a general rule, most uses of TSD should be avoided in new APIs. If a function needs some information, it should be passed to it.
However, sometimes you need TSD to 'paper over' an API defect. A good example is 'gmtime'. The 'gmtime' function returns a pointer to a structure that is valid until the next call to 'gmtime'. But that would make 'gmtime' awfully hard to use in a multi-threaded program. What if some library called 'gmtime' when you didn't expect it, trashing your structure? One simple workaround is make the structure returned thread-specific. (The long-term solution, of course, is to create a more suitable API such as 'gmtime_r'.)
One case where it's perfectly reasonable to use TSD in new designs is for information that won't be accessed frequently that would clutter the API. For example, if a critical error is discovered, it might be nice to log certain context information from higher-level code (Which client were you serving? What command did they send?). Your choices are basically to pass this context information from function to function to function (which isn't even always possible if some of the functions are outside your control) or to store it in TSD.
I'm writing a program with a consumer thread and a producer thread, now it seems queue synchronization is a big overhead in the program, and I looked for some lock free queue implementations, but only found Lamport's version and an improved version on PPoPP '08:
enqueue_nonblock(data) {
if (NULL != buffer[head]) {
return EWOULDBLOCK;
}
buffer[head] = data;
head = NEXT(head);
return 0;
}
dequeue_nonblock(data) {
data = buffer[tail];
if (NULL == data) {
return EWOULDBLOCK;
}
buffer[tail] = NULL;
tail = NEXT(tail);
return 0;
}
Both versions require a pre-allocated array for the data, my question is that is there any single-consumer single-producer lock-free queue implementation which uses malloc() to allocate space dynamically?
And another related question is, how can I measure exact overhead in queue synchronization? Such as how much time it takes of pthread_mutex_lock(), etc.
If you are worried about performance, adding malloc() to the mix won't help things. And if you are not worried about performance, why not simply control access to the queue via a mutex. Have you actually measured the performance of such an implementation? It sounds to me as though you are going down the familar route of premature optimisation.
The algorithm you show manages to work because although the two threads share the resource (i.e., the queue), they share it in a very particular way. Because only one thread ever alters the head-index of the queue (the producer), and only one thread every alters the tail-index (consumer, of course), you can't get an inconsistent state of the shared object. It's also important that the producer put the actual data in before updating the head index, and that the consumer reads the data it wants before updating the tail index.
It works as well as it does b/c the array is quite static; both threads can count on the storage for the elements being there. You probably can't replace the array entirely, but what you can do is change what the array is used for.
I.e., instead of keeping the data in the array, use it to keep pointers to the data. Then you can malloc() and free() the data items, while passing references (pointers) to them between your threads via the array.
Also, posix does support reading a nanosecond clock, although the actual precision is system dependent. You can read this high resolution clock before and after and just subtract.
Yes.
There exist a number of lock-free multiple-reader multiple-writer queues.
I have implemented one, by Michael and Scott, from their 1996 paper.
I will (after some more testing) be releasing a small library of lock-free data structures (in C) which will include this queue.
You should look at FastFlow library
I recall seeing one that looked interesting a few years ago, though I can't seem to find it now. :( The lock-free implementation that was proposed did require use of a CAS primitive, though even the locking implementation (if you didn't want to use the CAS primitive) had pretty good perf characteristics--- the locks only prevented multiple readers or multiple producers from hitting the queue at the same time, the producer still never raced with the consumer.
I do remember that the fundamental concept behind the queue was to create a linked list that always had one extra "empty" node in it. This extra node meant that the head and the tail pointers of the list would only ever refer to the same data when the list was empty. I wish I could find the paper, I'm not doing the algorithm justice with my explanation...
AH-ha!
I've found someone who transcribed the algorithm without the remainder of the article. This could be a useful starting point.
I've worked with a fairly simple queue implementation the meets most of your criteria. It used a static maximum size pool of bytes, and then we implemented messages within that. There was a head pointer that one process would move, and and a tail pointer that the other process would move.
Locks were still required, but we used Peterson's 2-Processor Algorithm, which is pretty lightweight since it doesn't involve system calls. The lock is only required for very small, well-bounded area: a few CPU cycles at most, so you never block for long.
I think the allocator can be a performance problem. You can try to use a custom multithreaded memory allocator, that use a linked-list for maintaing freed blocks. If your blocks are not (nearly) the same size, you can implement a "Buddy system memory allocator", witch is very fast. You have to synchronise your queue (ring buffer) with a mutex.
To avoid too much synchronisation, you can try write/read multiple values to/from the queue at each access.
If you still want to use, lock-free algorithms, then you must use pre-allocated data or use a lock-free allocator.
There is a paper about a lock-free allocator "Scalable Lock-Free Dynamic Memory Allocation", and an implementation Streamflow
Before starting with Lock-free stuff, look at:Circular lock-free buffer
Adding malloc would kill any performance gain you may make and a lock based structure would be just as effective. This is so because malloc requires some sort of CAS lock over the heap and hence some forms of malloc have their own lock so you may be locking in the Memory Manager.
To use malloc you would need to pre allocate all the nodes and manage them with another queue...
Note you can make some form of expandable array which would need to lock if it was expanded.
Also while interlocked are lock free on the CPU they do placea memory lock and block memory for the duration of the instruction and often stall the pipeline.
This implementation uses C++'s new and delete which can trivially be ported to the C standard library using malloc and free:
http://www.drdobbs.com/parallel/writing-lock-free-code-a-corrected-queue/210604448?pgno=2