Is there any difference between calling munmap in the main thread and calling it in a pthread function?
There is no difference.
mmap() and munmap() are system calls that affect the virtual memory of the entire process (all the threads), but it doesn't matter which thread makes the call.
Of course these functions do not perform any inter-thread synchronization, so if there are conflicting accesses or potential data races on the memory regions in question, that's your problem to solve.
Related
Is a thread dynamically allocated memory?
I have been researching and have a fair understanding of threads and how they are used. I have specifically looked at the POSIX API for threads.
I am trying to understand thread creation and how it differs from a simple malloc call.
I understand that threads share certain memory segments with the parent process, but it has it's own stack.
Any resources I can read through on this topic is appreciated. Thanks!
Thread creation and a malloc() call are completely different concepts. A malloc() call dynamically allocates the requested byte chunk of memory from the heap for the use of the program.
Whereas a thread can be considered as a 'light-weight process'. The thread is an entity within a process and every process will have atleast one thread to help complete its execution. The threads of a process will share the process virtual address and all the resources of the process. When you create new threads of a process, these new threads will have their own user stack, they will be scheduled independently to be executed by the scheduler. Also for the thread to run concurrently they will have their context which will store the state of the thread just before preemption i.e the status of all the registers.
Is a thread dynamically allocated memory?
No, nothing of the sort. Threads have memory uniquely associated with them -- at least a stack -- but such memory is not the thread itself.
I am trying to understand thread creation and how it differs from a simple malloc call.
New thread creation is not even the same kind of thing as memory allocation. The two are not at all comparable.
Threading implementations that have direct OS support (not all do) are unlikely to rely on the C library to obtain memory for their stack, kernel data structures, or any other thread-implementation-associated data. On the other hand, those that do not have OS support, such as Linux's old "green" threads, are more likely to allocate memory via the C library. Even threading implementations without direct OS support have the option of using a system call to obtain the memory they need, just as malloc() itself must do. In any case, the memory obtained is not itself the thread.
Note also that the difference between threading systems with and without OS support is orthogonal to the threading API. For example, Linux's green threads and the now-ubiquitous, kernel-supported NPTL threads both implement the POSIX thread API.
It is expected that threads, on which pthread_detach() was not called, should be pthread_join()ed before the main thread returns from main() or calls exit().
However, what happens when this requirement is not met? What happens when a process terminates when it still contains unjoined and not detached threads?
I would find it odd to learn that these other threads’ resources will not be reclaimed until system reboot. However, if these resources will be reclaimed, then there may be little need to bother about joining or detaching, mightn’t it?
It is up to the operating system. Typical modern operating systems will indeed reclaim the memory and descriptors (handles) used by abandoned threads. This is similar to how dynamically allocated memory works: typical modern systems will reclaim it when a process exits, even if the process never explicitly freed the memory. For certain unusual programs, this can be a meaningful performance optimization, because freeing lots of small resources takes time and the OS may be able to do it more quickly.
However, what happens when this requirement is not met? What happens when a process terminates when it still contains unjoined and not detached threads?
On any system with POSIX threads that is not ancient, the non-joined threads simply "evaporate" into space when the SYS_exit system call is performed by the main thread.
I would find it odd to learn that these other threads’ resources will not be reclaimed until system reboot.
They will be.
However, if these resources will be reclaimed, then there may be little need to bother about joining or detaching, mightn’t it?
It depends on what these threads do. The danger is at-exit data races.
In C++, global variables are destructed (usually via atexit or equivalent registration mechanism), FILE handles are deleted, etc. etc.
If non-joined thread tries to access any such resource, it will likely crash with SIGSEGV, possibly producing core dump, and an unclean process exit code, both of which are often quite undesirable.
This question already has answers here:
Does guarding a variable with a pthread mutex guarantee it's also not cached?
(3 answers)
Closed 3 years ago.
Do pthread_mutex_lock and pthread_mutex_unlock functions call memory fence/barrier instructions? Or do the the lower level instructions like compare_and_swap implicity have memory barriers?
Do pthread_mutex_lock and pthread_mutex_unlock functions call memory fence/barrier instructions?
They do, as well as thread creation.
Note, however, there are two types of memory barriers: compiler and hardware.
Compiler barriers only prevent the compiler from reordering reads and writes and speculating variable values, but don't prevent the CPU from reordering.
The hardware barriers prevent the CPU from reordering reads and writes. Full memory fence is usually the slowest instruction, most of the time you only need operations with acquire and release semantics (to implement spinlocks and mutexes).
With multi-threading you need both barriers most of the time.
Any function whose definition is not available in this translation unit (and is not intrinsic) is a compiler memory barrier. pthread_mutex_lock, pthread_mutex_unlock, pthread_create also issue a hardware memory barrier to prevent the CPU from reordering reads and writes.
From Programming with POSIX Threads by David R. Butenhof:
Pthreads provides a few basic rules about memory visibility. You can count on all implementations of the standard to follow these rules:
Whatever memory values a thread can see when it calls pthread_create can also be seen by the new thread when it starts. Any data written to memory after the call to pthread_create may not necessarily be seen by the new thread, even if the write occurs before the thread starts.
Whatever memory values a thread can see when it unlocks a mutex, either directly or by waiting on a condition variable, can also be seen by any thread that later locks the same mutex. Again, data written after the mutex is unlocked may not necessarily be seen by the thread that locks the mutex, even if the write occurs before the lock.
Whatever memory values a thread can see when it terminates, either by cancellation, returning from its start function, or by calling pthread_exit, can also be seen by the thread that joins with the terminated thread bycalling pthread_join. And, of course, data written after the thread terminates may not necessarily be seen by the thread that joins, even if the write occurs before the join.
Whatever memory values a thread can see when it signals or broadcasts a condition variable can also be seen by any thread that is awakened by that signal or broadcast. And, one more time, data written after the signal or broadcast may not necessarily be seen by the thread that wakes up, even if the write occurs before it awakens.
Also see C++ and Beyond 2012: Herb Sutter - atomic<> Weapons for more details.
Please take a look at section 4.12 of the POSIX specification.
Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. [emphasis mine]
Then a list of functions is given which synchronize memory, plus a few additional notes.
If that requires memory barrier instructions on some architecture, then those must be used.
About compare_and_swap: that isn't in POSIX; check the documentation for whatever you are using. For instance, IBM defines a compare_and_swap function for AIX 5.3. which doesn't have full memory barrier semantics The documentation note says:
If compare_and_swap is used as a locking primitive, insert an isync at the start of any critical sections.
From this documentation we can guess that IBM's compare_and_swap has release semantics: since the documentation does not require a barrier for the end of the critical section. The acquiring processor needs to issue an isync to make sure it is not reading stale data, but the publishing processor doesn't have to do anything.
At the instruction level, some processors have compare and swap with certain synchronizing guarantees, and some don't.
I am looking at the section 4.11 of The Open Group Base Specifications Issue 7 (IEEE Std 1003.1, 2013 Edition), section 4.11 document, which spells out the memory synchronization rules. This is the most specific by the POSIX standard I have managed to come by for detailing the POSIX/C memory model.
Here's a quote
4.11 Memory Synchronization
Applications shall ensure that access to any memory location by more
than one thread of control (threads or processes) is restricted such
that no thread of control can read or modify a memory location while
another thread of control may be modifying it. Such access is
restricted using functions that synchronize thread execution and also
synchronize memory with respect to other threads. The following
functions synchronize memory with respect to other threads:
fork() pthread_barrier_wait() pthread_cond_broadcast()
pthread_cond_signal() pthread_cond_timedwait() pthread_cond_wait()
pthread_create() pthread_join() pthread_mutex_lock()
pthread_mutex_timedlock()
pthread_mutex_trylock() pthread_mutex_unlock() pthread_spin_lock()
pthread_spin_trylock() pthread_spin_unlock() pthread_rwlock_rdlock()
pthread_rwlock_timedrdlock() pthread_rwlock_timedwrlock()
pthread_rwlock_tryrdlock() pthread_rwlock_trywrlock()
pthread_rwlock_unlock() pthread_rwlock_wrlock() sem_post()
sem_timedwait() sem_trywait() sem_wait() semctl() semop() wait()
waitpid()
(exceptions to the requirement omitted).
Basically, paraphrasing the above document, the rule is that when applications read or modify a memory location while another thread or process may modify it, they should make sure to synchronize the thread execution and memory with respect to other threads by calling one of the listed functions. Among them, pthread_create(3) is mentioned to provide that memory synchronization.
I understand that this basically means there needs to be some sort of memory barrier implied by each of the functions (although the standard seems not to use that concept). So for example returning from pthread_create(), we are guaranteed that the memory modifications made by that thread before the call appear to other threads (running possibly different CPU/core) after they also synchronize memory. But what about the newly created thread - is there implied memory barrier before the thread starts running the thread function so that it unfailingly sees the memory modifications synchronized by pthread_create()? Is this specified by the standard? Or should we provide memory synchronization explicitly to be able to trust correctness of any data we read according to POSIX standard?
Special case (which would as a special case answer the above question): does a context switch provide memory synchronization, that is, when the execution of a process or thread is started or resumed, is the memory synchronized with respect to any memory synchronization by other threads of execution?
Example:
Thread #1 creates a constant object allocated from heap. Thread #1 creates a new thread #2 that reads the data from the object. If we can assume the new thread #2 starts with memory synchronized then everything is fine. However, if the CPU core running the new thread has copy of previously allocated but since discarded data in its cache memory instead of the new value, then it might have wrong view of the state and the application may function incorrectly.
More concretely...
Previously in the program (this is the value in CPU #1 cache memory)
int i = 0;
Thread T0 running in CPU #0:
pthread_mutex_lock(...);
int tmp = i;
pthread_mutex_unlock(...);
Thread T1 running in CPU #1:
i = 42;
pthread_create(...);
Newly created thread T2 running in CPU #0:
printf("i=%d\n", i); /* First step in the thread function */
Without memory barrier, without synchronizing thread T2 memory it could happen that the output would be
i=0
(previously cached, unsynchronized value).
Update:
Lot of applications using POSIX thread library would not be thread safe if this implementation craziness was allowed.
is there implied memory barrier before the thread starts running the thread function so that it
unfailingly sees the memory modifications synchronized by pthread_create()?
Yes. Otherwise there would be no point to pthread_create acting as memory synchronization (barrier).
(This is afaik. not explicitly stated by posix, (nor does posix define a standard memory model),
so you'll have to decide whether you trust your implementation to do the only sane thing it possibly could - ensure synchronization before the new thread is run- I would not worry particularly about it).
Special case (which would as a special case answer the above question): does a context switch provide memory synchronization, that is, when the execution of a process or thread is started or resumed, is the memory synchronized with respect to any memory synchronization by other threads of execution?
No, a context switch does not act as a barrier.
Thread #1 creates a constant object allocated from heap. Thread #1 creates a new thread #2 that reads the data from the object. If we can assume the new thread #2 starts with memory synchronized then everything is fine. However, if the CPU core running the new thread has copy of previously allocated but since discarded data in its cache memory instead of the new value, then it might have wrong view of the state and the application may function incorrectly.
Since pthread_create must perform memory synchronization, this cannot happen. Any old memory that reside in a cpu cache on another core must be invalidated. (Luckily, the commonly used platforms are cache coherent, so the hardware takes care of that).
Now, if you change your object after you've created your 2. thread, you need memory synchronization again so all parties can see the changes, and otherwise avoid race conditions. pthread mutexes are commonly used to achieve that.
cache coherent architectures guarantee from the architectural design point of view that even separated CPUs (ccNUMA - cache coherent Not Uniform Memory Architecture), with independent memory channels when accessing a memory location will not incur in the incoherency you are describing in the example.
This happens with an important penalty, but the application will function correctly.
Thread #1 runs on CPU0, and hold the object memory in cache L1. When thread #2 on CPU1 read the same memory address (or more exactly: the same cache line - look for false sharing for more info), it forces a cache miss on CPU0 before loading that cache line.
You've turned the guarantee pthread_create provides into an incoherent one. The only thing the pthread_create function could possibly do is establish a "happens before" relationship between the thread that calls it and the newly-created thread.
There is no way it could establish such a relationship with existing threads. Consider two threads, one calls pthread_create, the other accesses a shared variable. What guarantee could you possibly have? "If the thread called pthread_create first, then the other thread is guaranteed to see the latest value of the variable". But that "If" renders the guarantee meaningless and useless.
Creating thread:
i = 1;
pthread_create (...)
Created thread:
if (i == 1)
...
Now, this is a coherent guarantee -- the created thread must see i as 1 since that "happened before" the thread was created. Our code made it possible for the standard to enforce a logical "happens before" relationship, and the standard did so to assure us that our code works as we expect.
Now, let's try to do that with an unrelated thread:
Creating thread:
i = 1;
pthread_create (...)
Unrelated thread:
if ( i == 1)
...
What guarantee could we possible have, even if the standard wanted to provide one? With no synchronization between the threads, we haven't tried to make a logical happens before relationship. So the standard can't honor it -- there's nothing to honor. There no particular behavior that is "right", so no way the standard can promise us the right behavior.
The same applies to the other functions. For example, the guarantee for pthread_mutex_lock means that a thread that acquires a mutex sees all changes made by, or seen by, any threads that have unlocked the mutex. We logically expect our thread to get the mutex "after" any threads that got the mutex "before", and the standard promises to honor that expectation so our code works.
I am working over an embedded http server written in C which was originally using fork() for handling each client request.
I switched it to use pthread_create instead of fork().
During memory usage comparison b/w the fork() and threaded version, I observed that is a change in %VSZ utilization as listed by top. The fork() version reports higher %VSZ then of pthread_create().
Can anyone explain why this change is there, because, as far as I think all the changes I have done are related to creating threads. I can't determine how it as changed the Virtual memory Size of the process.
Basically a fork()creates another process, which means that it gets assigned its own memory space, which means that you multiply the memory used.
A Thread on the other hand shares its memory space with the process that created it, therefore your memory usage will be way smaller, but you have to worry about race conditions and deadlocks if you access the same variable from multiple threads. (Does not happen with processes unless you use shared memory constructs)