How do jemalloc and tcmalloc track threads?

How do jemalloc and tcmalloc track threads? - c

Now I am actively studying the code of memory managers jemalloc and tcmalloc. But I can't understand how these two managers track threads.
If I understand correctly, a new thread can be detected during memory allocation, after which a new thread cache is created. But how does tcmalloc / jemalloc detect when a thread is destroyed and the thread cache attached to it can be freed for a future use?
Google results could not give even a minimum of any useful information.

I can only answer for jemalloc, but the way it works is that when the thread cache is created it is associated with the thread specific data for that thread.
When you create thread specific data, you can give it a 'destructor', which is invoked when the thread is being destroyed. If you're using pthreads it's the pthread_key_create routine, which is the C way of creating thread specific data.
In the case of jemalloc, there is a bit of code in tcache.h, which hooks tcache_thread_cleanup with the tcache data (my source jemalloc-3.0.0):
143 malloc_tsd_funcs(JEMALLOC_INLINE, tcache, tcache_t *, NULL,
144 tcache_thread_cleanup)
So when the thread is exited, the destructor gets called. It gets given the pointer to the cache for that thread and runs the tcache_thread_cleanup routine at that time.

Related

Thread Creation - is it dynamically allocated?

Is a thread dynamically allocated memory?
I have been researching and have a fair understanding of threads and how they are used. I have specifically looked at the POSIX API for threads.
I am trying to understand thread creation and how it differs from a simple malloc call.
I understand that threads share certain memory segments with the parent process, but it has it's own stack.
Any resources I can read through on this topic is appreciated. Thanks!

Thread creation and a malloc() call are completely different concepts. A malloc() call dynamically allocates the requested byte chunk of memory from the heap for the use of the program.
Whereas a thread can be considered as a 'light-weight process'. The thread is an entity within a process and every process will have atleast one thread to help complete its execution. The threads of a process will share the process virtual address and all the resources of the process. When you create new threads of a process, these new threads will have their own user stack, they will be scheduled independently to be executed by the scheduler. Also for the thread to run concurrently they will have their context which will store the state of the thread just before preemption i.e the status of all the registers.

Is a thread dynamically allocated memory?
No, nothing of the sort. Threads have memory uniquely associated with them -- at least a stack -- but such memory is not the thread itself.
I am trying to understand thread creation and how it differs from a simple malloc call.
New thread creation is not even the same kind of thing as memory allocation. The two are not at all comparable.
Threading implementations that have direct OS support (not all do) are unlikely to rely on the C library to obtain memory for their stack, kernel data structures, or any other thread-implementation-associated data. On the other hand, those that do not have OS support, such as Linux's old "green" threads, are more likely to allocate memory via the C library. Even threading implementations without direct OS support have the option of using a system call to obtain the memory they need, just as malloc() itself must do. In any case, the memory obtained is not itself the thread.
Note also that the difference between threading systems with and without OS support is orthogonal to the threading API. For example, Linux's green threads and the now-ubiquitous, kernel-supported NPTL threads both implement the POSIX thread API.

pthread_create(3) and memory synchronization guarantee in SMP architectures

I am looking at the section 4.11 of The Open Group Base Specifications Issue 7 (IEEE Std 1003.1, 2013 Edition), section 4.11 document, which spells out the memory synchronization rules. This is the most specific by the POSIX standard I have managed to come by for detailing the POSIX/C memory model.
Here's a quote
4.11 Memory Synchronization
Applications shall ensure that access to any memory location by more
than one thread of control (threads or processes) is restricted such
that no thread of control can read or modify a memory location while
another thread of control may be modifying it. Such access is
restricted using functions that synchronize thread execution and also
synchronize memory with respect to other threads. The following
functions synchronize memory with respect to other threads:
fork() pthread_barrier_wait() pthread_cond_broadcast()
pthread_cond_signal() pthread_cond_timedwait() pthread_cond_wait()
pthread_create() pthread_join() pthread_mutex_lock()
pthread_mutex_timedlock()
pthread_mutex_trylock() pthread_mutex_unlock() pthread_spin_lock()
pthread_spin_trylock() pthread_spin_unlock() pthread_rwlock_rdlock()
pthread_rwlock_timedrdlock() pthread_rwlock_timedwrlock()
pthread_rwlock_tryrdlock() pthread_rwlock_trywrlock()
pthread_rwlock_unlock() pthread_rwlock_wrlock() sem_post()
sem_timedwait() sem_trywait() sem_wait() semctl() semop() wait()
waitpid()
(exceptions to the requirement omitted).
Basically, paraphrasing the above document, the rule is that when applications read or modify a memory location while another thread or process may modify it, they should make sure to synchronize the thread execution and memory with respect to other threads by calling one of the listed functions. Among them, pthread_create(3) is mentioned to provide that memory synchronization.
I understand that this basically means there needs to be some sort of memory barrier implied by each of the functions (although the standard seems not to use that concept). So for example returning from pthread_create(), we are guaranteed that the memory modifications made by that thread before the call appear to other threads (running possibly different CPU/core) after they also synchronize memory. But what about the newly created thread - is there implied memory barrier before the thread starts running the thread function so that it unfailingly sees the memory modifications synchronized by pthread_create()? Is this specified by the standard? Or should we provide memory synchronization explicitly to be able to trust correctness of any data we read according to POSIX standard?
Special case (which would as a special case answer the above question): does a context switch provide memory synchronization, that is, when the execution of a process or thread is started or resumed, is the memory synchronized with respect to any memory synchronization by other threads of execution?
Example:
Thread #1 creates a constant object allocated from heap. Thread #1 creates a new thread #2 that reads the data from the object. If we can assume the new thread #2 starts with memory synchronized then everything is fine. However, if the CPU core running the new thread has copy of previously allocated but since discarded data in its cache memory instead of the new value, then it might have wrong view of the state and the application may function incorrectly.
More concretely...
Previously in the program (this is the value in CPU #1 cache memory)
int i = 0;
Thread T0 running in CPU #0:
pthread_mutex_lock(...);
int tmp = i;
pthread_mutex_unlock(...);
Thread T1 running in CPU #1:
i = 42;
pthread_create(...);
Newly created thread T2 running in CPU #0:
printf("i=%d\n", i); /* First step in the thread function */
Without memory barrier, without synchronizing thread T2 memory it could happen that the output would be
i=0
(previously cached, unsynchronized value).
Update:
Lot of applications using POSIX thread library would not be thread safe if this implementation craziness was allowed.

is there implied memory barrier before the thread starts running the thread function so that it
unfailingly sees the memory modifications synchronized by pthread_create()?
Yes. Otherwise there would be no point to pthread_create acting as memory synchronization (barrier).
(This is afaik. not explicitly stated by posix, (nor does posix define a standard memory model),
so you'll have to decide whether you trust your implementation to do the only sane thing it possibly could - ensure synchronization before the new thread is run- I would not worry particularly about it).
Special case (which would as a special case answer the above question): does a context switch provide memory synchronization, that is, when the execution of a process or thread is started or resumed, is the memory synchronized with respect to any memory synchronization by other threads of execution?
No, a context switch does not act as a barrier.
Thread #1 creates a constant object allocated from heap. Thread #1 creates a new thread #2 that reads the data from the object. If we can assume the new thread #2 starts with memory synchronized then everything is fine. However, if the CPU core running the new thread has copy of previously allocated but since discarded data in its cache memory instead of the new value, then it might have wrong view of the state and the application may function incorrectly.
Since pthread_create must perform memory synchronization, this cannot happen. Any old memory that reside in a cpu cache on another core must be invalidated. (Luckily, the commonly used platforms are cache coherent, so the hardware takes care of that).
Now, if you change your object after you've created your 2. thread, you need memory synchronization again so all parties can see the changes, and otherwise avoid race conditions. pthread mutexes are commonly used to achieve that.

cache coherent architectures guarantee from the architectural design point of view that even separated CPUs (ccNUMA - cache coherent Not Uniform Memory Architecture), with independent memory channels when accessing a memory location will not incur in the incoherency you are describing in the example.
This happens with an important penalty, but the application will function correctly.
Thread #1 runs on CPU0, and hold the object memory in cache L1. When thread #2 on CPU1 read the same memory address (or more exactly: the same cache line - look for false sharing for more info), it forces a cache miss on CPU0 before loading that cache line.

You've turned the guarantee pthread_create provides into an incoherent one. The only thing the pthread_create function could possibly do is establish a "happens before" relationship between the thread that calls it and the newly-created thread.
There is no way it could establish such a relationship with existing threads. Consider two threads, one calls pthread_create, the other accesses a shared variable. What guarantee could you possibly have? "If the thread called pthread_create first, then the other thread is guaranteed to see the latest value of the variable". But that "If" renders the guarantee meaningless and useless.
Creating thread:
i = 1;
pthread_create (...)
Created thread:
if (i == 1)
...
Now, this is a coherent guarantee -- the created thread must see i as 1 since that "happened before" the thread was created. Our code made it possible for the standard to enforce a logical "happens before" relationship, and the standard did so to assure us that our code works as we expect.
Now, let's try to do that with an unrelated thread:
Creating thread:
i = 1;
pthread_create (...)
Unrelated thread:
if ( i == 1)
...
What guarantee could we possible have, even if the standard wanted to provide one? With no synchronization between the threads, we haven't tried to make a logical happens before relationship. So the standard can't honor it -- there's nothing to honor. There no particular behavior that is "right", so no way the standard can promise us the right behavior.
The same applies to the other functions. For example, the guarantee for pthread_mutex_lock means that a thread that acquires a mutex sees all changes made by, or seen by, any threads that have unlocked the mutex. We logically expect our thread to get the mutex "after" any threads that got the mutex "before", and the standard promises to honor that expectation so our code works.

Freeing memory across threads

Is it a bad practice to free memory across threads? Such that a thread allocates memory and, after exiting, passes the pointer to the main thread to free the memory. I feel like the answer is yes but I'm just wondering.
The purpose of this in my code is so that the main thread can do some other stuff with the memory before it gets freed. There's plenty of workarounds, in my case, which I'm totally fine with using. But having a thread return void * to a block of memory can, in my case, make the code pretty convenient.
EDIT: I know there are no technical faults in doing this.

It's not wrong for a thread to pass control of memory it has allocated to another thread. For example, in a producer/consumer model, it would be very reasonable for the producer thread to allocate memory for whatever it is that it produces, and then hand control over that memory to the consumer thread for the consumer thread to use and release.

It's not "bad practice" as long as it makes sense to your data flow model, and particular to the requirements your program has on object lifetimes, but it can incur costs. Many modern allocators use thread-local arenas, where allocating and freeing an object in the same thread incurs no synchronization penalty, but freeing it in a different thread forces synchronization or incurs other costs. I would not change your design for this reason unless it's a major bottleneck, but with this implementation-detail in mind you could also consider other designs, such as having the thread store its output in a buffer provided by the parent thread in the argument to the thread start function.

All threads share a common heap. It doesn't matter which thread allocates or frees the memory, as long as the other threads are done using the memory when it gets freed.

Dynamic memory usage comes with a responsibility that you are in complete control of it. It is the user’s responsibility to explicitly manage the lifetime of the dynamically allocated object and ensure its deallocation once the expected lifetime of the object ends. There is nothing wrong in dynamically allocated memory blocks used across different threads. All the threads in the same process share the same heap area. The only care that one needs to take care is that the object lifetimes are clearly well defined and scoped.

Restarting threads from saved contexts

I am trying to implement a checkpointing scheme for multithreaded applications by using fork. I will take the checkpoint at a safe location such as a barrier. One thread will call fork to replicate the address space and signals will be sent to all other threads so that they can save their contexts and write it to a file.
The forked process will not run initially. Only when restart from checkpoint is required, a signal would be sent to it so it can start running. At that point, the threads who were not forked but whose contexts were saved, will be recreated from the saved contexts.
My first question is if it is enough to recreate threads from saved contexts and run them from there, if i assume there was no lock held, no signal pending during checkpoint etc... . Lastly, how a thread can be created to run from a known context.

What you want is not possible without major integration with the pthreads implementation. Internal thread structures will likely contain their own kernel-space thread ids, which will be different in the restored contexts.
It sounds to me like what you really want is forkall, which is non-trivial to implement. I don't think barriers are useful at all for what you're trying to accomplish. Asynchronous interruption and checkpointing is just as good as synchronized.
If you want to try hacking forkall into glibc, you should start out by looking at the setxid code NPTL uses for synchronizing setuid() calls between threads using signals. The same principle is what's needed to implement forkall, but you'd basically call setjmp instead of setuid in the signal handlers, and then longjmp back into them after making new threads in the child. After that you'd have to patch up the thread structures to have the right pid/tid values, free the excess new stacks that were created, etc.
Edit: Since the setxid code in glibc/NPTL is rather dense reading for someone not familiar with the codebase, you might instead look at the corresponding code I have in musl, called __synccall:
http://git.etalabs.net/cgi-bin/gitweb.cgi?p=musl;a=blob;f=src/thread/synccall.c;h=91ac5eb77322da7393f778da29d35fb3c2def15d;hb=HEAD
It uses a signal to synchronize all threads, then runs a callback sequentially in each thread one-by-one. To implement forkall, you'd want to do something like this prior to the fork, but instead of a callback, simply save jump buffers for each thread except the calling thread (you can't use a callback for this because the return would invalidate the jump buffer you just saved), then perform the fork from the calling thread. After that, you would make N new threads, and have them jump back to the old threads' saved jump buffers, and destroy their new (unneeded) stacks. You'd also need to make the right syscall to update their thread register (e.g. %gs on x86) and tid address.
Then you need to take these ideas and integrate them with glibc's thread allocation and thread stack cache framework. :-)

Why would pthread_create() fail with only 2 threads active?

I'm having some trouble in my first foray into threads in C. I'm trying (for now) to write a very simple server program that accepts a socket connection and starts a new thread to process it. It seems to work fine except that it will only create about 300 threads (303, sometimes 304) before pthread_create() fails with the EAGAIN code, which means:
"The system lacked the necessary resources to create another thread, or the system-imposed limit on the total number of threads in a process {PTHREAD_THREADS_MAX} would be exceeded."
This is not 303 threads at the same time - each thread exits which is confirmed by gdb. Each time the process request function is called there are two threads running.
So it means "the system lacked the necessary resources". My question is (and it may be a bit stupid) - what are these resources? Presumably it's a memory leak in my program (certainly possible, likely even), but I'd have thought that even so it could manage more than 300 considering the rest of the program does very little.
How can I find out how much memory my program has available to confirm that it's running out of it? There's plenty of memory and swap free so presumably there's an artificial limit imposed by the OS (Linux).
Thanks

If you are not creating the thread with the attribute PTHREAD_CREATE_DETACHED (or detaching them with pthread_detach(), you may need to call pthread_join() on each created thread after it exits to free up the resources associated with it.

Possibly a little overkill(?) but Valgrind can help you locate memleaks in Linux.
Could you perhaps post some code snippets? Preferably the parts where you allocate/free memory/sockets and where you create your threads.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight