Profiling waiting threads with perf or other tools

Profiling waiting threads with perf or other tools - c

In a multithreaded program, how can one effectively profile threads that are waiting on a lock, are sleeping or are scheduled out in some other way?
For my profiling purposes I need to have insight in some lock contention. So I would like to see this in, for example, a stack trace profiler tool from which one can generate flame graphs. I first tried to do this using gperftools CPU profiler. But as the name suggests, that only profiles threads that are actually doing something on a CPU and you will not see the stack traces of the threads waiting on a lock.
So then I switched to perf which I was hoping to be powerful enough to somehow be able to gather profile info on the scheduled out threads as well. But so far without luck.
Here is my test program:
#include <pthread.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
void *threadfunction(void *arg)
{
pthread_mutex_t* mutex = arg;
pthread_mutex_lock(mutex);
printf("I am the thread function!\n");
pthread_mutex_unlock(mutex);
}
int main(void)
{
pthread_mutex_t mutex;
pthread_mutex_init(&mutex, 0);
pthread_mutex_lock(&mutex);
pthread_t thread;
(void) pthread_create(&thread, NULL, threadfunction, &mutex);
for (int i = 0; i < 60000000; i++)
printf("I am the main function!\n");
pthread_mutex_unlock(&mutex);
(void) pthread_join(thread, NULL);
return 0;
}
Now I run the following perf-record
perf record -F50 --call-graph dwarf test_program
I compile this into a flamegraph with
perf script > perf.script
stackcollapse-perf.pl perf.script | flamegraph.pl > flamegraph.svg
the resulting flamegraph is below. And you can see that it basically only shows the stacks belonging to the main function. The threadfunction is run in a different thread and because its waiting for a lock and therefore scheduled out, you don't see any stack traces related to it.
I have tried adding certain events to perf-record
-e sched:sched_stat_sleep,sched:sched_switch
But that did not help either.
How could I effectively create and compile a lock contention based stack trace profile combined with a CPU based stack trace profile using perf? I like to use perf but I am very much open for other tool suggestions as well.
For instance it is known to me that with the gdb based poor man's profiler you can actually get the stack traces of the sleeping threads. This is reasonable because gdb somehow must be able to get stack information from any thread. But I would prefer a more attuned and dedicated tool for such a task then gdb.
Thanks!

Related

pthread_create error code 11 even with a few threads

So I basically have code that is basically
pthread_t cpu[10];
while (a certain condition){
for (int i=0;i<10;i++){
pthread_create(&cpu[i],NULL,(a function), NULL);
}
this code should be running about 10 threads at a time however, after running the while loop a certain amount of times it says their is a pthread error with code 11. I know I am running the threads multiple times however, shouldn't only 10 instances be happening?

The limit on threads is being reached because the program is calling pthread_create() in a loop, constantly spawning more threads, without ever calling pthread_join() to clean up the resources of the existing threads. This quickly fills up the process's threads-table, at which point pthread_create() starts to error out because there is no more room in the process's threads-table for any more threads.
To avoid that problem, you need to modify the code so that it only summons a finite (and reasonable -- read: dozens, not hundreds or thousands) number of threads into existence at one time.
A simple calling-pattern to achieve that would look something like this:
pthread_t cpu[10];
while (a certain condition){
// spawn 10 threads
for (int i=0;i<10;i++){
pthread_create(&cpu[i],NULL,(a function), NULL);
}
// at this point all 10 threads are running
// wait until all 10 threads have exited
for (int i=0;i<10;i++){
pthread_join(&cpu[i], NULL);
}
}
Another common (and somewhat more elegant) approach would be to use a thread-pool instead of spawning and joining threads. That's often preferable because it avoids the overhead required to constantly create and then tear down threads, and because it means that as soon as a thread has finished computing job A, it can immediately grab job B out of the pending-jobs-queue and start working on it -- unlike the code shown above, which has to wait for all 10 threads to complete before it can spawn 10 more.

#include <string.h>
int main() {
printf("%s\n", strerror(11));
return 0;
}
Gives on my system: Resource temporarily unavailable
Again on my system: EAGAIN is defined as 11 (errno.h).
man pthread_create states as possible error return codes:
EAGAIN
Insufficient resources to create another thread.
EAGAIN
A system-imposed limit on the number of threads was encountered. There are a number of limits that may trigger this error: ...
The reason for that was already answered by Jeremy Friesner in the comments section above.

Why dosen't the parent thread calling pthread_yield make child threads run first?

Here are my code.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <sched.h>
void *helper(void *arg)
{
printf("HELPER\n");
return NULL;
}
int main()
{
pthread_t thread;
pthread_create(&thread, NULL, &helper, NULL);
pthread_yield();
// sched_yield();
printf("MAIN\n");
pthread_join(thread, NULL);
return 0;
}
Using pthread_yield() or sched_yield(), the output is always:
MAIN
HELPER
Two facts I have learnt make me to presume HELPER would be printed before MAIN:
Calling pthread_yield causes the calling thread to relinquish the CPU. The thread is placed at the end of the run queue for its static priority and another thread is scheduled to run. If the calling thread is the only thread in the highest priority list at that time, it will continue to run.
The child thread is created with the same priority as the parent.
What may be the reason for HELPER is printed after MAIN?

Whether pthread_yield can/will do what you expect (i.e. make the HELPER run first) depends on the system scheduler being used and how it's configured.
In general you can only expect (or rather hope for) such behavior on systems with very simple schedulers. A modern (Linux) system will - per default - use a more complex scheduling so you can't rely on pthread_yield to synchronize the order of execution.
And even if MAIN was stopped and the HELPER was started, the HELPER could be preempted before doing the printing. Or how about multi-core CPUs? What if both threads ran in parallel? Which would do the print first?
So - No, pthread_yield is not the tool for syncronizing threads.
For more on scheduling read:
https://man7.org/linux/man-pages/man7/sched.7.html
Here you can read about a number of system calls that you can use for getting information about the scheduler and configure the scheduler.
But in order to control the thread execution order, you shouldn't rely on controlling the scheduler. Implement your own control, e.g. by using a mutex

Ways to create linux threads besides of pthread_create on linux

I intercepted pthread_create call to capture the relations among all the threads. But I found that some threads' creation were not recorded with only intercepting pthread_create. I also tried to intercept posix_spawn posix_spawnp and clone call. But there are still some threads that I don't know who create them running in my experiment. So are there any other ways to create threads on linux?
More specifically,
I used LD_PRELOAD to intercept pthread_create call, the code fragment is shown below:
int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine)(void *), void *arg){
static void *handle = NULL;
static P_CREATE old_create=NULL;
if( !handle )
{
handle = dlopen("libpthread.so.0", RTLD_LAZY);
old_create = (P_CREATE)dlsym(handle, "pthread_create");
}
pthread_t tmp=pthread_self();
//print pthread_t pid
int result=old_create(thread,attr,start_routine,(void *)temp);
//print thread pid
return result;
}
In this way, I captured all the thread creation process. The same goes for clone. But actually clone was not called by the application. Sometimes, I got a parent-child threads pair which the parent thread is not printed before. So I don't know whether there are other ways to create this parent thread.
More more specifically, the upper application is a Mapreduce job on JVM1.7. I want to observe all the threads and processes and their relation
Thank you.

(moving from the comment)
LD_PRELOAD tricks just let you intercept C calls to external libraries - in this particular case to lptrhead (for pthread_create) and to libc (for fork/clone); but to create threads a program can bypass them completely and talk straight to the kernel, by invoking such syscalls (clone in particular) using int 80h (on x86) or sysenter (on amd64).
Straight syscalls cannot be intercepted that easily, you generally need the help of the kernel itself - which generally happens through the ptrace interface - which incidentally is how stuff like strace and debuggers are implemented. You should look in particular at the PTRACE_O_TRACECLONE, PTRACE_O_TRACEVFORK and PTRACE_O_TRACEFORK options to trace the creation of new processes/threads, or straight PTRACE_SYSCALL to block on all syscalls.
The setup of ptrace is a bit laborious and I don't have much time now, but there are several examples on the Internet of the basic pthread loop that you'll surely be able to find/adapt to your objective.

using pthread in C?

I am trying to write a program that will continuously take reading from a sensor that will monitor water level. Then after every (10-15 mins) it will need to take soil moisture readings from other sensors. I have never used POSIX pthreads before. This is what I have so far as a concept of how it may work.
It seems to be working the way I want it to, but is it a correct way to implement this. Is there anything else I need to do?
void *soilMoisture(void *vargp)
{
sleep(10);
funct();
return NULL;
}
int main()
{
pthread_t pt;
int k=1;
pthread_create(&pt, NULL, soilMoisture, NULL);
while(k>0)
{
printf("This is the main thread (Measuring the water level) : %d\n", k);
sleep(1);
}
return 0;
}
void funct()
{
printf("******(Measuring soil moisture after sleeping for 10SEC)***********\n");
pthread_t ptk;
pthread_create(&ptk, NULL, soilMoisture, NULL);
}

It is not clear why you create a new thread every 10 seconds rather than just letting the original continue. Since the original thread exits, you aren't directly accumulating threads, but you aren't waiting for any of them, so there are some resources unreleased. You also aren't error checking, so you won't know when anything does go wrong; monitoring will simply stop.
You will eventually run out of space, one way or another. You have a couple of options.
Don't create a new thread every 10 seconds. Leave the thread running by making a loop in the soilMoisture() function and do away with funct() — or at least the pthread_create() call in it.
If you must create new threads, make them detached. You'll need to create a non-default pthread_attr_t using the functions outlined and linked to in When pthread_attr_t is not NULL.
There are a myriad issues you've not yet dealt with, notably synchronization between the two threads. If you don't have any such synchronization, you'd be better off with two separate programs — the Unix mantra of "each program does one job but does it well" still applies. You'd have one program to do the soil moisture reading, and the other to do the water level reading. You'll need to decide whether data is stored in a database or otherwise logged, and for how log such data is kept. You'll need to think about rotating logs. What should happen if sensors go off-line? How can you restart threads or processes? How can you detect when threads or processes lock up unexpectedly or exit unexpectedly? Etc.
I assume the discrepancy between 10-15 minutes mentioned in the question and 10 seconds in the code is strictly for practical testing rather than a misunderstanding of the POSIX sleep() function.

pthread_mutex_unlock not atomic

I've the following source code (adapted from my original code):
#include "stdafx.h"
#include <stdlib.h>
#include <stdio.h>
#include "pthread.h"
#define MAX_ENTRY_COUNT 4
int entries = 0;
bool start = false;
bool send_active = false;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t condNotEmpty = PTHREAD_COND_INITIALIZER;
pthread_cond_t condNotFull = PTHREAD_COND_INITIALIZER;
void send()
{
for (;;) {
if (!start)
continue;
start = false;
for(int i = 0; i < 11; ++i) {
send_active = true;
pthread_mutex_lock(&mutex);
while(entries == MAX_ENTRY_COUNT)
pthread_cond_wait(&condNotFull, &mutex);
entries++;
pthread_cond_broadcast(&condNotEmpty);
pthread_mutex_unlock(&mutex);
send_active = false;
}
}
}
void receive(){
for(int i = 0; i < 11; ++i){
pthread_mutex_lock(&mutex);
while(entries == 0)
pthread_cond_wait(&condNotEmpty, &mutex);
entries--;
pthread_cond_broadcast(&condNotFull);
pthread_mutex_unlock(&mutex);
}
if (send_active)
printf("x");
}
int _tmain(int argc, _TCHAR* argv[])
{
pthread_t s;
pthread_create(&s, NULL, (void *(*)(void*))send, NULL);
for (;;) {
pthread_mutex_init(&mutex, NULL);
pthread_cond_init(&condNotEmpty, NULL);
pthread_cond_init(&condNotFull, NULL);
start = true;
receive();
pthread_mutex_destroy(&mutex);
mutex = NULL;
pthread_cond_destroy(&condNotEmpty);
pthread_cond_destroy(&condNotFull);
condNotEmpty = NULL;
condNotFull = NULL;
printf(".");
}
return 0;
}
The problem is like follows: from time to time the last unlock in the send function is not finished before the receive method continues. In my original code the mutexes are located in an objects which are deleted after doing the job. If the send method has not finished with the last unlock then the mutexes are invalid and my program causes failures in unlock.
The behavior can be easily reproduced by running the program: each time the "x" is diplayed the receive method has nearly finished and the send method "hangs" in the unlock call.
I've compiled with VS2008 and VS2010 - both results are same.
pthread_mutex_unlock is not atomic, this would solve the problem. How can I solve this issue? Any comments are welcome...
Best regards
Michael

Your printf("x") is a textbook race condition example.
After pthread_mutex_unlock() OS is free to not schedule this thread for any amount of time: ticks, seconds or days. You can't assume that send_active will be "falsified" in time.

pthread_mutex_unlock() must by definition release the mutex before it returns. The instant the mutex is released, another thread that's contending for the mutex could be scheduled. Note that even if pthread_mutex_unlock() could arrange to not release the mutex until just after it returned (what I think you mean by it being atomic), there would still be an equivalent race condition to whatever you're seeing now (it's not clear to me what race you're seeing since a comment indicates that ou're not realy interested in the race condition of accessing send_active to control the printf() call).
In that case the other thread could be scheduled 'between-the-lines' of the pthread_mutex_unlock() and the following statement/expression in the function that called it - you'd have the same race condition.

Here's some speculation on what might be happening. A couple caveats on this analysis:
this is based on the assumption that you're using the Win32 pthreads package from http://sourceware.org/pthreads-win32/
this is based only on a pretty brief examination of the pthreads source on that site and the information in the question and comments here - I have not had an opportunity to actually try to run/debug any of the code.
When pthread_mutex_unlock() is called, it decrements a lock count, and if that lock count drops to zero, the Win32 SetEvent() API is called on an associated event object to let any threads waiting on the mutex to be unblocked. Pretty standard stuff.
Here's where the speculation comes in. Lets say that SetEvent() has been called to unblock thread(s) waiting on the mutex, and it signals the event associated with the handle it was given (as it should). However, before the SetEvent() function does anything else, another thread starts running and closes the event object handle that that particular SetEvent() was called with (by calling pthread_mutex_destroy()).
Now the SetEvent() call in progress has a handle value that's no longer valid. I can't think of a particular reason that SetEvent() would do anything with that handle after it's signaled the event, but maybe it does (I could also imagine someone making a reasonable argument that SetEvent() should be able to expect that the event object handle remain valid for the duration of the API call).
If this is what's happening to you (and that's a big if), I'm not sure if there's an easy fix. I think that the pthreads library would have to have changes made so that it duplicated the event handle before calling SetEvent() then close that duplicate when the SetEvent() call returned. That way the handle would remain valid even if the 'main' handle were closed by another thread. I'm guessing it would have to do this in a number of places. This could be implemented by replacing the affected Win32 API calls with calls to wrapper functions that perform the "duplicate handle/call API/close duplicate" sequence.
It might not be unreasonable for you to try to make this change for the SetEvent() call in pthread_mutex_unlock() and see if it solves (or at least improves) your particular problem. If so, you may want to contact the maintainer of the library to see if a more comprehensive fix might be arranged somehow (be prepared - you may be asked to do a significant part of the work).
Out of curiosity, in your debugging of the state of the thread that hangs in pthread_mutex_unlock()/SetEvent(), do you have any information on exactly whats happening? What SetEvent() is waiting for? (the cdb debugger that's in the Debugging Tools for Windows package might be able to give you more information about this than the Visual Studio debugger).
Also, note the following comment in the source for pthread_mutex_destroy() which seems related (but different) than your particular problem:
/*
* FIXME!!!
* The mutex isn't held by another thread but we could still
* be too late invalidating the mutex below since another thread
* may already have entered mutex_lock and the check for a valid
* *mutex != NULL.
*
* Note that this would be an unusual situation because it is not
* common that mutexes are destroyed while they are still in
* use by other threads.
*/

Michael, thank you for your investigation and comments!
The code I'm using is from http://sourceware.org/pthreads-win32/.
The situation described by you in the third and fourth paragraph is exactly what's happening.
I've checked some solutions and a simple one seems to work for me: I waiting till the send function (and SetEvent) has finished. All my tests with this solution were successful till now. I'm going to do a larger test over the weekend.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight