Weird behaviour while resolving a deadlock - c

I had an exercise for class a few weeks ago, my solution was good, but I noticed some weird behaviour when observing it for a longer time.
The exercise was generating a deadlock with two posix threads and then to resolve it. (I abstracted the solution so it has no unnecessary code.
The scenario is the following:
I have two threads who share two fictional resources
both threads start in sequence and then try to occupy both resources (in sequence too)
both threads have different time spans for occupying
when a thread has both resources he works for 5 seconds and then frees the resources and takes a break, when the break is over he begins again with trying to occupy both resources
every 8 seconds a function checkes if both threads have the state waiting (both threads have ONE resource and are waiting for the second)
when a deadlock occures, the thread who worked more is getting canceled and then restarted
Here comes the problem, depending on the machine and the compilerflags the output says that e.g. thread A is cancelled but then thread B started. I tried it on different computers with different compilers, with different istallations.
Weird is that I compile with gcc -Wall -Werror -ansi -pedantic -D_POSIX_C_SOURCE=200809L -pthread -lrt and the problem occures with the second deadlock, but when I remove -Wall and -Werror the problems comes with the 3. deadlock 0o
I uploaded the source here. Compile flags are in the source, I tried gcc and clang.
And I also tried Ubuntu 13.04 and Arch.
Here is the output, I marked the lines with "-->"
Did I forget something so this effect appears? I don't think that there are bugs in some libs.

The problem is that you are passing the address of a local variable to the thread. And that this local variable may no longer exist when the thread starts and you are dereferencing the address location that used to hold the local variable but which now holds something else.
Since it is in the stack space of the program you aren't getting a segfault.
Here's a highlight of the problem areas of code and how it can be caused:
void resolve_deadlock()
{
void *pthread_exit_state;
int id_a = THREAD_A;
int id_b = THREAD_B;
<some code to detect deadlocks and kill a thread>
/* restart the killed thread */
if (pthread_create(&threads[THREAD_B], NULL, &thread_function, (void *) &id_b) != 0) {
perror("Create THREAD_B\n");
exit(EXIT_FAILURE);
}
}
So the program runs and:
resolve_deadlock is called
thread X is killed
pthread_create is called to create a thread
resolve_deadlock function ends
stack is over written on next function call
The OS swaps us out and runs another thread
thread X runs and dereferences our local var which no longer exists -> undefined behaviour.

Related

C malloc "can't allocate region" error, but can't repro with GDB?

How can I debug a C application that does not crash when attached with gdb and run inside of gdb?
It crashes consistently when run standalone - even the same debug build!
A few of us are getting this error with a C program written for BSD/Linux, and we are compiling on macOS with OpenSSL.
app(37457,0x7000017c7000) malloc: *** mach_vm_map(size=13835058055282167808) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
ERROR: malloc(buf->length + 1) failed!
I know, not helpful.
Recompiling the application with -g -rdynamic gives the same error. Ok, so now we know it isn't because of a release build as it continues to fail.
It works when running within a gdb debugging session though!!
$ sudo gdb app
(gdb) b malloc_error_break
Function "malloc_error_break" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (malloc_error_break) pending.
(gdb) run -threads 8
Starting program: ~/code/app/app -threads 8
[New Thread 0x1903 of process 45436]
warning: unhandled dyld version (15)
And it runs for hours. CTRL-C, and run ./app -threads 8 and it crashes after a second or two (a few million iterations).
Obviously there's an issue within one of the threads. But those workers for the threads are pretty big (a few hundred lines of code). Nothing stands out.
Note that the threads iterate over loops of about 20 million per second.
macOS 10.12.3
Homebrew w/GNU gcc and openssl (linking to crypto)
Ps, not familiar with C too much - especially any type of debugging. Be kind and expressive/verbose in answers. :)
One debugging technique that is sometimes overlooked is to include debug prints in the code, of course it has it's disadvantages, but also it has advantages. A thing you must keep in mind though in the face of abnormal termination is to make sure the printouts actually get printed. Often it's enough to print to stderr (but if that doesn't make the trick one may need to fflush the stream explicitly).
Another trick is to stop the program before the error occurs. This requires you to know when the program is about to crash, preferably as close as possible. You do this by using raise:
raise(SIGSTOP);
This does not terminate the program, it just suspends execution. Now you can attach with gdb using the command gdb <program-name> <pid> (use ps to find the pid of the process). Now in gdb you have to tell it to ignore SIGSTOP:
> handle SIGSTOP ignore
Then you can set break-points. You can also step out of the raise function using the finish command (may have to be issued multiple times to return to your code).
This technique makes the program have normal behaviour up to the time you decide to stop it, hopefully the final part when running under gdb would not alter the behavior enuogh.
A third option is to use valgrind. Normally when you see these kind of errors there's errors involved that valgrind will pick up. These are accesses out of range and uninitialized variables.
Many memory managers initialise memory to a known bad value to expose problems like this (e.g. Microsoft's CRT will use a range of values (0xCD means uninitialised, 0xDD means already free etc).
After each use of malloc, try memset'ing the memory to 0xCD (or some other constant value). This will allow you to identify uninitialised memory more easily with the debugger. don't use 0x00 as this is a 'normal' value and will be harder to spot if it's wrong (it will also probably 'fix' your problem).
Something like:
void *memory = malloc(sizeof(my_object));
memset(memory, 0xCD, sizeof(my_object));
If you know the size of the blocks, you could do something similar before free (this is sometimes harder unless you know the size of your objects, or track it in some way):
memset(memory, 0xDD, sizeof(my_object));
free(memory);

fatal error disappeared when running with gdb

I have a program which produces a fatal error with a testcase, and I can locate the problem by reading the log and the stack trace of the fatal - it turns out that there is a read operation upon a null pointer.
But when I try to attach gdb to it and set a breakpoint around the suspicious code, the null pointer just cannot be observed! The program works smoothly without any error.
This is a single-process, single-thread program, I didn't experience this kind of thing before. Can anyone give me some comments? Thanks.
Appended: I also tried to call pause() syscall before the fatal-trigger code, and expected to make the program sleep before fatal point and then attach the gdb on it on-the-fly, sadly, no fatal occurred.
It's only guesswork without looking at the code, but debuggers sometimes do this:
They initialize certain stuff for you
The timing of the operations is changed
I don't have a quote on GDB, but I do have one on valgrind (granted the two do wildly different things..)
My program crashes normally, but doesn't under Valgrind, or vice versa. What's happening?
When a program runs under Valgrind,
its environment is slightly different
to when it runs natively. For example,
the memory layout is different, and
the way that threads are scheduled is
different.
Same would go for GDB.
Most of the time this doesn't make any
difference, but it can, particularly
if your program is buggy.
So the true problem is likely in your program.
There can be several things happening.. The timing of the application can be changed, so if it's a multi threaded application it is possible that you for example first set the ready flag and then copy the data into the buffer, without debugger attached the other thread might access the buffer before the buffer is filled or some pointer is set.
It's could also be possible that some application has anti-debug functionality. Maybe the piece of code is never touched when running inside a debugger.
One way to analyze it is with a core dump. Which you can create by ulimit -c unlimited then start the application and when the core is dumped you could load it into gdb with gdb ./application ./core You can find a useful write-up here: http://www.ffnn.nl/pages/articles/linux/gdb-gnu-debugger-intro.php
If it is an invalid read on a pointer, then unpredictable behaviour is possible. Since you already know what is causing the fault, you should get rid of it asap. In general, expect the unexpected when dealing with faulty pointer operations.

How to get more detailed backtrace [duplicate]

This question already has answers here:
How to make backtrace()/backtrace_symbols() print the function names?
(5 answers)
Closed 8 years ago.
I am trying to print a backtrace when my C++ program terminated. Function printing backtrace is like below;
void print_backtrace(void){
void *tracePtrs[10];
size_t count;
count = backtrace(tracePtrs, 10);
char** funcNames = backtrace_symbols(tracePtrs, count);
for (int i = 0; i < count; i++)
syslog(LOG_INFO,"%s\n", funcNames[i]);
free(funcNames);
}
It gives an output like ;
desktop program: Received SIGSEGV signal, last error is : Success
desktop program: ./program() [0x422225]
desktop program: ./program() [0x422371]
desktop program: /lib/libc.so.6(+0x33af0) [0x7f0710f75af0]
desktop program: /lib/libc.so.6(+0x12a08e) [0x7f071106c08e]
desktop program: ./program() [0x428895]
desktop program: /lib/libc.so.6(__libc_start_main+0xfd) [0x7f0710f60c4d]
desktop program: ./program() [0x4082c9]
Is there a way to get more detailed backtrace with function names and lines, like gdb outputs?
Yes - pass the -rdynamic flag to the linker. It will cause the linker to put in the link tables the name of all the none static functions in your code, not just the exported ones.
The price you pay is a very slightly longer startup time of your program. For small to medium programs you wont notice it. What you get is that backtrace() is able to give you the name of all the none static functions in your back trace.
However - BEWARE: there are several gotchas you need to be aware of:
backtrace_symbols allocates memory from malloc. If you got into a SIGSEGV due to malloc arena corruption (quite common) you will double fault here and never see your back trace.
Depending on the platform this runs on (e.g. x86), the address/function name of the exact function where you crashed will be replaced in place on the stack with the return address of the signal handler. You need to get the right EIP of the crashed function from the signal handler parameters for those platforms.
syslog is not an async signal safe function. It might take a lock internally and if that lock is taken when the crash occurred (because you crashed in the middle of another call to syslog) you have a dead lock
If you want to learn all the gory details, check out this video of me giving a talk about it at OLS: http://free-electrons.com/pub/video/2008/ols/ols2008-gilad-ben-yossef-fault-handlers.ogg
Feed the addresses to addr2line and it will show you the file name, line number, and function name.
If you're fine with only getting proper backtraces when running through valgrind, then this might be an option for you:
VALGRIND_PRINTF_BACKTRACE(format, ...):
It will give you the backtrace for all functions, including static ones.
The better option I have found is libbacktrace by Ian Lance Taylor:
https://github.com/ianlancetaylor/libbacktrace
backtrace_symbols() does prints only exported symbols and could not be less portable as it requires the GNU libc.
addr2line is nice as it includes file names and line numbers. But it fails as soon as the loader performs relocations. Nowadays as ASLR is common, it will fail very often.
libunwind alone will not allow one to print file names and line numbers. To do this, one needs to parse DWARF debugging information inside the ELF binary file. This can be done using libdwarf, though. But why bother when libbacktrace gives you everything required for free?
Create a pipe
fork()
Make child process execute addr2line
In parent process, convert the addresses returned from backtrace() to hexadecimal
Write the hex addresses to the pipe
Read back the output from addr2line and print/log it
Since you're doing all this from a signal handler, make sure to not use functionality which is not async-signal-safe. You can see a list of async-signal-safe POSIX functions here.
If you don't want to take the "signal a different process that runs gdb on you" approach, which I think gby is advocating, you can also slightly alter your code to call open() on a crash log file and then backtrace_symbols_fd() with the fd returned by open() - both functions are async signal safe according to the glibc manual. You'll need still -rdynamic, of course. Also, from what I've seen, you still sometimes need to run addr2line on some addresses that the backtrace*() functions won't be able to decode.
Also note fork() is not async signal safe: http://article.gmane.org/gmane.linux.man/1893/match=fork+async, at least not on Linux. Neither is syslog(), as somebody already pointed out.
If ou want a very detailled backtrace, you should use ptrace(2) to trace the process you want the backtrace.
You will be able to see all functions your process used but you need some basic asm knowledge

What can cause non-deterministic output in a program?

I have a bug in a multi-processes program. The program receives input and instantly produces output, no network involved, and it doesn't have any time references.
What makes the cause of this bug hard to track down is that it only happens sometimes.
If I constantly run it, it produces both correct and incorrect output, with no discernible order or pattern.
What can cause such non-deterministic behavior? Are there tools out there that can help? There is a possibility that there are uninitialized variables in play. How do I find those?
EDIT: Problem solved, thanks for anyone who suggested
Race Condition.
I didn't thought of it mainly because I was sure that my design prevents this. The problem was that I've used 'wait' instead of 'waitpid', thus sometimes, when some process was lucky enough to finish before the one I was expecting, the correct order of things went wild.
You say it's a "multi-processes" program - could you be more specific? It may very well be a race condition in how you're handling the multiple processes.
If you could tell us more about how the processes interact, we might be able to come up with some possibilities. Note that although Artem's suggestion of using a debugger is fine in and of itself, you need to be aware that introducing a debugger may very well change the situation completely - particularly when it comes to race conditions. Personally I'm a fan of logging a lot, but even that can change the timing subtly.
The scheduler!
Basically, when you have multiple processes, they can run in any bizarre order they want. If those processes are sharing a resource that they are both reading and writing from (whether it be a file or memory or an IO device of some sort), ops are going to get interleaved in all sorts of weird orders. As a simple example, suppose you have two threads (they're threads so they share memory) and they're both trying to increment a global variable, x.
y = x + 1;
x = y
Now run those processes, but interleave the code in this way
Assume x = 1
P1:
y = x + 1
So now in P1, for variable y which is local and on the stack, y = 2. Then the scheduler comes in and starts P2
P2:
y = x + 1
x = y
x was still 1 coming into this, so 1 has been added to it and now x = 2
Then P1 finishes
P1:
x = y
and x is still 2! We incremented x twice but only got that once. And because we don't know how this is going to happen, it's referred to as non-deterministic behavior.
The good news is, you've stumbled upon one of the hardest problems in Systems programming as well as the primary battle cry of many of the functional language folks.
You're most likely looking at a race condition, i.e. an unpredictable and therefore hard to reproduce and debug interaction between improperly synchronized threads or processes.
The non-determinism in this case stems from process/thread and memory access scheduling. This is unpredictable because it is influenced by a large number of external factors, including network traffic and user input which constantly cause interrupts and lead to different actual sequences of execution in the program's threads each time it's run.
It could be a lot of things, memory leaks, critical sections access, unclosed resources, unclosed connection and etc. There is only one tool which can help you - DEBUGGER, or try examine your algorithm and find bug, or if you succeeded to point the problematic part, you can paste here a snippet and we will try to help you.
Start with the basics... make sure that all your variables have a default value and that all dynamic memory is zeroed out before you use it (i.e. use calloc rather than malloc). There should be a compiler option to flag this (unless you're using some obscure compiler).
If this is c++ (I know it's supposed to be a 'c' forum), there are times were object creation and initialization lags behind variable assignment that can bite you. For example if you have a scope that is used concurrently by multiple threads (as in a singleton or a global var) this can cause issues:
if (!foo)
Foo tmp = new Foo();
If you have multiple threads access the above, the first thread finds foo = null and starts the object creation and assignment and then yields. Another thread comes in and finds foo != null so skips the section and starts to use foo.
We'd need to see specifics about your code to be able to give a more accurate answer, but to be concise, when you have a program that coordinates between multiple processes or multiple threads, the variable of when the threads execute can add indeterminacy to your application. Essentially, the scheduling that the OS does can cause processes and threads to execute out-of-order. Depending on your environment and code, the scheduling that the OS does can cause wildly different results. You can search on google for more information about out-of-order execution with multithreading for more information; it's a large topic.
By "multi-process" do you mean multi-threaded? If we had two threads that do this routine
i = 1;
while(true)
{
printf(i++);
if(i > 4) i = 1;
}
Normally we'd expect the output to be something like
112233441122334411223344
But actually we'd be seeing something like
11232344112233441231423
This is because each thread would get to use the CPU at different rates. (There's a whole lot of complicated behind the scheduling schedule, and I'm too weak to tell you the technical stuffs behind it.) Suffice to say, the scheduling from the average person's point of view is pretty random.
This is an example of race condition mentioned in other comments.

Debugging a clobbered static variable in C (gdb broken?)

I've done a lot of programming but not much in C, and I need advice on debugging. I have a static variable (file scope) that is being clobbered after about 10-100 seconds of execution of a multithreaded program (using pthreads on OS X 10.4). My code looks something like this:
static float some_values[SIZE];
static int * addr;
addr points to valid memory address for a while, and then gets clobbered with some value (sometimes 0, sometimes nonzero), thereby causing a segfault when dereferenced. Poking around with gdb I have verified that addr is being layed out in memory immediately after some_values as one would expect, so my first guess would be that I have used an out-of-bounds index to write to some_values. However, this is a tiny file, so it is easy to check this is not the problem.
The obvious debugging technique would be to set a watchpoint on the variable addr. But doing so seems to create erratic and inexplicable behavior in gdb. The watchpoint gets triggered at the first assignment to addr; then after I continue execution, I immediately get a nonsensical segfault in another thread...supposedly a segfault on accessing the address of a static variable in a different part of the program! But then gdb lets me read from and write to that memory address interactively.
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x001d5bd0
0x0000678d in receive (arg=0x0) at mainloop.c:39
39 sample_buf_cleared ++;
(gdb) p &sample_buf_cleared
$17 = (int *) 0x1d5bd0
(gdb) p sample_buf_cleared
$18 = 1
(gdb) set sample_buf_cleared = 2
(gdb)
gdb is obviously confused. Does anyone know why? Or does anyone have any suggestions for debugging this bug without using watchpoints?
You could put an array of uint's between some_values and addr and determine if you are overruning some_values or if the corruption affects more addresses then you first thought. I would initialize padding to DEADBEEF or some other obvious pattern that is easy to distinguish and unlikely to occur in the program. If a value in the padding changes then cast it to float and see if the number makes sense as a float.
static float some_values[SIZE];
static unsigned int padding[1024];
static int * addr;
Run the program multiple times. In each run disable a different thread and see when the problems goes away.
Set the programs process affinity to a single core and then try the watchpoint. You may have better luck if you don't have two threads simultaneously modifying the value. NOTE: This solution does not preclude that from happening. It may make it easier to catch in a debugger.
static variables and multi-threading generally do not mix.
Without seeing your code (you should include your threaded code), my guess is that you have two threads concurrently writing to addr variable. It doesn't work.
You either need to:
create separate instances of addr for each thread; or
provide some sort of synchronisation around addr to stop two threads changing the value at the same time.
Try using valgrind; I haven't tried valgrind on OS X, and I don't understand your problem, but "try valgrind" is the first thing I think of when you say "clobbered".
One thing you could try would be to create a separate thread whose only purpose is to watch the value of addr, and to break when it changes. For example:
static int * volatile addr; // volatile here is important, and must be after the *
void *addr_thread_proc(void *arg)
{
while(1)
{
int *old_value = addr;
while(addr == old_value) /* spin */;
__asm__("int3"); // break the debugger, or raise SIGTRAP if no debugger
}
}
...
pthread_t spin_thread;
pthread_create(&spin_thread, NULL, &addr_thread_proc, NULL);
Then, whenever the value of addr changes, the int3 instruction will run, which will break the debugger, stopping all threads.
gdb often acts weird with multithreaded programs. Another solution (if you can afford it) would be to put printf()s all over the place to try and catch the moment where your value gets clobbered. Not very elegant, but sometimes effective.
I have not done any debugging on OSX, but I have seen the same behavior in GDB on Linux: program crashes, yet GDB can read and write the memory which program just tried to read/write unsuccessfully.
This doesn't necessarily mean GDB is confused; rather the kernel allowed GDB to read/write memory via ptrace() which the inferior process is not allowed to read or write. IOW, it was a (recently fixed) kernel bug.
Still, it sounds like GDB watchpoints aren't working for you for whatever reason.
One technique you could use is to mmap space for some_values rather than statically allocating space for them, arrange for the array to end on a page boundary, and arrange for the next page to be non-accessible (via mprotect).
If any code tries to access past the end of some_values, it will get an exception (effectively you are setting a non-writable "watch point" just past some_values).

Resources