I am trying to implement a user level thread library and need to schedule threads in a round robin fashion. I am currently trying to make switching work for 2 threads that I have created using makecontext, getcontext and swapcontext. setitimer with ITIMER_PROF value is used and sigaction is assigned a handler to schedule a new thread whenever the SIGPROF signal is generated.
However, the signal handler is not invoked and the threads therefore never get scheduled. What could be the reason? Here are some snippets of the code:
void userthread_init(long period){
/*long time_period = period;
//Includes all the code like initializing the timer and attaching the signal
// handler function "schedule()" to the signal SIGPROF.
// create a linked list of threads - each thread's context gets added to the list/updated in the list
// in userthread_create*/
struct itimerval it;
struct sigaction act;
act.sa_flags = SA_SIGINFO;
act.sa_sigaction = &schedule;
sigemptyset(&act.sa_mask);
sigaction(SIGPROF,&act,NULL);
time_period = period;
it.it_interval.tv_sec = 4;
it.it_interval.tv_usec = period;
it.it_value.tv_sec = 1;
it.it_value.tv_usec = 100000;
setitimer(ITIMER_PROF, &it,NULL);
//for(;;);
}
The above code is to initialize a timer and attach a handler schedule to the signal handler. I am assuming the signal SIGPROF will be given to the above function which will invoke the scheduler() function. The scheduler function is given below:
void schedule(int sig, siginfo_t *siginf, ucontext_t* context1){
printf("\nIn schedule");
ucontext_t *ucp = NULL;
ucp = malloc(sizeof(ucontext_t));
getcontext(ucp);
//ucp = &sched->context;
sched->context = *context1;
if(sched->next != NULL){
sched = sched->next;
}
else{
sched = first;
}
setcontext(&sched->context);
}
I have a queue of ready threads in which their respective contexts are stored. Each thread should get scheduled whenever setcontext instruction is executed. However, scheduler() is not invoked! Can anyone please point out my mistake??
Completely revising this answer after looking at the code. There are a few issues:
There are several compiler warnings
You are never initializing your thread ID's, not outside or inside your thread creation method, so I'm surprised the code even works!
You are reading from uninitialized memory in your gtthread_create() function, I tested on both OSX & Linux, on OSX it crashes, on Linux by some miracle it's initialized.
In some places you call malloc(), and overwrite it with a pointer to something else - leaking memory
Your threads don't remove themselves from the linked list after they've finished, so weird things are happening after the routines finish.
When I add in the while(1) loop, I do see schedule() being called and output from thread 2, but thread 1 vanishes into fat air (probably because of the uninitialized thread ID). I think you need to have a huge code cleanup.
Here's what I'd suggest:
Fix ALL of your compiler warnings — even if you think they don't matter, the noise may lead to you missing things (such as incompatible pointer types, etc). You're compiling with -Wall & -pedantic; that's a good thing - so now take the next step & fix them.
Put \n at the END of your printf statements, not the start — The two threads ARE outputting to stdout, but it's not getting flushed so you can't see it. Change your printf("\nMessage"); calls to printf("Message\n");
Use Valgrind to detect memory issues — valgrind is the single most amazing tool you will ever use for C/C++ development. It's available through apt-get & yum. Instead of running ./test1, run valgrind ./test1 and it will highlight memory corruption, memory leaks, uninitialized reads, etc. I can't stress this enough; Valgrind is amazing.
If a system call returns a value, check it — in your code, check the return values to all of getcontext, swapcontext, sigaction, setitimer
Only call async-signal-safe methods from your scheduler (or any signal handler) — so far you've fixed malloc() and printf() from inside your scheduler. Check out the signal(7) man page - see "Async-signal-safe functions"
Modularize your code — your linked list implementation could be tidier, and if it was separated out, then 1) your scheduler would have less code & be simpler, and 2) you can isolate issues in your linked list without having to debug scheduler code.
You're almost there, so keep at it - but keep in mind these three simple rules:
Clean as you go
Keep the compiler warnings fixed
When weird things are happening, use valgrind
Good luck!
Old answer:
You should check the return value of any system call. Whether or not it helps you find the answer, you should do it anyway :)
Check the return value of sigaction(), if it's -1, check errno. sigaction() can fail for a few reasons. If your signal handler isn't getting fired, it's possible it hasn't been set up.
Edit: and make sure you check the return of setitimer() too!
Edit 2: Just a thought, can you try getting rid of the malloc()? malloc is not signal safe. eg: like this:
void schedule(int sig, siginfo_t *siginf, ucontext_t* context1){
printf("In schedule\n");
getcontext(&sched->context);
if(sched->next != NULL){
sched = sched->next;
}
else{
sched = first;
}
setcontext(&sched->context);
}
Edit 3: According to this discussion, you can't use printf() inside a signal handler. You can try replacing it with a call to write(), which is async-signal safe:
// printf("In schedule\n");
const char message[] = "In schedule\n";
write( 1, message, sizeof( message ) );
Related
I have a small program that contains a variable that I need to malloc:
char **v;
v = (char**)malloc(sizeof(char *) * MAX_EVENTS);
for (int i = 0; i < MAX_EVENTS; i++)
v[i] = (char *)malloc(MAX_NAME_SIZE);
In order to make Valgrind happy, to avoid any memory leaks, I set up handlers for termination signals. This handler will simply free that allocation before exiting, as well as terminating child processes.
static void term_handler() {
if (v != NULL) {
for (int i = 0; i < MAX_EVENTS; i++) {
if (v[i] != NULL)
free(v[i]);
}
free(v);
}
for (int i = 0; i < MAX_PROCS; i++)
if (children[i])
kill(children[i], SIGTERM);
exit(EXIT_SUCCESS);
}
To access v from the handler, I put it as a global variable. children is a static array pid_t children[MAX_PROCS]; but could potentially be malloced as well.
What is the cleanest way to access those allocations from the handler? Having global variables is not recommend but nor are memory leaks and not properly terminated programs.
Should I keep an array of pointers to my allocations as a global variable? Or should I just avoid handling unexpected signals?
Signal handlers are tricky, in that they are called asynchronously, and therefore there are only a small set of function calls that are safe to call from within a signal handler. In particular, allocating or freeing memory from within a signal-handler is a no-no (as is calling exit()!), so don't do it.
If you want to make sure the memory gets freed(*), however, you can do so by having your signal handler "tell" your program's main thread that it is time for it to exit. The main thread can then break out of its event loop, free the memory, and do any other cleanup work it would normally do before exiting.
So then the question becomes, how can a signal handler safely tell the main thread to perform a controlled/graceful exit?
If the main thread is running an event loop that executes on a fixed schedule (e.g. every so-many milliseconds), it may be as easy as declaring a global variable (e.g. volatile bool pleaseQuitNow = false; that the main thread tests on each iteration of its event loop, and having the signal-handler set that variable to a different value. The main thread will then see the changed variable on its next iteration and respond by breaking out of the event loop.
If the main thread's event-loop is event-based, on the other hand (e.g. it is blocked inside select() or poll() or similar and the call won't return for some indefinite amount of time), then an alternate way to wake up the main thread would be to create a pipe() or socketpair() at program startup, and have the main thread watch one of the two file-descriptors for read-ready status. Then when the signal handler runs, it can send() a byte on the other file descriptor, which will cause the first file descriptor to indicate ready-for-read status. The main thread can respond to that ready-for-read status by breaking out of its event loop and exiting gracefully.
In addition to avoiding async-signal-unsafe calls, the benefit of doing it this way is that you have only one shutdown/cleanup-path to test/debug/maintain, instead of two.
(*) Of course on any modern OS the memory will get freed anyway, by the OS's process-cleanup routines; but valgrind will complain about memory leaks, so it's better to free the memory manually if possible, if only so that you can use valgrind to find "real" memory leaks without having to sort through a bunch of false-positives every time.
I've written a program that uses SIGALRM and a signal handler.
I'm now trying to add this as a test module within the kernel.
I found that I had to replace a lot of the functions that libc provides with their underlying syscalls..examples being timer_create with sys_timer_create timer_settime with sys_timer_settime and so on.
However, I'm having issues with sigaction.
Compiling the kernel throws the following error
arch/arm/mach-vexpress/cpufreq_test.c:157:2: error: implicit declaration of function 'sys_sigaction' [-Werror=implicit-function-declaration]
I've attached the relevant code block below
int estimate_from_cycles() {
timer_t timer;
struct itimerspec old;
struct sigaction sig_action;
struct sigevent sig_event;
sigset_t sig_mask;
memset(&sig_action, 0, sizeof(struct sigaction));
sig_action.sa_handler = alarm_handler;
sigemptyset(&sig_action.sa_mask);
VERBOSE("Blocking signal %d\n", SIGALRM);
sigemptyset(&sig_mask);
sigaddset(&sig_mask, SIGALRM);
if(sys_sigaction(SIGALRM, &sig_action, NULL)) {
ERROR("Could not assign sigaction\n");
return -1;
}
if (sigprocmask(SIG_SETMASK, &sig_mask, NULL) == -1) {
ERROR("sigprocmask failed\n");
return -1;
}
memset (&sig_event, 0, sizeof (struct sigevent));
sig_event.sigev_notify = SIGEV_SIGNAL;
sig_event.sigev_signo = SIGALRM;
sig_event.sigev_value.sival_ptr = &timer;
if (sys_timer_create(CLOCK_PROCESS_CPUTIME_ID, &sig_event, &timer)) {
ERROR("Could not create timer\n");
return -1;
}
if (sigprocmask(SIG_UNBLOCK, &sig_mask, NULL) == -1) {
ERROR("sigprocmask unblock failed\n");
return -1;
}
cycles = 0;
VERBOSE("Entering main loop\n");
if(sys_timer_settime(timer, 0, &time_period, &old)) {
ERROR("Could not set timer\n");
return -1;
}
while(1) {
ADD(CYCLES_REGISTER, 1);
}
return 0;
}
Is such an approach of taking user-space code and changing the calls alone sufficient to run the code in kernel-space?
Is such an approach of taking user-space code and changing the calls
alone sufficient to run the code in kernel-space?
Of course not! What are you doing is to call the implementation of a system call directly from kernel space, but there is not guarantee that they SYS_function has the same function definition as the system call. The correct approach is to search for the correct kernel routine that does what you need. Unless you are writing a driver or a kernel feature you don't nee to write kernel code. System calls must be only invoked from user space. Their main purpose is to offer a safe manner to access low level mechanisms offered by an operating system such as File System, Socket and so on.
Regarding signals. You had a TERRIBLE idea to try to use signal system calls from kernel space in order to receive a signal. A process sends a signal to another process and signal are meant to be used in user space, so between user space processes. Typically, what happens when you send a signal to another process is that, if the signal is not masked, the receiving process is stopped and the signal handler is executed. Note that in order to achieve this result two switches between user space and kernel space are required.
However, the kernel has its internal tasks which have exactly the same structure of a user space with some differences ( e.g. memory mapping, parent process, etc..). Of course you cannot send a signal from a user process to a kernel thread (imagine what happen if you send a SIGKILL to a crucial component). Since kernel threads have the same structure of user space thread, they can receive signal but its default behaviour is to drop them unless differently specified.
I'd recommend to change you code to try to send a signal from kernel space to user space rather than try to receive one. ( How would you send a signal to kernel space? which pid would you specify?). This may be a good starting point : http://people.ee.ethz.ch/~arkeller/linux/kernel_user_space_howto.html#toc6
You are having problem with sys_sigaction because this is the old definition of the system call. The correct definition should be sys_rt_sigaction.
From the kernel source 3.12 :
#ifdef CONFIG_OLD_SIGACTION
asmlinkage long sys_sigaction(int, const struct old_sigaction __user *,
struct old_sigaction __user *);
#endif
#ifndef CONFIG_ODD_RT_SIGACTION
asmlinkage long sys_rt_sigaction(int,
const struct sigaction __user *,
struct sigaction __user *,
size_t);
#endif
BTW, you should not call any of them, they are meant to be called from user space.
You're working in kernel space so you should start thinking like you're working in kernel space instead of trying to port a userspace hack into the kernel. If you need to call the sys_* family of functions in kernel space, 99.95% of the time, you're already doing something very, very wrong.
Instead of while (1), have it break the loop on a volatile variable and start a thread that simply sleeps and change the value of the variable when it finishes.
I.e.
void some_function(volatile int *condition) {
sleep(x);
*condition = 0;
}
volatile int condition = 1;
start_thread(some_function, &condition);
while(condition) {
ADD(CYCLES_REGISTER, 1);
}
However, what you're doing (I'm assuming you're trying to get the number of cycles the CPU is operating at) is inherently impossible on a preemptive kernel like Linux without a lot of hacking. If you keep interrupts on, your cycle count will be inaccurate since your kernel thread may be switched out at any time. If you turn interrupts off, other threads won't run and your code will just infinite loop and hang the kernel.
Are you sure you can't simply use the BogoMIPs value from the kernel? It is essentially what you're trying to measure but the kernel does it very early in the boot process and does it right.
This question is based on:
When is it safe to destroy a pthread barrier?
and the recent glibc bug report:
http://sourceware.org/bugzilla/show_bug.cgi?id=12674
I'm not sure about the semaphores issue reported in glibc, but presumably it's supposed to be valid to destroy a barrier as soon as pthread_barrier_wait returns, as per the above linked question. (Normally, the thread that got PTHREAD_BARRIER_SERIAL_THREAD, or a "special" thread that already considered itself "responsible" for the barrier object, would be the one to destroy it.) The main use case I can think of is when a barrier is used to synchronize a new thread's use of data on the creating thread's stack, preventing the creating thread from returning until the new thread gets to use the data; other barriers probably have a lifetime equal to that of the whole program, or controlled by some other synchronization object.
In any case, how can an implementation ensure that destruction of the barrier (and possibly even unmapping of the memory it resides in) is safe as soon as pthread_barrier_wait returns in any thread? It seems the other threads that have not yet returned would need to examine at least some part of the barrier object to finish their work and return, much like how, in the glibc bug report cited above, sem_post has to examine the waiters count after having adjusted the semaphore value.
I'm going to take another crack at this with an example implementation of pthread_barrier_wait() that uses mutex and condition variable functionality as might be provided by a pthreads implementation. Note that this example doesn't try to deal with performance considerations (specifically, when the waiting threads are unblocked, they are all re-serialized when exiting the wait). I think that using something like Linux Futex objects could help with the performance issues, but Futexes are still pretty much out of my experience.
Also, I doubt that this example handles signals or errors correctly (if at all in the case of signals). But I think proper support for those things can be added as an exercise for the reader.
My main fear is that the example may have a race condition or deadlock (the mutex handling is more complex than I like). Also note that it is an example that hasn't even been compiled. Treat it as pseudo-code. Also keep in mind that my experience is mainly in Windows - I'm tackling this more as an educational opportunity than anything else. So the quality of the pseudo-code may well be pretty low.
However, disclaimers aside, I think it may give an idea of how the problem asked in the question could be handled (ie., how can the pthread_barrier_wait() function allow the pthread_barrier_t object it uses to be destroyed by any of the released threads without danger of using the barrier object by one or more threads on their way out).
Here goes:
/*
* Since this is a part of the implementation of the pthread API, it uses
* reserved names that start with "__" for internal structures and functions
*
* Functions such as __mutex_lock() and __cond_wait() perform the same function
* as the corresponding pthread API.
*/
// struct __barrier_wait data is intended to hold all the data
// that `pthread_barrier_wait()` will need after releasing
// waiting threads. This will allow the function to avoid
// touching the passed in pthread_barrier_t object after
// the wait is satisfied (since any of the released threads
// can destroy it)
struct __barrier_waitdata {
struct __mutex cond_mutex;
struct __cond cond;
unsigned waiter_count;
int wait_complete;
};
struct __barrier {
unsigned count;
struct __mutex waitdata_mutex;
struct __barrier_waitdata* pwaitdata;
};
typedef struct __barrier pthread_barrier_t;
int __barrier_waitdata_init( struct __barrier_waitdata* pwaitdata)
{
waitdata.waiter_count = 0;
waitdata.wait_complete = 0;
rc = __mutex_init( &waitdata.cond_mutex, NULL);
if (!rc) {
return rc;
}
rc = __cond_init( &waitdata.cond, NULL);
if (!rc) {
__mutex_destroy( &pwaitdata->waitdata_mutex);
return rc;
}
return 0;
}
int pthread_barrier_init(pthread_barrier_t *barrier, const pthread_barrierattr_t *attr, unsigned int count)
{
int rc;
rc = __mutex_init( &barrier->waitdata_mutex, NULL);
if (!rc) return rc;
barrier->pwaitdata = NULL;
barrier->count = count;
//TODO: deal with attr
}
int pthread_barrier_wait(pthread_barrier_t *barrier)
{
int rc;
struct __barrier_waitdata* pwaitdata;
unsigned target_count;
// potential waitdata block (only one thread's will actually be used)
struct __barrier_waitdata waitdata;
// nothing to do if we only need to wait for one thread...
if (barrier->count == 1) return PTHREAD_BARRIER_SERIAL_THREAD;
rc = __mutex_lock( &barrier->waitdata_mutex);
if (!rc) return rc;
if (!barrier->pwaitdata) {
// no other thread has claimed the waitdata block yet -
// we'll use this thread's
rc = __barrier_waitdata_init( &waitdata);
if (!rc) {
__mutex_unlock( &barrier->waitdata_mutex);
return rc;
}
barrier->pwaitdata = &waitdata;
}
pwaitdata = barrier->pwaitdata;
target_count = barrier->count;
// all data necessary for handling the return from a wait is pointed to
// by `pwaitdata`, and `pwaitdata` points to a block of data on the stack of
// one of the waiting threads. We have to make sure that the thread that owns
// that block waits until all others have finished with the information
// pointed to by `pwaitdata` before it returns. However, after the 'big' wait
// is completed, the `pthread_barrier_t` object that's passed into this
// function isn't used. The last operation done to `*barrier` is to set
// `barrier->pwaitdata = NULL` to satisfy the requirement that this function
// leaves `*barrier` in a state as if `pthread_barrier_init()` had been called - and
// that operation is done by the thread that signals the wait condition
// completion before the completion is signaled.
// note: we're still holding `barrier->waitdata_mutex`;
rc = __mutex_lock( &pwaitdata->cond_mutex);
pwaitdata->waiter_count += 1;
if (pwaitdata->waiter_count < target_count) {
// need to wait for other threads
__mutex_unlock( &barrier->waitdata_mutex);
do {
// TODO: handle the return code from `__cond_wait()` to break out of this
// if a signal makes that necessary
__cond_wait( &pwaitdata->cond, &pwaitdata->cond_mutex);
} while (!pwaitdata->wait_complete);
}
else {
// this thread satisfies the wait - unblock all the other waiters
pwaitdata->wait_complete = 1;
// 'release' our use of the passed in pthread_barrier_t object
barrier->pwaitdata = NULL;
// unlock the barrier's waitdata_mutex - the barrier is
// ready for use by another set of threads
__mutex_unlock( barrier->waitdata_mutex);
// finally, unblock the waiting threads
__cond_broadcast( &pwaitdata->cond);
}
// at this point, barrier->waitdata_mutex is unlocked, the
// barrier->pwaitdata pointer has been cleared, and no further
// use of `*barrier` is permitted...
// however, each thread still has a valid `pwaitdata` pointer - the
// thread that owns that block needs to wait until all others have
// dropped the pwaitdata->waiter_count
// also, at this point the `pwaitdata->cond_mutex` is locked, so
// we're in a critical section
rc = 0;
pwaitdata->waiter_count--;
if (pwaitdata == &waitdata) {
// this thread owns the waitdata block - it needs to hang around until
// all other threads are done
// as a convenience, this thread will be the one that returns
// PTHREAD_BARRIER_SERIAL_THREAD
rc = PTHREAD_BARRIER_SERIAL_THREAD;
while (pwaitdata->waiter_count!= 0) {
__cond_wait( &pwaitdata->cond, &pwaitdata->cond_mutex);
};
__mutex_unlock( &pwaitdata->cond_mutex);
__cond_destroy( &pwaitdata->cond);
__mutex_destroy( &pwaitdata_cond_mutex);
}
else if (pwaitdata->waiter_count == 0) {
__cond_signal( &pwaitdata->cond);
__mutex_unlock( &pwaitdata->cond_mutex);
}
return rc;
}
17 July 20111: Update in response to a comment/question about process-shared barriers
I forgot completely about the situation with barriers that are shared between processes. And as you mention, the idea I outlined will fail horribly in that case. I don't really have experience with POSIX shared memory use, so any suggestions I make should be tempered with scepticism.
To summarize (for my benefit, if no one else's):
When any of the threads gets control after pthread_barrier_wait() returns, the barrier object needs to be in the 'init' state (however, the most recent pthread_barrier_init() on that object set it). Also implied by the API is that once any of the threads return, one or more of the the following things could occur:
another call to pthread_barrier_wait() to start a new round of synchronization of threads
pthread_barrier_destroy() on the barrier object
the memory allocated for the barrier object could be freed or unshared if it's in a shared memory region.
These things mean that before the pthread_barrier_wait() call allows any thread to return, it pretty much needs to ensure that all waiting threads are no longer using the barrier object in the context of that call. My first answer addressed this by creating a 'local' set of synchronization objects (a mutex and an associated condition variable) outside of the barrier object that would block all the threads. These local synchronization objects were allocated on the stack of the thread that happened to call pthread_barrier_wait() first.
I think that something similar would need to be done for barriers that are process-shared. However, in that case simply allocating those sync objects on a thread's stack isn't adequate (since the other processes would have no access). For a process-shared barrier, those objects would have to be allocated in process-shared memory. I think the technique I listed above could be applied similarly:
the waitdata_mutex that controls the 'allocation' of the local sync variables (the waitdata block) would be in process-shared memory already by virtue of it being in the barrier struct. Of course, when the barrier is set to THEAD_PROCESS_SHARED, that attribute would also need to be applied to the waitdata_mutex
when __barrier_waitdata_init() is called to initialize the local mutex & condition variable, it would have to allocate those objects in shared memory instead of simply using the stack-based waitdata variable.
when the 'cleanup' thread destroys the mutex and the condition variable in the waitdata block, it would also need to clean up the process-shared memory allocation for the block.
in the case where shared memory is used, there needs to be some mechanism to ensured that the shared memory object is opened at least once in each process, and closed the correct number of times in each process (but not closed entirely before every thread in the process is finished using it). I haven't thought through exactly how that would be done...
I think these changes would allow the scheme to operate with process-shared barriers. the last bullet point above is a key item to figure out. Another is how to construct a name for the shared memory object that will hold the 'local' process-shared waitdata. There are certain attributes you'd want for that name:
you'd want the storage for the name to reside in the struct pthread_barrier_t structure so all process have access to it; that means a known limit to the length of the name
you'd want the name to be unique to each 'instance' of a set of calls to pthread_barrier_wait() because it might be possible for a second round of waiting to start before all threads have gotten all the way out of the first round waiting (so the process-shared memory block set up for the waitdata might not have been freed yet). So the name probably has to be based on things like process id, thread id, address of the barrier object, and an atomic counter.
I don't know whether or not there are security implications to having the name be 'guessable'. if so, some randomization needs to be added - no idea how much. Maybe you'd also need to hash the data mentioned above along with the random bits. Like I said, I really have no idea if this is important or not.
As far as I can see there is no need for pthread_barrier_destroy to be an immediate operation. You could have it wait until all threads that are still in their wakeup phase are woken up.
E.g you could have an atomic counter awakening that initially set to the number of threads that are woken up. Then it would be decremented as last action before pthread_barrier_wait returns. pthread_barrier_destroy then just could be spinning until that counter falls to 0.
Is it possible to restore the normal execution flow of a C program, after the Segmentation Fault error?
struct A {
int x;
};
A* a = 0;
a->x = 123; // this is where segmentation violation occurs
// after handling the error I want to get back here:
printf("normal execution");
// the rest of my source code....
I want a mechanism similar to NullPointerException that is present in Java, C# etc.
Note: Please, don't tell me that there is an exception handling mechanism in C++ because I know that, dont' tell me I should check every pointer before assignment etc.
What I really want to achieve is to get back to normal execution flow as in the example above. I know some actions can be undertaken using POSIX signals. How should it look like? Other ideas?
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <signal.h>
#include <stdlib.h>
#include <ucontext.h>
void safe_func(void)
{
puts("Safe now ?");
exit(0); //can't return to main, it's where the segfault occured.
}
void
handler (int cause, siginfo_t * info, void *uap)
{
//For test. Never ever call stdio functions in a signal handler otherwise*/
printf ("SIGSEGV raised at address %p\n", info->si_addr);
ucontext_t *context = uap;
/*On my particular system, compiled with gcc -O2, the offending instruction
generated for "*f = 16;" is 6 bytes. Lets try to set the instruction
pointer to the next instruction (general register 14 is EIP, on linux x86) */
context->uc_mcontext.gregs[14] += 6;
//alternativly, try to jump to a "safe place"
//context->uc_mcontext.gregs[14] = (unsigned int)safe_func;
}
int
main (int argc, char *argv[])
{
struct sigaction sa;
sa.sa_sigaction = handler;
int *f = NULL;
sigemptyset (&sa.sa_mask);
sa.sa_flags = SA_SIGINFO;
if (sigaction (SIGSEGV, &sa, 0)) {
perror ("sigaction");
exit(1);
}
//cause a segfault
*f = 16;
puts("Still Alive");
return 0;
}
$ ./a.out
SIGSEGV raised at address (nil)
Still Alive
I would beat someone with a bat if I saw something like this in production code though, it's an ugly, for-fun hack. You'll have no idea if the segfault have corrupted some of your data, you'll have no sane way of recovering and know that everything is Ok now, there's no portable way of doing this. The only mildly sane thing you could do is try to log an error (use write() directly, not any of the stdio functions - they're not signal safe) and perhaps restart the program. For those cases you're much better off writing a superwisor process that monitors a child process exit, logs it and starts a new child process.
You can catch segmentation faults using a signal handler, and decide to continue the excecution of the program (at your own risks).
The signal name is SIGSEGV.
You will have to use the sigaction() function, from the signal.h header.
Basically, it works the following way:
struct sigaction sa1;
struct sigaction sa2;
sa1.sa_handler = your_handler_func;
sa1.sa_flags = 0;
sigemptyset( &sa1.sa_mask );
sigaction( SIGSEGV, &sa1, &sa2 );
Here's the prototype of the handler function:
void your_handler_func( int id );
As you can see, you don't need to return. The program's execution will continue, unless you decide to stop it by yourself from the handler.
"All things are permissible, but not all are beneficial" - typically a segfault is game over for a good reason... A better idea than picking up where it was would be to keep your data persisted (database, or at least a file system) and enable it to pick up where it left off that way. This will give you much better data reliability all around.
See R.'s comment to MacMade answer.
Expanding on what he said, (after handling SIGSEV, or, for that case, SIGFPE, the CPU+OS can return you to the offending insn) here is a test I have for division by zero handling:
#include <stdio.h>
#include <limits.h>
#include <string.h>
#include <signal.h>
#include <setjmp.h>
static jmp_buf context;
static void sig_handler(int signo)
{
/* XXX: don't do this, not reentrant */
printf("Got SIGFPE\n");
/* avoid infinite loop */
longjmp(context, 1);
}
int main()
{
int a;
struct sigaction sa;
memset(&sa, 0, sizeof(struct sigaction));
sa.sa_handler = sig_handler;
sa.sa_flags = SA_RESTART;
sigaction(SIGFPE, &sa, NULL);
if (setjmp(context)) {
/* If this one was on setjmp's block,
* it would need to be volatile, to
* make sure the compiler reloads it.
*/
sigset_t ss;
/* Make sure to unblock SIGFPE, according to POSIX it
* gets blocked when calling its signal handler.
* sigsetjmp()/siglongjmp would make this unnecessary.
*/
sigemptyset(&ss);
sigaddset(&ss, SIGFPE);
sigprocmask(SIG_UNBLOCK, &ss, NULL);
goto skip;
}
a = 10 / 0;
skip:
printf("Exiting\n");
return 0;
}
No, it's not possible, in any logical sense, to restore normal execution following a segmentation fault. Your program just tried to dereference a null pointer. How are you going to carry on as normal if something your program expects to be there isn't? It's a programming bug, the only safe thing to do is to exit.
Consider some of the possible causes of a segmentation fault:
you forgot to assign a legitimate value to a pointer
a pointer has been overwritten possibly because you are accessing heap memory you have freed
a bug has corrupted the heap
a bug has corrupted the stack
a malicious third party is attempting a buffer overflow exploit
malloc returned null because you have run out of memory
Only in the first case is there any kind of reasonable expectation that you might be able to carry on
If you have a pointer that you want to dereference but it might legitimately be null, you must test it before attempting the dereference. I know you don't want me to tell you that, but it's the right answer, so tough.
Edit: here's an example to show why you definitely do not want to carry on with the next instruction after dereferencing a null pointer:
void foobarMyProcess(struct SomeStruct* structPtr)
{
char* aBuffer = structPtr->aBigBufferWithLotsOfSpace; // if structPtr is NULL, will SIGSEGV
//
// if you SIGSEGV and come back to here, at this point aBuffer contains whatever garbage was in memory at the point
// where the stack frame was created
//
strcpy(aBuffer, "Some longish string"); // You've just written the string to some random location in your address space
// good luck with that!
}
Call this, and when a segfault will occur, your code will execute segv_handler and then continue back to where it was.
void segv_handler(int)
{
// Do what you want here
}
signal(SIGSEGV, segv_handler);
There is no meaningful way to recover from a SIGSEGV unless you know EXACTLY what caused it, and there's no way to do that in standard C. It may be possible (conceivably) in an instrumented environment, like a C-VM (?). The same is true for all program error signals; if you try to block/ignore them, or establish handlers that return normally, your program will probably break horribly when they happen unless perhaps they're generated by raise or kill.
Just do yourself a favour and take error cases into account.
In POSIX, your process will get sent SIGSEGV when you do that. The default handler just crashes your program. You can add your own handler using the signal() call. You can implement whatever behaviour you like by handling the signal yourself.
You can use the SetUnhandledExceptionFilter() function (in windows), but even to be able to skip the "illegal" instruction you will need to be able to decode some assembler opcodes. And, as glowcoder said, even if it would "comment out" in runtime the instructions that generates segfaults, what will be left from the original program logic (if it may be called so)?
Everything is possible, but it doesn't mean that it has to be done.
Unfortunately, you can't in this case. The buggy function has undefined behavior and could have corrupted your program's state.
What you CAN do is run the functions in a new process. If this process dies with a return code that indicates SIGSEGV, you know it has failed.
You could also rewrite the functions yourself.
I can see at case for recovering from a Segmentation Violation, if your handling events in a loop and one of these events causes a Segmentation Violation then you would only want to skip over this event, continue processing the remaining events. In my eyes Segmentation Violation are much the same as NullPointerExceptions in Java. Yes the state will be inconsistent and unknown after either of these, however in some cases you would like to handle the situation and carry on. For instance in Algo trading you would pause the execution of an order and allow a trader to manually take over, with out crashing the entire system and ruining all other orders.
the best solution is to inbox each unsafe access this way :
#include <iostream>
#include <signal.h>
#include <setjmp.h>
static jmp_buf buf;
int counter = 0;
void signal_handler(int)
{
longjmp(buf,0);
}
int main()
{
signal(SIGSEGV,signal_handler);
setjmp(buf);
if(counter++ == 0){ // if we did'nt try before
*(int*)(0x1215) = 10; // access an other process's memory
}
std::cout<<"i am alive !!"<<std::endl; // we will get into here in any case
system("pause");
return 0;
}
you program will never crash in almost all os
This glib manual gives you a clear picture of how to write signal handlers.
A signal handler is just a function that you compile together with the rest
of the program. Instead of directly invoking the function, you use signal
or sigaction to tell the operating system to call it when a signal arrives.
This is known as establishing the handler.
In your case you will have to wait for the SIGSEGV indicating a segmentation fault. The list of other signals can be found here.
Signal handlers are broadly classified into tow categories
You can have the handler function note that the signal arrived by tweaking some
global data structures, and then return normally.
You can have the handler function terminate the program or transfer
control to a point where it can recover from the situation that caused the signal.
SIGSEGV comes under program error signals
The example code of section 10.6, the expected result is:
after several iterations, the static structure used by getpwnam will be corrupted, and the program will terminate with SIGSEGV signal.
But on my platform, Fedora 11, gcc (GCC) 4.4.0, the result is
[Langzi#Freedom apue]$ ./corrupt
in sig_alarm
I can see the output from sig_alarm only once, and the program seems hung up for some reason, but it does exist, and still running.
But when I try to use gdb to run the program, it seems OK, I will see the output from sig_alarm at regular intervals.
And from my manual, it said the signal handler will be set to SIG_DEF after the signal is handled, and system will not block the signal. So at the beginning of my signal handler I reset the signal handler.
Maybe I should use sigaction instead, but I only want to know the reason about the difference between normal running and gdb running.
Any advice and help will be appreciated.
following is my code:
#include "apue.h"
#include <pwd.h>
void sig_alarm(int signo);
int main()
{
struct passwd *pwdptr;
signal(SIGALRM, sig_alarm);
alarm(1);
for(;;) {
if ((pwdptr = getpwnam("Zhijin")) == NULL)
err_sys("getpwnam error");
if (strcmp("Zhijin", pwdptr->pw_name) != 0) {
printf("data corrupted, pw_name: %s\n", pwdptr->pw_name);
}
}
}
void sig_alarm(int signo)
{
signal(SIGALRM, sig_alarm);
struct passwd *rootptr;
printf("in sig_alarm\n");
if ((rootptr = getpwnam("root")) == NULL)
err_sys("getpwnam error");
alarm(1);
}
According to the standard, you're really not allowed to do much in a signal handler. All you are guaranteed to be able to do in the signal-handling function, without causing undefined behavior, is to call signal, and to assign a value to a volatile static object of the type sig_atomic_t.
The first few times I ran this program, on Ubuntu Linux, it looked like your call to alarm in the signal handler didn't work, so the loop in main just kept running after the first alarm. When I tried it later, the program ran the signal handler a few times, and then hung. All this is consistent with undefined behavior: the program fails, sometimes, and in various more or less interesting ways.
It is not uncommon for programs that have undefined behavior to work differently in the debugger. The debugger is a different environment, and your program and data could for example be laid out in memory in a different way, so errors can manifest themselves in a different way, or not at all.
I got the program to work by adding a variable:
volatile sig_atomic_t got_interrupt = 0;
And then I changed your signal handler to this very simple one:
void sig_alarm(int signo) {
got_interrupt = 1;
}
And then I inserted the actual work into the infinite loop in main:
if (got_interrupt) {
got_interrupt = 0;
signal(SIGALRM, sig_alarm);
struct passwd *rootptr;
printf("in sig_alarm\n");
if ((rootptr = getpwnam("root")) == NULL)
perror("getpwnam error");
alarm(1);
}
I think the "apue" you mention is the book "Advanced Programming in the UNIX Environment", which I don't have here, so I don't know if the purpose of this example is to show that you shouldn't mess around with things inside of a signal handler, or just that signals can cause problems by interrupting the normal work of the program.
According to the spec, the function getpwnam is not reentrant and is not guaranteed to be thread safe. Since you are accessing the structure in two different threads of control (signal handlers are effectively running in a different thread context), you are running into this issue. Whenever you have concurrent or parallel execution (as when using pthreads or when using a signal handler), you must be sure not to modify shared state (e.g. the structure owned by 'getpwnam'), and if you do, then appropriate locking/synchronization must be used.
Additionally, the signal function has been deprecated in favor of the sigaction function. In order to ensure portable behavior when registering signal handlers, you should always use the sigaction invocation.
Using the sigaction function, you can use the SA_RESETHAND flag to reset the default handler. You can also use the sigprocmask function to enable/disable the delivery of signals without modifying their handlers.
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
void sigalrm_handler(int);
int main()
{
signal(SIGALRM, sigalrm_handler);
alarm(3);
while(1)
{
}
return 0;
}
void sigalrm_handler(int sign)
{
printf("I am alive. Catch the sigalrm %d!\n",sign);
alarm(3);
}
For example, my code is runing in main doing nothing and every 3 seconds my program says im alive x)
I think that if you do as i done calling in the handler function alarm with value 3, the problem is resolved :)