How can I efficiently catch and handle segmentation faults from C in an OSX Carbon application?
Background: I am making an OSX Carbon application. I must call a library function from a third party. Because of threading issues, the function can occasionally crash, usually because it's updating itself from one thread, and it's got some internally stale pointer or handle as I query it from another. The function is a black box to me. I want to be able to call the function but be able to "catch" if it has crashed and supply an alternative return.
In Windows, I can use the simple Visual C and Intel C compiler's __try{} and __except.
/* Working Windows Example */
__try { x=DangerousFunction(y);}
__except(EXCEPTION_EXECUTE_HANDLER) {x=0.0;} /* whups, func crashed! */
I am trying to make the same kind of crash-catcher for OSX. I am using pure C on a very large application. I call the function millions of times per second, so efficiency is very important too. (Impressively, the Windows __try() overhead is immeasurably small!)
Here's what I have experimented with:
1) C++ exceptions. I am not sure if C++ exceptions catch the segfault crashes. And my app is currently C. I could try wrappers and #ifdefs to make it C++ but this is a lot of work for the app, and I don't think C++ exceptions will catch the crash.
2) signal + setjump + longjmp. I thought this would work... it's what it's designed for. But I set up my SEGV error handler [in fact I set it up for every signal!] and it's never called during the crash. I can manually test (and succeed) when calling raise(SEGV). But the crashes don't seem to actually call it. My thoughts are that CFM applications do NOT have access to the full BSD signals, only a subset, and that Mach applications are necessary for the Real Thing.
3) MPSetExceptionHandler. Not well documented. I attempted to set a handler. It compiled and ran, but did not catch the segfault.
Are you sure you're not getting a SIGBUS rather then a SIGSEGV?
The below catches SIGBUS as caused by trying to write at memory location 0:
cristi:tmp diciu$ cat test.c
#include <signal.h>
static void sigac(int sig)
{
printf("sig action here, signal is %d\n", sig);
exit(1);
}
int main()
{
(void)signal(SIGSEGV, sigac);
(void)signal(SIGBUS, sigac);
printf("Raising\n");
strcpy(0, "aaksdjkajskd|");
}
cristi:tmp diciu$ ./a.out
Raising
sig action here, signal is 10
Related
First SO question, so here it goes.
I'm not asking for someone to review the code, i want to get to the bottom of this.
It would be helpful if someone knew what change in the kernel could be responsible for the following.
In the University we were tasked to implement extended functionality in a modeled Operating System written in C (written by my professor), that models each core with a pthread.
Project github forked by me.
We had to implement the necessary functionality by implementing the required syscalls. (multithreading, sockets, pipes, mlfq, etc).
After implementing each functionality we had to confirm that it was working using the validate_api program.
Problem time:
The validate_api.c contains a lot of tests to check the functionality of the OS.
BOOT_TEST: bare-boots the machine and tests something.
A simple test for creating a new thread inside a process:
BOOT_TEST(test_create_join_thread,
"Test that a process thread can be created and joined. Also, that "
"the argument of the thread is passed correctly."
)
{
int flag = 0;
int task(int argl, void* args) {
ASSERT(args == &flag);
*(int*)args = 1;
return 2;
}
Tid_t t = CreateThread(task, sizeof(flag), &flag);
/* Success in creating thread */
ASSERT(t!=NOTHREAD);
int exitval;
/* Join should succeed */
ASSERT(ThreadJoin(t, &exitval)==0);
/* Exit status should be correct */
ASSERT(exitval==2);
/* Shared variable should be updates */
ASSERT(flag==1);
/* A second Join should fail! */
ASSERT(ThreadJoin(t, NULL)==-1);
return 0;
}
As you can see there is a nested function called task() which is the starting point of the thread that is going to be created using the createThread() syscall we implemented.
The problem is that, although the thread is created correctly, when scheduled to run, the program exits with segmentation fault and cannot access the memory of the task function, gdb doesn't even recognize it as a variable (in the thread struct field pointing to it). The weird thing is that this happens ONLY when using a kernel version newer than 5.7. I opened an issue in the original project's repo.
Running the actual OS and its programs it's fine with no issues whatsoever, only validate_api fails due to that nested function. If i move the task function into global scope then the test finishes successfully. Same goes for every other test that has a nested function inside.
Note: The project is finished (1 month now), i downgraded to 5.4 just to test my implementation.
Note2: I dont need help with the implementation of any functionality (the project is finished any way), i just want to figure out why it doesn't work on kernels > 5.7
Note3: I'm here because my prof. doesn't respond to my repeated emails regarding the issue
I tried compiling using -fno-stack-protector and with -z execstack with no luck. Also simple nested functions like:
int main(){
int foo(){
puts("Hello there");
}
foo();
}
work with any kernel
Machine Details:
Arch Linux - 5.10 / 5.4 LTS
GCC 10.2
Thank you
UPDATE:
The test joins the thread, so it never goes out of scope.
When writing code I often have checks to see if errors occurred. An example would be:
char *x = malloc( some_bytes );
if( x == NULL ){
fprintf( stderr, "Malloc failed.\n" );
exit(EXIT_FAILURE);
}
I've also used strerror( errno ) in the past.
I've only ever written small desktop appications where it doesn't matter if the program exit()ed in case of an error.
Now, however, I'm writing C code for an embedded system (Arduino) and I don't want the system to just exit in case of an error. I want it to go to a particular state/function where it can power down systems, send error reports and idle safely.
I could simply call an error_handler() function, but I could be deep in the stack and very low on memory, leaving error_handler() inoperable.
Instead, I'd like execution to effectively collapse the stack, free up a bunch of memory and start sorting out powering down and error reporting. There is a serious fire risk if the system doesn't power down safely.
Is there a standard way that safe error handling is implemented in low memory embedded systems?
EDIT 1:
I'll limit my use of malloc() in embedded systems. In this particular case, the errors would occur when reading a file, if the file was not of the correct format.
Maybe you're waiting for the Holy and Sacred setjmp/longjmp, the one who came to save all the memory-hungry stacks of their sins?
#include <setjmp.h>
jmp_buf jumpToMeOnAnError;
void someUpperFunctionOnTheStack() {
if(setjmp(jumpToMeOnAnError) != 0) {
// Error handling code goes here
// Return, abort(), while(1) {}, or whatever here...
}
// Do routinary stuff
}
void someLowerFunctionOnTheStack() {
if(theWorldIsOver)
longjmp(jumpToMeOnAnError, -1);
}
Edit: Prefer not to do malloc()/free()s on embedded systems, for the same reasons you said. It's simply unhandable. Unless you use a lot of return codes/setjmp()s to free the memory all the way up the stack...
If your system has a watchdog, you could use:
char *x = malloc( some_bytes );
assert(x != NULL);
The implementation of assert() could be something like:
#define assert (condition) \
if (!(condition)) while(true)
In case of a failure the watchdog would trigger, the system would make a reset. At restart the system would check the reset reason, if the reset reason was "watchdog reset", the system would goto a safe state.
update
Before entering the while loop, assert cold also output a error message, print the stack trace or save some data in non volatile memory.
Is there a standard way that safe error handling is implemented in low memory embedded systems?
Yes, there is an industry de facto way of handling it. It is all rather simple:
For every module in your program you need to have a result type, such as a custom enum, which describes every possible thing that could go wrong with the functions inside that module.
You document every function properly, stating what codes it will return upon error and what code it will return upon success.
You leave all error handling to the caller.
If the caller is another module, it too passes on the error to its own caller. Possibly renames the error into something more suitable, where applicable.
The error handling mechanism is located in main(), at the bottom of the call stack.
This works well together with classic state machines. A typical main would be:
void main (void)
{
for(;;)
{
serve_watchdog();
result = state_machine();
if(result != good)
{
error_handler(result);
}
}
}
You should not use malloc in bare bone or RTOS microcontroller applications, not so much because of safety reasons, but simple because it doesn't make any sense whatsoever to use it. Apply common sense when programming.
Use setjmp(3) to set a recovery point, and longjmp(3) to jump to it, restoring the stack to what it was at the setjmp point. It wont free malloced memory.
Generally, it is not a good idea to use malloc/free in an embedded program if it can be avoided. For example, a static array may be adequate, or even using alloca() is marginally better.
to minimize stack usage:
write the program so the calls are in parallel rather than function calls sub function that calls sub function that calls sub function.... I.E. top level function calls sub function where sub function promptly returns, with status info. top level function then calls next sub function... etc
The (bad for stack limited) nested method of program architecture:
top level function
second level function
third level function
forth level function
should be avoided in embedded systems
the preferred method of program architecture for embedded systems is:
top level function (the reset event handler)
(variations in the following depending on if 'warm' or 'cold' start)
initialize hardware
initialize peripherals
initialize communication I/O
initialize interrupts
initialize status info
enable interrupts
enter background processing
interrupt handler
re-enable the interrupt
using 'scheduler'
select a foreground function
trigger dispatch for selected foreground function
return from interrupt
background processing
(this can be, and often is implemented as a 'state' machine rather than a loop)
loop:
if status info indicates need to call second level function 1
second level function 1, which updates status info
if status info indicates need to call second level function 2
second level function 2, which updates status info
etc
end loop:
Note that, as much as possible, there is no 'third level function x'
Note that, the foreground functions must complete before they are again scheduled.
Note: there are lots of other details that I have omitted in the above, like
kicking the watchdog,
the other interrupt events,
'critical' code sections and use of mutex(),
considerations between 'soft real-time' and 'hard real-time',
context switching
continuous BIT, commanded BIT, and error handling
etc
Is there any function in C to check if the computer is going to sleep,hibernate or locked and waking up from these state?
In msdn they provided for C#, C++ but not for C.
My OS is windows7
Like below is the code I'm using to check the time duration between starting the program and terminating it(shutting down the system will terminate the program so this way time duration can be measured).
#include <stdio.h>
#include <stdlib.h>
#include <conio.h>
#include<time.h>
clock_t start_time=0;
void bye (void)
{
FILE *read,*write;
write=fopen("F:\\count.txt","w");
clock_t end_time=clock();
fprintf(write,"Time: %d",(end_time-start_time)/CLOCKS_PER_SEC);
fclose(write);
}
int main (void)
{
start_time=clock();
atexit (bye);
//exit (EXIT_SUCCESS);
getch();
}
In the same way I want to check for locked/sleep/hibernate.
One possible way to wrap the c++ code(provided in the link) in c as mentioned by #ddriver
But is it not possible in C at all?
The WinAPI has generally at least the same possibilities as .NET framework. What your are asking for is the PowerManagement API.
You will have to register to receive PowerSettingNotificationEvents with the RegisterPowerSettingNotification function. Unfortunately, it is used differently for a GUI application where you give a handle to a window that will then receive a WM_POWERBROADCAST message each time the system is about to change state (one of the suspend modes or the hibernate mode), and for a non GUI (typically a service) that registers a HandlerEx callback with a dwControl parameter of SERVICE_CONTROL_POWEREVENT and a dwEventType of PBT_POWERSETTINGCHANGE.
The link you provide is about signals, emitted when power mode is changing. So, obviously, you can check when the system is about to go to sleep, or it just woke up.
As of checking if the system currently sleeps, that is simply not possible, as user code will simply not be running during deep sleep states. Maybe some platform specific, very low level BIOS API, but those are usually not public, and far from portable.
I'm working on a runtime non-native binary translator in Windows, and so far I've been able to "trap" interrupts (i.e. INT 0x99) for the OS binaries I'm trying to emulate by using an ugly hack that uses Windows SEH to handle invalid interrupts; but only because the system call vector is different than the one in Windows, allowing me to catch these "soft" exceptions by doing something like this:
static int __stdcall handler_cb(EXCEPTION_POINTERS* pes, ...)
{
if (pes->ExceptionRecord->ExceptionCode != EXCEPTION_ACCESS_VIOLATION)
return EXCEPTION_CONTINUE_SEARCH;
char* instruct = (char*) pes->ContextRecord->Eip;
if (!instruct)
handle_invalid_instruction(instruct);
switch (instruct[0])
{
case 0xcd: // INT
{
if (instruct[1] != 0x99) // INT 0x99
handle_invalid_instruction(instruct);
handle_syscall_translation();
...
}
...
default:
halt_and_catch_fire();
}
return EXCEPTION_SUCCESS;
}
Which works fairly well (but slowly), the problem with this is that Windows first attempts to handle the instruction/interrupt, and for non-native binaries that use sysenter/sysexit instead of int 0x99, some systenter instructions in the non-native binary are actually valid NT kernel calls themselves when executed, meaning my handler is never called, and worse; the state of the "host" OS is also compromised. Is there any way to "trap" sysenter instructions in Windows? How would I go about doing this?
As far as I know, there is no way (from a user-mode process) to "disable" SYSENTER, so that executing it will generate an exception. (I'm assuming your programs don't try to SYSEXIT, because only Ring 0 can do that).
The only I option I think you have is to do like VirtualBox does, and scan for invalid instructions, replacing them with illegal opcodes or something similar, that you can trap on, and emulate. See 10.4. Details about software virtualization.
To fix these performance and security issues, VirtualBox contains a Code Scanning and Analysis Manager (CSAM), which disassembles guest code, and the Patch Manager (PATM), which can replace it at runtime.
Before executing ring 0 code, CSAM scans it recursively to discover problematic instructions. PATM then performs in-situ patching, i.e. it replaces the instruction with a jump to hypervisor memory where an integrated code generator has placed a more suitable implementation. In reality, this is a very complex task as there are lots of odd situations to be discovered and handled correctly. So, with its current complexity, one could argue that PATM is an advanced in-situ recompiler.
I am using the library Function ConnectToTCPServer. This function times out when the host is not reachable. In that case the application crashes with the following error:
"NON-FATAL RUN-TIME ERROR: "MyClient.c", line 93, col 15, thread id 0x000017F0: Library function error (return value == -11 [0xfffffff5]). Timeout error"
The Errorcode 11 is a Timeout error, so this could happen quite often in my application - however the application crashes - i would like to catch this error rather than having my application crash.
How can i catch this runtime error in Ansi C90?
EDIT:
Here is a Codesnippet of the current use:
ConnectToTCPServer(&srvHandle, srvPort, srvName, HPMClientCb, answer, timeout);
with
int HPMClientCb(UINT handle, int xType, int errCode, void *transData){
printf("This was never printed\n");
return errCode;
}
The Callbackfunction is never called. My Server is not running, so ConnectToTCPServer will timeout. I would suspect that the callback is called - but it never is called.
EDIT 2: The Callback function is actually not called, the Returnvalue of ConnectToTCPServer contains the same error information. I think it might be a bug that ConnectToTCPServer throws this error. I just need to catch it and bin it in C90. Any Ideas?
EDIT 3: I tested the Callbackfunction, on the rare occaision that my server is online the callback function is actually called - this does not help though because the callback is not called when an error occurs.
Looking in NI documentation, I see this:
"Library error breakpoints -- You can set an option to break program execution whenever a LabWindows/CVI library function returns an error during run time. "
I would speculate they have a debug option to cause the program to stop on run-time errors, which you need to disable in configuration, in compile time or in run-time.
My first guess would have been configuration value or compilation flag, but this is the only option I found, which is a run-time option:
// If debugging is enabled, this function directs LabWindows/CVI not
// to display a run-time error dialog box when a National Instruments
// library function reports an error.
DisableBreakOnLibraryErrors();
Say if it helped.
Theres no such thing as a general case of "catching" an error (or an 'exception') in standard C. Thats up to your library to decide what to do with it. Likely its logging its state and then simply calling abort(). In Unix, that signals SIGABRT which can be handled and not just exit()ed. Or their library may just be logging and then calling exit().
You could run your application under a utility like strace to see what system calls are being performed and what signals are being asserted.
I'd work with your vendor if you can't make any headway otherwise.
From the documentation, it seems you should get a call to your clientCallbackFunction when an error occurs. If you don't, you should edit your question to clarify that.
I'm not sure I understand you.
I looked at the documentation for the library function ConnectToTCPServer(). It returns an int; 0 means success, negative numbers are the error codes.
EDIT: Here is a Codesnippet of the
current use:
ConnectToTCPServer(&srvHandle, srvPort, srvName, HPMClientCb, answer, timeout);
If that's really the current use, you don't seem to be trying to tell whether ConnectToTCPServer() succeeds. To do that, you'd need
int err_code;
...
err_code = ConnectToTCPServer(&srvHandle, srvPort, srvName, HPMClientCb, answer, timeout);
and then test err_code.
The documentation for ConnectToTCPServer()implies that your callback function won't be called unless there's a message from a TCP server. No server, no message. In that case,
ConnectToTCPServer() should return a negative number.
You should check the return value of ConnectToTCPServer().
Finding a negative number there, you should do something sensible.
Did I understand the documentation correctly?
Normally, you should be able to simply check the return value. The fact that your application exits implies that something is already catching the error and asserting (or something similar). Without seeing any context (i.e. code demonstrating how you're using this function), it's difficult to be any more precise.
The documentation states that ConnectToTCPServer will return the error code. The callback is only called if the connection is established, disconnected or when there is data ready to be read.
The message you get states that the error is NON-FATAL, hence it shouldn't abort. If you're sure the code doesn't abort later it seems indeed like a bug in the library.
I'm not familiar with CVI, but there might be a (compile-/runtime-) option to abort even on non-fatal errors (for debugging purposes). If you can reproduce this in a minimal example you should report it to NI.