Ptrace mprotect debugging trouble - c

I'm having trouble with an research project.
What i am trying to is to use ptrace to watch the execution of a target process.
With the help of ptrace i am injecting a mprotect syscall into the targets code segment (similar to a breakpoint) and set the stack protection to PROT_NONE.
After that i restore the original instructions and let the target continue.
When i get an invalid permisson segfault i again inject the syscall to unprotect the stack again and afterwards i execute the instruction which caused the segfault and protect the stack again.
(This does indeed work for simple programs.)
My problem now is, that with this setup the target (pretty) randomly crashes in library function calls (no matter whether i use dynamic or static linking).
By crashing i mean, it either tries to access memory which for some reason is not mapped, or it just keeps hanging in the function __lll_lock_wait_private (that was following a malloc call).
Let me emphasis again, that the crashes don't always happen and don't always happen at the same positions.
It kind of sounds like an synchronisation problem but as far as i can tell (meaning i looked into /proc/pid/tasks/) there is only one thread running.
So do you have any clue what could be the reason for this?
Please tell me your suggestions even if you are not sure, i am running out of ideas here ...

It's also possible the non-determinism is created by address space randomization.
You may want to disable that to try and make the problem more deterministic.
EDIT:
Given that turning ASR off 'fixes' the problem then maybe the under-lying problem might be:
Somewhere thinking 0 is invalid when it should be valid, or visaversa. (What I had).
Using addresses from one run against a different run?

Related

FreeRTOS - Stack corruption on STM32F4

I am currently having problems with what I think is stack corruption of some error of configuration while running FreeRTOS on an STM32F407 target.
I have looked at FreeRTOS stack corruption on STM32F4 with gcc but got no help there.
The application runs two tasks and relies on one CAN interrupt. The workflow is as follows:
The two tasks, network_task and app_task is created along with two queues, raw_msg_queue and app_msg_queue. The CAN interrupt is also set up.
The network_task has the highest priority and starts waiting on the raw_msg_queue, indefinitely.
The app_task is next and starts waiting on the app_msg_queue.
The CAN interrupt then triggers because of an external event, adding a CAN message to the raw_msg_queue.
The network_task wakes up, process the message, adds the processed message to the app_msg_queue and then continues to wait on the raw_msg_queue.
The app_task wakes up and I get a hard fault.
The thing is that I have wrapped the calls that app_task makes to xQueueReceive in two steps because of end-user convenience and portability. The app_task total function chain is that it calls network_receive(..) -> os_queue_receive(..) -> xQueueReceive(..). This works well, but when it returns from xQueueReceive(..) it only manages to return to os_queue_receive(..) before it returns to a seemingly random memory location and i get a hard-fault.
The stack sizes should be adequate and are set to 2048 for both, all large data structures are passed around as pointers.
I am running my code on two STM32F407. FreeRTOS is at version 7.4.2, the latest at the time of writing.
I am really hoping that someone can help me out here!
First, you can take a look here and try to get more info about the hard fault.
You may also want to check your interrupt priority setting, as the tricky ARM Cortex-M interrupt priority mechanism causes some trouble in FreeRTOS. Refer to here.
I know this question is rather old, but perhaps this could help other people out facing a similar issue. In FreeRTOS, you can utilize the
void vApplicationStackOverflowHook(xTaskHandle xTask, signed char *pcTaskName)
function to detect a stack overflow and grab relevent information about the offending task. It's possible that data would be corrupt due to the overflow, but you can atleast address the fact that an overflow occured (reset system, set error flag/LED, etc.)
For this specific question, I'd be curious to see the thread initialization code as well as the interrupt routine. If the problem is in fact an overflow, I think it would be fairly simply to adjust these parameters until the problem goes away. You mention 2048 bytes should be sufficient for each thread - if that's truly the case, I doubt the problem is an overflow. At that point, it's more likely you're dereferencing a dangling pointer to a stale memory address.

Hijacking sys calls

I'm writing a kernel module and I need to hijack/wrap some sys calls. I'm brute-forcing the sys_call_table address and I'm using cr0 to disable/enable page protection. So far so good (I'll make public the entire code once it's done, so I can update this question if somebody wants).
Anyways, I have noticed that if I hijack __NR_sys_read I get a kernel oops when I unload the kernel module, and also all konsoles (KDE) crash. Note that this doesn't happen with __NR_sys_open or __NR_sys_write.
I'm wondering why is this happening. Any ideas?
PS: Please don't go the KProbes way, I already know about it and it's not possible for me to use it as the final product should be usable without having to recompile the entire kernel.
EDIT: (add information)
I restore the original function before unloading. Also, I have created two test-cases, one with _write only and one with _read. The one with _write unloads fine, but the one with _read unloads and then crashes the kernel).
EDIT: (source code)
I'm currently at home so I can't post the source code right now, but if somebody wants, I can post an example code as soon as I get to work. (~5 hours)
This may be because a kernel thread is currently inside read - if calling your read-hook doesn't lock the module, it can't be unloaded safely.
This would explain the "konsoles" (?) crashing as they are probably currently performing the read syscall, waiting for data. When they return from the actual syscall, they'll be jumping into the place where your function used to be, causing the problem.
Unloading will be messy, but you need to first remove the hook, then wait for all callers exit the hook function, then unload the module.
I've been playing with linux syscall hooking recently, but I'm by no means a kernel guru, so I appologise if this is off-base.
PS: This technique might prove more reliable than brute-forcing the sys_call_table. The brute-force techniques I've seen tend to kernel panic if sys_close is already hooked.

fatal error disappeared when running with gdb

I have a program which produces a fatal error with a testcase, and I can locate the problem by reading the log and the stack trace of the fatal - it turns out that there is a read operation upon a null pointer.
But when I try to attach gdb to it and set a breakpoint around the suspicious code, the null pointer just cannot be observed! The program works smoothly without any error.
This is a single-process, single-thread program, I didn't experience this kind of thing before. Can anyone give me some comments? Thanks.
Appended: I also tried to call pause() syscall before the fatal-trigger code, and expected to make the program sleep before fatal point and then attach the gdb on it on-the-fly, sadly, no fatal occurred.
It's only guesswork without looking at the code, but debuggers sometimes do this:
They initialize certain stuff for you
The timing of the operations is changed
I don't have a quote on GDB, but I do have one on valgrind (granted the two do wildly different things..)
My program crashes normally, but doesn't under Valgrind, or vice versa. What's happening?
When a program runs under Valgrind,
its environment is slightly different
to when it runs natively. For example,
the memory layout is different, and
the way that threads are scheduled is
different.
Same would go for GDB.
Most of the time this doesn't make any
difference, but it can, particularly
if your program is buggy.
So the true problem is likely in your program.
There can be several things happening.. The timing of the application can be changed, so if it's a multi threaded application it is possible that you for example first set the ready flag and then copy the data into the buffer, without debugger attached the other thread might access the buffer before the buffer is filled or some pointer is set.
It's could also be possible that some application has anti-debug functionality. Maybe the piece of code is never touched when running inside a debugger.
One way to analyze it is with a core dump. Which you can create by ulimit -c unlimited then start the application and when the core is dumped you could load it into gdb with gdb ./application ./core You can find a useful write-up here: http://www.ffnn.nl/pages/articles/linux/gdb-gnu-debugger-intro.php
If it is an invalid read on a pointer, then unpredictable behaviour is possible. Since you already know what is causing the fault, you should get rid of it asap. In general, expect the unexpected when dealing with faulty pointer operations.

Is there a better way than parsing /proc/self/maps to figure out memory protection?

On Linux (or Solaris) is there a better way than hand parsing /proc/self/maps repeatedly to figure out whether or not you can read, write or execute whatever is stored at one or more addresses in memory?
For instance, in Windows you have VirtualQuery.
In Linux, I can mprotect to change those values, but I can't read them back.
Furthermore, is there any way to know when those permissions change (e.g. when someone uses mmap on a file behind my back) other than doing something terribly invasive and using ptrace on all threads in the process and intercepting any attempt to make a syscall that could affect the memory map?
Update:
Unfortunately, I'm using this inside of a JIT that has very little information about the code it is executing to get an approximation of what is constant. Yes, I realize I could have a constant map of mutable data, like the vsyscall page used by Linux. I can safely fall back on an assumption that anything that isn't included in the initial parse is mutable and dangerous, but I'm not entirely happy with that option.
Right now what I do is I read /proc/self/maps and build a structure I can binary search through for a given address's protection. Any time I need to know something about a page that isn't in my structure I reread /proc/self/maps assuming it has been added in the meantime or I'd be about to segfault anyways.
It just seems that parsing text to get at this information and not knowing when it changes is awfully crufty. (/dev/inotify doesn't work on pretty much anything in /proc)
I do not know an equivalent of VirtualQuery on Linux. But some other ways to do it which may or may not work are:
you setup a signal handler trapping SIGBUS/SIGSEGV and go ahead with your read or write. If the memory is protected, your signal trapping code will be called. If not your signal trapping code is not called. Either way you win.
you could track each time you call mprotect and build a corresponding data structure which helps you in knowing if a region is read or write protected. This is good if you have access to all the code which uses mprotect.
you can monitor all the mprotect calls in your process by linking your code with a library redefining the function mprotect. You can then build the necessary data structure for knowing if a region is read or write protected and then call the system mprotect for really setting the protection.
you may try to use /dev/inotify and monitor the file /proc/self/maps for any change. I guess this one does not work, but should be worth the try.
There sorta is/was /proc/[pid|self]/pagemap, documentation in the kernel, caveats here:
https://lkml.org/lkml/2015/7/14/477
So it isn't completely harmless...

What is a privileged instruction?

I have added some code which compiles cleanly and have just received this Windows error:
---------------------------
(MonTel Administrator) 2.12.7: MtAdmin.exe - Application Error
---------------------------
The exception Privileged instruction.
(0xc0000096) occurred in the application at location 0x00486752.
I am about to go on a bug hunt, and I am expecting it to be something silly that I have done which just happens to produce this message. The code compiles cleanly with no errors or warnings. The size of the EXE file has grown to 1,454,132 bytes and includes links to ODCS.lib, but it is otherwise pure C to the Win32 API, with DEBUG on (running on a P4 on Windows 2000).
To answer the question, a privileged instruction is a processor op-code (assembler instruction) which can only be executed in "supervisor" (or Ring-0) mode.
These types of instructions tend to be used to access I/O devices and protected data structures from the windows kernel.
Regular programs execute in "user mode" (Ring-3) which disallows direct access to I/O devices, etc...
As others mentioned, the cause is probably a corrupted stack or a messed up function pointer call.
This sort of thing usually happens when using function pointers that point to invalid data.
It can also happen if you have code that trashes your return stack. It can sometimes be quite tricky to track these sort of bugs down because they usually are hard to reproduce.
A privileged instruction is an IA-32 instruction that is only allowed to be executed in Ring-0 (i.e. kernel mode). If you're hitting this in userspace, you've either got a really old EXE, or a corrupted binary.
As I suspected, it was something silly that I did. I think I solved this twice as fast because of some of the clues in comments in the messages above. Thanks to those, especially those who pointed to something early in the app overwriting the stack. I actually found several answers here more useful than the post I have marked as answering the question as they clued and queued me as to where to look, though I think it best sums up the answer.
As it turned out, I had just added a button that went over the maximum size of an array holding some toolbar button information (which was on the stack). I had forgotten that
#define MAX_NUM_TOOBAR_BUTTONS (24)
even existed!
First probability that I can think of is, you may be using a local array and it is near the top of the function declaration. Your bounds checking gone insane and overwrite the return address and it points to some instruction that only kernel is allowed to execute.
The error location 0x00486752 seems really small to me, before where executable code usually lives. I agree with Daniel, it looks like a wild pointer to me.
I saw this with Visual c++ 6.0 in the year 2000.
The debug C++ library had calls to physical I/O instructions in it, in an exception handler.
If I remember correctly, it was dumping status to an I/O port that used to be for DMA base registers, which I assume someone at Microsoft was using for a debugger card.
Look for some error condition that might be latent causing diagnostics code to run.
I was debugging, backtracked and read the dissassembly. It was an exception while processing std::string, maybe indexing off the end.
The CPU of most processors manufactured in the last 15 years have some special instructions which are very powerful. These privileged instructions are kept for operating system kernel applications and are not able to be used by user written programs.
This restricts the damage that a user-written program can inflict upon the system and cuts down the number of times that the system actually crashes.
When executing in kernel mode, the operating system has unrestricted access to both the kernel and the user program's memory.
The load instructions for the base and limit registers are privileged instructions.

Resources