Get user stackpointer from task_struct

Get user stackpointer from task_struct - c

I have kcore and I want to get userspace backtrace from kcore. Because some one from our application is making lot of munmap and making the system hang(CPU soft lockup 22s!). I looked at some macro but still this is just giving me kernel backtrace only. What I want is userspace backtrace.
Good news is I have pointer to task_struct.
task_struct->thread->sp (Kernel stack pointer)
task_struct->thread->usersp (user stack pointer) but this is junk
My question is how to get userspace backtrace from kcore or task_struct.

First of all, vmcore is a immediate full memory snapshot, so it contains all pages (including user pages). But if the user pages are swapped out, they couldn't be accessed. So that is why kdump (and similar tools as your gdb python script) focused on kernel debugging functionality only. For userspace debugging and stacktraces you have to use coredump functionality. By default the coredumps are produced when kernel sends (for example) SIGSEGV to your app, but you can make them when you want by using gcore of modifying kernel. Also there is a "userspace" way of making process dump, see google coredumper project
Also, you can try to unwind user stacktrace directly from kcore - but this is a tricky way, and you will have to hope that userspace stacktrace is not swapped out at the moment. (do you use a swap?) You can see __save_stack_trace_user, it will make sense of how to retrieve userspace context

First of all vmcores typically don't contain user pages. I'm unaware of any magic which would help here - you would have to inspect vm mappings for given task address space and then inspect physical pages, and I highly doubt the debugger knows how to do it.
But most importantly you likely don't have any valid reason to do it in the first place.
So, what are you trying to achieve?
=======================
Given the edit:
some one from our application is making lot of munmap and making the
system hang(CPU soft lockup 22s!).
There may or may not be an actual kernel issue which must be debugged. I don't see any use for userspace stacktraces for this one though.
So as I understand presumed issue is excessive mmap + munmap calls from the application.Inspecting the backtrace of the thread reported with said lockup may or may not happen to catch the culprit. What you really want is to collect backtraces of /all/ callers and sort them by frequency. This can be done (albeit with pain) with systemtap.

Related

How to read stack trace kernelside in ebpf?

I would like to filter my ebpf with address in stack,
by example if stack trace contain the address of _do_fork then write to map.
I seen this https://www.kernel.org/doc/html/latest/bpf/bpf_design_QA.html#q-can-bpf-programs-access-stack-pointer saying that it isn't possible to get adresses. But I also seen this https://www.spinics.net/lists/netdev/msg497159.html
"The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through". So I'm confused.
The final question is how we can get adresses of stack trace in-kernel with bpf_get_stack, if it is possible?
thanks in advance

It is possible to access the stack traces.
The first link you mention (bpf_design_QA) does not refer to the program being traced, it deals with the stack pointer used by the BPF program itself when performing the tracing operation. But as mentioned in the commit log for bpf_get_stack(), you can get access to the stack.
There is some documentation for the BPF helpers, such as bpf_get_stack(), available online. You probably want to have a look at code samples using it too.
I don't have much experience myself with tracing stack, but it seems that very few tools doing so are actually using thie bpf_get_stack() helper. Instead, tools from bcc like profile or from kernel samples like offwaketime (BPF side, user space side) are generally using stack trace maps (BPF_MAP_TYPE_STACK_TRACE), so you may want to have a look at this too (bcc even offers a specific API for them).

Is there a way to dump the complete stack trace after normal execution of the binary?

I want the complete stack trace, mainly the list of functions traversed in a normal execution of a binary.
AFAIK, GDB provides the trace only when it hits a break point or in case of a crash.

That is called the call graph.
That would require either:
Instrumentation, i.e. adding code into each function to record when entering/leaving it
Profiling, i.e. sampling the program's state and recording which functions are detected
Emulation, i.e. running the program on a fake/virtual CPU and recording when jumps occur
Of the above, only the first one would provide 100% accuracy, and of course in general its very hard to do since you often use libraries and those wouldn't be instrumented even if you got your own code to be.
The reason this is hard is that the stack frame "history" isn't normally recorded; once the program has stopped running there is no current stack frame to inspect, unlike when breaking in a debugger.
See also this question.

If your OS provides dtrace, you can use the PID provider:
pid Provider
The pid provider allows for tracing of the entry and return of any function in a user process ...

Hijacking sys calls

I'm writing a kernel module and I need to hijack/wrap some sys calls. I'm brute-forcing the sys_call_table address and I'm using cr0 to disable/enable page protection. So far so good (I'll make public the entire code once it's done, so I can update this question if somebody wants).
Anyways, I have noticed that if I hijack __NR_sys_read I get a kernel oops when I unload the kernel module, and also all konsoles (KDE) crash. Note that this doesn't happen with __NR_sys_open or __NR_sys_write.
I'm wondering why is this happening. Any ideas?
PS: Please don't go the KProbes way, I already know about it and it's not possible for me to use it as the final product should be usable without having to recompile the entire kernel.
EDIT: (add information)
I restore the original function before unloading. Also, I have created two test-cases, one with _write only and one with _read. The one with _write unloads fine, but the one with _read unloads and then crashes the kernel).
EDIT: (source code)
I'm currently at home so I can't post the source code right now, but if somebody wants, I can post an example code as soon as I get to work. (~5 hours)

This may be because a kernel thread is currently inside read - if calling your read-hook doesn't lock the module, it can't be unloaded safely.
This would explain the "konsoles" (?) crashing as they are probably currently performing the read syscall, waiting for data. When they return from the actual syscall, they'll be jumping into the place where your function used to be, causing the problem.
Unloading will be messy, but you need to first remove the hook, then wait for all callers exit the hook function, then unload the module.
I've been playing with linux syscall hooking recently, but I'm by no means a kernel guru, so I appologise if this is off-base.
PS: This technique might prove more reliable than brute-forcing the sys_call_table. The brute-force techniques I've seen tend to kernel panic if sys_close is already hooked.

Ptrace mprotect debugging trouble

I'm having trouble with an research project.
What i am trying to is to use ptrace to watch the execution of a target process.
With the help of ptrace i am injecting a mprotect syscall into the targets code segment (similar to a breakpoint) and set the stack protection to PROT_NONE.
After that i restore the original instructions and let the target continue.
When i get an invalid permisson segfault i again inject the syscall to unprotect the stack again and afterwards i execute the instruction which caused the segfault and protect the stack again.
(This does indeed work for simple programs.)
My problem now is, that with this setup the target (pretty) randomly crashes in library function calls (no matter whether i use dynamic or static linking).
By crashing i mean, it either tries to access memory which for some reason is not mapped, or it just keeps hanging in the function __lll_lock_wait_private (that was following a malloc call).
Let me emphasis again, that the crashes don't always happen and don't always happen at the same positions.
It kind of sounds like an synchronisation problem but as far as i can tell (meaning i looked into /proc/pid/tasks/) there is only one thread running.
So do you have any clue what could be the reason for this?
Please tell me your suggestions even if you are not sure, i am running out of ideas here ...

It's also possible the non-determinism is created by address space randomization.
You may want to disable that to try and make the problem more deterministic.
EDIT:
Given that turning ASR off 'fixes' the problem then maybe the under-lying problem might be:
Somewhere thinking 0 is invalid when it should be valid, or visaversa. (What I had).
Using addresses from one run against a different run?

Is there a better way than parsing /proc/self/maps to figure out memory protection?

On Linux (or Solaris) is there a better way than hand parsing /proc/self/maps repeatedly to figure out whether or not you can read, write or execute whatever is stored at one or more addresses in memory?
For instance, in Windows you have VirtualQuery.
In Linux, I can mprotect to change those values, but I can't read them back.
Furthermore, is there any way to know when those permissions change (e.g. when someone uses mmap on a file behind my back) other than doing something terribly invasive and using ptrace on all threads in the process and intercepting any attempt to make a syscall that could affect the memory map?
Update:
Unfortunately, I'm using this inside of a JIT that has very little information about the code it is executing to get an approximation of what is constant. Yes, I realize I could have a constant map of mutable data, like the vsyscall page used by Linux. I can safely fall back on an assumption that anything that isn't included in the initial parse is mutable and dangerous, but I'm not entirely happy with that option.
Right now what I do is I read /proc/self/maps and build a structure I can binary search through for a given address's protection. Any time I need to know something about a page that isn't in my structure I reread /proc/self/maps assuming it has been added in the meantime or I'd be about to segfault anyways.
It just seems that parsing text to get at this information and not knowing when it changes is awfully crufty. (/dev/inotify doesn't work on pretty much anything in /proc)

I do not know an equivalent of VirtualQuery on Linux. But some other ways to do it which may or may not work are:
you setup a signal handler trapping SIGBUS/SIGSEGV and go ahead with your read or write. If the memory is protected, your signal trapping code will be called. If not your signal trapping code is not called. Either way you win.
you could track each time you call mprotect and build a corresponding data structure which helps you in knowing if a region is read or write protected. This is good if you have access to all the code which uses mprotect.
you can monitor all the mprotect calls in your process by linking your code with a library redefining the function mprotect. You can then build the necessary data structure for knowing if a region is read or write protected and then call the system mprotect for really setting the protection.
you may try to use /dev/inotify and monitor the file /proc/self/maps for any change. I guess this one does not work, but should be worth the try.

There sorta is/was /proc/[pid|self]/pagemap, documentation in the kernel, caveats here:
https://lkml.org/lkml/2015/7/14/477
So it isn't completely harmless...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight