How to read stack trace kernelside in ebpf? - c

I would like to filter my ebpf with address in stack,
by example if stack trace contain the address of _do_fork then write to map.
I seen this https://www.kernel.org/doc/html/latest/bpf/bpf_design_QA.html#q-can-bpf-programs-access-stack-pointer saying that it isn't possible to get adresses. But I also seen this https://www.spinics.net/lists/netdev/msg497159.html
"The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through". So I'm confused.
The final question is how we can get adresses of stack trace in-kernel with bpf_get_stack, if it is possible?
thanks in advance

It is possible to access the stack traces.
The first link you mention (bpf_design_QA) does not refer to the program being traced, it deals with the stack pointer used by the BPF program itself when performing the tracing operation. But as mentioned in the commit log for bpf_get_stack(), you can get access to the stack.
There is some documentation for the BPF helpers, such as bpf_get_stack(), available online. You probably want to have a look at code samples using it too.
I don't have much experience myself with tracing stack, but it seems that very few tools doing so are actually using thie bpf_get_stack() helper. Instead, tools from bcc like profile or from kernel samples like offwaketime (BPF side, user space side) are generally using stack trace maps (BPF_MAP_TYPE_STACK_TRACE), so you may want to have a look at this too (bcc even offers a specific API for them).

Related

Use execution trace on-chip buffer (ETB) on STM32H7

I need to output the on-chip buffer (ETB) execution trace in some particular cases. I'm talking about an operational functionality, not about ETM trace during debugging phase.
I've read Arm® CoreSight™ ETM-M7 Technical Reference Manual but there is almost no detail about using this ETB feature.
There is also this link on ARM Information center, but I found it particularly unclear.
How can I use ETB ?
EDIT: I clarified a little bit the situation thanks to a presentation from STMicro. It states that "The ETF can be used as a trace buffer for storing traces onchip. The trace can be read by software, or by the debugger,
or flushed via the trace port. If configured as a circular buffer,
the trace will be stored continuously, so the most recent trace
will overwrite the oldest. Alternatively, the FIFO full flag can
be used to stop a trace when the buffer is full, and hence
capture a trace at a particular point in time." So what I need to access is not the ETB but the ETF, which is done through a register (the FIFO is apparently not memory mapped ?)
Yes, the CoreSight Architecture and ETM trace are designed to enable this sort of crash analysis, particularly in realtime systems where crashes can be difficult to reproduce and you may not able to have the target device hooked up to an external debug capture device all the time. ETM trace can be completely non-intrusive (except for the additional power consumption cost of having the logic active).
The architecture is quite generic, although each implementation will make different trade-offs about what is implemented. This unfortunately means that the documentation is quite spread-out. You might find this technical overview is useful for context (but not detail).
To achieve the crash analysis, you need to cover the following steps:
Configure ETF in circular buffer mode
Configure ETM to trace everything, with fairly frequent synchronisation
Disable the ETM after a crash (so the buffer is not overwritten)
Extract the trace from the crash (to SD card, for example)
Unpack any wrapping protocol added by the ETF
Decompress the trace (presumably offline)
With a circular buffer, trace decompression can only start from a synchronisation point. The ETMv4 protocol uses variable length packets, and rarely traces a full PC address value. You probably want 4 synchronisation points in the buffer, then only the first 25% is lost.
Trace decompression relies on having the code image which was running - this shouldn't be too much of a problem in this use case.
If you can't buffer far back enough after a crash, it is possible to use the filtering logic in the ETM to exclude any code you know is not interesting. Depending on the nature of any crash, you might want timing information. You can set this with a threshold to get a tick in the trace every 100 cycles or so - trace accuracy for cost, but it might be a great clue.
For programming the ETM, you want the ETMv4 architecture (it uses DWT comparators as 'processor comparator inputs' if you need filtering) and for the ETF I think it will be this technical reference manual. Check part_number in the Peripheral ID registers to make sure you have the right programmer's model.
Normally you use the ETB with a hardware debugger that supports ETB such as Segger J-Trace or Keil uLinkPro for example. It is something normally for the tool vendor to worry about and not directly usable within your application.
The necessary trace pins (TRACED0 to TRACED3 and TRACECLK) need to be available on your debug header, and not multiplexed to some other function by your application.
The STM32H7 Reference manuals contain a whole section on the "Trace and debug subsystem" (you have not specified the exact part, so you'll have to find it yourself). But in the RM0399 for STM32H745/755 and STM32H747/757 I am looking at it occupies over 100 pages of the manual.

Get user stackpointer from task_struct

I have kcore and I want to get userspace backtrace from kcore. Because some one from our application is making lot of munmap and making the system hang(CPU soft lockup 22s!). I looked at some macro but still this is just giving me kernel backtrace only. What I want is userspace backtrace.
Good news is I have pointer to task_struct.
task_struct->thread->sp (Kernel stack pointer)
task_struct->thread->usersp (user stack pointer) but this is junk
My question is how to get userspace backtrace from kcore or task_struct.
First of all, vmcore is a immediate full memory snapshot, so it contains all pages (including user pages). But if the user pages are swapped out, they couldn't be accessed. So that is why kdump (and similar tools as your gdb python script) focused on kernel debugging functionality only. For userspace debugging and stacktraces you have to use coredump functionality. By default the coredumps are produced when kernel sends (for example) SIGSEGV to your app, but you can make them when you want by using gcore of modifying kernel. Also there is a "userspace" way of making process dump, see google coredumper project
Also, you can try to unwind user stacktrace directly from kcore - but this is a tricky way, and you will have to hope that userspace stacktrace is not swapped out at the moment. (do you use a swap?) You can see __save_stack_trace_user, it will make sense of how to retrieve userspace context
First of all vmcores typically don't contain user pages. I'm unaware of any magic which would help here - you would have to inspect vm mappings for given task address space and then inspect physical pages, and I highly doubt the debugger knows how to do it.
But most importantly you likely don't have any valid reason to do it in the first place.
So, what are you trying to achieve?
=======================
Given the edit:
some one from our application is making lot of munmap and making the
system hang(CPU soft lockup 22s!).
There may or may not be an actual kernel issue which must be debugged. I don't see any use for userspace stacktraces for this one though.
So as I understand presumed issue is excessive mmap + munmap calls from the application.Inspecting the backtrace of the thread reported with said lockup may or may not happen to catch the culprit. What you really want is to collect backtraces of /all/ callers and sort them by frequency. This can be done (albeit with pain) with systemtap.

Is there a way to dump the complete stack trace after normal execution of the binary?

I want the complete stack trace, mainly the list of functions traversed in a normal execution of a binary.
AFAIK, GDB provides the trace only when it hits a break point or in case of a crash.
That is called the call graph.
That would require either:
Instrumentation, i.e. adding code into each function to record when entering/leaving it
Profiling, i.e. sampling the program's state and recording which functions are detected
Emulation, i.e. running the program on a fake/virtual CPU and recording when jumps occur
Of the above, only the first one would provide 100% accuracy, and of course in general its very hard to do since you often use libraries and those wouldn't be instrumented even if you got your own code to be.
The reason this is hard is that the stack frame "history" isn't normally recorded; once the program has stopped running there is no current stack frame to inspect, unlike when breaking in a debugger.
See also this question.
If your OS provides dtrace, you can use the PID provider:
pid Provider
The pid provider allows for tracing of the entry and return of any function in a user process ...

Ptrace mprotect debugging trouble

I'm having trouble with an research project.
What i am trying to is to use ptrace to watch the execution of a target process.
With the help of ptrace i am injecting a mprotect syscall into the targets code segment (similar to a breakpoint) and set the stack protection to PROT_NONE.
After that i restore the original instructions and let the target continue.
When i get an invalid permisson segfault i again inject the syscall to unprotect the stack again and afterwards i execute the instruction which caused the segfault and protect the stack again.
(This does indeed work for simple programs.)
My problem now is, that with this setup the target (pretty) randomly crashes in library function calls (no matter whether i use dynamic or static linking).
By crashing i mean, it either tries to access memory which for some reason is not mapped, or it just keeps hanging in the function __lll_lock_wait_private (that was following a malloc call).
Let me emphasis again, that the crashes don't always happen and don't always happen at the same positions.
It kind of sounds like an synchronisation problem but as far as i can tell (meaning i looked into /proc/pid/tasks/) there is only one thread running.
So do you have any clue what could be the reason for this?
Please tell me your suggestions even if you are not sure, i am running out of ideas here ...
It's also possible the non-determinism is created by address space randomization.
You may want to disable that to try and make the problem more deterministic.
EDIT:
Given that turning ASR off 'fixes' the problem then maybe the under-lying problem might be:
Somewhere thinking 0 is invalid when it should be valid, or visaversa. (What I had).
Using addresses from one run against a different run?

What is a privileged instruction?

I have added some code which compiles cleanly and have just received this Windows error:
---------------------------
(MonTel Administrator) 2.12.7: MtAdmin.exe - Application Error
---------------------------
The exception Privileged instruction.
(0xc0000096) occurred in the application at location 0x00486752.
I am about to go on a bug hunt, and I am expecting it to be something silly that I have done which just happens to produce this message. The code compiles cleanly with no errors or warnings. The size of the EXE file has grown to 1,454,132 bytes and includes links to ODCS.lib, but it is otherwise pure C to the Win32 API, with DEBUG on (running on a P4 on Windows 2000).
To answer the question, a privileged instruction is a processor op-code (assembler instruction) which can only be executed in "supervisor" (or Ring-0) mode.
These types of instructions tend to be used to access I/O devices and protected data structures from the windows kernel.
Regular programs execute in "user mode" (Ring-3) which disallows direct access to I/O devices, etc...
As others mentioned, the cause is probably a corrupted stack or a messed up function pointer call.
This sort of thing usually happens when using function pointers that point to invalid data.
It can also happen if you have code that trashes your return stack. It can sometimes be quite tricky to track these sort of bugs down because they usually are hard to reproduce.
A privileged instruction is an IA-32 instruction that is only allowed to be executed in Ring-0 (i.e. kernel mode). If you're hitting this in userspace, you've either got a really old EXE, or a corrupted binary.
As I suspected, it was something silly that I did. I think I solved this twice as fast because of some of the clues in comments in the messages above. Thanks to those, especially those who pointed to something early in the app overwriting the stack. I actually found several answers here more useful than the post I have marked as answering the question as they clued and queued me as to where to look, though I think it best sums up the answer.
As it turned out, I had just added a button that went over the maximum size of an array holding some toolbar button information (which was on the stack). I had forgotten that
#define MAX_NUM_TOOBAR_BUTTONS (24)
even existed!
First probability that I can think of is, you may be using a local array and it is near the top of the function declaration. Your bounds checking gone insane and overwrite the return address and it points to some instruction that only kernel is allowed to execute.
The error location 0x00486752 seems really small to me, before where executable code usually lives. I agree with Daniel, it looks like a wild pointer to me.
I saw this with Visual c++ 6.0 in the year 2000.
The debug C++ library had calls to physical I/O instructions in it, in an exception handler.
If I remember correctly, it was dumping status to an I/O port that used to be for DMA base registers, which I assume someone at Microsoft was using for a debugger card.
Look for some error condition that might be latent causing diagnostics code to run.
I was debugging, backtracked and read the dissassembly. It was an exception while processing std::string, maybe indexing off the end.
The CPU of most processors manufactured in the last 15 years have some special instructions which are very powerful. These privileged instructions are kept for operating system kernel applications and are not able to be used by user written programs.
This restricts the damage that a user-written program can inflict upon the system and cuts down the number of times that the system actually crashes.
When executing in kernel mode, the operating system has unrestricted access to both the kernel and the user program's memory.
The load instructions for the base and limit registers are privileged instructions.

Resources