Debug printf hanging kernel — possible causes and solutions?

Debug printf hanging kernel — possible causes and solutions? - arm

There are numerous parts in kernel where putting printf can lead to hanging of kernel (especially at early boot). (On a side note, it comes from my experience the same situation can have place with printk at early stages of boot of Linux). It can have numerous reasons, but sometimes reason is not obvious. You just put printf in a function that there is no way it is executed more than once during boot (or at least just a few times, not hundreds of times). It could still have some timing issues?
What are the typical causes of kernel boot hanging by adding printf in code?
As far as it comes for solution, you can probably create your own data structure and push your "stack trace" there and then print it at some point. But finding proper point to print it could be problematic. Can there be any possible policy to apply in such problematic early stages of boot? Or we are doomed to do detailed step by step analysis every time we encounter such a problem?

Related

What Can Cause a C Program to Crash Operating System

I recently found that a fairly large image manipulation program I'm writing in C on a Windows 8 machine has a bug when used in very particular circumstances. Unfortunately, the bug is causing my entire computer to come to a standstill so that my only option is to pull the plug on the computer (especially annoying when I'm working remotely...)
Because it's an image manipulation program, I can't just flood it with print statements to isolate the problematic section - the problem occurs somewhere in a loop that's called billions of times, so adding a printf slows it down to the point that it would take days to get to a failing iteration.
I understand, therefore, if this question is too broad, as it isn't really reasonable for me to put down all of the code that could cause my problem, I'm simply asking
What are the circumstances in which C code can, instead of seg faulting or halting the program, actually freeze the entire OS
When I search the problem, I see code golf questions like this
A C program which crashes the system(shuts down the system)
This is not what I'm asking - obviously I haven't written system("shutdown") anywhere in my loop.
Being most familiar with python and java, this problem is not what I'm used to, but in my experience,
Dividing by zero produces a seg fault
Accessing memory by accident that is slightly outside an intended array causes a seg fault (sometimes down the road a little)
Accessing protected memory causes the program to hang
Stack overflow causes a seg fault
Dereferencing a non-initialized pointer causes a seg fault
Is this impression false - could those cases cause the whole system to crash? What cases am I missing? Is it dependent on my version of gcc, or my permission status?
I haven't been able to try to reproduce it on a different operating system yet, as it requires a few dependencies to run the entire program.
If my only option is to sit for days waiting for the program to run with print statements, or avoid weird situations, then, of course, so be it. I'm looking for key places to look for the bug.

On modern systems with hardware-enforced privilege separation between user-mode and kernel-mode, and an operating system that functions to correctly configure these mechanisms, you simply cannot crash the system from a user mode process.
Any of those errors are trapped by the CPU, which call exception handlers in the OS which will quickly pull the plug on your system.
If I had to guess, a piece of hardware is overheating or malfunctioning:
Overheating CPU due to poor thermal conductivity with heatsink
Failing / under-sized power supply
Failing DIMMs
Failing hard drive
Failing CPU
Failing / overheating GPU
I've seen cryptocoin-mining software bring a system to its knees because it was pushing the limits of the GPU. When the card would lock-up/reset, the driver would get confused or lock-up, and the system would end up needed rebooted.
Your system is doing next to nothing when you're just sitting there browsing the web, etc. But if your system locks up when you start running a CPU-intensive application, it can bring out problems that you didn't know where there.
While this is a little out-of-place on Stack Overflow, it falls into one of those grey areas between hardware and software. I would stress-test your system, keeping an eye on CPU/GPU/memory temperatures, and power supply voltages. Check out MemTest86, Stresslinux.

The most trivial cause of OS freezing is "memory full". If you have processes that use a lot of memory, then your system is going to swap from main memory (typically RAM) to secondary memory (typically disk) which lead to a very huge overhead... As a user what you usually observe is a almost freezed computer, sometimes so freezed that you think it is crashed. If your OS is badly designed then it sometimes crashes!

Is there any tool which detects race conditions at runtime and rewrites code to avoid them in future

I have seen tools which can run a multithreaded program deterministically, even in the presence of race conditions. Now I wonder if there is any tool which can actually detect races and rewrite the code (at runtime) to not have the detected races in the future.
Does such kind of a tool exists? Or is it too hard to create one? I think it can be created with a tool that does dynamic binary translation of code, such as PIN or valgrind.

A race condition means that the outcome of your computation depends on the timing of events occurring (interrupts, scheduler, etc.). What developers typically mean (what what I assume you also mean) is that the program is "correct" most of the time and only sometimes fails - a bug triggered by a race condition which only happens rarely.
How is some automated detection algorithm supposed to know what is the desired outcome and what not?
And even if it could do that: Knowing that it happened and figuring out how and then fixing it is usually way more difficult. I'm sure you can reduce this problem to the halting problem.
At work we have played around with some frameworks which allow to create unit tests to detect race conditions (I can have a look when I'm back at work what it is called), but it's based on enumerating all possible thread schedules. As a simple test we have let it run on a concurrent queue implemention and the test case with 1 consumer and 1 producer and a queue capacity of 1 took a few seconds to run. Simply increasing the queue capacity to 2 made it run for several days. Might be that the tool wasn't too good but it shows the number of possible combinations explodes really fast.

While running a program in a deterministic manner is too much to ask, detecting some (if not all) race conditions is not difficult.
A lot of research is presently happening in this area. And you are on the right track with
PIN -------> Intel Parallel Studio
Valgrind --> Cachegrind and ThreadSanitizer 1.0
There are many open source and commercial tools. I used to work on 1
The way such tools work is that they track lock and other information per memory access and maintain a history of previous accesses. Every access is compared with history of previous accesses. There is more to it.

I turned on compiler optimization and my multithreaded C program imploded aggressively, any articles I can read on this?

I'm using MinGW, which is gcc for Windows. My program involves multiple windows, two different main threads, and several worker threads in a thread pool for overlapped network I/O.
It works perfectly fine without compiler optimization.
A) Is compiler optimization even necessary? My program's already very fast. Is it at all likely that it will provide a significant improvement?
B) Are there any articles on how to properly build a multthreaded program so compiler optimization can do its job?

“Imploded aggressively” is a bit weird (is your program a controller for a fission bomb?), but I understand that your program behaved as desired without compiler optimizations and mysteriously with compiler optimizations.
The technical term for this is that your program is buggy.
Multithreaded programming is intrinsically hard. Multithreaded programming when the threads share memory is very hard; it's the masochist way of concurrent programming (message passing is a lot easier to get right). You don't just need to read an article or two, you need to read several books and get a few years' programming experience.
You were unlucky that your program seemed to work without optimizations. It probably wouldn't work on a different machine where the timings are a bit different, or with a different compiler, or on a different operating system, either. So you ended up wasting your time thinking your program worked. But it doesn't. A compiler transforms correct source code into correct executables, no matter what optimization level you choose.¹
¹ Barring compiler bugs, sure. But the odds are very strongly stacked against you.

99.9% of all household failures in one optimization mode and not another are due to serious bugs. Multithreading races etc. are very sensitive to code performance. An instruction reorder or loop shortcut can turn a test pass into a debugging nightmare.
I'm assuming that the server runs up OK and detonates under load in aparrently different places, so making conventional debugging useless?
You are going to have to rely on logging and changing the test conditions to narrow down the point of ignition. My guess is this is going to be a Heisenbug that mutates with changes to the code, optimization, options, load profile, buffer sizes etc.
Not fixing the problem is not a good plan since it wil just show up in another form on next years boxes with more cores etc. Even with optimization off, it's still there, lurking, waiting for the opportunity to strike.
I hope I'm providing some comfort.
Seriously - log everything you can with a good logger - one that queues up the logs so as to keep disk latency out of the main app. Change things around to try and make the bug mutate and perhaps show up in the non-optimized build too. Write down, (type in), absolutely everything that you do amd what happens after any change, good or bad. Making the bug worse is actually better than making its symptoms go away, (without knowing exactly why). Try the server on various hardware configs, if you can.
Eventually, you will find the bug!
You have one thing going for you - it seems that you can reliably reproduce the problem. That, in itself, is a massive plus.
Forgot to ask - apart from the nuclear explosive metaphor, what is the main symptom? Is it AV'ing/segfaulting all over the place, or is it locked or livelocked up?

To answer part "A" of your question, the unoptimized version of your code still has the concurrency bugs in it, but the timing of how the threads run is such that the bugs have not yet been exposed with your test workloads. The current version of the unoptimized program will eventually fail in use, so you will need to fix the concurrency bugs before using the program for real work.

Timing Kernel Executions on CUDA

I've used code from CUDA C Best Practices to implement an execution timer. However their is something strange and I don't know if it's an anomaly or if that's normal. I get different read outs each time I run my CUDA app.
Could these readings by related to design or is that something I should expect.
I'm not running any graphic intensive applications on my machine, other than Windows 7.

Well it depends how big the differences are. One thing you can see anomalies caused by is the kernel scheduler. It may just happen that the scheduler is giving some extra timeslices to kernel functions (because graphics API calls have error checking involved) which shows more execution time. If the differences are very large I would say check your code but if it's very low in orders of milliseconds I wouldn't worry about it +- 10msecs is the usual for the timeslicing quantum in most OS's (windows probably included).
Also Aero is kind of intensive so that may be adding to the discrepancies you are seeing.

I've used code from CUDA C Best Practices to implement an execution timer.
Yeah, well, that's not a "best practice" in my experience.
I suggest using the nvprof profiler instead for your device-side code and CUDA Runtime API calls (it also works relatively well, I think, for your own host-side code). It'll take you a bit of hassle to set up and figure out which options you want to use, but it's worth it.

Profiling network software / Profiling software with lot of system call waiting

I'm working on a complex network software and I have trouble determining how to improve the systems performance.
Specifically in one part of the software which is using blocking synchronous calls. Since this part of the system is doing heavy computations it's nearly impossible to determine whether the slowness of this component is caused by these computations or the waiting for the other parts of the system.
Are there any light-weight profilers that can capture this information? I can't use heavy duty profile like valgrind since that would completely skew the results (although valgrind would be perfect, since it captures all the required information).
I tried using oProfile but I just wasn't able to get any meaningful results out of it (perhaps if there is a concise tutorial somewhere...).

What you need is something that gives you stack samples, on wall-clock time (not just CPU time like gprof), and reports by line (not just by function) the percent of samples containing the line.
Zoom will do it,
but I just do random-pausing. Here's why it works.
Here's a blow-by-blow example.
Here's another explanation.

Comment out your "heavy computations" and see if it's still slow. That will tell you if it's waiting on other systems over the network or the computations. The answer may not be either/or and may just be an accumulation of things.

You could also do some old fashioned printf debugging and print the time before and after executing the function to standard output or syslog. That is about as light-weight as profiling gets.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight