I am a new coder in c, recently moved over from python, but still like to challenge myself with fairly ambitious projects (like a chess program), and have found that my computer suffers an unusual number of BSODs, both when I am running a program and not (admittedly, attempting to use the entirety of my memory as a hash table may not have been the greatest idea).
So my question is, are these most likely caused by my crappy c code, or is it more likely that my 3 year old, overworked laptop is the culprit?
If it could be the code, what are the big things I should avoid doing so as to prevent this?
BSOD usually contains some information as to what caused it.
What information it contains, and how exactly it is displayed depends on the version of Windows you are running.
As can be seen from the list here:
https://hetmanrecovery.com/recovery_news/bsod-errors
Most BSOD errors come from device / driver / kernel code, and not from your typical userland program.
That said, it might be possible to trigger BSOD if your code uses particularly low level windows API, especially if you run it with administrator privileges.
Note, that simply filling up memory will result in allocations for your program failing, and possibly your program, but not the whole OS crashing.
Also, windows does place limits on how much an individual process can allocate.
One final note:
"3 year old laptop" does not provide enough information to tell anything about your hardware, since there are different tiers of laptops available, and some of the high end 3 year old ones will still be better performing then a mid tier one bought yesterday.
As a troubleshooting measure, I would recommend backing up your data, making a clean install of your OS (aka "format the machine"), then making sure all your drivers are up to date.
You may also want to try hardware diagnostic tools, such as memtes86, check SMART on your storage, etc.
It's not supposed to be possible for anything you do in an ordinary "user space" program to crash the whole computer. Something else must also be wrong. Here are some possibilities:
If you are making the computer do CPU- and RAM-intensive work for long periods, you may stress the hardware to the point where a marginally defective component fails. Usually it's either the RAM, the power supply, or the cooling fans at fault.
Make sure your power supply is rated for all of the kit you have, running simultaneously. Make sure you have enough airflow for the amount of heat you're generating; check for dust-clogged heatsinks and fans that aren't actually spinning. If you have more than one RAM stick, take one out at a time and see if that makes the problem disappear.
I'd like to tell you to get error-correcting RAM if you don't have it already, but for infuriating market differentiation reasons you'd have to replace the motherboard and CPU as well. It's still worth doing, in the long run, but it amounts to replacing the whole computer.
You may be tickling a bug in the OS or the drivers. The most probable culprit is the GPU driver, particularly if your program does anything graphical. Regrettably, all you can do about this is make sure you're fully patched up.
Related
I understand that in the past with C you could screw up pointers and memory allocation, and potentially accidentally corrupt other running programs or the operating system itself outside of your program, and crash the system. This would require a restart to pick up the pieces and continue with program development.
Have system security improvements stopped this from happening?
In the past with MSDOS and Windows 3.1/95/98/Me, and MacOS prior to version 10, (generally before preemptive multitasking became the norm for everything) system security generally did not exist. Programs had full control to write data anywhere at any time.
But now with more modern system design and process security, programs generally are blocked by system security from accidentally or intentionally damaging anything else.
The execute-disable feature of modern processors may also be helping with preventing accidentally jumping to a random memory location and running whatever is there as processor machine code.
So how badly can you screw up with modern programming with C without attempting to hack the operating system security?
Can you still manage to accidentally crash the whole system? I assume this is no longer possible. The kernel or other system security steps in and halts the action.
Can you corrupt your login environment and have to log out and back in again? I assume this too is prevented, as processes should not normally have access to other process memory space, even within your own login security environment.
In general it seems like programming in C may now be much easier than it was in the past just due to these system protections that are now used everywhere, to keep you from shooting yourself and the system in the foot.
In the realm of what you can accidentally do, it has certainly gotten easier than the MS-DOS days. I remember a bug that corrupted the in-memory disk cache. I was lucky to have anything left after that one.
Now, unless you're writing C code that runs inside the kernel, that's not possible to do anymore. Nor is crashing the OS in general unless you are actively trying to exploit a flaw.
The various other things, like the NX bit and other such attempts to erect a bit of a security fence around C programs, they do make your program crash faster and at a place closer to where the error really happened. But they aren't anywhere near the level of win that you got with simple separated address spaces. They are designed to make active attempts to exploit things much more difficult, and they're far better at this than they are at catching accidents or mistakes.
And corrupting the login environment as distinct from the OS as a whole is also generally not statistically likely. Though if your program is doing sophisticated file manipulation you could have a bug that messed up the user's files.
And, short of actually crashing the OS you can accidentally cause resource exhaustion. And while this can often be recovered from, it's very uncomfortable while it's going on. Your system slows to a crawl and may not even be able to launch processes. Linux has some protection against this. And if you put your development environment in a cgroup, you can prevent it completely.
Of course, anything that an active exploit could do, you could do by mistake. But, I'm talking about the statistical likelihood of doing these things by accident.
Probably the biggest improvement since separated address spaces are tools like Valgrind that monitor your program on-the-fly for out-of-bounds acesses or accesses to freed memory and the like.
MS-DOS and early Windows were rather weird for their use as general purpose computers with high-quality C development environments and also having such promiscuous memory. It's taken Windows a long time to outgrow that, and programming practices on the platform are still a little weird.
Short answer: yes.
On computers that use virtual memory and keep data and code separate, it is much harder to write catastrophic bugs as you'll get a hardware exception (that the OS translates into something softer). So you can't have bugs that run off to overwrite your own executable code, or have runaway code that starts executing random "op codes" from the data segment. The bug will still be there of course, and needs to be fixed. But it will be a whole lot less mysterious and not nearly as disastrous.
Computers that don't have these security features require more care and testing by the programmer.
Crashing the whole system is still quite possible, but mostly this is because of bugs in the OS and API. The occasional "blue screen of death" in Windows is still a thing. With some effort you could also lag down the whole computer by using 100% CPU or by allocating excessive amounts of heap memory, turning it next to useless.
TL;DR at end in bold if you don't want rationale/context (which I'm providing since it's always good to explain the core issue and not simply ask for help with Method X which might not be the best approach)
I frequently do software performance analysis on older hardware, which shows up race errors, single-frame graphical glitches and other issues more readily than more modern silicon.
Often, it would be really cool to be able to take screenshots of a misbehaving application that might render garbage for one or two frames or display erroneous values for a few fractions of a second. Unfortunately, problems most frequently arise when the systems in question are swapping heavily to disk, making it consistently unlikely that the screenshots I try to take will contain the bugs I'm trying to capture.
The obvious solution would be a capture device, and I definitely want to explore pixel-perfect image and video recording in the future when I have the resources for that (it sounds like a hugely fun opportunity to explore FPGAs).
I recently realized, however, that the kernel is what is performing the swapping, and that if I move screenshotting into kernelspace, well, I don't have to wait for my screenshot keystroke to make its way through the X input layer, into the screenshot program, wait for that to do its XSHM dance and get the screenshot data, all while the system is heavily I/O loaded (eg, 5-second system load of >10) - I can simply have the kernel memcpy() the displayed area of video memory to a preallocated buffer at the exact fraction of a second I hit PrtSc!
TL;DR: Where should I start looking to figure out how to "portably" (within the sense of Linux having different graphics drivers, each with different architectural designs) access the currently-displayed area of video memory?
I get the impression I should be looking at libdrm, possibly within KMS, but I would really appreciate some pointers to knowing what actually accesses video memory.
I'm also guessing there are probably some caveats and gotchas to reading video memory directly on certain chipsets? I don't expect my code to make it into the Linux kernel (who knows, but I doubt it) but I'd still like whatever I build to be fairly portable across computers for convenience.
NOTE: I am not using compositing with the systems in question, in case this changes anything. I'm interested to know whether I could write a compositing-compatible system; I suspect this would be nontrivial.
I recently found that a fairly large image manipulation program I'm writing in C on a Windows 8 machine has a bug when used in very particular circumstances. Unfortunately, the bug is causing my entire computer to come to a standstill so that my only option is to pull the plug on the computer (especially annoying when I'm working remotely...)
Because it's an image manipulation program, I can't just flood it with print statements to isolate the problematic section - the problem occurs somewhere in a loop that's called billions of times, so adding a printf slows it down to the point that it would take days to get to a failing iteration.
I understand, therefore, if this question is too broad, as it isn't really reasonable for me to put down all of the code that could cause my problem, I'm simply asking
What are the circumstances in which C code can, instead of seg faulting or halting the program, actually freeze the entire OS
When I search the problem, I see code golf questions like this
A C program which crashes the system(shuts down the system)
This is not what I'm asking - obviously I haven't written system("shutdown") anywhere in my loop.
Being most familiar with python and java, this problem is not what I'm used to, but in my experience,
Dividing by zero produces a seg fault
Accessing memory by accident that is slightly outside an intended array causes a seg fault (sometimes down the road a little)
Accessing protected memory causes the program to hang
Stack overflow causes a seg fault
Dereferencing a non-initialized pointer causes a seg fault
Is this impression false - could those cases cause the whole system to crash? What cases am I missing? Is it dependent on my version of gcc, or my permission status?
I haven't been able to try to reproduce it on a different operating system yet, as it requires a few dependencies to run the entire program.
If my only option is to sit for days waiting for the program to run with print statements, or avoid weird situations, then, of course, so be it. I'm looking for key places to look for the bug.
On modern systems with hardware-enforced privilege separation between user-mode and kernel-mode, and an operating system that functions to correctly configure these mechanisms, you simply cannot crash the system from a user mode process.
Any of those errors are trapped by the CPU, which call exception handlers in the OS which will quickly pull the plug on your system.
If I had to guess, a piece of hardware is overheating or malfunctioning:
Overheating CPU due to poor thermal conductivity with heatsink
Failing / under-sized power supply
Failing DIMMs
Failing hard drive
Failing CPU
Failing / overheating GPU
I've seen cryptocoin-mining software bring a system to its knees because it was pushing the limits of the GPU. When the card would lock-up/reset, the driver would get confused or lock-up, and the system would end up needed rebooted.
Your system is doing next to nothing when you're just sitting there browsing the web, etc. But if your system locks up when you start running a CPU-intensive application, it can bring out problems that you didn't know where there.
While this is a little out-of-place on Stack Overflow, it falls into one of those grey areas between hardware and software. I would stress-test your system, keeping an eye on CPU/GPU/memory temperatures, and power supply voltages. Check out MemTest86, Stresslinux.
The most trivial cause of OS freezing is "memory full". If you have processes that use a lot of memory, then your system is going to swap from main memory (typically RAM) to secondary memory (typically disk) which lead to a very huge overhead... As a user what you usually observe is a almost freezed computer, sometimes so freezed that you think it is crashed. If your OS is badly designed then it sometimes crashes!
I'm using MinGW, which is gcc for Windows. My program involves multiple windows, two different main threads, and several worker threads in a thread pool for overlapped network I/O.
It works perfectly fine without compiler optimization.
A) Is compiler optimization even necessary? My program's already very fast. Is it at all likely that it will provide a significant improvement?
B) Are there any articles on how to properly build a multthreaded program so compiler optimization can do its job?
“Imploded aggressively” is a bit weird (is your program a controller for a fission bomb?), but I understand that your program behaved as desired without compiler optimizations and mysteriously with compiler optimizations.
The technical term for this is that your program is buggy.
Multithreaded programming is intrinsically hard. Multithreaded programming when the threads share memory is very hard; it's the masochist way of concurrent programming (message passing is a lot easier to get right). You don't just need to read an article or two, you need to read several books and get a few years' programming experience.
You were unlucky that your program seemed to work without optimizations. It probably wouldn't work on a different machine where the timings are a bit different, or with a different compiler, or on a different operating system, either. So you ended up wasting your time thinking your program worked. But it doesn't. A compiler transforms correct source code into correct executables, no matter what optimization level you choose.¹
¹ Barring compiler bugs, sure. But the odds are very strongly stacked against you.
99.9% of all household failures in one optimization mode and not another are due to serious bugs. Multithreading races etc. are very sensitive to code performance. An instruction reorder or loop shortcut can turn a test pass into a debugging nightmare.
I'm assuming that the server runs up OK and detonates under load in aparrently different places, so making conventional debugging useless?
You are going to have to rely on logging and changing the test conditions to narrow down the point of ignition. My guess is this is going to be a Heisenbug that mutates with changes to the code, optimization, options, load profile, buffer sizes etc.
Not fixing the problem is not a good plan since it wil just show up in another form on next years boxes with more cores etc. Even with optimization off, it's still there, lurking, waiting for the opportunity to strike.
I hope I'm providing some comfort.
Seriously - log everything you can with a good logger - one that queues up the logs so as to keep disk latency out of the main app. Change things around to try and make the bug mutate and perhaps show up in the non-optimized build too. Write down, (type in), absolutely everything that you do amd what happens after any change, good or bad. Making the bug worse is actually better than making its symptoms go away, (without knowing exactly why). Try the server on various hardware configs, if you can.
Eventually, you will find the bug!
You have one thing going for you - it seems that you can reliably reproduce the problem. That, in itself, is a massive plus.
Forgot to ask - apart from the nuclear explosive metaphor, what is the main symptom? Is it AV'ing/segfaulting all over the place, or is it locked or livelocked up?
To answer part "A" of your question, the unoptimized version of your code still has the concurrency bugs in it, but the timing of how the threads run is such that the bugs have not yet been exposed with your test workloads. The current version of the unoptimized program will eventually fail in use, so you will need to fix the concurrency bugs before using the program for real work.
I hope not everyone is using Rational Purify.
So what do you do when you want to measure:
time taken by a function
peak memory usage
code coverage
At the moment, we do it manually [using log statements with timestamps and another script to parse the log and output to excel. phew...)
What would you recommend? Pointing to tools or any techniques would be appreciated!
EDIT: Sorry, I didn't specify the environment first, Its plain C on a proprietary mobile platform
I've done this a lot. If you have an IDE, or an ICE, there is a technique that takes some manual effort, but works without fail.
Warning: modern programmers hate this, and I'm going to get downvoted. They love their tools. But it really works, and you don't always have the nice tools.
I assume in your case the code is something like DSP or video that runs on a timer and has to be fast. Suppose what you run on each timer tick is subroutine A. Write some test code to run subroutine A in a simple loop, say 1000 times, or long enough to make you wait at least several seconds.
While it's running, randomly halt it with a pause key and sample the call stack (not just the program counter) and record it. (That's the manual part.) Do this some number of times, like 10. Once is not enough.
Now look for commonalities between the stack samples. Look for any instruction or call instruction that appears on at least 2 samples. There will be many of these, but some of them will be in code that you could optimize.
Do so, and you will get a nice speedup, guaranteed. The 1000 iterations will take less time.
The reason you don't need a lot of samples is you're not looking for small things. Like if you see a particular call instruction on 5 out of 10 samples, it is responsible for roughly 50% of the total execution time. More samples would tell you more precisely what the percentage is, if you really want to know. If you're like me, all you want to know is where it is, so you can fix it, and move on to the next one.
Do this until you can't find anything more to optimize, and you will be at or near your top speed.
You probably want different tools for performance profiling and code coverage.
For profiling I prefer Shark on MacOSX. It is free from Apple and very good. If your app is vanilla C you should be able to use it, if you can get hold of a Mac.
For profiling on Windows you can use LTProf. Cheap, but not great:
http://successfulsoftware.net/2007/12/18/optimising-your-application/
(I think Microsoft are really shooting themself in the foot by not providing a decent profiler with the cheaper versions of Visual Studio.)
For coverage I prefer Coverage Validator on Windows:
http://successfulsoftware.net/2008/03/10/coverage-validator/
It updates the coverage in real time.
For complex applications I am a great fan of Intel's Vtune. It is a slightly different mindset to a traditional profiler that instruments the code. It works by sampling the processor to see where instruction pointer is 1,000 times a second. It has the huge advantage of not requiring any changes to your binaries, which as often as not would change the timing of what you are trying to measure.
Unfortunately it is no good for .net or java since there isn't a way for the Vtune to map instruction pointer to symbol like there is with traditional code.
It also allows you to measure all sorts of other processor/hardware centric metrics, like clocks per instruction, cache hits/misses, TLB hits/misses, etc which let you identify why certain sections of code may be taking longer to run than you would expect just by inspecting the code.
If you're doing an 'on the metal' embedded 'C' system (I'm not quite sure what 'mobile' implied in your posting), then you usually have some kind of timer ISR, in which it's fairly easy to sample the code address at which the interrupt occurred (by digging back in the stack or looking at link registers or whatever). Then it's trivial to build a histogram of addresses at some combination of granularity/range-of-interest.
It's usually then not too hard to concoct some combination of code/script/Excel sheets which merges your histogram counts with addresses from your linker symbol/list file to give you profile information.
If you're very RAM limited, it can be a bit of a pain to collect enough data for this to be both simple and useful, but you would need to tell us a more about your platform.
nProf - Free, does that for .NET.
Gets the job done, at least enough to see the 80/20. (20% of the code, taking 80% of the time)
Windows (.NET and Native Exes): AQTime is a great tool for the money. Standalone or as a Visual Studio plugin.
Java: I'm a fan of JProfiler. Again, can run standalone or as an Eclipse (or various other IDEs) plugin.
I believe both have trial versions.
The Google Perftools are extremely useful in this regard.
I use devpartner with MSVC 6 and XP
How are any tools going to work if your platform is a proprietary OS? I think you're doing the best you can right now