Periodically, the program enters the HardFault_Handler. In the register HFSR set bit FORCED and in UFSR register set UNALIGNED.
The project uses STM32F417, FreeRtos, LWIP. In most cases, the error in the stack are LWIP function. The error occurs rarely once
a few days.
The program is compiled with the flag --no_unaligned_access.
It is unclear why there is such an error - while --no_unaligned_access flag is enabled and even every few days, and second whether it is possible to process or ignore this error and continue the program?
(I know this is years after the OQ. Posting in case it's still helpful.)
I'm working on a project that is using LWIP 1.4.1 that does have at least one unaligned access fault; I've just now fixed it. (I'm here researching if this is a known issue.)
In src/netif/etharp.c: etharp_request()
return etharp_raw(netif, (struct eth_addr *)netif->hwaddr, ðbroadcast,
(struct eth_addr *)netif->hwaddr, &netif->ip_addr, ðzero,
ipaddr, ARP_REQUEST);
The cast (struct eth_addr *)netif->hwaddr is causing the non-alignment of netif->hwaddr to be discarded. A subsquent memcpy() inside etharp_raw() then faults.
The solution I've shoe-horned in is to allocate temporary storage that is aligned and pass that instead:
struct eth_addr hwaddr;
memcpy(hwaddr.addr, netif->hwaddr, ETHARP_HWADDR_LEN);
return etharp_raw(netif, &hwaddr, ðbroadcast,
&hwaddr, &netif->ip_addr, ðzero,
ipaddr, ARP_REQUEST);
A quick check through the rest of etharp.c reveals quite a number of such casts, some of which are harmless but at least one or two others are also likely to fault.
lwIP 2.x has been released since mid-last year (2018). My experience is that changing over from 1.4x to 2.x did not cause any/much issues, so it's best to switch over. That sort of problems (if they are actual problems) might have been fixed.
Also, the F4x series is Cortex-M4, so they can do unaligned access. It would only cause problems if you are using F0xx or L0xx series that uses the Cortex-M0+ core.
I don't know if this is already resolved. But I ran into same problem with (STM32F746) during one of the project.
Solution: Just add following line in int main()
SCB->CCR = SCB->CCR & ~(1<<3);//Resetting the 3rd bit (meant for enabling hardfault for unaligned access)
Check if it is still relevant for you.
In my case I was using packed structures which was causing this problem. After above mentioned fix with option 1, it got away for me.
Related
I'm having one of those moments where I'm sure there is some obvious thing I'm missing but I can't see it for looking.
We have some code (Not Invented Here, natch) which looks something like this (I've made it pseudocode for ease of reading):
struct outputs_struct{
char *SomeString;
};
int DoSomething(struct allthings_struct *AllThings)
{
struct inputs_struct The_Inputs;
struct outputs_struct The_Outputs;
int error = 0;
// Populate input data, then:
error = DoGetOutputsFromInputs(Allthings, &The_Inputs, &The_Outputs);
return error;
}
int DoGetOutputsFromInputs(struct allthings_struct *AllThings, struct input_struct *Inputs, struct outputs_struct *Outputs)
{
// Some reading of input data, then:
Outputs->SomeString = (char *)malloc(100);
strcpy(Outputs->SomeString, "Hello,world");
// Some other stuff
return 0;
}
As soon as this function returns, we get a SEGFAULT.
It SEGFAULTs immediately on coming back from DoGetOutputsFromInputs(). Likewise if I print markers & pause before the return statement in DoGetOutputsFromInputs() it is fine right up to the moment it actually returns.
I have also tried upping my caffeine dosage, experiments are ongoing in that department, so far: no progress.
Edit 1: Further testing reveals it's not the malloc() that's at fault / causing the issue, the code actually crashes if we return sooner than that part, so I think there is some oddness going on elsewhere that I will have to chase down.
Apologies for the vagueness and pseudocode, it's a huge steaming pile of code auto-generated by gSoap (which doesn't auto-generate any sort of comments or documentation, of course...) from ONVIF WSDL's, we're developing in Ubuntu and the target is a TI DaVinci DSP/ARM9 SoC. This code is a subsection of a corner of the TI SDK and hence various things are outside our immediate influence / too time-consuming to delve into.
Your example does not repro. I suspect that the referencing of the parent-frame-stack-declared The_Outputs is the culprit and somewhere on the code a cast is done that fools the compiler to write a few bytes higher on the stack, where exactly the ebp ret address would be, triggering the fault when execting the ret (I assume an x86 like stack architecture).
Running under gdb should make this fairly trivial to capture. Enter DoGetOutputsFromInputs and use watch to set a break-on-write on the stack ret address (see Can I set a breakpoint on 'memory access' in GDB?). Let it run, should break when the overwrite occurs (if my hypothesis is correct) and that instruction is your culprit.
Of course compiling with stack-smash protection would also capture the problem fairly easy, but where is the fun?
Well to answer my own question and close this off / avoid wasting anyone's time... basically, it's not the malloc, it's unlikely it's even that function, there is something lurking in the code which isn't quite right and which I will have to devote a fair bit more time & coffee to tracking down.
Thanks all for the input.
Nurse, fetch the valium!
Its impossible to say without the actual code but this could be due to memory corruption (e.g., buffer overflow or underflow) or UB (undefined behavior). If it is chances are the actual issue is happening somewhere else and just happens to show up at this point.
A few things you can do to narrow down the cause:
Use Valgrind or a similar tool to look for memory issues.
Create a minimal example code that replicates the issue.
Double-check all memory allocations, frees, and copies.
Test the DoGetOutputsFromInputs() to ensure it works as expected.
I'm developing a software for the Cortex-M3 embedded micro controller (Atmel SAM3S)
I using the IAR EWARM IDE & compiler.
I suspect that for some reason I have a buffer overflow, or a memory leak, which causes the stack to be corrupted, because I suddenly find myself stuck outside of my code space.
The reason I ask this question, is that it's really hard finding out what actually caused this mess-up, and I want to know which techniques are you using when you want to find out the cause of the issue.
Are you using memory debuggers, in-circuit trace debugging hardware, etc.
You should try using canary values. This is how it basically goes - say you have some struct:
struct foo {
unsigned long bar;
void * baz;
};
Modify it so it looks like this:
struct foo {
unsigned long canary1;
unsigned long bar;
void * baz;
unsigned long canary2;
};
When you initialize the struct, put some arbitrary values into canary1 and canary2. Whenever you do some operation on your struct, check if the values stay the same. This way, if you have a buffer overflow or stack smashing, you'll detect it. You can do the same inside functions with automatic variables:
int foo(int bar) {
unsigned long canary1 = 0xDEADBABE;
char baz[20];
unsigned long canary2 = 0xBAD0C0DE;
...
}
And so on. Don't forget to check that the values remain the same before you return. Also, if you can get your code to consistently jump to the same location, try putting some code there (or a breakpoint) and get a stack trace.
GCC knows how to add these canary values by itself, but I don't know if your compiler can do that. But you could still do it manually.
The program counter is a register, and as such it cannot be "overwritten". What might happen is, as you say, that the stack gets overwritten and then you execute a return instruction which reads an invalid return address from the stack, thus causing a jump into la-la-land.
My favourite debugging method is printing things out, which might be difficult on an embedded target, of course. The second-best would be to step through the suspect routine.
You should also investigate things that are known to cause jumps, such as interrupt service routines.
I had a similar issue using IAR EWARM on an STM32. Memory dumps, disassembly, canaries, all turned up nothing. Finally rolled back to an earlier version of EWARM and the problem went away. I sent a message to IAR support but never heard back. I'm sorry I don't remember which version of EWARM this was. It was a few projects ago.
I would keep a memory window open and try the canary test first. If it still randomly jumps out of code space, try installing an older version of EWARM.
One thing I can add is that with ARM chips, it is possible that there was a BL somewhere instead of a BX or BLX causing the chip to go into the wrong Thumb/ARM mode. Not as common with later chips, but still...
When I find jumps to nowhere, I look for bad function pointer tables, overwrites of any interrupt vector tables, and yes, stack overflow which is the easiest to test. Drop known bytes values into your stack area and when the crash occurs, see how much stack you had remaining with a debugger. If none, there you go.
I'd also do the standard see what's changed in the last X days stuff to try and isolate any problems. Finally, just printf the heck out of your code to try and narrow where the bad jump is occurring. If you can get it down to a function or two, you can trace the assembler and see if it's a compiler issue, a memory issue, or an interrupt issue pretty quickly. Good luck!
I'm writing a C application which is run across a compute cluster (using condor). I've tried many methods to reveal the offending code but to no avail.
Clues:
On Average when I run the code on 15 machines for 2 days, I get two or three segfaults (signal 11).
When I run the code locally I do not get a segfault. I ran it for nearly 3 weeks on my home machine.
Attempts:
I ran the code in valGrind for four days locally with no memory errors.
I captured the segfault signal by defining my own signal handler so that I can output some of the program state.
Now when a segfault happens I can print out the current stack using backtrace.
I can print out variable values.
I created a variable which is set to the current line number.
Have also tried commenting chunks of the code out, hoping that if the problem goes away I will discover the segfault.
Sadly the line number outputted is fairly random. I'm not entirely sure what I can do with the stacktrace. Am I correct in assuming that it only records the address of the function in which the segfault occurs?
Suspicions:
I suspect that the check pointing system which condor uses to move jobs across machines is more sensitive to memory corruption and this is why I don't see it locally.
That indices are being corrupted by the bug, and that these indices are causing the segfault. This would explain the fact that the segfaults are occurring on fairly random line numbers.
UPDATE
Researching this some more I've found the following links:
LibSegFault - a library for automatically catching and printing state data about segfaults.
Stack unwinding (stack trace) with GCC tutorial on catching segfaults and get the line numbers of the offending instructions.
UPDATE 2
Greg suggested looking at the condor log and to 'correlate the segfaults to when condor restarts the executable from a checkpoint'. Looking at the logs the segfaults all occur immediately after a restart. All of the failures appear to occur when a job switches from one type of machine to another type.
UPDATE 3
The segfault was being caused by differences between hosts, by setting the 'requiremets' field in the condor submit file to problem completely disappeared.
One can set individual machines:
requirements = machine == "hostname1" || machine == "hostname2"
or an entire class of machines:
requirements = classOfMachinesName
See requirements example here
if you can, compile with debugging, and run under gdb.
alternatively, get core dumped and load that into debugger.
mpich has built-in debugger, or you can buy commercial parallel debugger.
Then you can step through the code to see what happening in debugger
http://nmi.cs.wisc.edu/node/1610
http://nmi.cs.wisc.edu/node/1611
Can you create a core dump when your segfault happens? You can then debug this dump to try to figure out the state of the code when it crashed.
Look at what instruction caused the fault. Was it even a valid instruction or are you trying to execute data? If valid, what memory is it trying to access? Where did this pointer come from. You need to narrow down the location of your fault (stack corruption, heap corruption, uninitialized pointer, accessing invalid memory). If it's a corruption, see if if there's any tell-tale data in the corrupted area (pointers to symbols, data that looks like something in your structures, ...). Your memory allocator may already have built in features to debug some corruption (see MALLOC_CHECK_ on Linux or MallocGuardEdges on Mac OS). A common case for these is using memory that has been free()'d, so logging your malloc() / free() pairs might help.
If you have used the condor_compile tool to relink your code with the condor checkpointing code, it does a few things differently than a normal link. Most importantly, it statically links your code, and uses it's own malloc. Another big difference is that condor will then run it on a foreign machine, where the environment may be different enough from what you expect to cause problems.
The executable generated by condor_compile is runnable as a standalone binary outside of the condor system. If you run the binary emitted from condor_compile locally, outside of condor, do you still see the segfaults?
If it doesn't, can you correlate the segfaults to when condor restarts the executable from a checkpoint (the user log will tell you when this happens).
You've tried most of what I'd think of. The only other thing I'd suggest is start adding a lot of logging code and hope you can narrow down where the error is happening.
The one thing you do not say is how much flexibility you have to solve the problem.
Can you, for example, have the system come to a halt and just run your application?
Also how important are these crashes to solve?
I am assuming that for the most part you do. This may require a lot of resources.
The short term step is to put tons of "asserts" ( semi handwritten ) of each variable
to make sure it hasn't changed when you don't want it to. This can ccontinue to work as you go through the long term process.
Long term-- try running it on a cluster of two ( maybe your home computer and a VM ).
Do you still see the segfaults. If not increase the cluster size until you start seeing segfaults.
Run it on a minimum configuration ( to get segfaults ) and record all your inputs till a crash. Automate running the system with the inputs that you recorded, tweaking them until you can consistent get a crash with minimal input.
At that point look around. If you still can't find the bug, then you will have to ask again with some extra data you gathered with those runs.
I have some software that I have working on a redhat system with icc and it is working fine. When I ported the code to an IRIX system running with MIPS then I get some calculations that come out as "nan" when there should definitely be values there.
I don't have any good debuggers on the non-redhat system, but I have tracked down that some of my arrays are getting "nan" sporadically in them and that is causing my dot product calculation to come back as "nan."
Seeing as how I can't track it down with a debugger, I am thinking that the problem may be with a memcpy. Are there any issues with the MIPS compiler memcpy() function with dynamically allocated arrays? I am basically using
memcpy(to, from, n*sizeof(double));
And I can't really prove it, but I think this may be the issue. Is there some workaround? Perhaps sme data is misaligned? How do I fix that?
I'd be surprised if your problem came from a bug in memcpy. It may be an alignment issue: are your doubles sufficiently aligned? (They will be if you only store them in double or double[] objects or through double* pointers but might not be if you move them around via void* pointers). X86 platforms are more tolerant to misalignment than most.
Did you try compiling your code with gcc at a high warning level? (Gcc is available just about everywhere that's not a microcontroller or mainframe. It may produce slower code but better diagnostics than the “native” compiler.)
Of course, it could always be a buffer overflow or other memory management problem in some unrelated part of the code that just happened not to cause any visible bug on your original platform.
If you can't get a access to a good debugger, try at least printf'ing stuff in key places.
Is it possible for the memory regions to and from to overlap? memcpy isn't required to handle overlapping memory regions. If this is your problem then the solution is as simple as using memmove instead.
Is sizeof() definitely supported?
I have built a plain C code on Linux (Fedora) using code-sorcery tool-chain. This is for ARM Cortex-A8 target. This code is running on a Cortex A8 board, running embedded Linux.
When I run this code for some test case, which does dynamic memory allocation (malloc) for some large size (10MB), it crashes after some time giving error message as below:
select 1 (init), adj 0, size 61, to kill
select 1030 (syslogd), adj 0, size 64, to kill
select 1032 (klogd), adj 0, size 74, to kill
select 1227 (bash), adj 0, size 378, to kill
select 1254 (ppp), adj 0, size 1069, to kill
select 1255 (TheoraDec_Corte), adj 0, size 1159, to kill
send sigkill to 1255 (TheoraDec_Corte), adj 0, size 1159
Program terminated with signal SIGKILL, Killed.
Then, when I debug this code for the same test case using gdb built for the target, the point where this dynamic memory allocation happens, code fails to allocate that memory and malloc returns NULL. But during normal stand-alone run, I believe malloc should be failing to allocate but it strangely might not be returning NULL, but it crashes and the OS kills my process.
Why is this behaviour different when run under gdb and when without debugger?
Why would malloc fails yet not return a NULL. Could this be possible, or the reason for the error message I am getting is else?
How do I fix this?
thanks,
-AD
So, for this part of the question, there is a surefire answer:
Why would malloc fails yet not return a NULL. Could this be possible, or the reason for the error message i am getting is else?
In Linux, by default the kernel interfaces for allocating memory almost never fail outright. Instead, they set up your page table in such a way that on the first access to the memory you asked for, the CPU will generate a page fault, at which point the kernel handles this and looks for physical memory that will be used for that (virtual) page. So, in an out-of-memory situation, you can ask the kernel for memory, it will "succeed", and the first time you try to touch that memory it returned back, this is when the allocation actually fails, killing your process. (Or perhaps some other unfortunate victim. There are some heuristics for that, which I'm not incredibly familiar with. See "oom-killer".)
Some of your other questions, the answers are less clear for me.
Why is this behaviour different when run under gdb and when without debugger?It could be (just a guess really) that GDB has its own malloc, and is tracking your allocations somehow. On a somewhat related point, I've actually frequently found that heap bugs in my code often aren't reproducible under debuggers. This is frustrating and makes me scratch my head, but it's basically something I've pretty much figured one has to live with...
How do i fix this?
This is a bit of a sledgehammer solution (that is, it changes the behavior for all processes rather than just your own, and it's generally not a good idea to have your program alter global state like that), but you can write the string 2 to /proc/sys/vm/overcommit_memory. See this link that I got from a Google search.
Failing that... I'd just make sure you're not allocating more than you expect to.
By definition running under a debugger is different than running standalone. Debuggers can and do hide many of the bugs. If you compile for debugging you can add a fair amount of code, similar to compiling completely unoptimized (allowing you to single step or watch variables for example). Where compiling for release can remove debugging options and remove code that you needed, there are many optimization traps you can fall into. I dont know from your post who is controlling the compile options or what they are.
Unless you plan to deliver the product to be run under the debugger you should do your testing standalone. Ideally do your development without the debugger as well, saves you from having to do everything twice.
It sounds like a bug in your code, slowly re-read your code using new eyes as if you were explaining it to someone, or perhaps actually explain it to someone, line by line. There may be something right there that you cannot see because you have been looking at it the same way for too long. It is amazing how many times and how well that works.
I could also be a compiler bug. Doing things like printing out the return value, or not can cause the compiler to generate different code. Adding another variable and saving the result to that variable can kick the compiler to do something different. Try changing the compiler options, reduce or remove any optimization options, reduce or remove the debugger compiler options, etc.
Is this a proven system or are you developing on new hardware? Try running without any of the caches enabled for example. Working in a debugger and not in standalone, if not a compiler bug can be a timing issue, single stepping flushes the pipline, mixes the cache up differently, gives the cache and memory system an eternity to come up with a result which it doesnt have in real time.
In short there is a very long list of reasons why running under a debugger hides bugs that you cannot find until you test in the final deliverable like environment, I have only touched on a few. Having it work in the debugger and not in standalone is not unexpected, it is simply how the tools work. It is likely your code, the hardware, or your tools based on the description you have given so far.
The fastest way to eliminate it being your code or the tools is to disassemble the section and inspect how the passed values and return values are handled. If the return value is optimized out there is your answer.
Are you compiling for a shared C library or static? Perhaps compile for static...