Lets state the conditions where sqlcxt() can cause segmentation fault, I am woking on unix, using ProC for database connections to Oracle database.
My program crashes and the core file shows that the crash is due to the sqlcxt() function
A loadobject was found with an unexpected checksum value.
See `help core mismatch' for details, and run `proc -map'
to see what checksum values were expected and found.
...
dbx: warning: Some symbolic information might be incorrect.
...
t#null (l#1) program terminated by signal SEGV
(no mapping at the fault address)0xffffffffffffffff:
<bad address 0xffffffffffffffff>
Current function is dbMatchConsortium
442 **sqlcxt((void **)0, &sqlctx, &sqlstm, &sqlfpn);**
There is a decent chance that the problem you are having is some sort of pointer-error / memory allocation error in your C code. These things are never easy to find. Some things that you might
try:
See if you can comment out (or #ifdef) out sections of your program and if the problem disappears. If so then you can close in on the bad section
Run your program in a debugger.
Do a code review with somebody else - this will often lead to finding more than one problem (Usually works in my code).
I hope that this helps. Please add more details and I will check back on this question and see if I can help you .
It's probably an allocation error in your program. When I got this kind of behaviour it was always my fault. I develop on Solaris/SPARC and Oracle 10g. Once it was a double free (i.e. I freed the same pointer twice) and the second time I had a core in the Oracle part of the program was when I freed a pointer which was not an allocated memory block.
If you're on Solaris you can try the libumem allocation library (google it for details) to see if the behaviour change.
A solution that worked for me: Delete the c files created by ProC & make(recompile)
Pro c files(*.pc) are 'compiled'/preprocessed in c files and sometimes when 'compiling' them some errors may occur (in my case it wasn't any more space left) and even if the build succeeds I would get a SIGSEGV signal in sqlcxt libclntsh.so when executing them.
pstack & gdb could help you for debugging if that's not the case.
Related
I get segmentation fault on a certain scenario(it is C code with DEC VAX FMS(Forms Management System) calls to get a certain field on a CRT screen - pretty old legacy code). I am on an AIX machine, and have only dbx installed on it. GDB, valgrind etc. are not available.
Here is what I get when I try to debug:
Unreadable instruction at address 0x53484950
I do not know how to proceed from here.
I have tried a few things:
1.
(dbx) up
not that many levels
(dbx) down
not that many levels
(dbx) n
where
Segmentation fault in . at 0x53484950 ($t1)
0x53484950 (???) Unreadable instruction at address 0x53484950
Tried tracei(for machine instructions), dump(dump gives so much output, I am unable to make sense of it) etc. but nothing seems to help.
(dbx) &0x53484950/X
expected variable, found "1397246288"
I am used to getting a stack trace on "where" and going on from there. This is something I have not encountered before, and it appears I am not very good at dbx either. Any help to get to at least the line of code that is causing trouble is appreciated.
Once you have hit a segfault, there is no way to continue, so the n command is not going to do anything. At that point, all you can do is examine the stack and the variables, and that will be meaningless unless you have the source code and can recompile it.
In fact, without the source code, I am not sure how you could possibly proceed with fixing the program. Even if you could "decompile" the program, or at least disassemble the program, the risk of making a mistake when trying to patch the binary in order to fix it is virtually 100%.
I'm sorry. Given the limitations you are working under, I would argue the the problem is insolvable. Without tools such as gdb or valgind, it will be difficult to find the problem, and without the source code, it will be very difficult to fix the problem once you have found it.
I have a critical bug in my project. When I use gdb to open the .core it shows me something like(I didn't put all the gdb output for ease of reading):
This is very very suspicious, new written part of code ::
0x00000000004579fe in http_chunk_count_loop
(f=0x82e68dbf0, pl=0x817606e8a Address 0x817606e8a out of bounds)
This is very mature part of code which worked for a long time without problem::
0x000000000045c8a5 in packet_handler_http
(f=0x82e68dbf0, pl=0x817606e8a Address 0x817606e8a out of bounds)
Ok now what messes my mind is the pl=0x817606e8a Address 0x817606e8a out of bounds, gdb shows it was already out of bounds before it reached new written code. This make me think the problem caused by function which calls packet_handler_http.
But packet_handler_http is very mature and working for a long time without problem. And this makes me I am misundertanding gdb output.
The problem is with packet_handler_http I guess but because of this was already working code I am confused, am I right with my guess or am I missing something?
To detect "memory errors" you might like to run the program under Valgrind: http://valgrind.org
If having compiled the program with symbols (-g for gcc) you could quite reliably detect "out of bounds" conditions down to the line of code where the error occurrs, as well with the line of code having allocated the memory (if ever).
The problem is with packet_handler_http I guess
That guess is unlikely to be correct: if the packet_handler_http is really receiving invalid pointer, then the corruption has happened "upstream" from it.
This is very mature part of code which worked for a long time without problem
I routinely find bugs in code that worked "without problem" for 10+ years. Also, the corruption may be happening in newly-added code, but causing problems elsewhere. Heap and stack buffer overflows are often just like that.
As alk already suggested, run your executable under Valgrind, or Address Sanitizer (also included in GCC-4.8), and fix any problems they find.
Thanks guys for your contrubition , even gdb says opposite it turn out pointer was good.
There was a part in new code which causes out of bounds problem.
There was line like :: (goodpointer + offset) and this offset was http chunk size and I were taking it from network(data sniffing). And there was kind of attack that this offset were extremely big, which cause integer overflow. And this resulted out of bounds problem.
My conclusions : don't thrust the parameters from network never AND gdb may not always points the parameter correctly at coredump because at the moment of crush things can get messy in stack .
I got a segment error in a object like this:
http_client_reset(struct http_client *client) {
if (client->last_req) {
/* #client should never be NULL, but weather
a valid object, I don't know */
...
}
}
by debugging the core dump file in GDB, the memory address of client is 0x40a651c0. I have tried several times, and the address is the same.
Then I tried the bt command in GDB:
(gdb) bt
#0 0x0804c80e in http_client_reset (
c=<error reading variable: Cannot access memory at address 0x40a651c0>,
c#entry=<error reading variable: Cannot access memory at address 0x40a651bc>)
at http/client.c:170
Cannot access memory at address 0x40a651bc
there is no back trace message, I have greped my source code, and there is only one call on http_client_reset.
How to debug such a bug via only a memory address?
Is there a way to judge a object is valid before access its field(except obj == NULL)?
Never a coredump crash debugging is a 'Black and White' matter.So you would not be able to get an exact answer for the questions pertaining to debugging coredump. However, most coredump will be due to programming errors which can be classified into broad areas. I will provide some of these broad areas and some debugging mechanism - which might help you.
Class of programming error leading to crash
multi-threaded code - check for missing critical section while accessing common data. This can corrupt the data leading to such crash. In your case you can check for http_client pointer, access of this and CRUD - Create/Read/Update and Delete.
Heap Corruption - In most of the cases, this would be a valid pointer and due incorrect handling of heap in another section of code, this may cause the valid pointer to be overwritten. Think of an array in and around the pointer location - ABW etc kind of issues would easily cause this problem.
Stack Corruption - This is very unlikely, but hard to find them. In case you overwrite stack data - similar to array in the above example - but on stack, then the same issue will occur.
Ways to un-earth the coredump root cause
You need understand that - technically coredump is an illegal operation causing un-handled exception leading to crash. Since most of it are related to memory handling, a static-analysis tool - such as kloc/PCLint would capture almost 80% of the issues. Then I would next run on valgrind/purify and would most probably uncover the rest of the issue. Very few issues miss both of them - which would be some sequencing/timing related code - which can be found out with code review.
HTH!
I'm writing a C application which is run across a compute cluster (using condor). I've tried many methods to reveal the offending code but to no avail.
Clues:
On Average when I run the code on 15 machines for 2 days, I get two or three segfaults (signal 11).
When I run the code locally I do not get a segfault. I ran it for nearly 3 weeks on my home machine.
Attempts:
I ran the code in valGrind for four days locally with no memory errors.
I captured the segfault signal by defining my own signal handler so that I can output some of the program state.
Now when a segfault happens I can print out the current stack using backtrace.
I can print out variable values.
I created a variable which is set to the current line number.
Have also tried commenting chunks of the code out, hoping that if the problem goes away I will discover the segfault.
Sadly the line number outputted is fairly random. I'm not entirely sure what I can do with the stacktrace. Am I correct in assuming that it only records the address of the function in which the segfault occurs?
Suspicions:
I suspect that the check pointing system which condor uses to move jobs across machines is more sensitive to memory corruption and this is why I don't see it locally.
That indices are being corrupted by the bug, and that these indices are causing the segfault. This would explain the fact that the segfaults are occurring on fairly random line numbers.
UPDATE
Researching this some more I've found the following links:
LibSegFault - a library for automatically catching and printing state data about segfaults.
Stack unwinding (stack trace) with GCC tutorial on catching segfaults and get the line numbers of the offending instructions.
UPDATE 2
Greg suggested looking at the condor log and to 'correlate the segfaults to when condor restarts the executable from a checkpoint'. Looking at the logs the segfaults all occur immediately after a restart. All of the failures appear to occur when a job switches from one type of machine to another type.
UPDATE 3
The segfault was being caused by differences between hosts, by setting the 'requiremets' field in the condor submit file to problem completely disappeared.
One can set individual machines:
requirements = machine == "hostname1" || machine == "hostname2"
or an entire class of machines:
requirements = classOfMachinesName
See requirements example here
if you can, compile with debugging, and run under gdb.
alternatively, get core dumped and load that into debugger.
mpich has built-in debugger, or you can buy commercial parallel debugger.
Then you can step through the code to see what happening in debugger
http://nmi.cs.wisc.edu/node/1610
http://nmi.cs.wisc.edu/node/1611
Can you create a core dump when your segfault happens? You can then debug this dump to try to figure out the state of the code when it crashed.
Look at what instruction caused the fault. Was it even a valid instruction or are you trying to execute data? If valid, what memory is it trying to access? Where did this pointer come from. You need to narrow down the location of your fault (stack corruption, heap corruption, uninitialized pointer, accessing invalid memory). If it's a corruption, see if if there's any tell-tale data in the corrupted area (pointers to symbols, data that looks like something in your structures, ...). Your memory allocator may already have built in features to debug some corruption (see MALLOC_CHECK_ on Linux or MallocGuardEdges on Mac OS). A common case for these is using memory that has been free()'d, so logging your malloc() / free() pairs might help.
If you have used the condor_compile tool to relink your code with the condor checkpointing code, it does a few things differently than a normal link. Most importantly, it statically links your code, and uses it's own malloc. Another big difference is that condor will then run it on a foreign machine, where the environment may be different enough from what you expect to cause problems.
The executable generated by condor_compile is runnable as a standalone binary outside of the condor system. If you run the binary emitted from condor_compile locally, outside of condor, do you still see the segfaults?
If it doesn't, can you correlate the segfaults to when condor restarts the executable from a checkpoint (the user log will tell you when this happens).
You've tried most of what I'd think of. The only other thing I'd suggest is start adding a lot of logging code and hope you can narrow down where the error is happening.
The one thing you do not say is how much flexibility you have to solve the problem.
Can you, for example, have the system come to a halt and just run your application?
Also how important are these crashes to solve?
I am assuming that for the most part you do. This may require a lot of resources.
The short term step is to put tons of "asserts" ( semi handwritten ) of each variable
to make sure it hasn't changed when you don't want it to. This can ccontinue to work as you go through the long term process.
Long term-- try running it on a cluster of two ( maybe your home computer and a VM ).
Do you still see the segfaults. If not increase the cluster size until you start seeing segfaults.
Run it on a minimum configuration ( to get segfaults ) and record all your inputs till a crash. Automate running the system with the inputs that you recorded, tweaking them until you can consistent get a crash with minimal input.
At that point look around. If you still can't find the bug, then you will have to ask again with some extra data you gathered with those runs.
I have built a plain C code on Linux (Fedora) using code-sorcery tool-chain. This is for ARM Cortex-A8 target. This code is running on a Cortex A8 board, running embedded Linux.
When I run this code for some test case, which does dynamic memory allocation (malloc) for some large size (10MB), it crashes after some time giving error message as below:
select 1 (init), adj 0, size 61, to kill
select 1030 (syslogd), adj 0, size 64, to kill
select 1032 (klogd), adj 0, size 74, to kill
select 1227 (bash), adj 0, size 378, to kill
select 1254 (ppp), adj 0, size 1069, to kill
select 1255 (TheoraDec_Corte), adj 0, size 1159, to kill
send sigkill to 1255 (TheoraDec_Corte), adj 0, size 1159
Program terminated with signal SIGKILL, Killed.
Then, when I debug this code for the same test case using gdb built for the target, the point where this dynamic memory allocation happens, code fails to allocate that memory and malloc returns NULL. But during normal stand-alone run, I believe malloc should be failing to allocate but it strangely might not be returning NULL, but it crashes and the OS kills my process.
Why is this behaviour different when run under gdb and when without debugger?
Why would malloc fails yet not return a NULL. Could this be possible, or the reason for the error message I am getting is else?
How do I fix this?
thanks,
-AD
So, for this part of the question, there is a surefire answer:
Why would malloc fails yet not return a NULL. Could this be possible, or the reason for the error message i am getting is else?
In Linux, by default the kernel interfaces for allocating memory almost never fail outright. Instead, they set up your page table in such a way that on the first access to the memory you asked for, the CPU will generate a page fault, at which point the kernel handles this and looks for physical memory that will be used for that (virtual) page. So, in an out-of-memory situation, you can ask the kernel for memory, it will "succeed", and the first time you try to touch that memory it returned back, this is when the allocation actually fails, killing your process. (Or perhaps some other unfortunate victim. There are some heuristics for that, which I'm not incredibly familiar with. See "oom-killer".)
Some of your other questions, the answers are less clear for me.
Why is this behaviour different when run under gdb and when without debugger?It could be (just a guess really) that GDB has its own malloc, and is tracking your allocations somehow. On a somewhat related point, I've actually frequently found that heap bugs in my code often aren't reproducible under debuggers. This is frustrating and makes me scratch my head, but it's basically something I've pretty much figured one has to live with...
How do i fix this?
This is a bit of a sledgehammer solution (that is, it changes the behavior for all processes rather than just your own, and it's generally not a good idea to have your program alter global state like that), but you can write the string 2 to /proc/sys/vm/overcommit_memory. See this link that I got from a Google search.
Failing that... I'd just make sure you're not allocating more than you expect to.
By definition running under a debugger is different than running standalone. Debuggers can and do hide many of the bugs. If you compile for debugging you can add a fair amount of code, similar to compiling completely unoptimized (allowing you to single step or watch variables for example). Where compiling for release can remove debugging options and remove code that you needed, there are many optimization traps you can fall into. I dont know from your post who is controlling the compile options or what they are.
Unless you plan to deliver the product to be run under the debugger you should do your testing standalone. Ideally do your development without the debugger as well, saves you from having to do everything twice.
It sounds like a bug in your code, slowly re-read your code using new eyes as if you were explaining it to someone, or perhaps actually explain it to someone, line by line. There may be something right there that you cannot see because you have been looking at it the same way for too long. It is amazing how many times and how well that works.
I could also be a compiler bug. Doing things like printing out the return value, or not can cause the compiler to generate different code. Adding another variable and saving the result to that variable can kick the compiler to do something different. Try changing the compiler options, reduce or remove any optimization options, reduce or remove the debugger compiler options, etc.
Is this a proven system or are you developing on new hardware? Try running without any of the caches enabled for example. Working in a debugger and not in standalone, if not a compiler bug can be a timing issue, single stepping flushes the pipline, mixes the cache up differently, gives the cache and memory system an eternity to come up with a result which it doesnt have in real time.
In short there is a very long list of reasons why running under a debugger hides bugs that you cannot find until you test in the final deliverable like environment, I have only touched on a few. Having it work in the debugger and not in standalone is not unexpected, it is simply how the tools work. It is likely your code, the hardware, or your tools based on the description you have given so far.
The fastest way to eliminate it being your code or the tools is to disassemble the section and inspect how the passed values and return values are handled. If the return value is optimized out there is your answer.
Are you compiling for a shared C library or static? Perhaps compile for static...