My program after running for around few hours randomly crashes because of a segmentation fault. My environment is Ubuntu (Linux)
When I try to print the data structure thats being accessed when it crashed the pointer is always pointing to invalid memory.
(gdb) p *xxx_info[8]
**Cannot access memory at address 0x7fd200000000**
(gdb)
In order to detect data corruption I add two fence variables that were hardcoded with a well defined value so that I can detect a memory corruption. I see that my fence variables for the linked list node which caused my process to crash had been violated from my earlier logs.
(gdb) p xxx_info
$2 = {0x7fd248000c30, 0x7fd248001050, 0x7fd248000b30, 0x7fd248000d40, 0x7fd248000f50, 0x0,
0x7fd248001160, 0x7fd248001280, 0x7fd200000000, 0x7fd2480008c0,
0x7fd248003000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
(gdb) p xxx_info[8]
$3 = (xxx_info_t *) 0x7fd200000000
(gdb) p *xxx_info[8]
**Cannot access memory at address 0x7fd200000000**
(gdb)
I added the two fence variables at the start and the end of the data structures.
(gdb) pt xxx_info_t
type = struct xxx_info_t {
uint32_t begin_fence;
char *str;
int cluster_id;
uint32_t end_fence;
}
Under normal circumstances the fence variables MUST always be 0xdeaddead as shown below:
(gdb) p/x *xxx_info[7]
$4 = {**begin_fence= 0xdeaddead**, str= 0x7fd248002f10, cluster_id= 0x1f2, **end_fence= 0xdeaddead**}
(gdb)
Whenever I access the array of pointers xxx_info I check for the fence values for each index. I noticed that I would get error messages saying that the fence variables for index 8 look corrupted. Later the code crashed when accessing index 8.
This means that somewhere I am overwriting the memory address pointed to by the index 8 of the array xxx_info .
My question is:
How can I debug these errors? Can I dynamically set some breakpoints so that whenever somebody overwrites that memory address I complain and assert. If this were a global variable that somebody were corrupting I could have set a HW breakpoint. Since this is a list created on heap (i malloc) the addresses that I will get will be dynamic and hence I can't use memory breakpoints/watchpoints.
Any ideas on what I can do?
Related
I'm trying to migrate a small c program from hpux to linux. The project compiles fine but crashes at runtime showing me a segmentation fault. I've already tried to see behind the mirror using strace and gdb but still don't understand. The relevant (truncated) parts:
tts_send_2.c
Contains a method
int sequenznummernabgleich(int sockfd, char *snd_id, char *rec_id, int timeout_quit) {
TS_TEL_TAB tel_tab_S01;
int n;
# truncated
}
which is called from within that file like this:
. . .
. . .
switch(sequenznummernabgleich(sockfd,c_snd_id,c_rec_id,c_timeout_quit)) {
/* kritischer Fehler */
case -1:
. . .
. . .
when calling that method I'm presented a segmentation fault (gdb output):
Program received signal SIGSEGV, Segmentation fault.
0x0000000000403226 in sequenznummernabgleich (sockfd=<error reading variable: Cannot access memory at address 0x7fffff62f94c>,
snd_id=<error reading variable: Cannot access memory at address 0x7fffff62f940>, rec_id=<error reading variable: Cannot access memory at address 0x7fffff62f938>,
timeout_quit=<error reading variable: Cannot access memory at address 0x7fffff62f934>) at tts_snd_2.c:498
498 int sequenznummernabgleich(int sockfd, char *snd_id, char *rec_id, int timeout_quit) {
which I just don't understand. When I'm stepping to the line where the method is called using gdb, all the variables are looking fine:
1008 switch(sequenznummernabgleich(sockfd,c_snd_id,c_rec_id,c_timeout_quit)) {
(gdb) p sockfd
$9 = 8
(gdb) p &sockfd
$10 = (int *) 0x611024 <sockfd>
(gdb) p c_snd_id
$11 = "KR", '\000' <repeats 253 times>
(gdb) p &c_snd_id
$12 = (char (*)[256]) 0xfde220 <c_snd_id>
(gdb) p c_rec_id
$13 = "CO", '\000' <repeats 253 times>
(gdb) p &c_rec_id
$14 = (char (*)[256]) 0xfde560 <c_rec_id>
(gdb) p c_timeout_quit
$15 = 20
(gdb) p &c_timeout_quit
$16 = (int *) 0xfde660 <c_timeout_quit>
I've also created an strace output. Here's the last part concerning the code shown above:
strace output
Any ideas ? I've searched the web and of course stackoverflow for hours without finding a really similar case.
Thanks
Kriz
I haven't used an HP/UX in eons but do hazily remember enough for the following suggestions:
Make sure you're initializing variables / struts correctly. Use calloc instead of malloc.
Also don't assume a specific bit pattern order: eg low byte then high byte. Ska endian-ness of the machine. There are usually macros in the compiler that will handle the appropriate ordering for you.
Update 15.10.16
After debugging for even more hours I found the real Problem. On the first line of the Method "sequenznummernabgleich" is a declaration of a struct
TS_TEL_TAB tel_tab_S01;
This is defined as following:
typedef struct {
TS_BOF_REC bof;
TS_REM_REC rem;
TS_EOF_REC eof;
int bof_len;
int rem_len;
int eof_len;
int cnt;
char teltyp[LEN_TELTYP+1];
TS_TEL_ENTRY entries[MAX_TEL];
} TS_TEL_TAB;
and it's embedded struct TS_TEL_ENTRY
typedef struct {
int len;
char tel[MAX_TEL_LEN];
} TS_TEL_ENTRY;
The problem is that the value for MAX_TEL_LEN had been changed from 512 to 1024 and thus the struct almost doubled in size what lead to that the STACK SIZE was not big enough anymore.
SOLUTION
Simply set the stack size from 8Mb to 64Mb. This can be achieved using ulimit command (under linux).
List current stack size: ulimit -s
Set stack size to 64Mb: ulimit -s 65535
Note: Values for stack size are in kB.
For a good short ref on ulimit command have a look # ss64
I have an issue with my pointer to a structure variable. I just started using GDB to debug the issue. The application stops when it hits on the line of code below due to segmentation fault. ptr_var is a pointer to a structure
ptr_var->page = 0;
I discovered that ptr_var is set to an invalid memory 0x0 after a series of function calls which caused the segmentation fault when assigning the value "0" to struct member "page". The series of function calls does not have a reference to ptr_var. The old address that used to be assigned to ptr_var is still in memory. I can still still print the values of members from the struct ptr_var using the old address. GDB session below shows that I am printing a string member of the struct ptr_var using its address
(gdb) x /s *0x7e11c0
0x7e0810: "Sample String"
I couldn't tell when the variable ptr_var gets assigned an invalid address 0x0. I'm a newbie to GDB and an average C programmer. Your assistance in this matter is greatly appreciated. Thank you.
What you want to do is set a watchpoint, GDB will then stop execution every time a member of a struct is modified.
With the following example code
typedef struct {
int val;
} Foo;
int main(void) {
Foo foo;
foo.val = 5;
foo.val = 10;
}
Drop a breakpoint at the creation of the struct and execute watch -l foo.val Then every time that member is changed you will get a break. The following is my GDB session, with my input
(gdb) break test.c:8
Breakpoint 3 at 0x4006f9: file test.c, line 8.
(gdb) run
Starting program: /usr/home/sean/a.out
Breakpoint 3, main () at test.c:9
9 foo.val = 5;
(gdb) watch -l foo.val
Hardware watchpoint 4: -location foo.val
(gdb) cont
Continuing.
Hardware watchpoint 4: -location foo.val
Old value = 0
New value = 5
main () at test.c:10
(gdb) cont
Continuing.
Hardware watchpoint 4: -location foo.val
Old value = 5
New value = 10
main () at test.c:11
(gdb) cont
If you can rerun, then break at a point where ptr_var is correct you can set a watch point on ptr_var like this: (gdb) watch ptr_var. Now when you continue every time ptr_var is modified gdb should stop.
Here's an example. This does contain undefined behaviour, as I'm trying to reproduce a bug, but hopefully it should be good enough to show you what I'm suggesting:
#include <stdio.h>
#include <stdint.h>
int target1;
int target2;
void
bad_func (int **bar)
{
/* Set contents of bar. */
uintptr_t ptr = (uintptr_t) bar;
printf ("Should clear %p\n", (void *) ptr);
ptr += sizeof (int *);
printf ("Will clear %p\n", (void *) ptr);
/* Bad! We just corrupted foo (maybe). */
*((int **) ptr) = NULL;
}
int
main ()
{
int *foo = &target1;
int *bar = &target2;
printf ("&foo = %p\n", (void *) &foo);
printf ("&boo = %p\n", (void *) &bar);
bad_func (&bar);
return *foo;
}
And here's a gdb session:
(gdb) break bad_func
Breakpoint 1 at 0x400542: file watch.c, line 11.
(gdb) r
&foo = 0x7fffffffdb88
&boo = 0x7fffffffdb80
Breakpoint 1, bad_func (bar=0x7fffffffdb80) at watch.c:11
11 uintptr_t ptr = (uintptr_t) bar;
(gdb) up
#1 0x00000000004005d9 in main () at watch.c:27
27 bad_func (&bar);
(gdb) watch foo
Hardware watchpoint 2: foo
(gdb) c
Continuing.
Should clear 0x7fffffffdb80
Will clear 0x7fffffffdb88
Hardware watchpoint 2: foo
Old value = (int *) 0x60103c <target1>
New value = (int *) 0x0
bad_func (bar=0x7fffffffdb80) at watch.c:18
18 }
(gdb)
For some reason the watchpoint appears to trigger on the line after the change was made, even though I compiled this with -O0, which is a bit of a shame. Still, it's usually close enough to help identify the problem.
For such kind of problems I often use the old electric fence library, it can be used to find bug in "software that overruns the boundaries of a malloc() memory allocation". You will find all the instructions and basic usage at this page:
http://elinux.org/Electric_Fence
(At the very end of the page linked above you will find the download link)
While debugging a crash using GDB, I found the program crashes in ASSERT(). The odd thing is that the pointer contains 0x0 which points to valid data.
Sample code:
#define MAX_NUM 10;
...
...
assert(x->y != NULL);
assert(x->y->z < MAX_NUM); <-- Crashes here
I can see that 'x' points to a valid address. When I do:
(gdb) print x
$16 = 0x841eda3
(gdb) print x->y
$17 = 0x0
(gdb) print *x->y
$18 = {
...
...
z = 1;
...
}
How is this possible? Shouldn't I get "Cannot access memory at address 0x0" error from GDB?
What version of GDB?
Core dumps don't always tell the truth. They can be messed up in many ways. I've had core dumps that look valid and are not. I think you simply got lucky that it looks right.
I'm running into a strange situation with passing a pointer to a structure with a very large array defined in the struct{} definition, a float array around 34MB in size. In a nutshell, the psuedo-code looks like this:
typedef config_t{
...
float values[64000][64];
} CONFIG;
int32_t Create_Structures(CONFIG **the_config)
{
CONFIG *local_config;
int32_t number_nodes;
number_nodes = Find_Nodes();
local_config = (CONFIG *)calloc(number_nodes,sizeof(CONFIG));
*the_config = local_config;
return(number_nodes);
}
int32_t Read_Config_File(CONFIG *the_config)
{
/* do init work here */
return(SUCCESS);
}
main()
{
CONFIG *the_config;
int32_t number_nodes,rc;
number_nodes = Create_Structures(&the_config);
rc = Read_Config_File(the_config);
...
exit(0);
}
The code compiles fine, but when I try to run it, I'll get a SIGSEGV at the { beneath Read_Config_File().
(gdb) run
...
Program received signal SIGSEGV, Segmentation fault.
0x0000000000407d0a in Read_Config_File (the_config=Cannot access memory at address 0x7ffffdf45428
) at ../src/config_parsing.c:763
763 {
(gdb) bt
#0 0x0000000000407d0a in Read_Config_File (the_config=Cannot access memory at address 0x7ffffdf45428
) at ../src/config_parsing.c:763
#1 0x00000000004068d2 in main (argc=1, argv=0x7fffffffe448) at ../src/main.c:148
I've done this sort of thing all the time, with smaller arrays. And strangely, 0x7fffffffe448 - 0x7ffffdf45428 = 0x20B8EF8, or about the 34MB of my float array.
Valgrind will give me similar output:
==10894== Warning: client switching stacks? SP change: 0x7ff000290 --> 0x7fcf47398
==10894== to suppress, use: --max-stackframe=34311928 or greater
==10894== Invalid write of size 8
==10894== at 0x407D0A: Read_Config_File (config_parsing.c:763)
==10894== by 0x4068D1: main (main.c:148)
==10894== Address 0x7fcf47398 is on thread 1's stack
The error messages all point to me clobbering the stack pointer, but a) I've never run across one that crashes on entry of the function and b) I'm passing pointers around, not the actual array.
Can someone help me out with this? I'm on a 64-bit CentOS box running kernel 2.6.18 and gcc 4.1.2
Thanks!
Matt
You've blown up the stack by allocating one of these huge config_t structs onto it. The two stack pointers on evidence in the gdb output, 0x7fffffffe448 and 0x7ffffdf45428, are very suggestive of this.
$ gdb
GNU gdb 6.3.50-20050815 ...blahblahblah...
(gdb) p 0x7fffffffe448 - 0x7ffffdf45428
$1 = 34312224
There's your ~34MB constant that matches the size of the config_t struct. Systems don't give you that much stack space by default, so either move the object off the stack or increase your stack space.
The short answer is that there must be a config_t declared as a local variable somewhere, which would put it on the stack. Probably a typo: missing * after a CONFIG declaration somewhere.
I am writing code to use a library called SCIP (solves optimisation problems). The library itself can be compiled in two ways: create a set of .a files, then the binary, OR create a set of shared objects. In both cases, SCIP is compiled with it's own, rather large, Makefile.
I have two implementations, one which compiles with the .a files (I'll call this program 1), the other links with the shared objects (I'll call this program 2). Program 1 is compiled using a SCIP-provided makefile, whereas program 2 is compiled using my own, much simpler makefile.
The behaviour I'm encountering occurs in the SCIP code, not in code that I wrote. The code extract is as follows:
void* BMSallocMemory_call(size_t size)
{
void* ptr;
size = MAX(size, 1);
ptr = malloc(size);
// This is where I call gdb print statements.
if( ptr == NULL )
{
printf("ERROR - unable to allocate memory for a SCIP*.\n");
}
return ptr;
}
void SCIPcreate(SCIP** A)
{
*A = (SCIP*)BMSallocMemory_call(sizeof(**(A)))
.
.
.
}
If I debug this code in gdb, and step through BMSallocMemory_call() in order to see what's happening, and view the contents of *((SCIP*)(ptr)), I get the following output:
Program 1 gdb output:
289 size = MAX(size, 1);
(gdb) step
284 {
(gdb)
289 size = MAX(size, 1);
(gdb)
290 ptr = malloc(size);
(gdb) print ptr
$1 = <value optimised out>
(gdb) step
292 if( ptr == NULL )
(gdb) print ptr
$2 = <value optimised out>
(gdb) step
290 ptr = malloc(size);
(gdb) print ptr
$3 = (void *) 0x8338448
(gdb) print *((SCIP*)(ptr))
$4 = {mem = 0x0, set = 0x0, interrupt = 0x0, dialoghdlr = 0x0, totaltime = 0x0, stat = 0x0, origprob = 0x0, eventfilter = 0x0, eventqueue = 0x0, branchcand = 0x0, lp = 0x0, nlp = 0x0, relaxation = 0x0, primal = 0x0, tree = 0x0, conflict = 0x0, cliquetable = 0x0, transprob = 0x0, pricestore = 0x0, sepastore = 0x0, cutpool = 0x0}
Program 2 gdb output:
289 size = MAX(size, 1);
(gdb) step
290 ptr = malloc(size);
(gdb) print ptr
$1 = (void *) 0xb7fe450c
(gdb) print *((SCIP*)(ptr))
$2 = {mem = 0x1, set = 0x8232360, interrupt = 0x1, dialoghdlr = 0xb7faa6f8, totaltime = 0x0, stat = 0xb7fe45a0, origprob = 0xb7fe4480, eventfilter = 0xfffffffd, eventqueue = 0x1, branchcand = 0x826e6a0, lp = 0x8229c20, nlp = 0xb7fdde80, relaxation = 0x822a0d0, primal = 0xb7f77d20, tree = 0xb7fd0f20, conflict = 0xfffffffd, cliquetable = 0x1, transprob = 0x8232360, pricestore = 0x1, sepastore = 0x822e0b8, cutpool = 0x0}
The only reason I can think of is that in either program 1's or SCIP's makefile, there is some sort of option that forces malloc to initialise memory it allocates. I simply must learn why the structure is initialised in the compiled implementation, and is not in the shared object implementation.
I doubt the difference has to do with how the two programs are built.
malloc does not initialize the memory it allocates. It may so happen by chance that the memory you get back is filled with zeroes. For example, a program that's just started is more likely to get zero-filled memory from malloc than a program that's been running for a while and allocating/deallocating memory.
edit You may find the following past questions of interest:
malloc zeroing out memory?
Create a wrapper function for malloc and free in C
When and why will an OS initialise memory to 0xCD, 0xDD, etc. on malloc/free/new/delete?
Initialization of malloc-ed memory may be implementation dependent. Implementations are free not to do so for performance reasons, but they could initialize the memory for example in debug mode.
One more note. Even uninitialized memory may contain zeros.
On Linux, according to this thread, memory will be zero-filled when first handed to the application. Thus, if your call to malloc() caused the program's heap to grow, the "new" memory will be zero-filled.
One way to verify is of course to just step into malloc() from your routine, that should make it pretty clear whether or not it contains code to initialize the memory, directly.