Does the main function of a C program ever reclaim the stack?

Does the main function of a C program ever reclaim the stack? - c

I am working through the OverTheWire wargames and one of my exploits overwrites the return address of main with the address of system. I have then used the fact that at the point main returns, esp is still pointing at one of my local variables and hence I can fill it with the command I want system to run (e.g. sh;#).
My confusion comes from that I thought functions in C reclaim the stack before returning and hence at the point the return address is called the stack pointer would be pointing at the return address rather than at the local variables. However, my exploit works so it seems that my stack pointer is pointing at the local variables when the return address is called.
The main thing I have noticed about this particular challenge compared to others is that it calls exit(0) at the end, instead of just ending, so the assembly doesn't end with leave, which may be the reason for this behaviour.
I haven't included the actual code since it's quite long and I was hoping there was a general explanation for what I am seeing, but please let me know if the assembly would be useful.

#include <stdio.h>
int main ( void )
{
printf("hello\n");
return(0);
}
the interesting relevant parts.
0000000000400430 <main>:
400430: 48 83 ec 08 sub $0x8,%rsp
400434: bf d4 05 40 00 mov $0x4005d4,%edi
400439: e8 c2 ff ff ff callq 400400 <puts#plt>
40043e: 31 c0 xor %eax,%eax
400440: 48 83 c4 08 add $0x8,%rsp
400444: c3 retq
400445: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40044c: 00 00 00
40044f: 90 nop
0000000000400450 <_start>:
400450: 31 ed xor %ebp,%ebp
400452: 49 89 d1 mov %rdx,%r9
400455: 5e pop %rsi
400456: 48 89 e2 mov %rsp,%rdx
400459: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
40045d: 50 push %rax
40045e: 54 push %rsp
40045f: 49 c7 c0 c0 05 40 00 mov $0x4005c0,%r8
400466: 48 c7 c1 50 05 40 00 mov $0x400550,%rcx
40046d: 48 c7 c7 30 04 40 00 mov $0x400430,%rdi
400474: e8 97 ff ff ff callq 400410 <__libc_start_main#plt>
400479: f4 hlt
40047a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
For the most part there is nothing special about main nor printf, etc these are just functions that conform to the calling convention. As re-asked SO questions will show sometimes the compiler will add extra stack or other calls when it sees a main() that it doesnt otherwise. but still it is a function that needs to conform to the calling convention. As seen in this case where the stack pointer is put back where it was found.
Before an operating system (Linux, Windows, MacOS, etc) can even think about running a program it needs to allocate some space for that program and tag that memory for that program in some way depending on the features of the processor and the OS, etc. Then you load the program from whatever media, and launch it at the binary file specified and/or well known entry point. A clean exit of the program will cause the operating system to free that memory, which the .text, .data, .bss and stack are the trivial/obvious ones that just go away as their memory just goes away. Other items that may have been allocated and associated with this program, open files, runtime allocated (not stack) memory, etc can/should also be freed, depends on the design of the os and/or the C library as to how that happens.
In the above case we see the bootstrap calls main and main returns then hlt is hit which this is an application not kernel code so that should cause a trap that causes the OS to clean up. An explicit exit() should be no different than a printf() or puts() or fopen() or any other function that ultimately makes one or more syscalls to the operating system. All that you can possibly find for these types of operating systems (Linux, Windows, MacOS) is the syscall. The release of memory happens outside the program as the program does not have control over it, would be a chicken and egg problem, program frees mmu tables that is using to free mmu tables...
compile and disassemble the object for main rather than the whole program
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: bf 00 00 00 00 mov $0x0,%edi
9: e8 00 00 00 00 callq e <main+0xe>
e: 31 c0 xor %eax,%eax
10: 48 83 c4 08 add $0x8,%rsp
14: c3 retq
no surprise there same as before, all we needed to see to understand that the stack was cleaned up before return. and that main is not special:
#include <stdio.h>
int notmain ( void )
{
printf("hello\n");
return(0);
}
0000000000000000 <notmain>:
0: 48 83 ec 08 sub $0x8,%rsp
4: bf 00 00 00 00 mov $0x0,%edi
9: e8 00 00 00 00 callq e <notmain+0xe>
e: 31 c0 xor %eax,%eax
10: 48 83 c4 08 add $0x8,%rsp
14: c3 retq
Now if you are asking if there is an exit() within main then sure it wont hit the return point in main so the stack pointer is offset by whatever amount. but if main calls some function and that function calls some function then that function calls exit() then the stack pointer is left at the stack frame point of function number two plus whatever the call (this is an x86) plus the exit() stack frame adds to it. You cannot simply assume that when exit() is called, if it is called, what the stack pointer is pointing at. You would have to examine the disassembly around that call to exit() plus the exit() code and anything it calls, to figure this out.

Related

Minimal 64-bit Windows executable crashes with tail-call optimization enabled by gcc

I'm trying to create a minimal 64-bit Windows executable to better understand how the Windows executable format works.
I wrote very basic assembly and C code as follows.
hi.s
section .text
hi:
db "hi", 0
global sayHi
align 16
sayHi:
lea rax, [rel hi]
ret
start.c
extern int puts();
extern const char *sayHi();
void start() {
puts(sayHi());
}
compiled with,
nasm -fwin64 hi.s
gcc -c -ostart.obj -O3 -fno-optimize-sibling-calls start.c
# I will explain the flag
and linked with,
golink /fo r.exe /console start.obj hi.obj msvcrt.dll
# create a console application `r.exe`
# the default entry point is `start`
The program runs fine and prints hi, but note the gcc flag -fno-optimize-sibling-calls. That flag disables tail-call optimizations so that the program always allocates stack space and calls a function. Without the flag, the program crashes.
This is the disassembled result without tail-call optimization. Not sure why gcc put a nop there, but otherwise it's very simple and runs fine.
0000000000401000 <.text>:
401000: 48 83 ec 28 sub rsp,0x28
401004: e8 27 00 00 00 call 0x401030 # sayHi
401009: 48 89 c1 mov rcx,rax
40100c: e8 ff 2f 00 00 call 0x404010 # puts
401011: 90 nop
401012: 48 83 c4 28 add rsp,0x28
401016: c3 ret
...
401020: 68 69 00 90 90 push 0xffffffff90900069 # "hi"
...
401030: 48 8d 05 e9 ff ff ff lea rax,[rip+0xffffffffffffffe9] # 0x401020
401037: c3 ret
This is when tail-call opt is enabled, in which the program crashes.
0000000000401000 <.text>:
401000: 48 83 ec 28 sub rsp,0x28
401004: e8 27 00 00 00 call 0x401030 # sayHi
401009: 48 89 c1 mov rcx,rax
40100c: 48 83 c4 28 add rsp,0x28
401010: e9 eb 2f 00 00 jmp 0x404000 # puts
...
401020: 68 69 00 90 90 push 0xffffffff90900069 # "hi"
...
401030: 48 8d 05 e9 ff ff ff lea rax,[rip+0xffffffffffffffe9] # 0x401020
401037: c3 ret
Now the program doesn't allocate stack space before puts and simply does a jmp instead of call.
I investigated further to see where exactly it jumps when calling puts.
In the no-tail-call case, the called address 0x404010 in the .idata section has the instruction jmp QWORD PTR [rip+0xffffffffffffffea] # 0x404000, and 0x404000 seems to contain the address to puts.
However in the tail-call case, the called address 0x404000 has 54 40 00 00 which is no meaningful instruction. The debugger says the program segfaults at 0x404003, so I'm pretty sure the program chokes trying to execute a garbage instruction.
I must be doing something wrong, but I'm not sure which, so could you explain why the tail-call case fails and how to get it work?

The problem was on golink not correctly handling tail-calls. I searched a while to make GNU ld link the program with the same options given to golink.
You can create a console-mode Windows executable by GNU ld with this command.
ld -o... --subsystem=console object-files...
--subsystem console or -subsystem=console also means the same. Use --subsystem=windows to create a GUI application.
GNU ld also handles Windows dll files, so in this case, simply giving ld a copy of msvcrt.dll from the system folder worked.

i386-elf-gcc out put strange assembler command about "static a = 0"

i am write a mini os. And when i write this code to show time clock, its goes wrong
7 void timer_callback(pt_regs *regs)
8 {
9 static uint32_t tick = 0;
10 printf("Tick: %dtimes\n", tick);
11 tick++;
12 }
tick is initialise not with 0, but 1818389861. but if tick init with 0x01 or anything else zero, it's ok!!!
so i wirte a simple c file then objdump:
staic.o: file format elf32-i386
Disassembly of section .text:
00000000 <main>:
extern void printf(char *, int);
int main(){
0: 8d 4c 24 04 lea 0x4(%esp),%ecx
4: 83 e4 f0 and $0xfffffff0,%esp
7: ff 71 fc pushl -0x4(%ecx)
a: 55 push %ebp
b: 89 e5 mov %esp,%ebp
d: 51 push %ecx
e: 83 ec 04 sub $0x4,%esp
static int a = 1;
printf("%d\n", a);
11: a1 00 00 00 00 mov 0x0,%eax
16: 83 ec 08 sub $0x8,%esp
19: 50 push %eax
1a: 68 00 00 00 00 push $0x0
1f: e8 fc ff ff ff call 20 <main+0x20>
24: 83 c4 10 add $0x10,%esp
return 0;
27: b8 00 00 00 00 mov $0x0,%eax
}
2c: 8b 4d fc mov -0x4(%ebp),%ecx
2f: c9 leave
30: 8d 61 fc lea -0x4(%ecx),%esp
33: c3 ret
so strange, no memory used!!!
Update: let me say it clearly
the second static.c is an experiment, it was thought it show no memory used, but i was wrong, mov 0x0 %eab is. i confuse 0x0 and $0x0 /..\
my origin problem is why tick not succeed init with 0.(but can init with 1 or anyelsenumber).
i look up it again use gdb, ok, it do use memory like mov
eax,ds:0x106010,but the real strong thing is the memory x 0x106010 is not 0,but it should be, just as i said, if i let tick = 1 or anythingelse, memory do init as i want, that is the strange thing!
the tool: gdb ,objdump return different asm(different means,not formate),because, just learn os,not good at c, so i let it go,ignore it....

Memory is used, be sure of that; however, you won't find that memory in the .text section. Memory for static variables is allocated in either .bss (when zero-initialized; or, in case of C++, dynamically initialized) or .data (when non-zero initialized) section.
When dumping object files with objdump using the -d (disassembly) option, it is important to also use the -r (relocations) option. Without that, the disassembly you get is deceiving and makes little sense.
In your case, the instruction at addresses 11 and 1f must have relocations, at address 11, to the variable a and at address 1f, to the function printf. The instruction at address 11 loads the value from your variable a, without proper relocations it looks as if it loaded a value from address 0.
As to your original question, the value you get, 1818389861, or 0x6C626D65, is quite remarkable. I would bet that somewhere in your program you have a buffer overrun involving a string containing the subsequence embl.
As a side note, I would like to call your attention to the use of correct type specifications in printf calls. The type specification %d corresponds to the type int; on all modern mainstream architectures, int and int32_t are of the same size. However, that is not guaranteed to always be so. There are special type specifications for use with explicitly-sized types, for example, for an int32_t you use "PRId32":
uint32_t x;
printf("%"PRId32, x);

Calling C function which takes no parameters with parameters

I have some weird question about probably undefined behavior between C calling convention and 64/32 bits compilation.
First here is my code:
int f() { return 0; }
int main()
{
int x = 42;
return f(x);
}
As you can see I am calling f with an argument while f takes no parameters.
My first question was does this argument is really given to f while calling it.
The mysterious lines
After a little objdump I obtained curious results.
While passing x as argument of f:
00000000004004b6 <f>:
4004b6: 55 push %rbp
4004b7: 48 89 e5 mov %rsp,%rbp
4004ba: b8 00 00 00 00 mov $0x0,%eax
4004bf: 5d pop %rbp
4004c0: c3 retq
00000000004004c1 <main>:
4004c1: 55 push %rbp
4004c2: 48 89 e5 mov %rsp,%rbp
4004c5: 48 83 ec 10 sub $0x10,%rsp
4004c9: c7 45 fc 2a 00 00 00 movl $0x2a,-0x4(%rbp)
4004d0: 8b 45 fc mov -0x4(%rbp),%eax
4004d3: 89 c7 mov %eax,%edi
4004d5: b8 00 00 00 00 mov $0x0,%eax
4004da: e8 d7 ff ff ff callq 4004b6 <f>
4004df: c9 leaveq
4004e0: c3 retq
4004e1: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4004e8: 00 00 00
4004eb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Without passing x as a argument:
00000000004004b6 <f>:
4004b6: 55 push %rbp
4004b7: 48 89 e5 mov %rsp,%rbp
4004ba: b8 00 00 00 00 mov $0x0,%eax
4004bf: 5d pop %rbp
4004c0: c3 retq
00000000004004c1 <main>:
4004c1: 55 push %rbp
4004c2: 48 89 e5 mov %rsp,%rbp
4004c5: 48 83 ec 10 sub $0x10,%rsp
4004c9: c7 45 fc 2a 00 00 00 movl $0x2a,-0x4(%rbp)
4004d0: b8 00 00 00 00 mov $0x0,%eax
4004d5: e8 dc ff ff ff callq 4004b6 <f>
4004da: c9 leaveq
4004db: c3 retq
4004dc: 0f 1f 40 00 nopl 0x0(%rax)
So as we can see:
4004d0: 8b 45 fc mov -0x4(%rbp),%eax
4004d3: 89 c7 mov %eax,%edi
happen when I call f with x but because I am not really good in assembly I don't really understand these lines.
The 64/32 bits paradoxe
Otherwise I tried something else and start printing the stack of my program.
Stack with x given to f (compiled in 64bits):
Address of x: ffcf115c
ffcf1128: 0 0
ffcf1130: -3206820 0
ffcf1138: -3206808 134513826
ffcf1140: 42 -3206820
ffcf1148: -145495616 134513915
ffcf1150: 1 -3206636
ffcf1158: -3206628 42
ffcf1160: -143903780 -3206784
Stack with x not given to f (compiled in 64bits):
Address of x: 3c19183c
3c191818: 0 0
3c191820: 1008277568 32766
3c191828: 4195766 0
3c191830: 1008277792 32766
3c191838: 0 42
3c191840: 4195776 0
And for some reason in 32bits x seems to be push on the stack.
Stack with x given to f (compiled in 32bits):
Address of x: ffdc8eac
ffdc8e78: 0 0
ffdc8e80: -2322772 0
ffdc8e88: -2322760 134513826
ffdc8e90: 42 -2322772
ffdc8e98: -145086016 134513915
ffdc8ea0: 1 -2322588
ffdc8ea8: -2322580 42
ffdc8eb0: -143494180 -2322736
Why the hell does x appear in 32 but not 64 ???
Code for printing: http://paste.awesom.eu/yayg/QYw6&ln
Why am I asking such stupid questions ?
First because I didn't found any standard that answer to my question
Secondly, think about calling a variadic function in C without the count of arguments given.
Last but not least, I think undefined behavior is fun.
Thank you for taking the time to read until here and for helping me understanding something or making me realize that my questions are pointless.

The answer is that, as you suspect, what you are doing is undefined behavior (in the case where the superfluous argument is passed).
The actual behavior in many implementations is harmless, however. An argument is prepared on the stack, and is ignored by the called function. The called function is not responsible for removing arguments from the stack, so there no harm (such as an unbalanced stack pointer).
This harmless behavior was what enabled C hackers to develop, once upon a time, a variable argument list facility that used to be under #include <varargs.h> in ancient versions of the Unix C library.
This evolved into the ANSI C <stdarg.h>.
The idea was: pass extra arguments into a function, and then march through the stack dynamically to retrieve them.
That won't work today. For instance, as you can see, the parameter is not in fact put into the stack, but loaded into the RDI register. This is the convention used by GCC on x86-64. If you march through the stack, you won't find the first several parameters. On IA-32, GCC passes parameters using the stack, by contrast: though you can get register-based behavior with the "fastcall" convention.
The va_arg macro from <stdarg.h> will correctly take into account the mixed register/stack parameter passing convention. (Or, rather, when you use the correct declaration for a variadic function, it will perhaps suppress the passage of the trailing arguments in registers, so that va_arg can just march through memory.)
P.S. your machine code might be easier to follow if you added some optimization. For instance, the sequence
4004c9: c7 45 fc 2a 00 00 00 movl $0x2a,-0x4(%rbp)
4004d0: 8b 45 fc mov -0x4(%rbp),%eax
4004d3: 89 c7 mov %eax,%edi
4004d5: b8 00 00 00 00 mov $0x0,%eax
is fairly obtuse due to what look like some wasteful data moves.

How arguments are passed to a function is dependent on the platform ABI (application binary interface). The ABI makes it possible to compile libraries with compiler X and use them with code compiled with compiler Y. None of this is defined by the standard.
There is no requirement by the standard that a "stack" even exist, much less that it be used for function calling.
The x86 chips had limited numbers of registers, and the ABI reflects that fact; the normal 32-bit x86 calling convention uses the stack for all arguments.
That is not the case with the 64-bit architecture, which has many more registers and uses some of them for the first few parameters. This significantly speeds up function calls.
Similarly, the Windows 32-bit "fastcall" calling convention passes a few arguments in registers. (In order to use a non-standard calling convention, you need to appropriately annotate the function declaration, and do so consistently where it is defined.)
You can find more information on various calling conventions in this Wikipedia article. The AMD64 ABI can be found on x86-64.org (PDF document). The original System V IA-32 ABI (the basis of the ABI used on Linux, xBSD and OS X) can still be accessed from www.sco.com (PDF document).
Undefined behaviour?
The code presented in the OP is definitely undefined behaviour.
In a function definition, an empty parameter list means that the function does not take any arguments. In a function declaration, an empty parameter fails to declare how many arguments the function takes.
§6.7.6.3/p.14: An empty list in a function declarator that is part of a definition of that function specifies that the function has no parameters. The empty list in a function declarator that is not part of a definition of that function specifies that no information about the number or types of the parameters is supplied.
When the function is eventually called, it must be called with the correct number of parameters:
§6.5.2.2/p.6: If the expression that denotes the called function has a type that does not include a prototype, the integer promotions are performed on each argument, and arguments that have type float are promoted to double... If the number of arguments does not equal the number of parameters, the behavior is undefined.
If the function is defined as a vararg function (with a trailing ellipsis), the vararg declaration must be visible wherever the function is called.
(Continuing from previous quote): If the function is defined with a type that includes a prototype, and either the prototype ends with an ellipsis (, ...) or the types of the arguments after promotion are not compatible with the types of the parameters, the behavior is undefined.

Assembly - why is %rsp decremented by so much, and why are arguments stored at the top of the stack?

Assembly newbie here... I wrote the following simple C program:
void fun(int x, int* y)
{
char arr[4];
int* sp;
sp = y;
}
int main()
{
int i = 4;
fun(i, &i);
return 0;
}
I compiled it with gcc and ran objdump with -S, but the Assembly code output is confusing me:
000000000040055d <fun>:
void fun(int x, int* y)
{
40055d: 55 push %rbp
40055e: 48 89 e5 mov %rsp,%rbp
400561: 48 83 ec 30 sub $0x30,%rsp
400565: 89 7d dc mov %edi,-0x24(%rbp)
400568: 48 89 75 d0 mov %rsi,-0x30(%rbp)
40056c: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
400573: 00 00
400575: 48 89 45 f8 mov %rax,-0x8(%rbp)
400579: 31 c0 xor %eax,%eax
char arr[4];
int* sp;
sp = y;
40057b: 48 8b 45 d0 mov -0x30(%rbp),%rax
40057f: 48 89 45 e8 mov %rax,-0x18(%rbp)
}
400583: 48 8b 45 f8 mov -0x8(%rbp),%rax
400587: 64 48 33 04 25 28 00 xor %fs:0x28,%rax
40058e: 00 00
400590: 74 05 je 400597 <fun+0x3a>
400592: e8 a9 fe ff ff callq 400440 <__stack_chk_fail#plt>
400597: c9 leaveq
400598: c3 retq
0000000000400599 <main>:
int main()
{
400599: 55 push %rbp
40059a: 48 89 e5 mov %rsp,%rbp
40059d: 48 83 ec 10 sub $0x10,%rsp
int i = 4;
4005a1: c7 45 fc 04 00 00 00 movl $0x4,-0x4(%rbp)
fun(i, &i);
4005a8: 8b 45 fc mov -0x4(%rbp),%eax
4005ab: 48 8d 55 fc lea -0x4(%rbp),%rdx
4005af: 48 89 d6 mov %rdx,%rsi
4005b2: 89 c7 mov %eax,%edi
4005b4: e8 a4 ff ff ff callq 40055d <fun>
return 0;
4005b9: b8 00 00 00 00 mov $0x0,%eax
}
4005be: c9 leaveq
4005bf: c3 retq
First, in the line:
400561: 48 83 ec 30 sub $0x30,%rsp
Why is the stack pointer decremented so much in the call to 'fun' (48 bytes)? I assume it has to do with alignment issues, but I cannot visualize why it would need so much space (I only count 12 bytes for local variables (assuming 8 byte pointers))?
Second, I thought that in x86_64, the arguments to a function are either stored in specific registers, or if there are a lot of them, just 'above' (with a downward growing stack) the base pointer, %rbp. Like in the picture at http://en.wikipedia.org/wiki/Call_stack#Structure except 'upside-down'.
But the lines:
400565: 89 7d dc mov %edi,-0x24(%rbp)
400568: 48 89 75 d0 mov %rsi,-0x30(%rbp)
suggest to me that they are being stored way down from the base of the stack (%rsi and %edi are where main put the arguments, right before calling 'fun', and 0x30 down from %rbp is exactly where the stack pointer is pointing...). And when I try to do stuff with them , like assigning their values to local variables, it grabs them from those locations near the head of the stack:
sp = y;
40057b: 48 8b 45 d0 mov -0x30(%rbp),%rax
40057f: 48 89 45 e8 mov %rax,-0x18(%rbp)
... what is going on here?! I would expect them to grab the arguments from either the registers they were stored in, or just above the base pointer, where I thought they are 'supposed to be', according to every basic tutorial I read. Every answer and post I found on here related to stack frame questions confirms my understanding of what stack frames "should" look like, so why is my Assembly output so darn weird?

Because that stuff is a hideously simplified version of what really goes on. It's like wondering why Newtonian mechanics doesn't model the movement of the planets down to the millimeter. Compilers need stack space for all sorts of things. For example, saving callee-saved registers.
Also, the fundamental fact is that debug-mode compilations contain all sorts of debugging and checking machinery. The compiler outputs all sorts of code that checks that your code is correct, for example the call to __stack_chk_fail.
There are only two ways to understand the output of a given compiler. The first is to implement the compiler, or be otherwise very familiar with the implementation. The second is to accept that whatever you understand is a gross simplification. Pick one.

Because you're compiling without optimization, the compiler does lots of extra stuff to maybe make things easier to debug, which use lots of extra space.
it does not attempt to compress the stack frame to reuse memory for anything, or get rid of any unused things.
it redundantly copies the arguments into the stack frame (which requires still more memory)
it copies a 'canary' on to the stack to guard against stack smashing buffer overflows (even though they can't happen in this code).
Try turning on optimization, and you'll see more real code.

This is 64 bit code. 0x30 of stack space corresponds to 6 slots on the stack. You have what appears to be:
2 slots for function arguments (which happen also to be passed in registers)
2 slots for local variables
1 slot for saving the AX register
1 slot looks like a stack guard, probably related to DEBUG mode.
Best thing is to experiment rather than ask questions. Try compiling in different modes (DEBUG, optimisation, etc), and with different numbers and types of arguments and variables. Sometimes asking other people is just too easy -- you learn better by doing your own experiments.

Anti-debugging: gdb does not write 0xcc byte for breakpoints. Any idea why?

I am learning some anti-debugging techniques on Linux and found a snippet of code for checking 0xcc byte in memory to detect the breakpoints in gdb. Here is that code:
if ((*(volatile unsigned *)((unsigned)foo + 3) & 0xff) == 0xcc)
{
printf("BREAKPOINT\n");
exit(1);
}
foo();
But it does not work. I even tried to set a breakpoint on foo() function and observe the contents in memory, but did not see any 0xcc byte written for breakpoint. Here is what I did:
(gdb) b foo
Breakpoint 1 at 0x804846a: file p4.c, line 8.
(gdb) x/x 0x804846a
0x804846a <foo+6>: 0xe02404c7
(gdb) x/16x 0x8048460
0x8048460 <frame_dummy+32>: 0x90c3c9d0 0x83e58955 0x04c718ec 0x0485e024
0x8048470 <foo+12>: 0xfefae808 0xc3c9ffff .....
As you can see, there seems to be no 0xcc byte written on the entry point of foo() function. Does anyone know what's going on or where I might be wrong? Thanks.

Second part is easily explained (as Flortify correctly stated):
GDB shows original memory contents, not the breakpoint "bytes". In default mode it actually even removes breakpoints when debugger suspends and re-inserts them before continuing. Users typically want to see their code, not strange modified instructions used for breakpoints.
With your C code you missed breakpoint for few bytes. GDB sets breakpoint after function prologue, because function prologue is not typically what gdb users want to see. So, if you put break to foo, actual breakpoint will be typically located few bytes after that (depends on prologue code itself that is function dependent as it may or might not have to save stack pointer, frame pointer and so on). But it is easy to check. I used this code:
#include <stdio.h>
int main()
{
int i,j;
unsigned char *p = (unsigned char*)main;
for (j=0; j<4; j++) {
printf("%p: ",p);
for (i=0; i<16; i++)
printf("%.2x ", *p++);
printf("\n");
}
return 0;
}
If we run this program by itself it prints:
0x40057d: 55 48 89 e5 48 83 ec 10 48 c7 45 f8 7d 05 40 00
0x40058d: c7 45 f4 00 00 00 00 eb 5a 48 8b 45 f8 48 89 c6
0x40059d: bf 84 06 40 00 b8 00 00 00 00 e8 b4 fe ff ff c7
0x4005ad: 45 f0 00 00 00 00 eb 27 48 8b 45 f8 48 8d 50 01
Now we run it in gdb (output re-formatted for SO).
(gdb) break main
Breakpoint 1 at 0x400585: file ../bp.c, line 6.
(gdb) info break
Num Type Disp Enb Address What
1 breakpoint keep y 0x0000000000400585 in main at ../bp.c:6
(gdb) disas/r main,+32
Dump of assembler code from 0x40057d to 0x40059d:
0x000000000040057d (main+0): 55 push %rbp
0x000000000040057e (main+1): 48 89 e5 mov %rsp,%rbp
0x0000000000400581 (main+4): 48 83 ec 10 sub $0x10,%rsp
0x0000000000400585 (main+8): 48 c7 45 f8 7d 05 40 00 movq $0x40057d,-0x8(%rbp)
0x000000000040058d (main+16): c7 45 f4 00 00 00 00 movl $0x0,-0xc(%rbp)
0x0000000000400594 (main+23): eb 5a jmp 0x4005f0
0x0000000000400596 (main+25): 48 8b 45 f8 mov -0x8(%rbp),%rax
0x000000000040059a (main+29): 48 89 c6 mov %rax,%rsi
End of assembler dump.
With this we verified, that program is printing correct bytes. But this also shows that breakpoint has been inserted at 0x400585 (that is after function prologue), not at first instruction of function.
If we now run program under gdb (with run) and then "continue" after breakpoint is hit, we get this output:
(gdb) cont
Continuing.
0x40057d: 55 48 89 e5 48 83 ec 10 cc c7 45 f8 7d 05 40 00
0x40058d: c7 45 f4 00 00 00 00 eb 5a 48 8b 45 f8 48 89 c6
0x40059d: bf 84 06 40 00 b8 00 00 00 00 e8 b4 fe ff ff c7
0x4005ad: 45 f0 00 00 00 00 eb 27 48 8b 45 f8 48 8d 50 01
This now shows 0xcc being printed for address 9 bytes into main.

If your hardware supports it, GDB may be using Hardware Breakpoints, which do not patch the code.
While I have not confirmed this via any official docs, this page indicates that
By default, gdb attempts to use hardware-assisted break-points.
Since you indicate expecting 0xCC bytes, I'm assuming you're running on x86 hardware, as the int3 opcode is 0xCC. x86 processors have a set of debug registers DR0-DR3, where you can program the address of data to cause a breakpoint exception. DR7 is a bitfield which controls the behavior of the breakpoints, and DR6 indicates the status.
The debug registers can only be read/written from Ring 0 (kernel mode). That means that the kernel manages these registers for you (via the ptrace API, I believe.)
However, for the sake of anti-debugging, all hope is not lost! On Windows, the GetThreadContext API allows you to get (a copy) of the CONTEXT for a (stopped) thread. This structure includes the contents of the DRx registers. This question is about how to implement the same on Linux.

This may also be a white lie that GDB is telling you... there may be a breakpoint there in RAM but GDB has noted what was there beforehand (so it can restore it later) and is showing you that, instead of the true contents of RAM.
Of course, it could also be using Hardware Breakpoints, which is a facility available on some processors. Setting h/w breakpoints is done by telling the processor the address it should watch out for (and trigger a breakpoint interrupt if it gets hit by the program counter while executing code).