x86 ADC carry flag and length - c

I'm just doing some analysis of a disassembled 32-bit program I wrote in C. Here is a portion of the output from the disassembler:
41153c 02 00 add al, [eax]
41153e 00 00 add [eax], al
411540 44 inc esp
411541 15 41 00 F8 FF adc eax, 0xfff80041 ; "A"
411546 FF invalid
I'm just trying to make sense of the ADC instruction. From what I've read in both the Intel developers manual, and various articles on x86 ASM the opcode 0x15 is the ADC instruction using EAX as the destination, and it would appear that following the opcode is a four byte 'immediate' that indicates the memory address for use in the add with carry.
However I'm a little unsure as to why the following byte (0xFF) is being marked as invalid.
I'm quite new to assembler, but I'm assuming that this is something to do with the carry flag, and might possibly be there to sign extend the value.
I've used two separate disassemblers to look at the code, and whilst one marks it as invalid, the other simply ignores it.
If someone could offer some advice, I'd appreciate it.
Thanks
UPDATE
I'll add a bit more information to this post, as there are another two ADC operations, and one of them doesn't have the extra 'invalid' byte
411547 FF 04 00 inc [eax+eax]
41154a 00 00 add [eax], al
41154c 61 popa
41154d 15 41 00 EC FF adc eax, 0xffec0041 ; "A"
411552 FF invalid
411553 FF 04 00 inc [eax+eax]
411556 00 00 add [eax], al
411558 5C pop esp
411559 15 41 00 74 65 adc eax, 0x65740041 ; "A"
41155e 73 74 jnc 0x4115d4 ↓
The second ADC that's taking place also has the extra 0xff 'invalid' byte, however the third does not.
From what I can see, the only difference between all three ADC operations is that the first two start with 0xff and have an extra 'invalid' byte, whilst the third does not. I'm assuming that this is forming some kind of flag to indicate if the extra byte is needed.

This is not code. You're trying to disassemble data.
If we rearrange the bytes aligned to 4, we can make some sense of it:
411548: 04 00 00 00 ; integer 4
41154C: 61 15 41 00 ; address 0x411561
411550: EC FF FF FF ; integer 0xFFFFFFEC (-20)
411554: 04 00 00 00 ; integer 4
411558: 5C 15 41 00 ; address 0x41155C
41155C: 74 65 73 74 ; string 'test'

Related

Understanding how assembly language passes arguments from one method to another

I've, for a few hours, been trying to enlarge my understanding of Assembly Language, by trying to read and understand the instructions of a very simple program I wrote in C to initiate myself to how arguments were handled in ASM.
#include <stdio.h>
int say_hello();
int main(void) {
printf("say_hello() -> %d\n", say_hello(10, 20, 30, 40, 50, 60, 70, 80, 90, 100));
}
int say_hello(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j) {
printf("a:b:c:d:e:f:g:h:i:j -> %d:%d:%d:%d:%d:%d:%d:%d:%d:%d\n", a, b, c, d, e, f, g, h, i, j);
return 1000;
}
The program is as I said, very basic and contains two functions, the main and another one called say_hello which takes 10 arguments, from a to j and print each one of them in a printf call. I've tried doing the same process (So trying to understand the instructions and what's happening), with the same program and less arguments, I think I was able to understand most of it, but then I was wondering, "ok but what's happening if I have so many arguments, there isn't any more register available to store the value in"
So I went to look for how many registers were available and usable in my case, and I found out from this website that "only" (not sure, correct me if I'm wrong) the following registers could be used in my case to store argument values in them edi, esi, r8d, r9d, r10d, r11d, edx, ecx, which is 8, so I went to modify my C program and I added a few more arguments, so that I reach the 8 limit, I even added one more, I don't really know why, let's say just in case.
So when I compiled my program using gcc with no optimization related option whatsoever, I was expecting the main() function to push the values that were left after all the 8 registers have been used, but I wasn't expecting anything from the say_hello() method, that's pretty much why I tried this out in the first place.
So I went to compile my program, then disassembled it using the objdump command (More specifically, this is the full command I used: objdump -d -M intel helloworld) and I started looking for my main method, which was doing pretty much as I expected
000000000000064a <main>:
64a: 55 push rbp
64b: 48 89 e5 mov rbp,rsp
64e: 6a 64 push 0x64
650: 6a 5a push 0x5a
652: 6a 50 push 0x50
654: 6a 46 push 0x46
656: 41 b9 3c 00 00 00 mov r9d,0x3c
65c: 41 b8 32 00 00 00 mov r8d,0x32
662: b9 28 00 00 00 mov ecx,0x28
667: ba 1e 00 00 00 mov edx,0x1e
66c: be 14 00 00 00 mov esi,0x14
671: bf 0a 00 00 00 mov edi,0xa
676: b8 00 00 00 00 mov eax,0x0
67b: e8 1e 00 00 00 call 69e <say_hello>
680: 48 83 c4 20 add rsp,0x20
684: 89 c6 mov esi,eax
686: 48 8d 3d 0b 01 00 00 lea rdi,[rip+0x10b] # 798 <_IO_stdin_used+0x8>
68d: b8 00 00 00 00 mov eax,0x0
692: e8 89 fe ff ff call 520 <printf#plt>
697: b8 00 00 00 00 mov eax,0x0
69c: c9 leave
69d: c3 ret
So it, as I expected pushed the values that were left after all the registers had been used into the stack, and then just did the usual work to pass values from one method to another. But then I went to look for the say_hello method, and it got me really confused.
000000000000069e <say_hello>:
69e: 55 push rbp
69f: 48 89 e5 mov rbp,rsp
6a2: 48 83 ec 20 sub rsp,0x20
6a6: 89 7d fc mov DWORD PTR [rbp-0x4],edi
6a9: 89 75 f8 mov DWORD PTR [rbp-0x8],esi
6ac: 89 55 f4 mov DWORD PTR [rbp-0xc],edx
6af: 89 4d f0 mov DWORD PTR [rbp-0x10],ecx
6b2: 44 89 45 ec mov DWORD PTR [rbp-0x14],r8d
6b6: 44 89 4d e8 mov DWORD PTR [rbp-0x18],r9d
6ba: 44 8b 45 ec mov r8d,DWORD PTR [rbp-0x14]
6be: 8b 7d f0 mov edi,DWORD PTR [rbp-0x10]
6c1: 8b 4d f4 mov ecx,DWORD PTR [rbp-0xc]
6c4: 8b 55 f8 mov edx,DWORD PTR [rbp-0x8]
6c7: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
6ca: 48 83 ec 08 sub rsp,0x8
6ce: 8b 75 28 mov esi,DWORD PTR [rbp+0x28]
6d1: 56 push rsi
6d2: 8b 75 20 mov esi,DWORD PTR [rbp+0x20]
6d5: 56 push rsi
6d6: 8b 75 18 mov esi,DWORD PTR [rbp+0x18]
6d9: 56 push rsi
6da: 8b 75 10 mov esi,DWORD PTR [rbp+0x10]
6dd: 56 push rsi
6de: 8b 75 e8 mov esi,DWORD PTR [rbp-0x18]
6e1: 56 push rsi
6e2: 45 89 c1 mov r9d,r8d
6e5: 41 89 f8 mov r8d,edi
6e8: 89 c6 mov esi,eax
6ea: 48 8d 3d bf 00 00 00 lea rdi,[rip+0xbf] # 7b0 <_IO_stdin_used+0x20>
6f1: b8 00 00 00 00 mov eax,0x0
6f6: e8 25 fe ff ff call 520 <printf#plt>
6fb: 48 83 c4 30 add rsp,0x30
6ff: b8 e8 03 00 00 mov eax,0x3e8
704: c9 leave
705: c3 ret
706: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
70d: 00 00 00
I'm really sorry in advance, I'm not exactly sure I really understand well what the square brackets do, but from what I've read and understand it's a way to "point" to the address containing the value I want (please correct me if I'm wrong), so for example mov DWORD PTR [rbp-0x4],edi moves the value in edi to the value at the address rsp-0x4, right?
I'm also not actually not sure why this process is required, can't the say_hello method just read edi for example and that's it? Why does the program have to move it into [rbp-0x4] and then re-reading it back from [rbp-0x4] to eax ?
So the program just goes on and reads every value it needs and put them into an available register, and when it reaches the point when there's no register left, it just starts moving all of them into esi and then pushing them onto the stack, then repeating the process until all the 10 arguments have been stored somewhere.
So that makes sense, I was satisfied and then just went to double check if I really had got it well, so I started reading from bottom to top, starting from 0x6ea to 0x6e2 so the sample I'm working on is
6e2: 45 89 c1 mov r9d,r8d
6e5: 41 89 f8 mov r8d,edi
6e8: 89 c6 mov esi,eax
6ea: 48 8d 3d bf 00 00 00 lea rdi,[rip+0xbf] # 7b0 <_IO_stdin_used+0x20>
So just like on all my previous tests, I was expecting the arguments to go in "reverse" like the first argument is the last instruction executed, and the last one the first instruction executed, so I started double checking every field.
So the first one, rdi was [rip+0x10b] which I thought for sure was pointing to my string.
So then I moved to 0x6e8, which moves eax which is currently equal to the value stored in [rbp-0x4], which is equal to edi as stated at 0x6a6, and edi is equal to 0xa (10) as stated on 0x671, so my first argument is my string, and the second one is 10, which is exactly what I expected.
But then when I jumped on the instruction executed right before 0x6e8, so 0x6e5 I was expecting it to be 20, so I did the same process. edi is moved to r8d and is currently equal to the value stored in [rbp-0x10] which is equal to ecx which is equal to, as stated at 0x662.. 40? What the heck? I'm confused, why would it be 40? Then I tried looking up the instruction right above that one, and found 50, and did the same for the next one, and again I found 60!! Why? Is the way I get those values wrong? Am I missing something in the instructions? Or did I just assume something by looking at my previous programs (which all had way less arguments, and were all in "reverse" like I said earlier) that I should not have?
I'm sorry if this is a dumb post, I'm very new to ASM (few hours of experience!) and just trying to get my mind cleared on that one, as I really can't figure it out alone. I'm also sorry if this post is too long, I was trying to include a lot of informations so that what I'm trying to do is clear, the result I get is clear, and what my problem is is clear aswell. Thanks a lot for reading and even a bigger thanks to anyone who will help!

Why for accessing elements of char array byte transffer is used

Let's consider this very simple code
int main(void)
{
char buff[500];
int i;
for (i=0; i<500; i++)
{
(buff[i])++;
}
}
So, it just goes through 500 bytes and increments it. This code was compiled using gcc on x86-64 architecture and disassembled using objdump -D utility. Looking at the disassembled code, I found out that data are transferred from memory to register byte by byte (see, movzbl instruction is used to get data from memory and mov %dl is used to store data in memory)
00000000004004ed <main>:
4004ed: 55 push %rbp
4004ee: 48 89 e5 mov %rsp,%rbp
4004f1: 48 81 ec 88 01 00 00 sub $0x188,%rsp
4004f8: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
4004ff: eb 20 jmp 400521 <main+0x34>
400501: 8b 45 fc mov -0x4(%rbp),%eax
400504: 48 98 cltq
400506: 0f b6 84 05 00 fe ff movzbl -0x200(%rbp,%rax,1),%eax
40050d: ff
40050e: 8d 50 01 lea 0x1(%rax),%edx
400511: 8b 45 fc mov -0x4(%rbp),%eax
400514: 48 98 cltq
400516: 88 94 05 00 fe ff ff mov %dl,-0x200(%rbp,%rax,1)
40051d: 83 45 fc 01 addl $0x1,-0x4(%rbp)
400521: 81 7d fc f3 01 00 00 cmpl $0x1f3,-0x4(%rbp)
400528: 7e d7 jle 400501 <main+0x14>
40052a: c9 leaveq
40052b: c3 retq
40052c: 0f 1f 40 00 nopl 0x0(%rax)
Looks like it has some performance implications, because in that case you have to access memory 500 times to read and 500 times to store. I know that cache system will cope it somehow, but anyway.
My question is why we can't load the quadwords and just do a couple of bit operations to increase each byte of that quadword and then push it back to memory? Obviously it would require some addition logic to deal with the last part of data that is less than quadword and some additional register.But this approach would dramatically reduce number of memory accessing that is the most expensive operation. Probably I don't see some obstacles that inhibit such optimization. So, it would be great to get some explanations here.
Reason why this shouldn't be done: Imagine if char happened to be unsigned (to make overflow have defined behavior) and you had a byte 0xFF followed (or preceded, depending on endianness) by 0x1.
Incrementing a byte at a time, you'd end up with the 0xFF becoming 0x00 and the 0x01 becoming 0x02. But if you just loaded 4 or 8 bytes at a time and added 0x01010101 (or eight byte equivalent) to achieve the same result, the 0xFF would overflow into the 0x01, so you'd end up with 0x00 and 0x03, not 0x00 and 0x02.
Similar issues would typically occur with signed char too; signed overflow and truncation rules (or lack thereof) make it more complicated, but the gist is that incrementing a byte at a time limits effects to that byte, with no cross-byte "interference".
When you compile without optimization, the compiler does a more literal translation of code to assembly, part of the reason for this is so that when you step through the code in a debugger, the steps correspond to your code.
If you enable optimization then the assembly may look completely different.
Also, your program causes undefined behaviour by reading an uninitialized char.

Type casting of Macro to optimize the code

Working on to optimize the code. Is it good idea to type cast the macro to char to reduce the memory consumption? What could be the side effect of doing this?
Example:
#define TRUE 1 //non-optimized code
sizeof(TRUE) --> 4
#define TRUE 1 ((char) 0x01) //To optimize
sizeof(TRUE) --> 1
#define MAX 10 //non-optimized code
sizeof(MAX) --> 4
#define MAX ((char) 10) //To optimize
sizeof(MAX) --> 1
They will make virtually no difference in memory consumption.
These macros provide values to be used in expressions, while the actual memory usage is (roughly) dictated by the type and number of variables and dynamically allocated memory. So, you may have TRUE as an int or as a char, but what actually matters is the type of variable it (or, the expression in which it appears) gets assigned to, which is not influenced by the type of the constant.
The only influence the type of these constants may have is in how the expressions they are used into are carried out - but even that effect should be almost non existant, given that the C standard (simplifying) implicitly promotes to int or unsigned all the smaller types before carrying out almost any operation.1
So: if you want to reduce your memory consumption, don't look at your constants, but at your data structures, possibly global and dynamically-allocates ones2! Maybe you have a huge array of double values where the precision of float would be enough, maybe you are keeping around big data longer than you need it, or you have memory leaks, or a big array of a badly-laid-out struct, or of booleans that are 4-byte wide when they could be a bitfield - this is the kind of thing you should look after, definitely not these #defines.
Notes
The idea being that integral operations are carried out at the native register size, which traditionally corresponds to int. Besides, even if this rule wasn't true, the only memory effect of changing the size of integral temporary values in expressions may be at most to increase a bit the stack usage (which is generally mostly preallocated anyway) in case of heavy register spilling.
What is allocated on the stack generally isn't problematic - as said above, it's generally preallocated, and if you were exhausting it your program would be already crashing.
There is no such thing as a char constant in C, which is why there are no suffixes for "short" and "char", as there are for "long" and "long long". The casted value of (char)0x10 will immediately be promoted back to an int in almost any integer context, because of the integer promotions (§6.3.1.1p2).
So if c is a char and you write if (c == (char)0x10) ...,
both x and (char)0x10 are promoted to int before being compared.
Of course, a given compiler might elide the conversion if it knows that it makes no difference, but that compiler would certainly also use a byte constant if possible even without the explicit cast.
The optimization level depends on (1) where those defines are used and (2) what is the processor's arquitecture (or microcontroller) you're running the code.
The (1) has already been addressed in other answers.
The (2) is importante because there are processors/microcontrollers that perform better with 8 bits instead of 32 bits. There are processors that are, for example, 16 bits and if you use 8 bits variables it could decrease the memory needed but increase the run time of the program.
Below are an example and its disassemble:
#include <stdint.h>
#define _VAR_UINT8 ((uint8_t) -1)
#define _VAR_UINT16 ((uint16_t) -1)
#define _VAR_UINT32 ((uint32_t) -1)
#define _VAR_UINT64 ((uint64_t) -1)
volatile uint8_t v1b;
volatile uint16_t v2b;
volatile uint32_t v4b;
volatile uint64_t v8b;
int main(void) {
v1b = _VAR_UINT8;
v2b = _VAR_UINT8;
v2b = _VAR_UINT16;
v4b = _VAR_UINT8;
v4b = _VAR_UINT16;
v4b = _VAR_UINT32;
v8b = _VAR_UINT8;
v8b = _VAR_UINT16;
v8b = _VAR_UINT32;
v8b = _VAR_UINT64;
return 0;
}
Below are the disassemble for a x86 32 bit specific platform (it could be differente if you compile the above code and generate the disassemble in our processor)
00000000004004ec <main>:
4004ec: 55 push %rbp
4004ed: 48 89 e5 mov %rsp,%rbp
4004f0: c6 05 49 0b 20 00 ff movb $0xff,0x200b49(%rip) # 601040 <v1b>
4004f7: 66 c7 05 48 0b 20 00 movw $0xff,0x200b48(%rip) # 601048 <v2b>
4004fe: ff 00
400500: 66 c7 05 3f 0b 20 00 movw $0xffff,0x200b3f(%rip) # 601048 <v2b>
400507: ff ff
400509: c7 05 31 0b 20 00 ff movl $0xff,0x200b31(%rip) # 601044 <v4b>
400510: 00 00 00
400513: c7 05 27 0b 20 00 ff movl $0xffff,0x200b27(%rip) # 601044 <v4b>
40051a: ff 00 00
40051d: c7 05 1d 0b 20 00 ff movl $0xffffffff,0x200b1d(%rip) # 601044 <v4b>
400524: ff ff ff
400527: 48 c7 05 06 0b 20 00 movq $0xff,0x200b06(%rip) # 601038 <v8b>
40052e: ff 00 00 00
400532: 48 c7 05 fb 0a 20 00 movq $0xffff,0x200afb(%rip) # 601038 <v8b>
400539: ff ff 00 00
40053d: c7 05 f1 0a 20 00 ff movl $0xffffffff,0x200af1(%rip) # 601038 <v8b>
400544: ff ff ff
400547: c7 05 eb 0a 20 00 00 movl $0x0,0x200aeb(%rip) # 60103c <v8b+0x4>
40054e: 00 00 00
400551: 48 c7 05 dc 0a 20 00 movq $0xffffffffffffffff,0x200adc(%rip) # 601038 <v8b>
400558: ff ff ff ff
40055c: b8 00 00 00 00 mov $0x0,%eax
400561: 5d pop %rbp
400562: c3 retq
400563: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40056a: 00 00 00
40056d: 0f 1f 00 nopl (%rax)
In my specific platform it is using 4 types of mov instruction, movb (7 bytes), movw (9 bytes), movl (10 bytes) and movq (12 bytes) depending upon the variable type and also the data type to be assigned.

What numeric values defines in dissembled of C code?

I'm understanding the assembly and C code.
I have following C program , compiled to generate Object file only.
#include <stdio.h>
int main()
{
int i = 10;
int j = 22 + i;
return 0;
}
I executed following command
objdump -S myprogram.o
Output of above command is:
objdump -S testelf.o
testelf.o: file format elf32-i386
Disassembly of section .text:
00000000 <main>:
#include <stdio.h>
int main()
{
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 10 sub $0x10,%esp
int i = 10;
6: c7 45 f8 0a 00 00 00 movl $0xa,-0x8(%ebp)
int j = 22 + i;
d: 8b 45 f8 mov -0x8(%ebp),%eax
10: 83 c0 16 add $0x16,%eax
13: 89 45 fc mov %eax,-0x4(%ebp)
return 0;
16: b8 00 00 00 00 mov $0x0,%eax
}
1b: c9 leave
1c: c3 ret
What is meant by number numeric before the mnemonic commands
i.e. "83 ec 10 " before "sub" command or
"c7 45 f8 0a 00 00 00" before "movl" command
I'm using following platform to compile this code:
$ lscpu
Architecture: i686
CPU op-mode(s): 32-bit
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
Vendor ID: GenuineIntel
Those are x86 opcodes. A detailed reference, other than the ones listed in the comments above is available here.
For example the c7 45 f8 0a 00 00 00 before the movl $0xa,-0x8(%ebp) are hexadecimal values for the opcode bytes. They tell the CPU to move the immediate value of 10 decimal (as a 4-byte value) into the address located on the current stack 8-bytes above the stack frame base pointer. That is where the variable i from your C source code is located when your code is running. The top of the stack is at a lower memory address than the bottom of the stack, so moving a negative direction from the base is moving up the stack.
The c7 45 f8 opcodes mean to mov data and clear the arithmetic carry flag in the EFLAGS register. See the reference for more detail.
The remainder of the codes are an immediate value. Since you are using a little endian system, the least significant byte of a number is listed first, such that 10 decimal which is 0x0a in hexadecimal and has a 4-byte value of 0x0000000a is stored as 0a 00 00 00.

Anti-debugging: gdb does not write 0xcc byte for breakpoints. Any idea why?

I am learning some anti-debugging techniques on Linux and found a snippet of code for checking 0xcc byte in memory to detect the breakpoints in gdb. Here is that code:
if ((*(volatile unsigned *)((unsigned)foo + 3) & 0xff) == 0xcc)
{
printf("BREAKPOINT\n");
exit(1);
}
foo();
But it does not work. I even tried to set a breakpoint on foo() function and observe the contents in memory, but did not see any 0xcc byte written for breakpoint. Here is what I did:
(gdb) b foo
Breakpoint 1 at 0x804846a: file p4.c, line 8.
(gdb) x/x 0x804846a
0x804846a <foo+6>: 0xe02404c7
(gdb) x/16x 0x8048460
0x8048460 <frame_dummy+32>: 0x90c3c9d0 0x83e58955 0x04c718ec 0x0485e024
0x8048470 <foo+12>: 0xfefae808 0xc3c9ffff .....
As you can see, there seems to be no 0xcc byte written on the entry point of foo() function. Does anyone know what's going on or where I might be wrong? Thanks.
Second part is easily explained (as Flortify correctly stated):
GDB shows original memory contents, not the breakpoint "bytes". In default mode it actually even removes breakpoints when debugger suspends and re-inserts them before continuing. Users typically want to see their code, not strange modified instructions used for breakpoints.
With your C code you missed breakpoint for few bytes. GDB sets breakpoint after function prologue, because function prologue is not typically what gdb users want to see. So, if you put break to foo, actual breakpoint will be typically located few bytes after that (depends on prologue code itself that is function dependent as it may or might not have to save stack pointer, frame pointer and so on). But it is easy to check. I used this code:
#include <stdio.h>
int main()
{
int i,j;
unsigned char *p = (unsigned char*)main;
for (j=0; j<4; j++) {
printf("%p: ",p);
for (i=0; i<16; i++)
printf("%.2x ", *p++);
printf("\n");
}
return 0;
}
If we run this program by itself it prints:
0x40057d: 55 48 89 e5 48 83 ec 10 48 c7 45 f8 7d 05 40 00
0x40058d: c7 45 f4 00 00 00 00 eb 5a 48 8b 45 f8 48 89 c6
0x40059d: bf 84 06 40 00 b8 00 00 00 00 e8 b4 fe ff ff c7
0x4005ad: 45 f0 00 00 00 00 eb 27 48 8b 45 f8 48 8d 50 01
Now we run it in gdb (output re-formatted for SO).
(gdb) break main
Breakpoint 1 at 0x400585: file ../bp.c, line 6.
(gdb) info break
Num Type Disp Enb Address What
1 breakpoint keep y 0x0000000000400585 in main at ../bp.c:6
(gdb) disas/r main,+32
Dump of assembler code from 0x40057d to 0x40059d:
0x000000000040057d (main+0): 55 push %rbp
0x000000000040057e (main+1): 48 89 e5 mov %rsp,%rbp
0x0000000000400581 (main+4): 48 83 ec 10 sub $0x10,%rsp
0x0000000000400585 (main+8): 48 c7 45 f8 7d 05 40 00 movq $0x40057d,-0x8(%rbp)
0x000000000040058d (main+16): c7 45 f4 00 00 00 00 movl $0x0,-0xc(%rbp)
0x0000000000400594 (main+23): eb 5a jmp 0x4005f0
0x0000000000400596 (main+25): 48 8b 45 f8 mov -0x8(%rbp),%rax
0x000000000040059a (main+29): 48 89 c6 mov %rax,%rsi
End of assembler dump.
With this we verified, that program is printing correct bytes. But this also shows that breakpoint has been inserted at 0x400585 (that is after function prologue), not at first instruction of function.
If we now run program under gdb (with run) and then "continue" after breakpoint is hit, we get this output:
(gdb) cont
Continuing.
0x40057d: 55 48 89 e5 48 83 ec 10 cc c7 45 f8 7d 05 40 00
0x40058d: c7 45 f4 00 00 00 00 eb 5a 48 8b 45 f8 48 89 c6
0x40059d: bf 84 06 40 00 b8 00 00 00 00 e8 b4 fe ff ff c7
0x4005ad: 45 f0 00 00 00 00 eb 27 48 8b 45 f8 48 8d 50 01
This now shows 0xcc being printed for address 9 bytes into main.
If your hardware supports it, GDB may be using Hardware Breakpoints, which do not patch the code.
While I have not confirmed this via any official docs, this page indicates that
By default, gdb attempts to use hardware-assisted break-points.
Since you indicate expecting 0xCC bytes, I'm assuming you're running on x86 hardware, as the int3 opcode is 0xCC. x86 processors have a set of debug registers DR0-DR3, where you can program the address of data to cause a breakpoint exception. DR7 is a bitfield which controls the behavior of the breakpoints, and DR6 indicates the status.
The debug registers can only be read/written from Ring 0 (kernel mode). That means that the kernel manages these registers for you (via the ptrace API, I believe.)
However, for the sake of anti-debugging, all hope is not lost! On Windows, the GetThreadContext API allows you to get (a copy) of the CONTEXT for a (stopped) thread. This structure includes the contents of the DRx registers. This question is about how to implement the same on Linux.
This may also be a white lie that GDB is telling you... there may be a breakpoint there in RAM but GDB has noted what was there beforehand (so it can restore it later) and is showing you that, instead of the true contents of RAM.
Of course, it could also be using Hardware Breakpoints, which is a facility available on some processors. Setting h/w breakpoints is done by telling the processor the address it should watch out for (and trigger a breakpoint interrupt if it gets hit by the program counter while executing code).

Resources