Where string data is stored? - c

I wrote a small c program:
#include <stdio.h>
int main()
{
char s[] = "Hello, world!";
printf("%s\n", s);
return 0;
}
which compiles to (on my linux machine):
.file "hello.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
movw $33, -20(%rbp)
leaq -32(%rbp), %rax
movq %rax, %rdi
call puts
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L3
call __stack_chk_fail
.L3:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2"
.section .note.GNU-stack,"",#progbits
I don't understand the assembly code, but I can't see anywhere the string message. So how the executable know what to print?

It's here:
movl $1819043144, -32(%rbp) ; 1819043144 = 0x6C6C6548 = "lleH"
movl $1998597231, -28(%rbp) ; 1998597231 = 0x77202C6F = "w ,o"
movl $1684828783, -24(%rbp) ; 1684828783 = 0x646C726F = "dlro"
movw $33, -20(%rbp) ; 33 = 0x0021 = "\0!"
In this particular case the compiler is generating inline instructions to generate the literal string constant before calling printf. Of course in other situations it may not do this but may instead store a string constant in another section of memory. Bottom line: you can not make any assumptions about how or where the compiler will generate and store string literals.

The string is here:
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
This copies a bunch of values to the stack. Those values happen to be your string.

string constants are stored in the binary of your application. Exactly where is up to your compiler.

Assembly has no "string" concept. Thus, the "string" is actually a chunk of memory. The string is stored somewhere in memory (up to the compiler) then you can manipulate this chunk of data using its memory address (pointer).
If your string is constant, compiler might want to use it as constants instead of storing it into memory, which is faster. This is your case, as pointed out by Paul R:
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
You cannot make assumptions about how the compiler will treat your string.

In addition to the above, the compiler can see that your string literal cannot be referenced directly (i.e. there can't be any valid pointers to your string), which is why it can just copy it inline. If however you assign a character pointer instead, i.e.
char *s = "Hello, world!";
The compiler will initialise a string literal somewhere in memory, since you can of course now point to it. This modification produces on my machine:
.LC0:
.string "Hello, world!"
.text
.globl main
.type main, #function
One assumption can be made about string literals: if a pointer is initialised to a literal, it will point to a static char array held somewhere in memory. As a result the pointer is valid in any part of the program, e.g. you can return a pointer to a string literal initialised in a function, and it will still be valid.

Related

In x86, why do I have the same instruction two times, with reversed operands?

I am doing several experiments with x86 asm trying to see how common language constructs map into assembly. In my current experiment, I am trying to see specifically how C language pointers map to register-indirect addressing. I have written a fairly hello-world like pointer program:
#include <stdio.h>
int
main (void)
{
int value = 5;
int *int_val = &value;
printf ("The value we have is %d\n", *int_val);
return 0;
}
and compiled it to the following asm using: gcc -o pointer.s -fno-asynchronous-unwind-tables pointer.c:[1][2]
.file "pointer.c"
.section .rodata
.LC0:
.string "The value we have is %d\n"
.text
.globl main
.type main, #function
main:
;------- function prologue
pushq %rbp
movq %rsp, %rbp
;---------------------------------
subq $32, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
;----------------------------------
movl $5, -20(%rbp) ; This is where the value 5 is stored in `value` (automatic allocation)
;----------------------------------
leaq -20(%rbp), %rax ;; (GUESS) If I have understood correctly, this is where the address of `value` is
;; extracted, and stored into %rax
;----------------------------------
movq %rax, -16(%rbp) ;;
movq -16(%rbp), %rax ;; Why do I have two times the same instructions, with reversed operands???
;----------------------------------
movl (%rax), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
;----------------------------------
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L3
call __stack_chk_fail
.L3:
leave
ret
.size main, .-main
.ident "GCC: (Ubuntu 4.9.1-16ubuntu6) 4.9.1"
.section .note.GNU-stack,"",#progbits
My issue is that I don't understand why it contains the instruction movq two times, with reversed operands. Could someone explain it to me?
[1]: I want to avoid having my asm code interspersed with cfi directives when I don't need them at all.
[2]: My environment is Ubuntu 14.10, gcc 4.9.1 (modified by ubuntu), and Gnu assembler (GNU Binutils for Ubuntu) 2.24.90.20141014, configured to target x86_64-linux-gnu
Maybe it will be clearer if you reorganize your blocks:
;----------------------------------
leaq -20(%rbp), %rax ; &value
movq %rax, -16(%rbp) ; int_val
;----------------------------------
movq -16(%rbp), %rax ; int_val
movl (%rax), %eax ; *int_val
movl %eax, %esi ; printf-argument
movl $.LC0, %edi ; printf-argument (format-string)
movl $0, %eax ; no floating-point numbers
call printf
;----------------------------------
The first block performs int *int_val = &value;, the second block performs printf .... Without optimization, the blocks are independent.
Since you're not doing any optimization, gcc creates very simple-minded code that does each statement in the program one at a time without looking at any other statement. So in your example, it stores a value into the variable int_val, and then the very next instruction reads that variable again as part of the next statement. In both cases, it is using %rax as the temporary to hold value, as that's the first register generally used for things.

Incrementing a variable through embedded assembly language

I am trying to understand how to embed assembly language in C (using gcc on x86_64 architecture). I wrote this program to increment the value of a single variable. But I am getting garbage value as output. And ideas why?
#include <stdio.h>
int main(void) {
int x;
x = 4;
asm("incl %0": "=r"(x): "r0"(x));
printf("%d", x);
return 0;
}
Thanks
Update The program is giving expected result on gcc 4.8.3 but not on gcc 4.6.3. I am pasting the assembly output of the non-working code:
.file "abc.c"
.section .rodata
.LC0:
.string "%d"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
subq $24, %rsp
movl $4, -20(%rbp)
movl -20(%rbp), %eax
incl %edx
movl %edx, %ebx
.cfi_offset 3, -24
movl %ebx, -20(%rbp)
movl $.LC0, %eax
movl -20(%rbp), %edx
movl %edx, %esi
movq %rax, %rdi
movl $0, %eax
call printf
movl $0, %eax
addq $24, %rsp
popq %rbx
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
You don't need to say x twice; once is sufficient:
asm("incl %0": "+r"(x));
The +r says that the value will be input and output.
Your way, with separate inputs and output registers, requires that you take the input from %1, add one, and write the output to %0, but you can't do that with incl.
The reason it works on some compilers is because GCC is free to allocate both %0 and %1 to the same register, and appears to have done so in those cases, but it does not have to. Incidentally, if you want to prevent GCC allocating an input and output to the same register (say, if you want to initialize the output before using the input to calculate a final output), you need to use the & modifier.
The documentation for the modifiers is here.

Read only string in C [duplicate]

This question already has answers here:
C/C++: Optimization of pointers to string constants
(6 answers)
Closed 9 years ago.
While reading about the "read only" string and came across the below snippet.
#include<stdio.h>
main()
{
char *foo = "some string";
char *bar = "some string";
printf("%d %d\n",foo,bar);
}
What i understood is foo and bar both will print the same address, but I am not able to understand what actually happening in background. i.e. When the string is same it will return same address but when I modify the string addresses are different.
foo and bar both will print the same address
Actually, according to the standard, they are not required to have the same address, it's unspecified. But in practice, most compiler will make identical string literals holding the same address.
You can't modify a string literal, I think you mean you use different string literals, in that case, it's obvious that the string will hold different addresses.
C11 6.4.5 String literals
It is unspecified whether these arrays are distinct provided their elements have the appropriate values. If the program attempts to modify such an array, the behavior is undefined.
Build the code with
gcc yourcode.c -S -o yourcode.S
.file "main.c"
.section .rodata
.LC0:
.string "some string"
.LC1:
.string "%d %d\n"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movq $.LC0, -16(%rbp)
movq $.LC0, -8(%rbp)
movl $.LC1, %eax
movq -8(%rbp), %rdx
movq -16(%rbp), %rcx
movq %rcx, %rsi
movq %rax, %rdi
movl $0, %eax
call printf
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.3-10ubuntu1) 4.6.3 20120918 (prerelease)"
.section .note.GNU-stack,"",#progbits
[char * foo] and [char *bar] are pointed to the same address. In this case "some string" is not permitted to modify. That will cause a runtime exception.

Why doesn't the compiler allocate and deallocate local var with "sub*" and "add*" on the stack?

According to some textbooks, the compiler will use sub* to allocate memory for local variables.
For example, I write a Hello World program:
int main()
{
puts("hello world");
return 0;
}
I guess this will be compiled to some assembly code on the 64 bit OS:
subq $8, %rsp
movq $.LC0, (%rsp)
calq puts
addq $8, %rsp
The subq allocates 8 byte memory (size of a point) for the argument and the addq deallocates it.
But when I input gcc -S hello.c (I use the llvm-gcc on Mac OS X 10.8), I get some assembly code.
.section __TEXT,__text,regular,pure_instructions
.globl _main
.align 4, 0x90
_main:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $16, %rsp
Ltmp2:
xorb %al, %al
leaq L_.str(%rip), %rcx
movq %rcx, %rdi
callq _puts
movl $0, -8(%rbp)
movl -8(%rbp), %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %eax
addq $16, %rsp
popq %rbp
ret
.......
L_.str:
.asciz "hello world!"
Around this callq without any addq and subq. Why? And what is the function of addq $16, %rsp?
Thanks for any input.
You don't have any local variables in your main(). All you may have in it is a pseudo-variable for the parameter passed to puts(), the address of the "hello world" string.
According to your last disassembly, the calling conventions appear to be such that the first parameter to puts() is passed in the rdi register and not on the stack, which is why there isn't any stack space allocated for this parameter.
However, since you're compiling your program with optimization disabled, you may encounter some unnecessary stack space allocations and reads and writes to and from that space.
This code illustrates it:
subq $16, %rsp ; allocate some space
...
movl $0, -8(%rbp) ; write to it
movl -8(%rbp), %eax ; read back from it
movl %eax, -4(%rbp) ; write to it
movl -4(%rbp), %eax ; read back from it
addq $16, %rsp
Those four mov instructions are equivalent to just one simple movl $0, %eax, no memory is needed to do that.
If you add an optimization switch like -O2 in your compile command, you'll see more meaningful code in the disassembly.
Also note that some space allocations may be needed solely for the purpose of keeping the stack pointer aligned, which improves performance or avoids issues with misaligned memory accesses (you could get the #AC exception on misaligned accesses if it's enabled).
The above code shows it too. See, those four mov instructions only use 8 bytes of memory, while the add and sub instructions grow and shrink the stack by 16.

Finding a string in a compiled executable

I have a very simple program as below:
#include
int main(){
char* mystring = "ABCDEFGHIJKLMNO";
puts(mystring);
char otherstring[15];
otherstring[0] = 'a';
otherstring[1] = 'b';
otherstring[2] = 'c';
otherstring[3] = 'd';
otherstring[4] = 'e';
otherstring[5] = 'f';
otherstring[6] = 'g';
otherstring[7] = 'h';
otherstring[8] = 'i';
otherstring[9] = 'j';
otherstring[10] = 'k';
otherstring[11] = 'l';
otherstring[12] = 'm';
otherstring[13] = 'n';
otherstring[14] = 'o';
puts(otherstring);
return 0;
}
Compiler was MS VC++.
Whether I build this program with or without optimisations I can find the string "ABCDEFGHIJKLMNO" in the executable using a hex editor.
However, I cannot find the string "abcdefghijklmno"
What is the compiler doing that is different for otherstring?
The hex editor I used was Hexedit - but tried others and still couldn't find otherstring. Anyone any ideas why not or how to find?
By the way I am not doing this for hacking reasons.
This is what my gcc did with this code. I assume your compiler does a similar thing. The string constant is stored in the read only section and mystring is initialized with it's address.
The individual chars are placed directly into their array location on the stack. Also note that otherstring is not NULL terminated when you're calling puts with it.
.file "test.c"
.section .rodata
.LC0:
.string "ABCDEFGHIJKLMNO"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $48, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
/* here is where mystring is loaded with the address of "ABCDEFGHIJKLMNO" */
movq $.LC0, -40(%rbp)
/* this is the call to puts */
movq -40(%rbp), %rax
movq %rax, %rdi
call puts
/* here is where the bytes are loaded into otherstring on the stack */
movb $97, -32(%rbp) //'a'
movb $98, -31(%rbp) //'b'
movb $99, -30(%rbp) //'c'
movb $100, -29(%rbp) //'d'
movb $101, -28(%rbp) //'e'
movb $102, -27(%rbp) //'f'
movb $103, -26(%rbp) //'g'
movb $104, -25(%rbp) //'h'
movb $105, -24(%rbp) //'i'
movb $106, -23(%rbp) //'j'
movb $107, -22(%rbp) //'k'
movb $108, -21(%rbp) //'l'
movb $109, -20(%rbp) //'m'
movb $110, -19(%rbp) //'n'
movb $111, -18(%rbp) //'o'
The compiler is likely placing the number for each character into each array position, just as you wrote it, without any optimization that would be found from reading the code. Remember that a single character is no different than a number in c, so you could even use the ascii codes instead of 'a'. From a hexeditor I would expect you would see those converted back to letters, just spaced out a bit.
In the first case the compiler initializes data with the exact string "ABC...".
In the second case, each assignment is done sequentially, therefore compiler generates code to perform this assignment. In the executable you should see 15 repeating byte sequences where only the initializer ('a', 'b', 'c'...) changes.

Resources