Read only string in C [duplicate] - c

This question already has answers here:
C/C++: Optimization of pointers to string constants
(6 answers)
Closed 9 years ago.
While reading about the "read only" string and came across the below snippet.
#include<stdio.h>
main()
{
char *foo = "some string";
char *bar = "some string";
printf("%d %d\n",foo,bar);
}
What i understood is foo and bar both will print the same address, but I am not able to understand what actually happening in background. i.e. When the string is same it will return same address but when I modify the string addresses are different.

foo and bar both will print the same address
Actually, according to the standard, they are not required to have the same address, it's unspecified. But in practice, most compiler will make identical string literals holding the same address.
You can't modify a string literal, I think you mean you use different string literals, in that case, it's obvious that the string will hold different addresses.
C11 6.4.5 String literals
It is unspecified whether these arrays are distinct provided their elements have the appropriate values. If the program attempts to modify such an array, the behavior is undefined.

Build the code with
gcc yourcode.c -S -o yourcode.S
.file "main.c"
.section .rodata
.LC0:
.string "some string"
.LC1:
.string "%d %d\n"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movq $.LC0, -16(%rbp)
movq $.LC0, -8(%rbp)
movl $.LC1, %eax
movq -8(%rbp), %rdx
movq -16(%rbp), %rcx
movq %rcx, %rsi
movq %rax, %rdi
movl $0, %eax
call printf
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.3-10ubuntu1) 4.6.3 20120918 (prerelease)"
.section .note.GNU-stack,"",#progbits
[char * foo] and [char *bar] are pointed to the same address. In this case "some string" is not permitted to modify. That will cause a runtime exception.

Related

How does x=x+1 is evaluated by the compiler and how is represented in assembly?

I'm trying to understand how does the compiler "sees" the i+1 part from expression i=i+1. I understand that i=3 means putting the value 3 in the location memory of variable i.
My guess about the i=i+1 is that the compiler expects a value from the right side of the "=" operator, so it gets the value from the location memory of variable i (which is 3, after the assignment) and add 1 to it, and the final result of the "i+1" expression(3+1=4) is stored back into the location memory of variable i, as a value. Is that correct?
And if it is, it means that any variable/combination of variables and literals present on the right side of an "=" operator will always be replaced with the value stored in them and those value can be added/substracted/etc with the values from other variables/literals (as in the x+1 expression), whilst the final result of those calculations will also be literal values (ex: 5, literal strings, etc), and will also be stored like values in a single variable on the left side of the "=" operator.
I'm also curious how this code is seen in assembly, and what are the main operations of this incrementation of i ( i = i+1);
#include <stdio.h>
int main()
{
int i = 3;
i = i + 1; // i should have the value of 4 stored back in it;
return 0;
}
This is not answerable for the general case. It depends on the target platform. If you want to inspect the assembly, you can do so with the -S parameter with gcc. When I did that to your code, it gave me this:
/tmp$ cat main.s
.file "main.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $3, -4(%rbp)
addl $1, -4(%rbp)
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Debian 9.2.1-8) 9.2.1 20190909"
.section .note.GNU-stack,"",#progbits
A brief little explanation of what is happening here. First we push the value of the stackpointer. This is so that we can jump back later.
.cfi_startproc
pushq %rbp
Then we set up the stack frame with this code. It corresponds to declaring variables.
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
Then we have this. Comments are mine.
movl $3, -4(%rbp) # i = 3;
addl $1, -4(%rbp) # i = i + 1;
Lastly, we return from the main function
movl $0, %eax # Store 0 in the "return register"
popq %rbp # Restore stackpointer
.cfi_def_cfa 7, 8
ret # return
Note that there is not a 1-1 relationship between lines. Not even for very simple lines.
Please also note that C imposes requirement on the observable behavior of the program and not on the generated assembly. So for instance, a compiler might remove the whole body for the main function because the variable i is not used in an observable way. And it will if you use optimization. When I recompiled your code with -O3 I got this instead:
/tmp/$ cat main.s
.file "main.c"
.text
.section .text.startup,"ax",#progbits
.p2align 4
.globl main
.type main, #function
main:
.LFB11:
.cfi_startproc
xorl %eax, %eax
ret
.cfi_endproc
.LFE11:
.size main, .-main
.ident "GCC: (Debian 9.2.1-8) 9.2.1 20190909"
.section .note.GNU-stack,"",#progbits
Notice how much that got removed from main. It can be interesting that movl $0, %eax has changed to xorl %eax, %eax. If you think about it, it's pretty obvious that this is a "set zero" operation. One could reasonably argue why anyone would write stuff like that. Well, the optimizer does certainly not optimize for readability. There are a few reasons why it is better. You can read about them here: What is the best way to set a register to zero in x86 assembly: xor, mov or and?

Why does a simple C "hello world" program not work if gcc -O3 without "volatile"?

I have the following C program
int main() {
char string[] = "Hello, world.\r\n";
__asm__ volatile ("syscall;" :: "a" (1), "D" (0), "S" ((unsigned long) string), "d" (sizeof(string) - 1)); }
which I want to run under Linux with with x86 64 bit. I call the syscall for "write" with 0 as fd argument because this is stdout.
If I compile under gcc with -O3, it does not work. A look into the assembly code
.file "test_for_o3.c"
.text
.section .text.startup,"ax",#progbits
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
subq $40, %rsp
.cfi_def_cfa_offset 48
xorl %edi, %edi
movl $15, %edx
movq %fs:40, %rax
movq %rax, 24(%rsp)
xorl %eax, %eax
movq %rsp, %rsi
movl $1, %eax
#APP
# 5 "test_for_o3.c" 1
syscall;
# 0 "" 2
#NO_APP
movq 24(%rsp), %rcx
xorq %fs:40, %rcx
jne .L5
xorl %eax, %eax
addq $40, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.L5:
.cfi_restore_state
call __stack_chk_fail#PLT
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0"
.section .note.GNU-stack,"",#progbits
tells us that gcc has simply not put the string data into the assembly code. Instead, if I declare "string" as "volatile", it works fine.
However, the idea of "volatile" is just to use it for variables that can change their values by (from the view of the executing function) unexpected events, isn't it? "volatile" can make code much slower, hence it should be avoided if possible.
As I would suppose, gcc must assume that the content of "string" must not be ignored because the pointer "string" is used as an input parameter in the inline assembly (and gcc has no idea what the inline assembly code will do with it).
If this is "allowed" behaviour of gcc, where can I read more about all the formal constraints I have to be aware of when writing code for -O3?
A second question would be what the "volatile" statement along with the inline assembly directive does exactly. I just got used to mark all inline assembly directives with "volatile" because it had not worked otherwise, in some situations.

Understanding an unexpected result due to an unmatched prototype (C89)

I have a program goo.c
void foo(double);
#include <stdio.h>
void foo(int x){
printf ("in foo.c:: x= %d\n",x);
}
which is called by foo.c
int main(){
double x=3.0;
foo(x);
}
I compile and run
gcc foo.c goo.c
./a.out
Guess what? I get "x= 1" as result. Then I find the signature of 'foo' should have been void foo(int). Apparently, my double input value 3.0 has to be downcast to an int. But, if I try to see the value of (int) 3.0 with the test program:
int main(){
double x=3.0;
printf ("%d", ((int) x));
}
I get 3 as output, which makes the earlier ` x= 1' even more hard to understand. Any idea? For information, my gcc is run with ANSI C standard. Thanks.
[EDIT] If I use gcc -S as suggested by JS1,
I get goo.s
.file "goo.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
movabsq $4613937818241073152, %rax
movq %rax, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, -24(%rbp)
movsd -24(%rbp), %xmm0
call foo
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits
and foo.s
.file "foo.c"
.section .rodata
.LC0:
.string "in foo.c:: x= %d\n"
.text
.globl foo
.type foo, #function
foo:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl %edi, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits
Anyone who knows how to read Assembly can help figure out the source problem?
Understanding why you get '1' requires a bit of ASM and x86-64 ABI
knowledge. First of all, goo.c and foo.c are two separate compilation
units. The only thing that foo.c knows about the foo function is
the bogus prototype.
The bogus prototype is as follows: void foo(double);. It's a function
that takes only a single double argument. The x86-64 ABI mandates that
the doubles are passed through the xmm registers (The exact phrasing
is 'If the class is SSE, the next available vector register is used,
the registers are taken in the order from %xmm0 to %xmm7.'.
That means that when the compiler sets up the arguments to call the
foo() function, it's going to pass the argument via %xmm0. In
simplified asm what happens is:
mov 3.0, %xmm0
call foo
Now, foo(), on it's side, believes it's going to recieve an int. The
x86-64 ABI says: 'If the class is INTEGER, the next available register
of the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used.'. The first
argument is supposed to be passed via %rdi. That means that foo()
will do something like:
mov %rdi, %rsi
mov 0xabcd, %rdi // 0xabcd being the address of the "%d" string
call printf
So you're going to end up printing whatever was in %rsi, and not %xmm0.
But why 1? You'll get an idea by issuing the following commands:
./a.out a
./a.out a b
./a.out a b c
See a pattern? Let's go back to the simplified assembly:
main:
mov 3.0, %xmm0
call foo
ret
foo:
mov %rdi, %rsi
mov 0xabcd, %rdi // 0xabcd being the address of the "%d" string
call printf
ret
As you can see, nothing is setting %rdi until it reaches foo(),
where it's passed on to printf. Which means 1 was passed to main
in the first place. Now, in the question, main is given the following
prototype: int main(). But the compiler actually setup the function to
have the following prototype instead: int main (int argc, char *argv[],
char *envp[]). The first argument, thus stored in %rdi, is actually
argc. That's why the program was printing 1.

Incrementing a variable through embedded assembly language

I am trying to understand how to embed assembly language in C (using gcc on x86_64 architecture). I wrote this program to increment the value of a single variable. But I am getting garbage value as output. And ideas why?
#include <stdio.h>
int main(void) {
int x;
x = 4;
asm("incl %0": "=r"(x): "r0"(x));
printf("%d", x);
return 0;
}
Thanks
Update The program is giving expected result on gcc 4.8.3 but not on gcc 4.6.3. I am pasting the assembly output of the non-working code:
.file "abc.c"
.section .rodata
.LC0:
.string "%d"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
subq $24, %rsp
movl $4, -20(%rbp)
movl -20(%rbp), %eax
incl %edx
movl %edx, %ebx
.cfi_offset 3, -24
movl %ebx, -20(%rbp)
movl $.LC0, %eax
movl -20(%rbp), %edx
movl %edx, %esi
movq %rax, %rdi
movl $0, %eax
call printf
movl $0, %eax
addq $24, %rsp
popq %rbx
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
You don't need to say x twice; once is sufficient:
asm("incl %0": "+r"(x));
The +r says that the value will be input and output.
Your way, with separate inputs and output registers, requires that you take the input from %1, add one, and write the output to %0, but you can't do that with incl.
The reason it works on some compilers is because GCC is free to allocate both %0 and %1 to the same register, and appears to have done so in those cases, but it does not have to. Incidentally, if you want to prevent GCC allocating an input and output to the same register (say, if you want to initialize the output before using the input to calculate a final output), you need to use the & modifier.
The documentation for the modifiers is here.

Where string data is stored?

I wrote a small c program:
#include <stdio.h>
int main()
{
char s[] = "Hello, world!";
printf("%s\n", s);
return 0;
}
which compiles to (on my linux machine):
.file "hello.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
movw $33, -20(%rbp)
leaq -32(%rbp), %rax
movq %rax, %rdi
call puts
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L3
call __stack_chk_fail
.L3:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2"
.section .note.GNU-stack,"",#progbits
I don't understand the assembly code, but I can't see anywhere the string message. So how the executable know what to print?
It's here:
movl $1819043144, -32(%rbp) ; 1819043144 = 0x6C6C6548 = "lleH"
movl $1998597231, -28(%rbp) ; 1998597231 = 0x77202C6F = "w ,o"
movl $1684828783, -24(%rbp) ; 1684828783 = 0x646C726F = "dlro"
movw $33, -20(%rbp) ; 33 = 0x0021 = "\0!"
In this particular case the compiler is generating inline instructions to generate the literal string constant before calling printf. Of course in other situations it may not do this but may instead store a string constant in another section of memory. Bottom line: you can not make any assumptions about how or where the compiler will generate and store string literals.
The string is here:
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
This copies a bunch of values to the stack. Those values happen to be your string.
string constants are stored in the binary of your application. Exactly where is up to your compiler.
Assembly has no "string" concept. Thus, the "string" is actually a chunk of memory. The string is stored somewhere in memory (up to the compiler) then you can manipulate this chunk of data using its memory address (pointer).
If your string is constant, compiler might want to use it as constants instead of storing it into memory, which is faster. This is your case, as pointed out by Paul R:
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
You cannot make assumptions about how the compiler will treat your string.
In addition to the above, the compiler can see that your string literal cannot be referenced directly (i.e. there can't be any valid pointers to your string), which is why it can just copy it inline. If however you assign a character pointer instead, i.e.
char *s = "Hello, world!";
The compiler will initialise a string literal somewhere in memory, since you can of course now point to it. This modification produces on my machine:
.LC0:
.string "Hello, world!"
.text
.globl main
.type main, #function
One assumption can be made about string literals: if a pointer is initialised to a literal, it will point to a static char array held somewhere in memory. As a result the pointer is valid in any part of the program, e.g. you can return a pointer to a string literal initialised in a function, and it will still be valid.

Resources