In which data segment is the C string stored? - c

I'm wondering what's the difference between char s[] = "hello" and char *s = "hello".
After reading this and this, I'm still not very clear on this question.
As I know, there are five data segments in memory, Text, BSS, Data, Stack and Heap.
From my understanding,
in case of char s[] = "hello":
"hello" is in Text.
s is in Data if it is a global variable or in Stack if it is a local variable.
We also have a copy of "hello" where the s is stored, so we can modify the value of this string via s.
in case of char *s = "hello":
"hello" is in Text.
s is in Data if it is a global variable or in Stack if it is a local variable.
s just points to "hello" in Text and we don't have a copy of it, therefore modifying the value of string via this pointer should cause "Segmentation Fault".
Am I right?

You are right that "hello" for the first case is mutable and for the second case is immutable string. And they are kept in read-only memory before initialization.
In the first case the mutable memory is initialized/copied from immutable string. In the second case the pointer refers to immutable string.
For first case wikipedia says,
The values for these variables are initially stored within the
read-only memory (typically within .text) and are copied into the
.data segment during the start-up routine of the program.
Let us examine segment.c file.
char*s = "hello"; // string
char sar[] = "hello"; // string array
char content[32];
int main(int argc, char*argv[]) {
char psar[] = "parhello"; // local/private string array
char*ps = "phello"; // private string
content[0] = 1;
sar[3] = 1; // OK
// sar++; // not allowed
// s[2] = 1; // segmentation fault
s = sar;
s[2] = 1; // OK
psar[3] = 1; // OK
// ps[2] = 1; // segmentation fault
ps = psar;
ps[2] = 1; // OK
return 0;
}
Here is the assembly generated for segment.c file. Note that both s and sar is in global aka .data segment. It seems sar is const pointer to a mutable initialized memory or not pointer at all(practically it is an array). And eventually it has an implication that sizeof(sar) = 6 is different to sizeof(s) = 8. There are "hello" and "phello" in readonly(.rodata) section and effectively immutable.
.file "segment.c"
.globl s
.section .rodata
.LC0:
.string "hello"
.data
.align 8
.type s, #object
.size s, 8
s:
.quad .LC0
.globl sar
.type sar, #object
.size sar, 6
sar:
.string "hello"
.comm content,32,32
.section .rodata
.LC1:
.string "phello"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $64, %rsp
movl %edi, -52(%rbp)
movq %rsi, -64(%rbp)
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
movl $1752326512, -32(%rbp)
movl $1869376613, -28(%rbp)
movb $0, -24(%rbp)
movq $.LC1, -40(%rbp)
movb $1, content(%rip)
movb $1, sar+3(%rip)
movq $sar, s(%rip)
movq s(%rip), %rax
addq $2, %rax
movb $1, (%rax)
movb $1, -29(%rbp)
leaq -32(%rbp), %rax
movq %rax, -40(%rbp)
movq -40(%rbp), %rax
addq $2, %rax
movb $1, (%rax)
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L2
call __stack_chk_fail
.L2:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
Again for local variable in main, the compiler does not bother to create a name. And it may keep it in register or in stack memory.
Note that local variable value "parhello" is optimized into 1752326512 and 1869376613 numbers. I discovered it by changing the value of "parhello" to "parhellp". The diff of the assembly output is as follows,
39c39
< movl $1886153829, -28(%rbp)
---
> movl $1869376613, -28(%rbp)
So there is no separate immutable store for psar . It is turned into integers in the code segment.

answer to your first question:
char s[] = "hello";
s is an array of type char. An array is a const pointer, meaning that you cannot change the s using pointer arithmetic (i.e. s++). The data aren't const, though, so you can change it.
See this example C code:
#include <stdio.h>
void reverse(char *p){
char c;
char* q = p;
while (*q) q++;
q--; // point to the end
while (p < q) {
c = *p;
*p++ = *q;
*q-- = c;
}
}
int main(){
char s[] = "DCBA";
reverse( s);
printf("%s\n", s); // ABCD
}
which reverses the text "DCBA" and produces "ABCD".
char *p = "hello"
p is a pointer to a char. You can do pointer arithmetic -- p++ will compile -- and puts data in read-only parts of the memory (const data).
and using p[0]='a'; will result to runtime error:
#include <stdio.h>
int main(){
char* s = "DCBA";
s[0]='D'; // compile ok but runtime error
printf("%s\n", s); // ABCD
}
this compiles, but not runs.
const char* const s = "DCBA";
With a const char* const, you can change neither s nor the data content which point to (i.e. "DCBE"). so data and pointer are const:
#include <stdio.h>
int main(){
const char* const s = "DCBA";
s[0]='D'; // compile error
printf("%s\n", s); // ABCD
}
The Text segment is normally the segment where your code is stored and is const; i.e. unchangeable. In embedded systems, this is the ROM, PROM, or flash memory; in a desktop computer, it can be in RAM.
The Stack is RAM memory used for local variables in functions.
The Heap is RAM memory used for global variables and heap-initialized data.
BSS contains all global variables and static variables that are initialized to zero or not initialized vars.
For more information, see the relevant Wikipedia and this relevant Stack Overflow question
With regards to s itself: The compiler decides where to put it (in stack space or CPU registers).
For more information about memory protection and access violations or segmentation faults, see the relevant Wikipedia page
This is a very broad topic, and ultimately the exact answers depend on your hardware and compiler.

Related

Please explain the output , where the C character pointer was assigned to new memory without Malloc call

#include <stdio.h>
int
main ()
{
char *a = "Hello";
a = "Hello_World";
printf ("%s", a);
return 0;
}
Now this program returned corrected and printed “Hello_World”.
But I remember reading that for changing a once initialised string pointer , I must use malloc to allocate memory and then input the new value of the string .
Please explain? Especially where is the memory allocated for the new changed value of the string , and what about the old memory.
Use gcc -S to see generated assembler.
You well see something like this
.LC0:
.string "Hello"
.LC1:
.string "Hello_World"
It was allocated in .data section as constants.
Then it wil be used like this
movq $.LC0, -8(%rbp)
movq $.LC1, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, %rsi
movl $.LC2, %edi

What's different between pointer with array in c? [duplicate]

This question already has answers here:
What is the difference between char s[] and char *s?
(14 answers)
Closed 5 years ago.
I try to google this topic, but no one can explain clear. I try the below code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char * argv[]){
char * p1 = "dddddd";
const char * p2 = "dddddd";
char p3[] = "dddddd";
char * p4 =(char*)malloc(sizeof("dddddd")+1);
strcpy(p4, "dddddd");
//*(p1+2) = 'b'; // test_1
//Output >> Bus error: 10
// *(p2+2) = 'b'; // test_2
// Output >> char_point.c:11:13: error: read-only variable is not assignable
*(p3+2) = 'b'; // test_3
// Output >>
//d
//dddddd
//dddddd
//ddbddd
*(p4+2) = 'k'; // test_4
// Output >>
//d
//dddddd
//dddddd
//ddbddd
//ddkddd
printf("%c\n", *(p1+2));
printf("%s\n", p1);
printf("%s\n", p2);
printf("%s\n", p3);
printf("%s\n", p4);
return 0;
}
I have try 3 tests, but only the test_3 and test_4 can pass. I know const char *p2 is read only, because it's a constant value! but i don't know why p1 can't be modified! which section of memory it's layout? BTW, I compile it on my Mac with GCC.
I try to compile it to dis-asm it by gcc -S, I got this.
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 13
.globl _main
.p2align 4, 0x90
_main: ## #main
.cfi_startproc
## BB#0:
pushq %rbp
Lcfi0:
.cfi_def_cfa_offset 16
Lcfi1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Lcfi2:
.cfi_def_cfa_register %rbp
subq $48, %rsp
movl $8, %eax
movl %eax, %ecx
leaq L_.str(%rip), %rdx
movl $0, -4(%rbp)
movl %edi, -8(%rbp)
movq %rsi, -16(%rbp)
movq %rdx, -24(%rbp)
movq %rdx, -32(%rbp)
movl L_main.p3(%rip), %eax
movl %eax, -39(%rbp)
movw L_main.p3+4(%rip), %r8w
movw %r8w, -35(%rbp)
movb L_main.p3+6(%rip), %r9b
movb %r9b, -33(%rbp)
movq %rcx, %rdi
callq _malloc
xorl %r10d, %r10d
movq %rax, -48(%rbp)
movl %r10d, %eax
addq $48, %rsp
popq %rbp
retq
.cfi_endproc
.section __TEXT,__cstring,cstring_literals
L_.str: ## #.str
.asciz "dddddd"
L_main.p3: ## #main.p3
.asciz "dddddd"
.subsections_via_symbols
I want to know every pointer what i declaration, which section is it?
"Why p1 can't be modified?"
Roughly speaking, p1 points to a string literal, and attempts to modify string literals cause undefined behavior in C.
More specifically, according to the §6.4.5 6 of the C11 Standard, string literals are:
used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type char....
Concerning objects with static storage duration, §5.1.2 1 states that
All objects with static storage duration shall be initialized (set to their initial values) before program startup. The manner and timing of such initialization are otherwise unspecified.
"Which section of memory it's layout?"
But, the Standard does not specify any specific memory layouts that an implementation must follow.
What the Standard does say about the arrays of char which are created from string literals is that (§6.4.5 7):
It is unspecified whether these arrays are distinct provided their elements have the appropriate values. If the program attempts to modify such an array, the behavior is undefined.
So
char * p1 = "dddddd";
this should be
const char * p1 = "dddddd";
String literals (the ones in quotes) reside in read-only memory. Even if you
don't use the const keyword in the declaration of the variable, p1 still
points to read-only memory. So
*(p1+2) = 'b'; // test_1
is going to fail.
Here
*(p2+2) = 'b'; // test_2
// Output >> char_point.c:11:13: error: read-only variable is not assignable
the compiler tells you, you cannot do that because you declared p2 as const.
The difference between the first test and this one, is that the code tries to
modify a character and fails.
Now this:
char * p4 =(char*)malloc(sizeof("dddddd")+1);
First, do not cast malloc & friends. Second: the sizeof-operator returns the
number of bytes needed to store the expression in memory. "ddddd" is a string
literal, it returns a pointer to char, so sizeof("dddddd") returns the number
of bytes that a pointer to char needs to be stored in memory.
The correct function would be strlen:
char * p4 = malloc(strlen("dddddd")+1);
Note that in this case
char txt[] = "Hello world";
printf("%lu\n", sizeof(txt));
will print 12 and not 11. C strings are '\0'-terminated, that means that txt
holds all these characters plus the '\0'-terminating byte. In this case
sizeof doesn't return the number of bytes for a pointer, because txt is an
array.
void foo(char *txt)
{
printf("%lu\n", sizeof(txt));
}
void bar(void)
{
char txt[] = "Hello world";
foo(txt);
}
Here you won't get 12 like before, most probably 8 (today's common size for a
pointer). Even though txt in bar is an array, the txt in foo is a
pointer.
Arrays are constant pointer, which means that an array points to a memory address and you cant change were it points. But you can change the elements in it.
While you can change where the pointer points, but it's elements are constant.
for example consider this code
int main(){
int a[] = {1,2,3};
int * ptr = {1,2,3};
//a[0] == *(a+0)
//a[1] == *(a+1)
a += 1; // this is wrong, because we cant change were array points
ptr += 1; // this is correct, now the pointer ptr will points to the next element which is 2
a[0] += 2 // this is correct, now a[0] will become 3
*ptr += 2 // this is wrong, because we cant change the elements of the pointer.
return 0;
}

Memory Allocation of Static String Literals

Consider the following struct:
struct example_t {
char * a;
char * b;
};
struct example_t test {
"Chocolate",
"Cookies"
};
I am aware of the implementation specific nature of the allocation of memory for the char*'s, but what of the string literals?
In this case, are there any guarantee from the C-standard with regards to the adjacent placement of "Chocolate" and "Cookies"?
In most implementations I tested the two literals are not padded, and are directly adjacent.
This allows the struct to be copied quickly with a memcpy, although I suspect this behavior is undefined. Does anyone have any information on this topic?
In your example, there are no absolute guarantees of the adjacency/placement of the two string literals with respect to each other. GCC in this case happens to demonstrate such behavior, but it has no obligation to exhibit this behavior.
In this example, we see no padding, and we can even use undefined behavior to demonstrate adjacency of string literals. This works with GCC, but using alternate libc's or different compilers, you could get other behavior, such as detecting duplicate string literals across translation units and reducing redundancy to save memory in the final application.
Also, while the pointers you declared are of type char *, the literals actually should be const char*, since they will be stored in RODATA, and writing to that memory will cause a segfault.
Code Listing
#include <stdio.h>
#include <string.h>
struct example_t {
char * a;
char * b;
char * c;
};
int main(void) {
struct example_t test = {
"Chocolate",
"Cookies",
"And milk"
};
size_t len = strlen(test.a) + strlen(test.b) + strlen(test.c) + ((3-1) * sizeof(char));
char* t= test.a;
int i;
for (i = 0; i< len; i++) {
printf("%c", t[i]);
}
return 0;
}
Sample output
./a.out
ChocolateCookiesAnd milk
Output of gcc -S
.file "test.c"
.section .rodata
.LC0:
.string "Chocolate"
.LC1:
.string "Cookies"
.LC2:
.string "And milk"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
subq $72, %rsp
.cfi_offset 3, -24
movq $.LC0, -48(%rbp)
movq $.LC1, -40(%rbp)
movq $.LC2, -32(%rbp)
movq -48(%rbp), %rax
movq %rax, %rdi
call strlen
movq %rax, %rbx
movq -40(%rbp), %rax
movq %rax, %rdi
call strlen
addq %rax, %rbx
movq -32(%rbp), %rax
movq %rax, %rdi
call strlen
addq %rbx, %rax
addq $2, %rax
movq %rax, -64(%rbp)
movq -48(%rbp), %rax
movq %rax, -56(%rbp)
movl $0, -68(%rbp)
jmp .L2
.L3:
movl -68(%rbp), %eax
movslq %eax, %rdx
movq -56(%rbp), %rax
addq %rdx, %rax
movzbl (%rax), %eax
movsbl %al, %eax
movl %eax, %edi
call putchar
addl $1, -68(%rbp)
.L2:
movl -68(%rbp), %eax
cltq
cmpq -64(%rbp), %rax
jb .L3
movl $0, %eax
addq $72, %rsp
popq %rbx
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4"
.section .note.GNU-stack,"",#progbits
No, there is no guarantee for adjacent placement.
One occasion where actual compilers will place them far apart is if the same string literal appears in different places (as read-only objects) and the string combining optimization is enabled.
Example:
char *foo = "foo";
char *baz = "baz";
struct example_t bar = {
"foo",
"bar"
}
may well end up in memory as "foo" followed by "baz" followed by "bar".
Here is an example demonstrating a real-world scenario where the strings are not adjacent. GCC decides to reuse the string "Chocolate" from earlier.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
const char *a = "Chocolate";
const char *b = "Spinach";
struct test_t {
const char *a;
const char *b;
};
struct test_t test = {"Chocolate", "Cookies"};
int main(void)
{
printf("%p %p\n", (const void *) a, (const void *) b);
printf("%p %p\n", (const void *) test.a, (const void *) test.b);
return EXIT_SUCCESS;
}
Output:
0x400614 0x40061e
0x400614 0x400626
I'll try to show you an example of gcc behaviour where, even in that case you don't get strings aligned in memory:
#include <stdio.h>
#include <stdlib.h>
char *s = "Cookies";
struct test {
char *a, *b, *c, *d;
};
struct test t = {
"Chocolate",
"Cookies",
"Milk",
"Cookies",
};
#define D(x) __FILE__":%d:%s: " x, __LINE__, __func__
#define P(x) do{\
printf(D(#x " = [%#p] \"%s\"\n"), x, x); \
} while(0)
int main()
{
P(t.a);
P(t.b);
P(t.c);
P(t.d);
return 0;
}
In this case, as the compiler tries to reuse already seen string literals, the ones you use to assign to the structure fields don't get aligned.
This is the output of the program:
$ pru3
pru3.c:25:main: t.a = [0x8518] "Chocolate"
pru3.c:26:main: t.b = [0x8510] "Cookies"
pru3.c:27:main: t.c = [0x8524] "Milk"
pru3.c:28:main: t.d = [0x8510] "Cookies"
As you see, the pointers are even repeated for the "Cookies" value.
The compiling here was made with default values, with:
gcc -o pru3 pru3.c

Where string data is stored?

I wrote a small c program:
#include <stdio.h>
int main()
{
char s[] = "Hello, world!";
printf("%s\n", s);
return 0;
}
which compiles to (on my linux machine):
.file "hello.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
movw $33, -20(%rbp)
leaq -32(%rbp), %rax
movq %rax, %rdi
call puts
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L3
call __stack_chk_fail
.L3:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2"
.section .note.GNU-stack,"",#progbits
I don't understand the assembly code, but I can't see anywhere the string message. So how the executable know what to print?
It's here:
movl $1819043144, -32(%rbp) ; 1819043144 = 0x6C6C6548 = "lleH"
movl $1998597231, -28(%rbp) ; 1998597231 = 0x77202C6F = "w ,o"
movl $1684828783, -24(%rbp) ; 1684828783 = 0x646C726F = "dlro"
movw $33, -20(%rbp) ; 33 = 0x0021 = "\0!"
In this particular case the compiler is generating inline instructions to generate the literal string constant before calling printf. Of course in other situations it may not do this but may instead store a string constant in another section of memory. Bottom line: you can not make any assumptions about how or where the compiler will generate and store string literals.
The string is here:
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
This copies a bunch of values to the stack. Those values happen to be your string.
string constants are stored in the binary of your application. Exactly where is up to your compiler.
Assembly has no "string" concept. Thus, the "string" is actually a chunk of memory. The string is stored somewhere in memory (up to the compiler) then you can manipulate this chunk of data using its memory address (pointer).
If your string is constant, compiler might want to use it as constants instead of storing it into memory, which is faster. This is your case, as pointed out by Paul R:
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
You cannot make assumptions about how the compiler will treat your string.
In addition to the above, the compiler can see that your string literal cannot be referenced directly (i.e. there can't be any valid pointers to your string), which is why it can just copy it inline. If however you assign a character pointer instead, i.e.
char *s = "Hello, world!";
The compiler will initialise a string literal somewhere in memory, since you can of course now point to it. This modification produces on my machine:
.LC0:
.string "Hello, world!"
.text
.globl main
.type main, #function
One assumption can be made about string literals: if a pointer is initialised to a literal, it will point to a static char array held somewhere in memory. As a result the pointer is valid in any part of the program, e.g. you can return a pointer to a string literal initialised in a function, and it will still be valid.

Finding a string in a compiled executable

I have a very simple program as below:
#include
int main(){
char* mystring = "ABCDEFGHIJKLMNO";
puts(mystring);
char otherstring[15];
otherstring[0] = 'a';
otherstring[1] = 'b';
otherstring[2] = 'c';
otherstring[3] = 'd';
otherstring[4] = 'e';
otherstring[5] = 'f';
otherstring[6] = 'g';
otherstring[7] = 'h';
otherstring[8] = 'i';
otherstring[9] = 'j';
otherstring[10] = 'k';
otherstring[11] = 'l';
otherstring[12] = 'm';
otherstring[13] = 'n';
otherstring[14] = 'o';
puts(otherstring);
return 0;
}
Compiler was MS VC++.
Whether I build this program with or without optimisations I can find the string "ABCDEFGHIJKLMNO" in the executable using a hex editor.
However, I cannot find the string "abcdefghijklmno"
What is the compiler doing that is different for otherstring?
The hex editor I used was Hexedit - but tried others and still couldn't find otherstring. Anyone any ideas why not or how to find?
By the way I am not doing this for hacking reasons.
This is what my gcc did with this code. I assume your compiler does a similar thing. The string constant is stored in the read only section and mystring is initialized with it's address.
The individual chars are placed directly into their array location on the stack. Also note that otherstring is not NULL terminated when you're calling puts with it.
.file "test.c"
.section .rodata
.LC0:
.string "ABCDEFGHIJKLMNO"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $48, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
/* here is where mystring is loaded with the address of "ABCDEFGHIJKLMNO" */
movq $.LC0, -40(%rbp)
/* this is the call to puts */
movq -40(%rbp), %rax
movq %rax, %rdi
call puts
/* here is where the bytes are loaded into otherstring on the stack */
movb $97, -32(%rbp) //'a'
movb $98, -31(%rbp) //'b'
movb $99, -30(%rbp) //'c'
movb $100, -29(%rbp) //'d'
movb $101, -28(%rbp) //'e'
movb $102, -27(%rbp) //'f'
movb $103, -26(%rbp) //'g'
movb $104, -25(%rbp) //'h'
movb $105, -24(%rbp) //'i'
movb $106, -23(%rbp) //'j'
movb $107, -22(%rbp) //'k'
movb $108, -21(%rbp) //'l'
movb $109, -20(%rbp) //'m'
movb $110, -19(%rbp) //'n'
movb $111, -18(%rbp) //'o'
The compiler is likely placing the number for each character into each array position, just as you wrote it, without any optimization that would be found from reading the code. Remember that a single character is no different than a number in c, so you could even use the ascii codes instead of 'a'. From a hexeditor I would expect you would see those converted back to letters, just spaced out a bit.
In the first case the compiler initializes data with the exact string "ABC...".
In the second case, each assignment is done sequentially, therefore compiler generates code to perform this assignment. In the executable you should see 15 repeating byte sequences where only the initializer ('a', 'b', 'c'...) changes.

Resources