Memory Allocation of Static String Literals - c

Consider the following struct:
struct example_t {
char * a;
char * b;
};
struct example_t test {
"Chocolate",
"Cookies"
};
I am aware of the implementation specific nature of the allocation of memory for the char*'s, but what of the string literals?
In this case, are there any guarantee from the C-standard with regards to the adjacent placement of "Chocolate" and "Cookies"?
In most implementations I tested the two literals are not padded, and are directly adjacent.
This allows the struct to be copied quickly with a memcpy, although I suspect this behavior is undefined. Does anyone have any information on this topic?

In your example, there are no absolute guarantees of the adjacency/placement of the two string literals with respect to each other. GCC in this case happens to demonstrate such behavior, but it has no obligation to exhibit this behavior.
In this example, we see no padding, and we can even use undefined behavior to demonstrate adjacency of string literals. This works with GCC, but using alternate libc's or different compilers, you could get other behavior, such as detecting duplicate string literals across translation units and reducing redundancy to save memory in the final application.
Also, while the pointers you declared are of type char *, the literals actually should be const char*, since they will be stored in RODATA, and writing to that memory will cause a segfault.
Code Listing
#include <stdio.h>
#include <string.h>
struct example_t {
char * a;
char * b;
char * c;
};
int main(void) {
struct example_t test = {
"Chocolate",
"Cookies",
"And milk"
};
size_t len = strlen(test.a) + strlen(test.b) + strlen(test.c) + ((3-1) * sizeof(char));
char* t= test.a;
int i;
for (i = 0; i< len; i++) {
printf("%c", t[i]);
}
return 0;
}
Sample output
./a.out
ChocolateCookiesAnd milk
Output of gcc -S
.file "test.c"
.section .rodata
.LC0:
.string "Chocolate"
.LC1:
.string "Cookies"
.LC2:
.string "And milk"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
subq $72, %rsp
.cfi_offset 3, -24
movq $.LC0, -48(%rbp)
movq $.LC1, -40(%rbp)
movq $.LC2, -32(%rbp)
movq -48(%rbp), %rax
movq %rax, %rdi
call strlen
movq %rax, %rbx
movq -40(%rbp), %rax
movq %rax, %rdi
call strlen
addq %rax, %rbx
movq -32(%rbp), %rax
movq %rax, %rdi
call strlen
addq %rbx, %rax
addq $2, %rax
movq %rax, -64(%rbp)
movq -48(%rbp), %rax
movq %rax, -56(%rbp)
movl $0, -68(%rbp)
jmp .L2
.L3:
movl -68(%rbp), %eax
movslq %eax, %rdx
movq -56(%rbp), %rax
addq %rdx, %rax
movzbl (%rax), %eax
movsbl %al, %eax
movl %eax, %edi
call putchar
addl $1, -68(%rbp)
.L2:
movl -68(%rbp), %eax
cltq
cmpq -64(%rbp), %rax
jb .L3
movl $0, %eax
addq $72, %rsp
popq %rbx
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4"
.section .note.GNU-stack,"",#progbits

No, there is no guarantee for adjacent placement.
One occasion where actual compilers will place them far apart is if the same string literal appears in different places (as read-only objects) and the string combining optimization is enabled.
Example:
char *foo = "foo";
char *baz = "baz";
struct example_t bar = {
"foo",
"bar"
}
may well end up in memory as "foo" followed by "baz" followed by "bar".

Here is an example demonstrating a real-world scenario where the strings are not adjacent. GCC decides to reuse the string "Chocolate" from earlier.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
const char *a = "Chocolate";
const char *b = "Spinach";
struct test_t {
const char *a;
const char *b;
};
struct test_t test = {"Chocolate", "Cookies"};
int main(void)
{
printf("%p %p\n", (const void *) a, (const void *) b);
printf("%p %p\n", (const void *) test.a, (const void *) test.b);
return EXIT_SUCCESS;
}
Output:
0x400614 0x40061e
0x400614 0x400626

I'll try to show you an example of gcc behaviour where, even in that case you don't get strings aligned in memory:
#include <stdio.h>
#include <stdlib.h>
char *s = "Cookies";
struct test {
char *a, *b, *c, *d;
};
struct test t = {
"Chocolate",
"Cookies",
"Milk",
"Cookies",
};
#define D(x) __FILE__":%d:%s: " x, __LINE__, __func__
#define P(x) do{\
printf(D(#x " = [%#p] \"%s\"\n"), x, x); \
} while(0)
int main()
{
P(t.a);
P(t.b);
P(t.c);
P(t.d);
return 0;
}
In this case, as the compiler tries to reuse already seen string literals, the ones you use to assign to the structure fields don't get aligned.
This is the output of the program:
$ pru3
pru3.c:25:main: t.a = [0x8518] "Chocolate"
pru3.c:26:main: t.b = [0x8510] "Cookies"
pru3.c:27:main: t.c = [0x8524] "Milk"
pru3.c:28:main: t.d = [0x8510] "Cookies"
As you see, the pointers are even repeated for the "Cookies" value.
The compiling here was made with default values, with:
gcc -o pru3 pru3.c

Related

Making a C callable write/print function in x64 assembly

I'm trying to make a write function in x64 that I can call in a C file.
I have the following files
write.s
.text
.globl write
write:
// stack thing
pushq %rbp
movq %rsp, %rbp
// function arguments as done by the C convention
movl %edi, -4(%rbp) // fd
movl %esi, -8(%rbp) // buf
movl %edx, -12(%rbp) // length
// write
movq $1, %rax // syscall 1 for write
movq -4(%rbp), %rdi // fd to rdi
movq -8(%rbp), %rsi // buf to rsi
movq -12(%rbp), %rdx // len to rdx
syscall
// return
movq %rbp, %rsp
popq %rbp
ret
write.h
void write(int fd, char *buf, int len);
main.c
#include "write.h"
int main() {
int fd = 1;
char *buf = "hi";
int len = 2;
write(fd, buf, len);
return 0;
}
The problem is that when I compile this with gcc -no-pie -o main write.s main.c
and run ./main it doesn't output anything.
I'm sorry if this is some obvious mistake, as I am not that familiar with x64 assembly.

GCC emits a label that's not jumped to by anything outside that label?

Taking the following C code
#include <stdio.h>
void test(unsigned char buffer[], int size) {
for (int i = 0; i < size; i++) {
unsigned char data = buffer[i];
printf("%c", data);
}
}
void main() {
unsigned char buffer[5] = "Hello";
test(buffer, 5);
return;
}
and compiling it the flags -fno-stack-protector -fno-asynchronous-unwind-tables -fno-unroll-loops for clarity produces the following assembly for the test() function:
test:
testl %esi, %esi
jle .L6
pushq %rbp
leal -1(%rsi), %eax
pushq %rbx
leaq 1(%rdi,%rax), %rbp
movq %rdi, %rbx
subq $8, %rsp
.p2align 4,,10
.p2align 3
.L3:
movzbl (%rbx), %edi
addq $1, %rbx
call putchar#PLT
cmpq %rbp, %rbx
jne .L3
addq $8, %rsp
popq %rbx
popq %rbp
ret
.p2align 4,,10
.p2align 3
.L6:
ret
.size test, .-test
.section .text.startup,"ax",#progbits
.p2align 4
It seems to me like the L3 label here is completely useless since it is never jumped to or entered. (Except by jne .L3, but that instruction is inside of the L3 label already).
Can anyone explain how and why this assembly still produces the expected effect?
If you read the assembler code from the top you will see that it reaches .L3, plus it also jumps to it with jne .L3, which is your for loop in C.

In which data segment is the C string stored?

I'm wondering what's the difference between char s[] = "hello" and char *s = "hello".
After reading this and this, I'm still not very clear on this question.
As I know, there are five data segments in memory, Text, BSS, Data, Stack and Heap.
From my understanding,
in case of char s[] = "hello":
"hello" is in Text.
s is in Data if it is a global variable or in Stack if it is a local variable.
We also have a copy of "hello" where the s is stored, so we can modify the value of this string via s.
in case of char *s = "hello":
"hello" is in Text.
s is in Data if it is a global variable or in Stack if it is a local variable.
s just points to "hello" in Text and we don't have a copy of it, therefore modifying the value of string via this pointer should cause "Segmentation Fault".
Am I right?
You are right that "hello" for the first case is mutable and for the second case is immutable string. And they are kept in read-only memory before initialization.
In the first case the mutable memory is initialized/copied from immutable string. In the second case the pointer refers to immutable string.
For first case wikipedia says,
The values for these variables are initially stored within the
read-only memory (typically within .text) and are copied into the
.data segment during the start-up routine of the program.
Let us examine segment.c file.
char*s = "hello"; // string
char sar[] = "hello"; // string array
char content[32];
int main(int argc, char*argv[]) {
char psar[] = "parhello"; // local/private string array
char*ps = "phello"; // private string
content[0] = 1;
sar[3] = 1; // OK
// sar++; // not allowed
// s[2] = 1; // segmentation fault
s = sar;
s[2] = 1; // OK
psar[3] = 1; // OK
// ps[2] = 1; // segmentation fault
ps = psar;
ps[2] = 1; // OK
return 0;
}
Here is the assembly generated for segment.c file. Note that both s and sar is in global aka .data segment. It seems sar is const pointer to a mutable initialized memory or not pointer at all(practically it is an array). And eventually it has an implication that sizeof(sar) = 6 is different to sizeof(s) = 8. There are "hello" and "phello" in readonly(.rodata) section and effectively immutable.
.file "segment.c"
.globl s
.section .rodata
.LC0:
.string "hello"
.data
.align 8
.type s, #object
.size s, 8
s:
.quad .LC0
.globl sar
.type sar, #object
.size sar, 6
sar:
.string "hello"
.comm content,32,32
.section .rodata
.LC1:
.string "phello"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $64, %rsp
movl %edi, -52(%rbp)
movq %rsi, -64(%rbp)
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
movl $1752326512, -32(%rbp)
movl $1869376613, -28(%rbp)
movb $0, -24(%rbp)
movq $.LC1, -40(%rbp)
movb $1, content(%rip)
movb $1, sar+3(%rip)
movq $sar, s(%rip)
movq s(%rip), %rax
addq $2, %rax
movb $1, (%rax)
movb $1, -29(%rbp)
leaq -32(%rbp), %rax
movq %rax, -40(%rbp)
movq -40(%rbp), %rax
addq $2, %rax
movb $1, (%rax)
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L2
call __stack_chk_fail
.L2:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
Again for local variable in main, the compiler does not bother to create a name. And it may keep it in register or in stack memory.
Note that local variable value "parhello" is optimized into 1752326512 and 1869376613 numbers. I discovered it by changing the value of "parhello" to "parhellp". The diff of the assembly output is as follows,
39c39
< movl $1886153829, -28(%rbp)
---
> movl $1869376613, -28(%rbp)
So there is no separate immutable store for psar . It is turned into integers in the code segment.
answer to your first question:
char s[] = "hello";
s is an array of type char. An array is a const pointer, meaning that you cannot change the s using pointer arithmetic (i.e. s++). The data aren't const, though, so you can change it.
See this example C code:
#include <stdio.h>
void reverse(char *p){
char c;
char* q = p;
while (*q) q++;
q--; // point to the end
while (p < q) {
c = *p;
*p++ = *q;
*q-- = c;
}
}
int main(){
char s[] = "DCBA";
reverse( s);
printf("%s\n", s); // ABCD
}
which reverses the text "DCBA" and produces "ABCD".
char *p = "hello"
p is a pointer to a char. You can do pointer arithmetic -- p++ will compile -- and puts data in read-only parts of the memory (const data).
and using p[0]='a'; will result to runtime error:
#include <stdio.h>
int main(){
char* s = "DCBA";
s[0]='D'; // compile ok but runtime error
printf("%s\n", s); // ABCD
}
this compiles, but not runs.
const char* const s = "DCBA";
With a const char* const, you can change neither s nor the data content which point to (i.e. "DCBE"). so data and pointer are const:
#include <stdio.h>
int main(){
const char* const s = "DCBA";
s[0]='D'; // compile error
printf("%s\n", s); // ABCD
}
The Text segment is normally the segment where your code is stored and is const; i.e. unchangeable. In embedded systems, this is the ROM, PROM, or flash memory; in a desktop computer, it can be in RAM.
The Stack is RAM memory used for local variables in functions.
The Heap is RAM memory used for global variables and heap-initialized data.
BSS contains all global variables and static variables that are initialized to zero or not initialized vars.
For more information, see the relevant Wikipedia and this relevant Stack Overflow question
With regards to s itself: The compiler decides where to put it (in stack space or CPU registers).
For more information about memory protection and access violations or segmentation faults, see the relevant Wikipedia page
This is a very broad topic, and ultimately the exact answers depend on your hardware and compiler.

Char array from C to ASM x64 GAS

I've got assessment to use array from C in ASM function.
Figured out that I need to pass address to that array. But how after that access values of array in ASM? (f.e. array[0], array[1] etc.)
C function:
#include <stdio.h>
void asm_function(char *address);
int main() {
char array[] = "Abc";
asm_function(15, array);
return 0;
}
ASM function:
.type asm_function, #function
.section .data
EXIT = 60
EXIT_SUCCESS = 1
BUFF_LENGTH = 512
format: .asciz "%s\n"
.section .bss
.lcomm buffer, BUFF_LENGTH
.section .text
.globl asm_function
asm_function:
movq %rdi, buffer
subq $8, %rsp
movq $0, %rax
movq buffer_lancuch, %rsi
movq $format, %rdi
call printf #prints whole String
exit:
addq $8, %rsp
movq $EXIT, %rax
movq $EXIT_SUCCESS, %rdi
syscall
What I acctually need is to access all of the chars seperately.
Will appreciate all of hints.

Where string data is stored?

I wrote a small c program:
#include <stdio.h>
int main()
{
char s[] = "Hello, world!";
printf("%s\n", s);
return 0;
}
which compiles to (on my linux machine):
.file "hello.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
movw $33, -20(%rbp)
leaq -32(%rbp), %rax
movq %rax, %rdi
call puts
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L3
call __stack_chk_fail
.L3:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2"
.section .note.GNU-stack,"",#progbits
I don't understand the assembly code, but I can't see anywhere the string message. So how the executable know what to print?
It's here:
movl $1819043144, -32(%rbp) ; 1819043144 = 0x6C6C6548 = "lleH"
movl $1998597231, -28(%rbp) ; 1998597231 = 0x77202C6F = "w ,o"
movl $1684828783, -24(%rbp) ; 1684828783 = 0x646C726F = "dlro"
movw $33, -20(%rbp) ; 33 = 0x0021 = "\0!"
In this particular case the compiler is generating inline instructions to generate the literal string constant before calling printf. Of course in other situations it may not do this but may instead store a string constant in another section of memory. Bottom line: you can not make any assumptions about how or where the compiler will generate and store string literals.
The string is here:
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
This copies a bunch of values to the stack. Those values happen to be your string.
string constants are stored in the binary of your application. Exactly where is up to your compiler.
Assembly has no "string" concept. Thus, the "string" is actually a chunk of memory. The string is stored somewhere in memory (up to the compiler) then you can manipulate this chunk of data using its memory address (pointer).
If your string is constant, compiler might want to use it as constants instead of storing it into memory, which is faster. This is your case, as pointed out by Paul R:
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
You cannot make assumptions about how the compiler will treat your string.
In addition to the above, the compiler can see that your string literal cannot be referenced directly (i.e. there can't be any valid pointers to your string), which is why it can just copy it inline. If however you assign a character pointer instead, i.e.
char *s = "Hello, world!";
The compiler will initialise a string literal somewhere in memory, since you can of course now point to it. This modification produces on my machine:
.LC0:
.string "Hello, world!"
.text
.globl main
.type main, #function
One assumption can be made about string literals: if a pointer is initialised to a literal, it will point to a static char array held somewhere in memory. As a result the pointer is valid in any part of the program, e.g. you can return a pointer to a string literal initialised in a function, and it will still be valid.

Resources