This question already has an answer here:
Why can I not modify a string literal in c?
(1 answer)
Closed 2 years ago.
It is well known that one must not modify string literals in C. The spec(section 6.4.5-7) clearly mentions that modifying a string literal is undefined behaviour. Trying to do so with GCC results in a segfault as the literals get stored in read-only memory.
However the following code seems to work fine with GCC.
int main() {
int *arr = (int[]){1,2,3};
arr[1] = 100;
printf("arr[1]: %i\n", arr[1]);
}
Looking at section 6.5.2.5.-5 of the spec, shows that arr would have an automatic storage duration, similar to if I had declared int arr[] = {1,2,3}; instead.
Is there a reason why string literals are handled differently?
Trying to do so with GCC results in a segfault as the literals get stored in read-only memory.
This is false in general.
A good counter-example is the Linux kernel.
It is compiled by GCC and stored in RAM. Read about undefined behavior.
On the OSDEV wiki you can find mentions of other operating system kernels compiled by GCC and stored in writable RAM.
And with special linker flags, you can ask GCC to put string literals in writable memory. There are few reasons to do so in 2020 (except when you are coding your toy operating system).
You might write or patch an existing C compiler (such as nwcc or tinycc) to put literal strings in some writable data segment.
You could be interested by static program analysis tools such as Frama-C, Clang Static analyzer.
Is there a reason why string literals are handled differently?
Yes, coding optimizing compilers is difficult. Be also aware of Rice's theorem. Consider using CompCert (you may need to pay for a license), or writing your GCC plugin (it might store every literal string starting with a in some data segment).
Related
I have recently started learning how to use the inline assembly in C Code and came across an interesting feature where you can specify registers for local variables (https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html#Local-Register-Variables).
The usage of this feature is as follows:
register int *foo asm ("r12");
Then I started to wonder whether it was possible to insert a char pointer such as
const char d[4] = "r12";
register int *foo asm (d);
but got the error: expected string literal before ‘d’ (as expected)
I can understand why this would be a bad practice, but is there any possible way to achieve a similar effect where I can use a char pointer to access the register? If not, is there any particular reason why this is not allowed besides the potential security issues?
Additionally, I read this StackOverflow question: String literals: pointer vs. char array
Thank you.
The syntax to initialize the variable would be register char *foo asm ("r12") = d; to point an asm-register variable at a string. You can't use a runtime-variable string as the register name; register choices have to get assembled into machine code at compile time.
If that's what you're trying to do, you're misunderstanding something fundamental about assembly language and/or how ahead-of-time compiled languages compile into machine code. GCC won't make self-modifying code (and even if it wanted to, doing that safely would require redoing register allocation done by the ahead-of-time optimizer), or code that re-JITs itself based on a string.
(The first time I looked at your question, I didn't understand what you were even trying to do, because I was only considering things that are possible. #FelixG's comment was the clue I needed to make sense of the question.)
(Also note that registers aren't indexable; even in asm you can't use a single instruction to read a register number selected by an integer in another register. You could branch on it, or store all the registers in memory and index that like variadic functions do for their incoming register args.)
And if you do want a compile-time constant string literal, just use it with the normal syntax. Use a CPP macro if you want the same string to initialize a char array.
I recently had a question, I know that a pointer to a constant array initialized as it is in the code below, is in the .rodata region and that this region is only readable.
However, I saw in pattern C11, that writing in this memory address behavior will be undefined.
I was aware that the Borland's Turbo-C compiler can write where the pointer points, this would be because the processor operated in real mode on some systems of the time, such as MS-DOS? Or is it independent of the operating mode of the processor? Is there any other compiler that writes to the pointer and does not take any memory breach failure using the processor in protected mode?
#include <stdio.h>
int main(void) {
char *st = "aaa";
*st = 'b';
return 0;
}
In this code compiling with Turbo-C in MS-DOS, you will be able to write to memory
As has been pointed out, trying to modify a constant string in C results in undefined behavior. There are several reasons for this.
One reason is that the string may be placed in read-only memory. This allows it to be shared across multiple instances of the same program, and doesn't require the memory to be saved to disk if the page it's on is paged out (since the page is read-only and thus can be reloaded later from the executable). It also helps detect run-time errors by giving an error (e.g. a segmentation fault) if an attempt is made to modify it.
Another reason is that the string may be shared. Many compilers (e.g., gcc) will notice when the same literal string appears more than once in a compilation unit, and will share the same storage for it. So if a program modifies one instance, it could affect others as well.
There is also never a need to do this, since the same intended effect can easily be achieved by using a static character array. For instance:
#include <stdio.h>
int main(void) {
static char st_arr[] = "aaa";
char *st = st_arr;
*st = 'b';
return 0;
}
This does exactly what the posted code attempted to do, but without any undefined behavior. It also takes the same amount of memory. In this example, the string "aaa" is used as an array initializer, and does not have any storage of its own. The array st_arr takes the place of the constant string from the original example, but (1) it will not be placed in read-only memory, and (2) it will not be shared with any other references to the string. So it's safe to modify it, if in fact that's what you want.
Is there any other compiler that writes to the pointer and does not take any memory breach failure using the processor in protected mode?
GCC 3 and earlier used to support gcc -fwriteable-strings to let you compile old K&R C where this was apparently legal, according to https://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Incompatibilities.html. (It's undefined behaviour in ISO C and thus a bug in an ISO C program). That option will define the behaviour of the assignment which ISO C leaves undefined.
GCC 3.3.6 manual - C Dialect options
-fwritable-strings
Store string constants in the writable data segment and don't uniquize them. This is for compatibility with old programs which assume they can write into string constants.
Writing into string constants is a very bad idea; “constants” should be constant.
GCC 4.0 removed that option (release notes); the last GCC3 series was gcc3.4.6 in March 2006. Although apparently it had become buggy in that version.
gcc -fwritable-strings would treat string literals like non-const anonymous character arrays (see #gnasher's answer), so they go in the .data section instead of .rodata, and thus get linked into a segment of the executable that's mapped to read+write pages, not read-only. (Executable segments have basically nothing to do with x86 segmentation, it's just a start+range memory-mapping from the executable file to memory.)
And it would disable duplicate-string merging, so char *foo() { return "hello"; } and char *bar() { return "hello"; } would return different pointer values, instead of merging identical string literals.
Related:
How can some GCC compilers modify a constant char pointer?
https://softwareengineering.stackexchange.com/questions/294748/why-are-c-string-literals-read-only
Linker option: still Undefined Behaviour so probably not viable
On GNU/Linux, linking with ld -N (--omagic) will make the text (as well as data) section read+write. This may apply to .rodata even though modern GNU Binutils ld puts .rodata in its own section (normally with read but not exec permission) instead of making it part of .text. Having .text writeable could easily be a security problem: you never want a page with write+exec at the same time, otherwise some bugs like buffer overflows can turn into code-injection attacks.
To do this from gcc, use gcc -Wl,-N to pass on that option to ld when linking.
This doesn't do anything about it being Undefined Behaviour to write const objects. e.g. the compiler will still merge duplicate strings, so writing into one char *foo = "hello"; will affect all other uses of "hello" in the whole program, even across files.
What to use instead:
If you want something writeable, use static char foo[] = "hello"; where the quoted string is just an array initializer for a non-const array. As a bonus, this is more efficient than static char *foo = "hello"; at global scope, because there's one fewer level of indirection to get to the data: it's just an array instead a pointer stored in memory.
You are asking whether or not the platform may cause undefined behavior to be defined. The answer to that question is yes.
But you are also asking whether or not the platform defines this behavior. In fact it does not.
Under some optimization hints, the compiler will merge string constants, so that writing to one constant will write to the other uses of that constant. I used this compiler once, it was quite capable of merging strings.
Don't write this code. It's not good. You will regret writing code in this style when you move onto a more modern platform.
Your literal "aaa" produces a static array of four const char 'a', 'a', 'a', '\0' in an anonymous location and returns a pointer to the first 'a', cast to char*.
Trying to modify any of the four characters is undefined behaviour. Undefined behaviour can do anything, from modifying the char as intended, pretending to modify the char, doing nothing, or crashing.
It's basically the same as static const char anonymous[4] = { 'a', 'a', 'a', '\0' }; char* st = (char*) &anonymous [0];
To add to the correct answers above, DOS runs in real mode, so there is no read only memory. All memory is flat and writable. Hence, writing to the literal was well defined (as it was in any sort of const variable) at the time.
I am not sure why strcat works in this case for me:
char* foo="foo";
printf(strcat(foo,"bar"));
It successfully prints "foobar" for me.
However, as per an earlier topic discussed on stackoverflow here: I just can't figure out strcat
It says, that the above should not work because foo is declared as a string literal. Instead, it needs to be declared as a buffer (an array of a predetermined size so that it can accommodate another string which we are trying to concatenate).
In that case, why does the above program work for me successfully?
This code invokes Undefined Behavior (UB), meaning that you have no guarantee of what will happen (failure here).
The reason is that string literals are immutable. That means that they are not mutable, and any attempt of doing so, will invoke UB.
Note what a difficult logical error(s) can arise with UB, since it might work (today and in your system), but it's still wrong, which makes it very likely that you might miss the error, and get along as everything was fine.
PS: In this Live Demo, I am lucky enough to get a Segmentation fault. I say lucky, because this seg fault will make me investigate and debug the code.
It's worth noting that GCC issues no warning, and the warning from Clang are also irrelevant:
p
rog.c:7:8: warning: format string is not a string literal (potentially insecure) [-Wformat-security]
printf(strcat(foo,"bar"));
^~~~~~~~~~~~~~~~~
prog.c:7:8: note: treat the string as an argument to avoid this
printf(strcat(foo,"bar"));
^
"%s",
1 warning generated.
String literals are immutable in the sense that the compiler will operate under the assumption that you won't mutate them, not that you'll necessarily get an error if you try to modify them. In legalese, this is "undefined behavior", so anything can happen, and, as far as the standard is concerned, it's fine.
Now, on modern platforms and with modern compilers you do have extra protections: on platforms that have memory protection the string table generally gets placed in a read-only memory area, so that modifying it will get you a runtime error.
Still, you may have a compiler that doesn't provide any of the runtime-enforced checks, either because you are compiling for a platform without memory protection (e.g. pre-80386 x86, so pretty much any C compiler for DOS such as Turbo C, most microcontrollers when operating on RAM and not on flash, ...), or with an older compiler which doesn't exploit this hardware capability by default to remain compatible with older revisions (older VC++ for a long time), or with a modern compiler which has such an option explicitly enabled, again for compatibility with older code (e.g. gcc with -fwritable-strings). In all these cases, it's normal that you won't get any runtime error.
Finally, there's an extra devious corner case: current-day optimizers actively exploit undefined behavior - i.e. they assume that it will never happen, and modify the code accordingly. It's not impossible that a particularly smart compiler can generate code that just drops such a write, as it's legally allowed to do anything it likes most for such a case.
This can be seen for some simple code, such as:
int foo() {
char *bar = "bar";
*bar = 'a';
if(*bar=='b') return 1;
return 0;
}
here, with optimizations enabled:
VC++ sees that the write is used just for the condition that immediately follows, so it simplifies the whole thing to return 0; no memory write, no segfault, it "appears to work" (https://godbolt.org/g/cKqYU1);
gcc 4.1.2 "knows" that literals don't change; the write is redundant and it gets optimized away (so, no segfault), the whole thing becomes return 1 (https://godbolt.org/g/ejbqDm);
any more modern gcc choose a more schizophrenic route: the write is not elided (so you get a segfault with the default linker options), but if it succeeded (e.g. if you manually fiddle with memory protection) you'd get a return 1 (https://godbolt.org/g/rnUDYr) - so, memory modified but the code that follows thinks it hasn't been modified; this is particularly egregious on AVR, where there's no memory protection and the write succeeds.
clang does pretty much the same as gcc.
Long story short: don't try your luck and tread carefully. Always assign string literals to const char * (not plain char *) and let the type system help you avoid this kind of problems.
This question already has answers here:
Why are compound literals in C modifiable
(2 answers)
Why do I get a segmentation fault when writing to a "char *s" initialized with a string literal, but not "char s[]"?
(19 answers)
Closed 4 years ago.
String literals are lvalues, which leaves the door open to modify string literals.
From C in a Nutshell:
In C source code, a literal is a token that denotes a fixed value, which may be an integer, a floating-point number, a character, or a string. A literal’s type is determined by its value and its notation.
The literals discussed here are different from compound literals, which were introduced in the C99 standard. Compound literals are ordinary modifiable objects, similar to variables.
Although C does not strictly prohibit modifying string literals, you should not attempt to do so. For one thing, the compiler, treating the string literal as a constant, may place it in read-only memory, in which case the attempted write operation causes a fault. For another, if two or more identical string literals are used in the program, the compiler may store them at the same location, so that modifying one causes unexpected results when you access another.
The first paragraph says that "a literal in C denotes a fixed value".
Does it mean that a literal (except compound literals) shouldn't be modified?
Since a string literal isn't a compound literal, should a string literal be modified?
The second paragraph says that "C does not strictly prohibit
modifying string literals" while compilers do. So should a string
literal be modified?
Do the two paragraphs contradict each other? How shall I understand them?
Can a literal which is neither compound literal nor string literal be modified?
From the C Standard (6.4.5 String literals)
7 It is unspecified whether these arrays are distinct provided their
elements have the appropriate values. If the program attempts to
modify such an array, the behavior is undefined.
As for your statement.
The second paragraph says that "C does not strictly prohibit modifying
string literals" while compilers do. So should a string literal be
modified?
Then compilers do not modify string literals. They may store identical string literals as one array.
As #o11c pointed out in a comment in the Annex J (informative) Portability issues there is written
J.5 Common extensions
1 The following extensions are widely used in
many systems, but are not portable to all implementations. The
inclusion of any extension that may cause a strictly conforming
program to become invalid renders an implementation nonconforming.
Examples of such extensions are new keywords, extra library functions
declared in standard headers, or predefined macros with names that do
not begin with an underscore.
J.5.5 Writable string literals
1 String literals are modifiable (in which case, identical string
literals should denote distinct objects) (6.4.5).
Don't modify string literals. Treat them as char const[].
String literals are effectively char const[] (modifying them results in undefined behavior), but for legacy reason they're really char [], which means the compiler won't stop you from writing into them, but your program will still go undefined if you do.
And saying more practically - not every hardware platfotm provides mechanisms to protect memory location where Read Only objects are stored. And it had to be defined as UB. There are 3 possible options:
Literals (and constant objects more generally) are kept in the RAM but the hardware does not provide memory protection mechanisms. Nothing can stop the programmer from writing to this location
Literals (and constant objects) are kept in the RAM but the hardware does provide memory protection mechanisms - you will get segfault
Read Only data is stored in the read only memory (for example uC FLASH). You can try to write it but there is no effect of it (example ARM). No hardware exception raised
The first paragraph says that "a literal in C denotes a fixed value".
Does it mean that a literal (except compound literals) shouldn't be modified?
I don't know what the authors intention was, but modification of the array resulting from a string literal during runtime is blatantly undefined, according to C11/6.4.5p7: "If the program attempts to modify such an array, the behavior is undefined."
It should also be noted that attempts to modify a const-qualified compound literal during runtime will also result in undefined behavior, which is explained along-side some volatile-related undefined behaviour in C11/6.7.3p6. It is otherwise well defined to modify compound literals.
For example:
char *fubar = "hello world";
(*fubar)++; // SQUARELY UNDEFINED BEHAVIOUR!
char *fubar = (char[]){"hello world"};
(*fubar)++; // This is well defined.
Literally replacing "hello world" with "goodbye galaxy", in either piece of source code, is fine. Redefining standard functions, however (i.e. #define memcpy strncpy or #define size_t signed char, which are both great ways to ruin someone elses day), is undefined behaviour.
Since a string literal isn't a compound literal, should a string literal be modified?
The array resulting from a string literal should certainly not be modified during runtime, for any attempt to do so would trigger undefined behaviour.
The string literal itself, which exists as a quoted sequence of characters within your source code, on the other hand... of course, that can be modified as you choose. You're not obliged to modify it, though.
The second paragraph says that "C does not strictly prohibit modifying string literals" while compilers do. So should a string literal be modified?
The C standard doesn't strictly prohibit a lot of undefined behavior; it leaves the behavior undefined, meaning your program is likely to behave erratically or be non-portable. In the realms of well defined C, your programs should not invoke any undefined behaviour, including overflowing arrays, modifying const-qualified objects or the arrays resulting from string literals, race conditions caused by multithreading, etc.
If you want to invoke undefined behaviour, C will let you shoot yourself in the foot. You might have a good reason for doing so; perhaps your program will be more optimal, or perhaps your compiler actually lets you modify string literals ("it's a feature, not a bug", they say, "so give us your money", they say, as you become reliant upon their non-standard quirks). Be aware that some compilers will instead behave as though the attempted modification didn't occur, or crash, or there could be some vulnerability caused.
... and above all else, be aware that your code will no longer be compliant C code!
Do the two paragraphs contradict each other?
By omission, perhaps. The first paragraph does state that the values are fixed, and the second paragraph that the values might be modifiable during runtime through invocation of undefined behaviour.
I think the author meant to make the distinction between elements of source code and the runtime environment. He/she could simply clarify this by ensuring it's explicit that literals should not be modified during runtime, for example.
How shall I understand them?
In the realms of C such values can't change during runtime because invoking undefined behaviour means the code in question is no longer compliant C code.
Perhaps they were trying to avoid explaining undefined behaviour, because it may seem too complex to explain. If you look deeper into the subject, you'll find that the meaning is, as predicted, roughly a conjunction of the two words.
undefined: /ʌndɪˈfʌɪnd/ adj. not clear or defined.
behaviour: /bɪˈheɪvjə/ noun. the way in which a machine or natural phenomenon works or functions
That is to say, an attempt to modify the array resulting from a string literal during runtime results in "unclear functionality". It's not required to be documented anywhere in the realms of computer science, and even if it is documented, that documentation might be a lie.
Can a literal which is neither compound literal nor string literal be modified?
As a lexical element in source code, providing it doesn't override a standard symbol, yes. Literals which aren't l-values (i.e. don't have any storage) such as integer constants, obviously can't be modified during runtime. I suppose it might be possible on some systems to attempt to modify the memory which a function pointer points at, which could be seen as a literal; that's also undefined behaviour and would result in code that isn't C.
It might also be possible to modify many other types of elements which aren't seen as objects by the C standard, such as the return address on the stack. That's what makes buffer overflows so subtly dangerous!
In C, in terms of the amount of memory used, if there are a bunch of functions all with return 1;, is each 1 literal stored or just one 1?
I.E. would it be better to use (at file scope) static const int numOne = 1 and then have the functions use return numOne;?
In case it is compiler dependent, I am compiling for a TI MCU using TI's C28x compiler.
Please note this question is about C not C++.
No, usually literals aren't "stored" at all. In particular small integer constants as this one usually go into immediates for the assembler, they are directly in the code, not in some data section.