Can a string literal in C be modified? - c

I recently had a question, I know that a pointer to a constant array initialized as it is in the code below, is in the .rodata region and that this region is only readable.
However, I saw in pattern C11, that writing in this memory address behavior will be undefined.
I was aware that the Borland's Turbo-C compiler can write where the pointer points, this would be because the processor operated in real mode on some systems of the time, such as MS-DOS? Or is it independent of the operating mode of the processor? Is there any other compiler that writes to the pointer and does not take any memory breach failure using the processor in protected mode?
#include <stdio.h>
int main(void) {
char *st = "aaa";
*st = 'b';
return 0;
}
In this code compiling with Turbo-C in MS-DOS, you will be able to write to memory

As has been pointed out, trying to modify a constant string in C results in undefined behavior. There are several reasons for this.
One reason is that the string may be placed in read-only memory. This allows it to be shared across multiple instances of the same program, and doesn't require the memory to be saved to disk if the page it's on is paged out (since the page is read-only and thus can be reloaded later from the executable). It also helps detect run-time errors by giving an error (e.g. a segmentation fault) if an attempt is made to modify it.
Another reason is that the string may be shared. Many compilers (e.g., gcc) will notice when the same literal string appears more than once in a compilation unit, and will share the same storage for it. So if a program modifies one instance, it could affect others as well.
There is also never a need to do this, since the same intended effect can easily be achieved by using a static character array. For instance:
#include <stdio.h>
int main(void) {
static char st_arr[] = "aaa";
char *st = st_arr;
*st = 'b';
return 0;
}
This does exactly what the posted code attempted to do, but without any undefined behavior. It also takes the same amount of memory. In this example, the string "aaa" is used as an array initializer, and does not have any storage of its own. The array st_arr takes the place of the constant string from the original example, but (1) it will not be placed in read-only memory, and (2) it will not be shared with any other references to the string. So it's safe to modify it, if in fact that's what you want.

Is there any other compiler that writes to the pointer and does not take any memory breach failure using the processor in protected mode?
GCC 3 and earlier used to support gcc -fwriteable-strings to let you compile old K&R C where this was apparently legal, according to https://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Incompatibilities.html. (It's undefined behaviour in ISO C and thus a bug in an ISO C program). That option will define the behaviour of the assignment which ISO C leaves undefined.
GCC 3.3.6 manual - C Dialect options
-fwritable-strings
Store string constants in the writable data segment and don't uniquize them. This is for compatibility with old programs which assume they can write into string constants.
Writing into string constants is a very bad idea; “constants” should be constant.
GCC 4.0 removed that option (release notes); the last GCC3 series was gcc3.4.6 in March 2006. Although apparently it had become buggy in that version.
gcc -fwritable-strings would treat string literals like non-const anonymous character arrays (see #gnasher's answer), so they go in the .data section instead of .rodata, and thus get linked into a segment of the executable that's mapped to read+write pages, not read-only. (Executable segments have basically nothing to do with x86 segmentation, it's just a start+range memory-mapping from the executable file to memory.)
And it would disable duplicate-string merging, so char *foo() { return "hello"; } and char *bar() { return "hello"; } would return different pointer values, instead of merging identical string literals.
Related:
How can some GCC compilers modify a constant char pointer?
https://softwareengineering.stackexchange.com/questions/294748/why-are-c-string-literals-read-only
Linker option: still Undefined Behaviour so probably not viable
On GNU/Linux, linking with ld -N (--omagic) will make the text (as well as data) section read+write. This may apply to .rodata even though modern GNU Binutils ld puts .rodata in its own section (normally with read but not exec permission) instead of making it part of .text. Having .text writeable could easily be a security problem: you never want a page with write+exec at the same time, otherwise some bugs like buffer overflows can turn into code-injection attacks.
To do this from gcc, use gcc -Wl,-N to pass on that option to ld when linking.
This doesn't do anything about it being Undefined Behaviour to write const objects. e.g. the compiler will still merge duplicate strings, so writing into one char *foo = "hello"; will affect all other uses of "hello" in the whole program, even across files.
What to use instead:
If you want something writeable, use static char foo[] = "hello"; where the quoted string is just an array initializer for a non-const array. As a bonus, this is more efficient than static char *foo = "hello"; at global scope, because there's one fewer level of indirection to get to the data: it's just an array instead a pointer stored in memory.

You are asking whether or not the platform may cause undefined behavior to be defined. The answer to that question is yes.
But you are also asking whether or not the platform defines this behavior. In fact it does not.
Under some optimization hints, the compiler will merge string constants, so that writing to one constant will write to the other uses of that constant. I used this compiler once, it was quite capable of merging strings.
Don't write this code. It's not good. You will regret writing code in this style when you move onto a more modern platform.

Your literal "aaa" produces a static array of four const char 'a', 'a', 'a', '\0' in an anonymous location and returns a pointer to the first 'a', cast to char*.
Trying to modify any of the four characters is undefined behaviour. Undefined behaviour can do anything, from modifying the char as intended, pretending to modify the char, doing nothing, or crashing.
It's basically the same as static const char anonymous[4] = { 'a', 'a', 'a', '\0' }; char* st = (char*) &anonymous [0];

To add to the correct answers above, DOS runs in real mode, so there is no read only memory. All memory is flat and writable. Hence, writing to the literal was well defined (as it was in any sort of const variable) at the time.

Related

Modifying non-string literals in C [duplicate]

This question already has an answer here:
Why can I not modify a string literal in c?
(1 answer)
Closed 2 years ago.
It is well known that one must not modify string literals in C. The spec(section 6.4.5-7) clearly mentions that modifying a string literal is undefined behaviour. Trying to do so with GCC results in a segfault as the literals get stored in read-only memory.
However the following code seems to work fine with GCC.
int main() {
int *arr = (int[]){1,2,3};
arr[1] = 100;
printf("arr[1]: %i\n", arr[1]);
}
Looking at section 6.5.2.5.-5 of the spec, shows that arr would have an automatic storage duration, similar to if I had declared int arr[] = {1,2,3}; instead.
Is there a reason why string literals are handled differently?
Trying to do so with GCC results in a segfault as the literals get stored in read-only memory.
This is false in general.
A good counter-example is the Linux kernel.
It is compiled by GCC and stored in RAM. Read about undefined behavior.
On the OSDEV wiki you can find mentions of other operating system kernels compiled by GCC and stored in writable RAM.
And with special linker flags, you can ask GCC to put string literals in writable memory. There are few reasons to do so in 2020 (except when you are coding your toy operating system).
You might write or patch an existing C compiler (such as nwcc or tinycc) to put literal strings in some writable data segment.
You could be interested by static program analysis tools such as Frama-C, Clang Static analyzer.
Is there a reason why string literals are handled differently?
Yes, coding optimizing compilers is difficult. Be also aware of Rice's theorem. Consider using CompCert (you may need to pay for a license), or writing your GCC plugin (it might store every literal string starting with a in some data segment).

Reading registers in the GCC using a char pointer

I have recently started learning how to use the inline assembly in C Code and came across an interesting feature where you can specify registers for local variables (https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html#Local-Register-Variables).
The usage of this feature is as follows:
register int *foo asm ("r12");
Then I started to wonder whether it was possible to insert a char pointer such as
const char d[4] = "r12";
register int *foo asm (d);
but got the error: expected string literal before ‘d’ (as expected)
I can understand why this would be a bad practice, but is there any possible way to achieve a similar effect where I can use a char pointer to access the register? If not, is there any particular reason why this is not allowed besides the potential security issues?
Additionally, I read this StackOverflow question: String literals: pointer vs. char array
Thank you.
The syntax to initialize the variable would be register char *foo asm ("r12") = d; to point an asm-register variable at a string. You can't use a runtime-variable string as the register name; register choices have to get assembled into machine code at compile time.
If that's what you're trying to do, you're misunderstanding something fundamental about assembly language and/or how ahead-of-time compiled languages compile into machine code. GCC won't make self-modifying code (and even if it wanted to, doing that safely would require redoing register allocation done by the ahead-of-time optimizer), or code that re-JITs itself based on a string.
(The first time I looked at your question, I didn't understand what you were even trying to do, because I was only considering things that are possible. #FelixG's comment was the clue I needed to make sense of the question.)
(Also note that registers aren't indexable; even in asm you can't use a single instruction to read a register number selected by an integer in another register. You could branch on it, or store all the registers in memory and index that like variadic functions do for their incoming register args.)
And if you do want a compile-time constant string literal, just use it with the normal syntax. Use a CPP macro if you want the same string to initialize a char array.

string literals and strcat

I am not sure why strcat works in this case for me:
char* foo="foo";
printf(strcat(foo,"bar"));
It successfully prints "foobar" for me.
However, as per an earlier topic discussed on stackoverflow here: I just can't figure out strcat
It says, that the above should not work because foo is declared as a string literal. Instead, it needs to be declared as a buffer (an array of a predetermined size so that it can accommodate another string which we are trying to concatenate).
In that case, why does the above program work for me successfully?
This code invokes Undefined Behavior (UB), meaning that you have no guarantee of what will happen (failure here).
The reason is that string literals are immutable. That means that they are not mutable, and any attempt of doing so, will invoke UB.
Note what a difficult logical error(s) can arise with UB, since it might work (today and in your system), but it's still wrong, which makes it very likely that you might miss the error, and get along as everything was fine.
PS: In this Live Demo, I am lucky enough to get a Segmentation fault. I say lucky, because this seg fault will make me investigate and debug the code.
It's worth noting that GCC issues no warning, and the warning from Clang are also irrelevant:
p
rog.c:7:8: warning: format string is not a string literal (potentially insecure) [-Wformat-security]
printf(strcat(foo,"bar"));
^~~~~~~~~~~~~~~~~
prog.c:7:8: note: treat the string as an argument to avoid this
printf(strcat(foo,"bar"));
^
"%s",
1 warning generated.
String literals are immutable in the sense that the compiler will operate under the assumption that you won't mutate them, not that you'll necessarily get an error if you try to modify them. In legalese, this is "undefined behavior", so anything can happen, and, as far as the standard is concerned, it's fine.
Now, on modern platforms and with modern compilers you do have extra protections: on platforms that have memory protection the string table generally gets placed in a read-only memory area, so that modifying it will get you a runtime error.
Still, you may have a compiler that doesn't provide any of the runtime-enforced checks, either because you are compiling for a platform without memory protection (e.g. pre-80386 x86, so pretty much any C compiler for DOS such as Turbo C, most microcontrollers when operating on RAM and not on flash, ...), or with an older compiler which doesn't exploit this hardware capability by default to remain compatible with older revisions (older VC++ for a long time), or with a modern compiler which has such an option explicitly enabled, again for compatibility with older code (e.g. gcc with -fwritable-strings). In all these cases, it's normal that you won't get any runtime error.
Finally, there's an extra devious corner case: current-day optimizers actively exploit undefined behavior - i.e. they assume that it will never happen, and modify the code accordingly. It's not impossible that a particularly smart compiler can generate code that just drops such a write, as it's legally allowed to do anything it likes most for such a case.
This can be seen for some simple code, such as:
int foo() {
char *bar = "bar";
*bar = 'a';
if(*bar=='b') return 1;
return 0;
}
here, with optimizations enabled:
VC++ sees that the write is used just for the condition that immediately follows, so it simplifies the whole thing to return 0; no memory write, no segfault, it "appears to work" (https://godbolt.org/g/cKqYU1);
gcc 4.1.2 "knows" that literals don't change; the write is redundant and it gets optimized away (so, no segfault), the whole thing becomes return 1 (https://godbolt.org/g/ejbqDm);
any more modern gcc choose a more schizophrenic route: the write is not elided (so you get a segfault with the default linker options), but if it succeeded (e.g. if you manually fiddle with memory protection) you'd get a return 1 (https://godbolt.org/g/rnUDYr) - so, memory modified but the code that follows thinks it hasn't been modified; this is particularly egregious on AVR, where there's no memory protection and the write succeeds.
clang does pretty much the same as gcc.
Long story short: don't try your luck and tread carefully. Always assign string literals to const char * (not plain char *) and let the type system help you avoid this kind of problems.

Does the preprocessor prepare a list of unique constant strings before the compiler goes into action?

In the code below, I have two different local char* variables declared in two different functions.
Each variable is initialized to point to a constant string, and the contents of the two strings are identical.
Checking in runtime, the variables are initialized to point to the same address in memory.
So the compiler must have assigned the same (constant) value to each one of them.
How is that possible?
#include <stdio.h>
void PrintPointer()
{
char* p = "abc";
printf("%p\n",p);
}
int main()
{
char* p = "abc";
printf("%p\n",p);
PrintPointer();
return 0;
}
It has nothing to do with the preprocessor. But the compiler is explicitly allowed (not required) by the standard to share the memory for identical string literals. For details on when this happens, you must consult your compiler's documentation.
For example, here's the relevant documentation for VC2013:
In some cases, identical string literals may be pooled to save space in the executable file. In string-literal pooling, the compiler causes all references to a particular string literal to point to the same location in memory, instead of having each reference point to a separate instance of the string literal. To enable string pooling, use the /GF compiler option.
The C++ standard says in N3797 2.14.15/12:
Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation defined. The effect of attempting to modify a string literal is undefined.
The C standard now contains the same wording. Historically it was possible to modify string literals at run-time in C, but this is now Undefined Behaviour. Some compilers may allow it, some not.
Technically, the compiler does it by storing string literals in the symbol table. If an identical string is seen more than once, the same symbolic reference is used each time. The same technique might well be used for other literals, but would not be so easily detected.
The preprocessor, by the way, has nothing to do with it.
How is that possible?
It's possible because the compiler keeps track of values like that. But no, the preprocessor generally doesn't get involved in things like this; the preprocessor does things like macro substitutions that modify the code before the compiler starts working. In this case, though, we're talking about actual code:
char* p = "abc";
and that's the domain of the compiler, not the preprocessor.
So the compiler must have assigned the same (constant) value to each one of them. How is that possible?
If you have two identical string literals, as you do here, then the compiler is allowed to combine them into a single one; apparently, your compiler does that. It's also allowed to store them separately.

C string literals linking

As far as I am aware, the linker tries to merge two string literals into one single literal, if they are both the same, e.g.:
file1.c
char const* firstString = "foo";
file2.c
char const* secondString = "foo";
Would result in only one occurence of foo\0 in the respective memory section (saving 4 Bytes). This is especially important for embedded applications (how does avr-gcc vs. gcc behave).
But I was wondering if I can actually count on this to happen, and rely, that if two strings are equal, also their pointers are equal (provided that in the whole program, you only pass string literals around and no runtime generated strings exist -- which is a reasonable assumption in my case). Obviously, I want to speed up speed comparisons with this, and allow a commonly used function to receive a string literal like so:
void lock(char const*);
void unlock(char const*);
lock("test");
dosmth();
unlock("test");
In essence, I want to avoid having a huge enum and huge switches inside the lock/unlock functions.
Actually they can't point to the same memory, since you can write to them.
Now, if you were to say:
char *s1 = "MyString";
char *s2 = "MyString";
then indeed it is possible that s1 == s2. Don't think it's guaranteed.
I would probably not count on it as it is compiler dependent on how symbols get resolved like this. Most compilers create 1 instance of the constant string and then simply reference that but I would not depend on this as there are cases when this does not work. Personally I wouldn't use strings like that in your lock and unlock method. An enum will probably serve you better.
The linker doesn't have to merge anything. It just has to map a declared type to a defined type. That involves finding the defined type and filling out the address offsets to jump to the right item.
What you are talking about would be an optimizing linker. Many linkers don't optimize at all, and those that do aren't held to an optimizing standard, so you'll never be able to generalize beyond the observed findings for the linker you discover (on that machine, at that time).
I think you are on the wrong track. Not only that the linker doesn't merge string literals from different compilation units, it most probably doesn't even create an external symbol for them at all.
Even inside the same compilation unit, the compiler may merge two occurrences of the same string literal into one, but it is not obliged to do so:
It is unspecified whether these arrays are distinct provided their
elements have the appropriate values.
Now to the real problem that seems to be at the source of your question. Doing lock/unlock pairs based on string literals is probably not a good idea in C. As you say this has the tendency to bloat your code with switch or similar stuff.
More natural would be to have a lock type and to declare a global variable with a distinguishable name for each of your distinctive lock/unlock events. This forces you to declare each such event in a way that is visible in all your compilation units, and to pick out one compilation unit where you actually define it.

Resources