C string literals linking - c

As far as I am aware, the linker tries to merge two string literals into one single literal, if they are both the same, e.g.:
file1.c
char const* firstString = "foo";
file2.c
char const* secondString = "foo";
Would result in only one occurence of foo\0 in the respective memory section (saving 4 Bytes). This is especially important for embedded applications (how does avr-gcc vs. gcc behave).
But I was wondering if I can actually count on this to happen, and rely, that if two strings are equal, also their pointers are equal (provided that in the whole program, you only pass string literals around and no runtime generated strings exist -- which is a reasonable assumption in my case). Obviously, I want to speed up speed comparisons with this, and allow a commonly used function to receive a string literal like so:
void lock(char const*);
void unlock(char const*);
lock("test");
dosmth();
unlock("test");
In essence, I want to avoid having a huge enum and huge switches inside the lock/unlock functions.

Actually they can't point to the same memory, since you can write to them.
Now, if you were to say:
char *s1 = "MyString";
char *s2 = "MyString";
then indeed it is possible that s1 == s2. Don't think it's guaranteed.

I would probably not count on it as it is compiler dependent on how symbols get resolved like this. Most compilers create 1 instance of the constant string and then simply reference that but I would not depend on this as there are cases when this does not work. Personally I wouldn't use strings like that in your lock and unlock method. An enum will probably serve you better.

The linker doesn't have to merge anything. It just has to map a declared type to a defined type. That involves finding the defined type and filling out the address offsets to jump to the right item.
What you are talking about would be an optimizing linker. Many linkers don't optimize at all, and those that do aren't held to an optimizing standard, so you'll never be able to generalize beyond the observed findings for the linker you discover (on that machine, at that time).

I think you are on the wrong track. Not only that the linker doesn't merge string literals from different compilation units, it most probably doesn't even create an external symbol for them at all.
Even inside the same compilation unit, the compiler may merge two occurrences of the same string literal into one, but it is not obliged to do so:
It is unspecified whether these arrays are distinct provided their
elements have the appropriate values.
Now to the real problem that seems to be at the source of your question. Doing lock/unlock pairs based on string literals is probably not a good idea in C. As you say this has the tendency to bloat your code with switch or similar stuff.
More natural would be to have a lock type and to declare a global variable with a distinguishable name for each of your distinctive lock/unlock events. This forces you to declare each such event in a way that is visible in all your compilation units, and to pick out one compilation unit where you actually define it.

Related

Is passing empty strings ("") in C bad practice?

I'm trying to fix a bug in very old C code that just popped up recently on Windows after we updated Visual Studio to version 2017. The code still runs on several linux platforms. That tells me we're probably relying on some undefined behavior, and we previously got lucky.
We have a bunch of function calls like this:
get_some_data(parent, "ge", "", type);
Running in debug, I noticed that upon entry to this function, that empty string is immediately filled with garbage, before the function has done anything. The function is declared like this:
static void get_some_data(
KEY Parent,
char *Prefix,
char *Suffix,
ENT EntType)
So is it unwise to pass the strings directly ("ge", "")? I know it's trivial to fix this case by declaring char *suffix="" and passing suffix instead of "", but I'm now questioning whether I need to go through this entire suite of code looking for this type of function call.
So is it unwise to pass the strings directly ("ge", "")?
There is nothing inherently wrong with passing string literals to functions in general.
However, it is unwise to pass pointers (in)to string literals specifically to parameters declared as pointers to non-const char, because C specifies that undefined behavior results from attempting to modify a string literal, and in practice, that UB often manifests as abrupt program termination. If a function declares a parameter as const char * then you can reasonably take that as a promise that it will not attempt to modify the target of that pointer -- which is what you need to ensure -- but if it declares a parameter as just char * then no such promise is made, and the function doesn't even have a way to check at runtime whether the argument is writable.
Possibly you can rely on documentation in place of const-qualification, for you're ok in this regard as long as no attempt is made in practice to modify a string literal, but that still leaves you more open to bugs than you otherwise would be.
I know it's trivial to fix this case by declaring char *suffix="" and passing suffix instead of ""
Such a change may disguise what you're doing from the compiler, so that it does not warn about the function call, but it does not fix anything. The same pointer value is passed to the function either way, and the same semantics and constraints apply. Also, if the compiler warned about the function call then it should also warn about the assignment.
This is not an issue in C++, by the way, or at least not the same issue, because in C++, string literals represent arrays of const char in the first place.
, but I'm now questioning whether I need to go through this entire suite of code looking for this type of function call.
Better might be to modify the signatures of the called functions. Where you intend for it to be ok to pass a string literal, ensure that the parameter has type const char *, like so:
static void get_some_data(
KEY Parent,
const char *Prefix,
const char *Suffix,
ENT EntType)
But do note that is highly likely to cause new warnings about violations of const-correctness. To ensure safety, you need to fix these, too, without casting away constness. This could well cascade broadly, but the exercise will definitely help you identify and fix places where your code was mishandling string literals.
On the other hand, a genuine fix that might be less pervasive would be to pass pointers to modifiable arrays instead of (unmodifiable) string literals. Perhaps that's what you had in mind with your proposed fix, but the correct way to do that is this:
char prefix[] = "ge";
char suffix[] = "";
get_some_data(parent, prefix, suffix, type);
Here, prefix and suffix are separate (modifiable) local arrays, initialized with copies of the string literals' contents.
With all that said, I'm inclined to suspect that if you're getting bona fide runtime errors related to these arguments with VS-compiled executables but not GCC-compiled ones, then the source of those is probably something else. My first guess would be that array bounds are being overrun. My second guess would be that you are compiling C code as C++, and running afoul of one or more of the (other) differences between them.
That's not to say that you shouldn't take a good look at the constness / writability concerns involved here, but it would suck to go through the whole exercise just to find out that you were ok to begin with. You could still end up with better code, but that's a little tricky to sell to the boss when they ask why the bug hasn't been fixed yet.
No, there is absolutely nothing wrong with passing a string literal, empty or not. Quite the opposite — if you try to "fix" your code by doing your trivial change, you will hide the bug and make life harder for whoever is going to fix it in future.
I encountered this same problem and the cause was a similar situation in a recently executed function where, crucially, the contents of the string were being changed.
Prior to the call where this strange behaviour was being observed, there was a call to older code with a parameter of PSTR type. Not realizing that the contents of that parameter were going to be changed, a programmer had supplied an empty string. The code was only ever updating the first character so declaring a char type parameter and supplying the address of that was sufficient to solve the problem that was exhibiting in the later call.

Reading registers in the GCC using a char pointer

I have recently started learning how to use the inline assembly in C Code and came across an interesting feature where you can specify registers for local variables (https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html#Local-Register-Variables).
The usage of this feature is as follows:
register int *foo asm ("r12");
Then I started to wonder whether it was possible to insert a char pointer such as
const char d[4] = "r12";
register int *foo asm (d);
but got the error: expected string literal before ‘d’ (as expected)
I can understand why this would be a bad practice, but is there any possible way to achieve a similar effect where I can use a char pointer to access the register? If not, is there any particular reason why this is not allowed besides the potential security issues?
Additionally, I read this StackOverflow question: String literals: pointer vs. char array
Thank you.
The syntax to initialize the variable would be register char *foo asm ("r12") = d; to point an asm-register variable at a string. You can't use a runtime-variable string as the register name; register choices have to get assembled into machine code at compile time.
If that's what you're trying to do, you're misunderstanding something fundamental about assembly language and/or how ahead-of-time compiled languages compile into machine code. GCC won't make self-modifying code (and even if it wanted to, doing that safely would require redoing register allocation done by the ahead-of-time optimizer), or code that re-JITs itself based on a string.
(The first time I looked at your question, I didn't understand what you were even trying to do, because I was only considering things that are possible. #FelixG's comment was the clue I needed to make sense of the question.)
(Also note that registers aren't indexable; even in asm you can't use a single instruction to read a register number selected by an integer in another register. You could branch on it, or store all the registers in memory and index that like variadic functions do for their incoming register args.)
And if you do want a compile-time constant string literal, just use it with the normal syntax. Use a CPP macro if you want the same string to initialize a char array.

Can a string literal in C be modified?

I recently had a question, I know that a pointer to a constant array initialized as it is in the code below, is in the .rodata region and that this region is only readable.
However, I saw in pattern C11, that writing in this memory address behavior will be undefined.
I was aware that the Borland's Turbo-C compiler can write where the pointer points, this would be because the processor operated in real mode on some systems of the time, such as MS-DOS? Or is it independent of the operating mode of the processor? Is there any other compiler that writes to the pointer and does not take any memory breach failure using the processor in protected mode?
#include <stdio.h>
int main(void) {
char *st = "aaa";
*st = 'b';
return 0;
}
In this code compiling with Turbo-C in MS-DOS, you will be able to write to memory
As has been pointed out, trying to modify a constant string in C results in undefined behavior. There are several reasons for this.
One reason is that the string may be placed in read-only memory. This allows it to be shared across multiple instances of the same program, and doesn't require the memory to be saved to disk if the page it's on is paged out (since the page is read-only and thus can be reloaded later from the executable). It also helps detect run-time errors by giving an error (e.g. a segmentation fault) if an attempt is made to modify it.
Another reason is that the string may be shared. Many compilers (e.g., gcc) will notice when the same literal string appears more than once in a compilation unit, and will share the same storage for it. So if a program modifies one instance, it could affect others as well.
There is also never a need to do this, since the same intended effect can easily be achieved by using a static character array. For instance:
#include <stdio.h>
int main(void) {
static char st_arr[] = "aaa";
char *st = st_arr;
*st = 'b';
return 0;
}
This does exactly what the posted code attempted to do, but without any undefined behavior. It also takes the same amount of memory. In this example, the string "aaa" is used as an array initializer, and does not have any storage of its own. The array st_arr takes the place of the constant string from the original example, but (1) it will not be placed in read-only memory, and (2) it will not be shared with any other references to the string. So it's safe to modify it, if in fact that's what you want.
Is there any other compiler that writes to the pointer and does not take any memory breach failure using the processor in protected mode?
GCC 3 and earlier used to support gcc -fwriteable-strings to let you compile old K&R C where this was apparently legal, according to https://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Incompatibilities.html. (It's undefined behaviour in ISO C and thus a bug in an ISO C program). That option will define the behaviour of the assignment which ISO C leaves undefined.
GCC 3.3.6 manual - C Dialect options
-fwritable-strings
Store string constants in the writable data segment and don't uniquize them. This is for compatibility with old programs which assume they can write into string constants.
Writing into string constants is a very bad idea; “constants” should be constant.
GCC 4.0 removed that option (release notes); the last GCC3 series was gcc3.4.6 in March 2006. Although apparently it had become buggy in that version.
gcc -fwritable-strings would treat string literals like non-const anonymous character arrays (see #gnasher's answer), so they go in the .data section instead of .rodata, and thus get linked into a segment of the executable that's mapped to read+write pages, not read-only. (Executable segments have basically nothing to do with x86 segmentation, it's just a start+range memory-mapping from the executable file to memory.)
And it would disable duplicate-string merging, so char *foo() { return "hello"; } and char *bar() { return "hello"; } would return different pointer values, instead of merging identical string literals.
Related:
How can some GCC compilers modify a constant char pointer?
https://softwareengineering.stackexchange.com/questions/294748/why-are-c-string-literals-read-only
Linker option: still Undefined Behaviour so probably not viable
On GNU/Linux, linking with ld -N (--omagic) will make the text (as well as data) section read+write. This may apply to .rodata even though modern GNU Binutils ld puts .rodata in its own section (normally with read but not exec permission) instead of making it part of .text. Having .text writeable could easily be a security problem: you never want a page with write+exec at the same time, otherwise some bugs like buffer overflows can turn into code-injection attacks.
To do this from gcc, use gcc -Wl,-N to pass on that option to ld when linking.
This doesn't do anything about it being Undefined Behaviour to write const objects. e.g. the compiler will still merge duplicate strings, so writing into one char *foo = "hello"; will affect all other uses of "hello" in the whole program, even across files.
What to use instead:
If you want something writeable, use static char foo[] = "hello"; where the quoted string is just an array initializer for a non-const array. As a bonus, this is more efficient than static char *foo = "hello"; at global scope, because there's one fewer level of indirection to get to the data: it's just an array instead a pointer stored in memory.
You are asking whether or not the platform may cause undefined behavior to be defined. The answer to that question is yes.
But you are also asking whether or not the platform defines this behavior. In fact it does not.
Under some optimization hints, the compiler will merge string constants, so that writing to one constant will write to the other uses of that constant. I used this compiler once, it was quite capable of merging strings.
Don't write this code. It's not good. You will regret writing code in this style when you move onto a more modern platform.
Your literal "aaa" produces a static array of four const char 'a', 'a', 'a', '\0' in an anonymous location and returns a pointer to the first 'a', cast to char*.
Trying to modify any of the four characters is undefined behaviour. Undefined behaviour can do anything, from modifying the char as intended, pretending to modify the char, doing nothing, or crashing.
It's basically the same as static const char anonymous[4] = { 'a', 'a', 'a', '\0' }; char* st = (char*) &anonymous [0];
To add to the correct answers above, DOS runs in real mode, so there is no read only memory. All memory is flat and writable. Hence, writing to the literal was well defined (as it was in any sort of const variable) at the time.

Does the preprocessor prepare a list of unique constant strings before the compiler goes into action?

In the code below, I have two different local char* variables declared in two different functions.
Each variable is initialized to point to a constant string, and the contents of the two strings are identical.
Checking in runtime, the variables are initialized to point to the same address in memory.
So the compiler must have assigned the same (constant) value to each one of them.
How is that possible?
#include <stdio.h>
void PrintPointer()
{
char* p = "abc";
printf("%p\n",p);
}
int main()
{
char* p = "abc";
printf("%p\n",p);
PrintPointer();
return 0;
}
It has nothing to do with the preprocessor. But the compiler is explicitly allowed (not required) by the standard to share the memory for identical string literals. For details on when this happens, you must consult your compiler's documentation.
For example, here's the relevant documentation for VC2013:
In some cases, identical string literals may be pooled to save space in the executable file. In string-literal pooling, the compiler causes all references to a particular string literal to point to the same location in memory, instead of having each reference point to a separate instance of the string literal. To enable string pooling, use the /GF compiler option.
The C++ standard says in N3797 2.14.15/12:
Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation defined. The effect of attempting to modify a string literal is undefined.
The C standard now contains the same wording. Historically it was possible to modify string literals at run-time in C, but this is now Undefined Behaviour. Some compilers may allow it, some not.
Technically, the compiler does it by storing string literals in the symbol table. If an identical string is seen more than once, the same symbolic reference is used each time. The same technique might well be used for other literals, but would not be so easily detected.
The preprocessor, by the way, has nothing to do with it.
How is that possible?
It's possible because the compiler keeps track of values like that. But no, the preprocessor generally doesn't get involved in things like this; the preprocessor does things like macro substitutions that modify the code before the compiler starts working. In this case, though, we're talking about actual code:
char* p = "abc";
and that's the domain of the compiler, not the preprocessor.
So the compiler must have assigned the same (constant) value to each one of them. How is that possible?
If you have two identical string literals, as you do here, then the compiler is allowed to combine them into a single one; apparently, your compiler does that. It's also allowed to store them separately.

C function parameter optimization: (MyStruct const * const myStruct) vs. (MyStruct const myStruct)

Example available at ideone.com:
int passByConstPointerConst(MyStruct const * const myStruct)
int passByValueConst (MyStruct const myStruct)
Would you expect a compiler to optimize the two functions above such that neither one would actually copy the contents of the passed MyStruct?
I do understand that many optimization questions are specific to individual compilers and optimization settings, but I can't be designing for a single compiler. Instead, I would like to have a general expectation as to whether or not I need to be passing pointers to avoid copying. It just seems like using const and allowing the compiler to handle the optimization (after I configure it) should be a better choice and would result in more legible and less error prone code.
In the case of the example at ideone.com, the compiler clearly is still copying the data to a new location.
In the first case (passing a const pointer to const) no copying occurs.
In the second case, copying does occur and I would not expect that to be optimized out if for no other reason because the address of the object is taken and then passed through an ellipsis into a function and from the point of view of the compiler, who knows what the function does with that pointer?
More generally speaking, I don't think changing call-by-value into call-by-reference is something compilers do. If you want copy by reference, implement it yourself.
Is it theoretically possible that a compiler could detect that it could just convert the function to be pass-by-reference? Yes; nothing in the C standard says it cannot..
Why are you worrying about this? If you are concerned about performance, has profiling shown copy-by-value to be a significant bottleneck in your software?
This topic is addressed by the comp.lang.c FAQ:
http://c-faq.com/struct/passret.html
When large structures are passed by value, this is commonly optimized by actually passing the address of the object rather than a copy. The callee then determines whether a copy needs to be made, or whether it can simply work with the original object.
The const qualifier on the parameter makes no difference. It is not part of the type; it is simply ignored. That is to say, these two function declarations are equivalent:
int foo(int);
int foo(const int);
It's possible for the declaration to omit the const, but for the definition to have it and vice versa. The optimization of the call cannot hinge on this const in the declaration. That const is not what creates the semantics that the object is passed by value and hence the original cannot be modified.
The optimization has to preserve the semantics; it has to look as if a copy of the object was really passed.
There are two ways you can tell that a copy was not passed: one is that a modification to the apparent copy affects the original. The other way is to compare addresses. For instance:
int compare(struct foo *ptr, struct foo copy);
Inside compare we can take the address of copy and see whether it is equal to ptr. If the optimization takes place even though we have done this, then it reveals itself to us.
The second declaration is actually a direct request by the user to receive a copy of the passed struct.
const modifier eliminates the possibility of any modifications made to the local copy, however, it is does not eliminate all the reasons for copying.
Firstly, the copy has to maintain its address identity, meaning that inside the second function the &myStruct expression should produce a value different from the address of any other MyStruct object. A smart compiler can, of course, detect the situations that depend on the address identity of the object.
Secondly, aliasing presents another problem. Imagine that the program has a global pointer MyStruct *global_struct and inside the second function someone modifies the *global_struct. There's a possibility that the *global_struct is the same struct object that was passed to the function as an argument. If no copy was made, the modifications made to *global_struct will be visible through the local parameter, which is a disaster. Aliasing issues are much more difficult (and in general case impossible) to resolve at compilation time, which is why compilers usually won't be able to optimize out the copying.
So, I would expect any compiler to perform the copying, as requested.

Resources