I've got two C strings that I want to append and result should be assigned to an lhs variable. I saw a static initialization code like:
char* out = "May God" "Bless You";.
The output was really "May GodBless You" on printing out. I understand this result can be output of some undefined behaviour.
The code was actually in production and never gave wrong results. And it was not like we had such statements only at one place. It could be seen at multiple places of very much stable code and were used to form sql queries.
Does C standard allow such concatenation?
Yes, it is guaranteed.
Extract from http://en.wikipedia.org/wiki/C_syntax#String_literal_concatenation :
Adjacent string literals are
concatenated at compile time; this
allows long strings to be split over
multiple lines, and also allows string
literals resulting from C preprocessor
defines and macros to be appended to
strings at compile time
Yes. this concatenation is allowed in C, it is not undefined behavior.
Although I think it should produce "May GodBless You" (since there is no space in the quoted part)
The Standard says
5.1.1.2 Translation phases
6. Adjacent string literal tokens are concatenated.
So, the Solaris compiler was doing the right thing.
Related
This question already has answers here:
Why are compound literals in C modifiable
(2 answers)
Why do I get a segmentation fault when writing to a "char *s" initialized with a string literal, but not "char s[]"?
(19 answers)
Closed 4 years ago.
String literals are lvalues, which leaves the door open to modify string literals.
From C in a Nutshell:
In C source code, a literal is a token that denotes a fixed value, which may be an integer, a floating-point number, a character, or a string. A literal’s type is determined by its value and its notation.
The literals discussed here are different from compound literals, which were introduced in the C99 standard. Compound literals are ordinary modifiable objects, similar to variables.
Although C does not strictly prohibit modifying string literals, you should not attempt to do so. For one thing, the compiler, treating the string literal as a constant, may place it in read-only memory, in which case the attempted write operation causes a fault. For another, if two or more identical string literals are used in the program, the compiler may store them at the same location, so that modifying one causes unexpected results when you access another.
The first paragraph says that "a literal in C denotes a fixed value".
Does it mean that a literal (except compound literals) shouldn't be modified?
Since a string literal isn't a compound literal, should a string literal be modified?
The second paragraph says that "C does not strictly prohibit
modifying string literals" while compilers do. So should a string
literal be modified?
Do the two paragraphs contradict each other? How shall I understand them?
Can a literal which is neither compound literal nor string literal be modified?
From the C Standard (6.4.5 String literals)
7 It is unspecified whether these arrays are distinct provided their
elements have the appropriate values. If the program attempts to
modify such an array, the behavior is undefined.
As for your statement.
The second paragraph says that "C does not strictly prohibit modifying
string literals" while compilers do. So should a string literal be
modified?
Then compilers do not modify string literals. They may store identical string literals as one array.
As #o11c pointed out in a comment in the Annex J (informative) Portability issues there is written
J.5 Common extensions
1 The following extensions are widely used in
many systems, but are not portable to all implementations. The
inclusion of any extension that may cause a strictly conforming
program to become invalid renders an implementation nonconforming.
Examples of such extensions are new keywords, extra library functions
declared in standard headers, or predefined macros with names that do
not begin with an underscore.
J.5.5 Writable string literals
1 String literals are modifiable (in which case, identical string
literals should denote distinct objects) (6.4.5).
Don't modify string literals. Treat them as char const[].
String literals are effectively char const[] (modifying them results in undefined behavior), but for legacy reason they're really char [], which means the compiler won't stop you from writing into them, but your program will still go undefined if you do.
And saying more practically - not every hardware platfotm provides mechanisms to protect memory location where Read Only objects are stored. And it had to be defined as UB. There are 3 possible options:
Literals (and constant objects more generally) are kept in the RAM but the hardware does not provide memory protection mechanisms. Nothing can stop the programmer from writing to this location
Literals (and constant objects) are kept in the RAM but the hardware does provide memory protection mechanisms - you will get segfault
Read Only data is stored in the read only memory (for example uC FLASH). You can try to write it but there is no effect of it (example ARM). No hardware exception raised
The first paragraph says that "a literal in C denotes a fixed value".
Does it mean that a literal (except compound literals) shouldn't be modified?
I don't know what the authors intention was, but modification of the array resulting from a string literal during runtime is blatantly undefined, according to C11/6.4.5p7: "If the program attempts to modify such an array, the behavior is undefined."
It should also be noted that attempts to modify a const-qualified compound literal during runtime will also result in undefined behavior, which is explained along-side some volatile-related undefined behaviour in C11/6.7.3p6. It is otherwise well defined to modify compound literals.
For example:
char *fubar = "hello world";
(*fubar)++; // SQUARELY UNDEFINED BEHAVIOUR!
char *fubar = (char[]){"hello world"};
(*fubar)++; // This is well defined.
Literally replacing "hello world" with "goodbye galaxy", in either piece of source code, is fine. Redefining standard functions, however (i.e. #define memcpy strncpy or #define size_t signed char, which are both great ways to ruin someone elses day), is undefined behaviour.
Since a string literal isn't a compound literal, should a string literal be modified?
The array resulting from a string literal should certainly not be modified during runtime, for any attempt to do so would trigger undefined behaviour.
The string literal itself, which exists as a quoted sequence of characters within your source code, on the other hand... of course, that can be modified as you choose. You're not obliged to modify it, though.
The second paragraph says that "C does not strictly prohibit modifying string literals" while compilers do. So should a string literal be modified?
The C standard doesn't strictly prohibit a lot of undefined behavior; it leaves the behavior undefined, meaning your program is likely to behave erratically or be non-portable. In the realms of well defined C, your programs should not invoke any undefined behaviour, including overflowing arrays, modifying const-qualified objects or the arrays resulting from string literals, race conditions caused by multithreading, etc.
If you want to invoke undefined behaviour, C will let you shoot yourself in the foot. You might have a good reason for doing so; perhaps your program will be more optimal, or perhaps your compiler actually lets you modify string literals ("it's a feature, not a bug", they say, "so give us your money", they say, as you become reliant upon their non-standard quirks). Be aware that some compilers will instead behave as though the attempted modification didn't occur, or crash, or there could be some vulnerability caused.
... and above all else, be aware that your code will no longer be compliant C code!
Do the two paragraphs contradict each other?
By omission, perhaps. The first paragraph does state that the values are fixed, and the second paragraph that the values might be modifiable during runtime through invocation of undefined behaviour.
I think the author meant to make the distinction between elements of source code and the runtime environment. He/she could simply clarify this by ensuring it's explicit that literals should not be modified during runtime, for example.
How shall I understand them?
In the realms of C such values can't change during runtime because invoking undefined behaviour means the code in question is no longer compliant C code.
Perhaps they were trying to avoid explaining undefined behaviour, because it may seem too complex to explain. If you look deeper into the subject, you'll find that the meaning is, as predicted, roughly a conjunction of the two words.
undefined: /ʌndɪˈfʌɪnd/ adj. not clear or defined.
behaviour: /bɪˈheɪvjə/ noun. the way in which a machine or natural phenomenon works or functions
That is to say, an attempt to modify the array resulting from a string literal during runtime results in "unclear functionality". It's not required to be documented anywhere in the realms of computer science, and even if it is documented, that documentation might be a lie.
Can a literal which is neither compound literal nor string literal be modified?
As a lexical element in source code, providing it doesn't override a standard symbol, yes. Literals which aren't l-values (i.e. don't have any storage) such as integer constants, obviously can't be modified during runtime. I suppose it might be possible on some systems to attempt to modify the memory which a function pointer points at, which could be seen as a literal; that's also undefined behaviour and would result in code that isn't C.
It might also be possible to modify many other types of elements which aren't seen as objects by the C standard, such as the return address on the stack. That's what makes buffer overflows so subtly dangerous!
In the code below, I have two different local char* variables declared in two different functions.
Each variable is initialized to point to a constant string, and the contents of the two strings are identical.
Checking in runtime, the variables are initialized to point to the same address in memory.
So the compiler must have assigned the same (constant) value to each one of them.
How is that possible?
#include <stdio.h>
void PrintPointer()
{
char* p = "abc";
printf("%p\n",p);
}
int main()
{
char* p = "abc";
printf("%p\n",p);
PrintPointer();
return 0;
}
It has nothing to do with the preprocessor. But the compiler is explicitly allowed (not required) by the standard to share the memory for identical string literals. For details on when this happens, you must consult your compiler's documentation.
For example, here's the relevant documentation for VC2013:
In some cases, identical string literals may be pooled to save space in the executable file. In string-literal pooling, the compiler causes all references to a particular string literal to point to the same location in memory, instead of having each reference point to a separate instance of the string literal. To enable string pooling, use the /GF compiler option.
The C++ standard says in N3797 2.14.15/12:
Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation defined. The effect of attempting to modify a string literal is undefined.
The C standard now contains the same wording. Historically it was possible to modify string literals at run-time in C, but this is now Undefined Behaviour. Some compilers may allow it, some not.
Technically, the compiler does it by storing string literals in the symbol table. If an identical string is seen more than once, the same symbolic reference is used each time. The same technique might well be used for other literals, but would not be so easily detected.
The preprocessor, by the way, has nothing to do with it.
How is that possible?
It's possible because the compiler keeps track of values like that. But no, the preprocessor generally doesn't get involved in things like this; the preprocessor does things like macro substitutions that modify the code before the compiler starts working. In this case, though, we're talking about actual code:
char* p = "abc";
and that's the domain of the compiler, not the preprocessor.
So the compiler must have assigned the same (constant) value to each one of them. How is that possible?
If you have two identical string literals, as you do here, then the compiler is allowed to combine them into a single one; apparently, your compiler does that. It's also allowed to store them separately.
I work on a c code that was not written by me, and there is lots of fprintf calls like this :
fprintf(file, "blabla1""blabla2%s""blabla3", mystring);
I had never seen that we could put several strings in the second argument of fprintf, is this a sort of concatenation ? Or is this a feature of fprintf ? If so, the specification of fprintf does not mention it ?
This is feature of string literals, they will be concatated if they are adjacent. If we look at the draft C99 standard section 6.4.5 String literals paragraph 4 says:
In translation phase 6, the multibyte character sequences specified by any sequence of
adjacent character and wide string literal tokens are concatenated into a single multibyte
character sequence. If any of the tokens are wide string literal tokens, the resulting
multibyte character sequence is treated as a wide string literal; otherwise, it is treated as a character string literal.
As Lundin points out a simpler quote can be found in section 5.1.1.2 Translation phases paragraph 6:
Adjacent string literal tokens are concatenated.
No, this is not a feature of fprintf(), that would be impossible (how would you implement such a function yourself?) since fprintf() is just a standard function with no extra magic done by the compiler.
It's a feature of C's syntax: adjacent string literals are treated as a single literal by just concatenating them together.
It's very useful together with the preprocessor's stringification support, for instance.
In the code you show, there is only one format code: "%s". It accepts the value contained in mystring, so the result will be: "blablablabla2_contents of mystring_blabla3"
Yes, this is legal code. I am not sure why someone would do this though.
I'll answer each question in turn.
is this a sort of concatenation ?
You hit the nail on the head. Yes indeed.
Or is this a feature of fprintf ?
Nope, just part of C syntax.
If so, the specification of fprintf does not mention it ?
That isn't actually a question, despite the punctuation, but you're probably correct that the fprintf specification does not mention this type of concatenation, and that's because it's because it's part of the language, not the specific function.
I just found something very interesting which was introduced by my typo. Here's a sample of very easy code script:
printf("A" "B");
The result would be
$> AB
Can someone explain how this happens?
As a part of the C standard, string literals that are next to one another are concatenated:
For C (quoting C99, but C11 has something similar in 6.4.5p5):
(C99, 6.4.5p5) "In translation phase 6, the multibyte character
sequences specified by any sequence of adjacent character and
identically-prefixed string literal tokens are concatenated into a
single multibyte character sequence."
C++ has a similar standard.
This is standard behaviour and can be very useful when splitting a very long string constant over multiple lines.
This is string concatenation, part of C standard. Any two or more consecutive string literals are combined into one.
I am doing a simple program that should count the occurrences of ternary operator ?: in C source code. And I am trying to simplify that as much as it is possible. So I've filtered from source code these things:
String literals " "
Character constants ' '
Trigraph sequences ??=, ??(, etc.
Comments
Macros
And now I am only counting the occurances of questionmarks.
So my question question is: Is there any other symbol, operator or anything else what could cause problem - contain '?' ?
Let's suppose that the source is syntax valid.
I think you found all places where a question-mark is introduced and therefore eliminated all possible false-positives (for the ternary op). But maybe you eliminated too much: Maybe you want to count those "?:"'s that get introduced by macros; you dont count those. Is that what you intend? If that's so, you're done.
Run your tool on preprocessed source code (you can get this by running e.g. gcc -E). This will have done all macro expansions (as well as #include substitution), and eliminated all trigraphs and comments, so your job will become much easier.
In K&R ANSI C the only places where a question mark can validly occur are:
String literals " "
Character constants ' '
Comments
Now you might notice macros and trigraph sequences are missing from this list.
I didn't include trigraph sequences since they are a compiler extension and not "valid C". I don't mean you should remove the check from your program, I'm trying to say you already went further then what's needed for ANSI C.
I also didn't include macros because when you're talking about a character that can occur in macros you can mean two things:
Macro names/identifiers
Macro bodies
The ? character can not occur in macro identifiers (http://stackoverflow.com/questions/369495/what-are-the-valid-characters-for-macro-names), and I see macro bodies as regular C code so the first list (string literals, character constants and comments*) should cover them too.
* Can macros validly contain comments? Because if I use this:
#define somemacro 15 // this is a comment
then // this is a comment isn't part of the macro. But what if I would compiler this C file with -D somemacro="15 // this is a comment"?