two strings separated by blank being concatenated automatically - c

I just found something very interesting which was introduced by my typo. Here's a sample of very easy code script:
printf("A" "B");
The result would be
$> AB
Can someone explain how this happens?

As a part of the C standard, string literals that are next to one another are concatenated:
For C (quoting C99, but C11 has something similar in 6.4.5p5):
(C99, 6.4.5p5) "In translation phase 6, the multibyte character
sequences specified by any sequence of adjacent character and
identically-prefixed string literal tokens are concatenated into a
single multibyte character sequence."
C++ has a similar standard.

This is standard behaviour and can be very useful when splitting a very long string constant over multiple lines.

This is string concatenation, part of C standard. Any two or more consecutive string literals are combined into one.

Related

Are trigraphs required to write a newline character in C99 using only ISO 646?

Assume that you're writing (portable) C99 code in the invariant set of ISO 646. This means that the \ (backslash, reverse solidus, however you name it) can't be written directly. For instance, one could opt to write a Hello World program as such:
%:include <stdio.h>
%:include <stdlib.h>
int main()
<%
fputs("Hello World!??/n", stdout);
return EXIT_SUCCESS;
%>
However, besides digraphs, I used the ??/ trigraph to write the \ character.
Given my assumptions above, is it possible to either
include the '\n' character (which is translated to a newline in <stdio.h> functions) in a string without the use of trigraphs, or
write a newline to a FILE * without using the '\n' character?
For stdout you could just use puts("") to output a newline. Or indeed replace the fputs in your original program with puts and delete the \n.
If you want to get the newline character into a variable so you can do other things with it, I know another standard function that gives you one for free:
int gimme_a_newline(void)
{
time_t t = time(0);
return strchr(ctime(&t), 0)[-1];
}
You could then say
fprintf(stderr, "Hello, world!%c", gimme_a_newline());
(I hope all of the characters I used are ISO646 or digraph-accessible. I found it surprisingly difficult to get a simple list of which ASCII characters are not in ISO646. Wikipedia has a color-coded table with not nearly enough contrast between colors for me to tell what's what.)
Your premise:
Assume that you're writing (portable) C99 code in the invariant set of ISO 646. This means that the \ (backslash, reverse solidus, however you name it) can't be written directly.
is questionable. C99 defines "source" and "execution" character sets, and requires that both include representations of the backslash character (C99 5.2.1). The only reason I can imagine for an effort such as you describe would be to try to produce source code that does not require character set transcoding upon movement among machines. In that case, however, the choice of ISO 646 as a common baseline is odd. You're more likely to run into an EBCDIC machine than one that uses an ISO 646 variant that is not coincident with the ISO-8859 family of character sets. (And if you can assume ISO 8859, then backslash does not present a problem.)
Nevertheless, if you insist on writing C source code without using a literal backslash character, then the trigraph for that character is the way to do so. That's what trigraphs were invented for. In character constants and string literals, you cannot portably substitute anything else for \n or its trigraph equivalent, ??/n, because it is implementation-dependent how that code is mapped. In particular, it is not safe to assume that it maps to a line-feed character (which, however, is included among the invariant characters of ISO 646).
Update:
You ask specifically whether it is possible to
include the '\n' character (which is translated to a newline in functions) in a string without the use of trigraphs, or
No, it is not possible, because there is no one '\n' character. Moreover, there seems to be a bit of a misconception here: \n in a character or string literal represents one character in the execution character set. The compiler is therefore responsible for that transformation, not the stdio functions. The stdio functions' responsibility is to handle that character on output by writing a character or character sequence intended to produce the specified effect ("[m]oves the active position to the initial position of the next line").
You also ask whether it is possible to
write a newline to a FILE * without using the '\n' character?
This one depends on exactly what you mean. If you want to write a character whose code in the execution character set you know, then you can write a numeric constant having that numeric value. In particular, if you want to write the character with encoded value 0xa (in the execution character set) then you can do so. For example, you could
fputc(0xa, my_file);
but that does not necessarily produce a result equivalent to
fputc('\n', my_file);
Short answer is, yes, for what you want to do, you have to use this trigraph.
Even if there was a digraph for \, it would be useless inside a string literal because digraphs must be tokens, they are recognized by the tokenizer, while trigraphs are pre-processed and so still work inside string literals and the like.
Still wondering why somebody would encode source this way today ... :o
No. \n (or its trigraph equivalent) is the portable representation of a newline character.
No. You'd have to represent the literal newline somehow, and \n (or it's trigraph equivalent) is the only portable representation.
It's very unusual to find C source code that uses trigraphs or digraphs! Some compilers (e.g. GNU gcc) require command-line options to enable the use of trigraphs and assume they have been used unintentionally and issues a warning if it encounters them in the source code.
EDIT: I forgot about puts(""). That's a sneaky way to do it, but only works for stdout.
Yes of course it's possible
fputc(0x0A, file);

strings one after the other in c : "...""...", as if they were concatenate

I work on a c code that was not written by me, and there is lots of fprintf calls like this :
fprintf(file, "blabla1""blabla2%s""blabla3", mystring);
I had never seen that we could put several strings in the second argument of fprintf, is this a sort of concatenation ? Or is this a feature of fprintf ? If so, the specification of fprintf does not mention it ?
This is feature of string literals, they will be concatated if they are adjacent. If we look at the draft C99 standard section 6.4.5 String literals paragraph 4 says:
In translation phase 6, the multibyte character sequences specified by any sequence of
adjacent character and wide string literal tokens are concatenated into a single multibyte
character sequence. If any of the tokens are wide string literal tokens, the resulting
multibyte character sequence is treated as a wide string literal; otherwise, it is treated as a character string literal.
As Lundin points out a simpler quote can be found in section 5.1.1.2 Translation phases paragraph 6:
Adjacent string literal tokens are concatenated.
No, this is not a feature of fprintf(), that would be impossible (how would you implement such a function yourself?) since fprintf() is just a standard function with no extra magic done by the compiler.
It's a feature of C's syntax: adjacent string literals are treated as a single literal by just concatenating them together.
It's very useful together with the preprocessor's stringification support, for instance.
In the code you show, there is only one format code: "%s". It accepts the value contained in mystring, so the result will be: "blablablabla2_contents of mystring_blabla3"
Yes, this is legal code. I am not sure why someone would do this though.
I'll answer each question in turn.
is this a sort of concatenation ?
You hit the nail on the head. Yes indeed.
Or is this a feature of fprintf ?
Nope, just part of C syntax.
If so, the specification of fprintf does not mention it ?
That isn't actually a question, despite the punctuation, but you're probably correct that the fprintf specification does not mention this type of concatenation, and that's because it's because it's part of the language, not the specific function.

Why does this C program compile without an error?

I'm a beginner in C, and I was playing with C. I typed a C code like this:
#include <stdio.h>
int main()
{
printf("hello world\n");
\
return 0;
}
Even though I used \ knowingly, the C compiler doesn't throw any error. What is this symbol used for in the C language?
Edit:
Even this works:
"\n";
The sequence backslash-newline is removed from the code in a very early phase (phase 2) of the translation process. It used to be how you created long string literals before there was string concatenation, and is how you still extend macros over multiple lines.
See §5.1.1.2 Translation Phases of the C99 standard:
The precedence among the syntax rules of translation is specified by the following
phases.5)
Physical source file multibyte characters are mapped, in an implementation defined
manner, to the source character set (introducing new-line characters for
end-of-line indicators) if necessary. Trigraph sequences are replaced by
corresponding single-character internal representations.
Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice. A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such
splicing takes place.
The source file is decomposed into preprocessing tokens6) and sequences of
white-space characters (including comments). A source file shall not end in a
partial preprocessing token or in a partial comment. Each comment is replaced by
one space character. New-line characters are retained. Whether each nonempty
sequence of white-space characters other than new-line is retained or replaced by
one space character is implementation-defined.
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
Each source character set member and escape sequence in character constants and
string literals is converted to the corresponding member of the execution character
set; if there is no corresponding member, it is converted to an implementation defined
member other than the null (wide) character.7)
Adjacent string literal tokens are concatenated.
White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
All external object and function references are resolved. Library components are
linked to satisfy external references to functions and objects not defined in the
current translation. All such translator output is collected into a program image
which contains information needed for execution in its execution environment.
5) Implementations shall behave as if these separate phases occur, even though many are typically folded together in practice.
6) As described in 6.4, the process of dividing a source file’s characters into preprocessing tokens is
context-dependent. For example, see the handling of < within a #include preprocessing directive.
7) An implementation need not convert all non-corresponding source characters to the same execution
character.
If you had a blank or any other character after your stray backslash, you would have a compilation error. We can tell that you don't have anything after it because you don't have a compilation error.
The other part of your question, about:
"\n";
is quite different. It is a simple expression that has no side-effects and therefore no effect on the program. The optimizer will completely discard it. When you write:
i = 1;
you have an expression with a value that is discarded; it is evaluated for its side-effect of modifying i.
Sometimes, you'll find code like:
*ptr++;
The compiler will warn you that the result of the expression is discarded; the expression can be simplified to:
ptr++;
and will achieve the same effect in the program.
The \, when immediately followed by a newline, is consumed by preprocessing and causes the next "physical" line to be joined to the current logical line. This is very important for writing long preprocessing directives, which have to be all on one logical line:
#define SHORT very log macro \
consisting of lots and \
lots of preprocessor \
tokens
If you remove the backslash-newline sequences, it is no longer correct. Some other languages from the Unix culture have a similar backslash line continuation syntax: the POSIX shell language derived from the Bourne shell, and also makefiles.
$ this is \
one shell command
About "\n";, that is a primary expression used to form an expression-statement. In C, expressions can be used as statements, and this is exploited all the time. Your printf call, for instance, is an expression statement. printf("hello world\n") is a postfix expression which calls a function, obtaining a return value. Because you used this expression as a statement, the return value is thrown away. The return value of printf
indicates how many characters were printed, or whether it was successful at all, so by throwing it away, your program makes itself oblivious to whether the printf call actually worked.
Since the value of an expression-statement is discarded, if such a statement also has no side effects, it is a useless statement which does nothing (like your "\n"). But such useless expression statements are not erroneous. If you add warning options to your compiler command line you might get a warning such as "statement with no effect" or something like that.
The backslash \ get interpreted by the C preprocessor. It protect its following character (the new line character on your case).
The backslash is simply escaping the next character. In this case, probably a line end (CR) character. Perfectly reasonable.
The backslash plus what is following it is an escape sequence; "\n" together is the newline character (prints a newline). Another important one is "\t", for tab.

C string append

I've got two C strings that I want to append and result should be assigned to an lhs variable. I saw a static initialization code like:
char* out = "May God" "Bless You";.
The output was really "May GodBless You" on printing out. I understand this result can be output of some undefined behaviour.
The code was actually in production and never gave wrong results. And it was not like we had such statements only at one place. It could be seen at multiple places of very much stable code and were used to form sql queries.
Does C standard allow such concatenation?
Yes, it is guaranteed.
Extract from http://en.wikipedia.org/wiki/C_syntax#String_literal_concatenation :
Adjacent string literals are
concatenated at compile time; this
allows long strings to be split over
multiple lines, and also allows string
literals resulting from C preprocessor
defines and macros to be appended to
strings at compile time
Yes. this concatenation is allowed in C, it is not undefined behavior.
Although I think it should produce "May GodBless You" (since there is no space in the quoted part)
The Standard says
5.1.1.2 Translation phases
6. Adjacent string literal tokens are concatenated.
So, the Solaris compiler was doing the right thing.

Occurrences of question mark in C code

I am doing a simple program that should count the occurrences of ternary operator ?: in C source code. And I am trying to simplify that as much as it is possible. So I've filtered from source code these things:
String literals " "
Character constants ' '
Trigraph sequences ??=, ??(, etc.
Comments
Macros
And now I am only counting the occurances of questionmarks.
So my question question is: Is there any other symbol, operator or anything else what could cause problem - contain '?' ?
Let's suppose that the source is syntax valid.
I think you found all places where a question-mark is introduced and therefore eliminated all possible false-positives (for the ternary op). But maybe you eliminated too much: Maybe you want to count those "?:"'s that get introduced by macros; you dont count those. Is that what you intend? If that's so, you're done.
Run your tool on preprocessed source code (you can get this by running e.g. gcc -E). This will have done all macro expansions (as well as #include substitution), and eliminated all trigraphs and comments, so your job will become much easier.
In K&R ANSI C the only places where a question mark can validly occur are:
String literals " "
Character constants ' '
Comments
Now you might notice macros and trigraph sequences are missing from this list.
I didn't include trigraph sequences since they are a compiler extension and not "valid C". I don't mean you should remove the check from your program, I'm trying to say you already went further then what's needed for ANSI C.
I also didn't include macros because when you're talking about a character that can occur in macros you can mean two things:
Macro names/identifiers
Macro bodies
The ? character can not occur in macro identifiers (http://stackoverflow.com/questions/369495/what-are-the-valid-characters-for-macro-names), and I see macro bodies as regular C code so the first list (string literals, character constants and comments*) should cover them too.
* Can macros validly contain comments? Because if I use this:
#define somemacro 15 // this is a comment
then // this is a comment isn't part of the macro. But what if I would compiler this C file with -D somemacro="15 // this is a comment"?

Resources