I am doing a simple program that should count the occurrences of ternary operator ?: in C source code. And I am trying to simplify that as much as it is possible. So I've filtered from source code these things:
String literals " "
Character constants ' '
Trigraph sequences ??=, ??(, etc.
Comments
Macros
And now I am only counting the occurances of questionmarks.
So my question question is: Is there any other symbol, operator or anything else what could cause problem - contain '?' ?
Let's suppose that the source is syntax valid.
I think you found all places where a question-mark is introduced and therefore eliminated all possible false-positives (for the ternary op). But maybe you eliminated too much: Maybe you want to count those "?:"'s that get introduced by macros; you dont count those. Is that what you intend? If that's so, you're done.
Run your tool on preprocessed source code (you can get this by running e.g. gcc -E). This will have done all macro expansions (as well as #include substitution), and eliminated all trigraphs and comments, so your job will become much easier.
In K&R ANSI C the only places where a question mark can validly occur are:
String literals " "
Character constants ' '
Comments
Now you might notice macros and trigraph sequences are missing from this list.
I didn't include trigraph sequences since they are a compiler extension and not "valid C". I don't mean you should remove the check from your program, I'm trying to say you already went further then what's needed for ANSI C.
I also didn't include macros because when you're talking about a character that can occur in macros you can mean two things:
Macro names/identifiers
Macro bodies
The ? character can not occur in macro identifiers (http://stackoverflow.com/questions/369495/what-are-the-valid-characters-for-macro-names), and I see macro bodies as regular C code so the first list (string literals, character constants and comments*) should cover them too.
* Can macros validly contain comments? Because if I use this:
#define somemacro 15 // this is a comment
then // this is a comment isn't part of the macro. But what if I would compiler this C file with -D somemacro="15 // this is a comment"?
Related
The following program compiles:
// #define WILL_COMPILE
#ifdef WILL_COMPILE
int i =
#endif
int main()
{
return 0;
}
GCC Live demo here.
But the following will issue a warning:
//#define WILL_NOT_COMPILE
#ifdef WILL_NOT_COMPILE
char* s = "failure
#endif
int main()
{
return 0;
}
GCC Live demo here.
I understand that in the first example, the controlled group is removed by the time the compilation phase of the translation is reached. So it compiles without errors or warnings.
But why is lexical validity required in the second example when the controlled group is not going to be included?
Searching online I found this quote:
Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended.
But this does not state why the lexical validity is checked when the conditional fails.
Have I missed something here?
In the translation phase 3 the preprocessor will generate preprocessor tokens and having a " end up in the catch all non-white-space character that cannot be one of the above
is undefined behavior.
See C11 6.4 Lexical elements p3:
A token is the minimal lexical element of the language in translation phases 7 and 8. The
categories of tokens are: keywords, identifiers, constants, string literals, and punctuators.
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string literals, punctuators, and
single non-white-space characters that do not lexically match the other preprocessing
token categories.69) If a ' or a " character matches the last category, the behavior is
undefined. ....
For reference the preprocessing-token are:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Of which the unmatched " in your second example matches non-white-space character that cannot be one of the above.
Since this is undefined behavior and not a constraint the compiler is not obliged to diagnose it but it is certainly allowed to and using -pedantic-errors it even becomes an error godbolt session. As rici points out it only becomes a constraint violation if the token survives preprocessing.
The gcc document you cite basically says the same thing:
... Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended. ...
"Why is [something about C] the way it is?" questions can't usually be answered, because none of the people who wrote the 1989 C standard are here to answer questions [as far as I know, anyway] and if they were here, it was nearly thirty years ago and they probably don't remember.
However, I can think of a plausible reason why the contents of skipped conditional groups are required to consist of a valid sequence of preprocessing tokens. Observe that comments are not required to consist of a valid sequence of preprocessing tokens:
/* this comment's perfectly fine even though it has an unclosed
character literal inside */
Observe also that it is really simple to scan for the end of a comment. /* you look for the next */, // you look for the end of the line. The only complication is that trigraphs and backslash-newline are supposed to be converted first. Tokenizing the contents of comments would be extra code to no useful purpose.
By contrast, it is not simple to scan for the end of a skipped conditional group, because conditional groups nest. You have to be looking for #if, #ifdef, and #ifndef as well as #else and #endif, and counting your depth. And all of those directives are lexically defined in terms of preprocessor tokens, because that's the most natural way to look for them when you're not in a skipped conditional group. Requiring skipped conditional groups to be tokenizable allows the preprocessor to use the same code to process directives within skipped conditional groups as it does elsewhere.
By default, GCC issues only a warning when it encounters an un-tokenizable line inside a skipped conditional group, an error elsewhere:
#if 0
"foo
#endif
"bar
gives me
test.c:2:1: warning: missing terminating " character
"foo
^
test.c:4:1: error: missing terminating " character
"bar
^~~~
This is an intentional leniency, possibly one I introduced myself (it's only been twenty years since I wrote a third of GCC's current preprocessor, but I have still forgotten a lot of the details). You see, the original C preprocessor, the one K and R wrote, did allow arbitrary nonsense inside skipped conditional groups, because it wasn't built around the concept of tokens in the first place; it transformed text into other text. So people would put comments between #if 0 and #endif instead of /* and */, and naturally enough those comments would sometimes contain apostrophes. So, when Per Bothner and Neil Booth and Chiaki Ishikawa and I replaced GCC's original "C-Compatible Compiler Preprocessor"1 with the integrated, fully standards-compliant "cpplib", circa GCC 3.0, we felt we needed to cut a little compatibility slack here.
1 Raise your hand if you're old enough to know why RMS thought this name was funny.
The description of Translation phase 3 (C11 5.1.1.2/3), which happens before preprocessing directives are actioned:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
And the grammar for preprocessing-token is:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Note in particular that a string-literal is a single preprocessing-token. The subsequent description (C11 6.4/3) clarifies that:
If a ' or a " character matches the last category, the behavior is
undefined.
So your second code causes undefined behaviour at translation phase 3.
Honestly I know the syntax of the C programming language well, but know almost nothing about the syntax of the C preprocessor, although I use it in my programming practice sometimes.
So the question. Suppose we have a simple macro that expands to nothing:
#define macro(param)
What is the restrictions for the syntax that can be put inside macro invoking construction?
It is certainly impossible to use single or multiple comma when invoking the macro:
macro(,); // won't compile
However if we put the comma into the brackets it will be accepted by C preprocessor:
macro((,)); // compiles fine
Of course, you can't use the comment characters:
macro(//); // compile error
because, as far as I know, comments are processed by preprocessor itself.
Unclosed quotes and round brackets aren't allowed too when using the macro:
macro("); // compile error
But characters unused in the C syntax are accepted well:
macro(##$); // compiles
Even characters of foreign languages work fine:
macro(бла-бла-бла я пишу по-русски); // compiles too
Can I use a random valid C/C++ code in curly brackets when invoking the macro? Can I use a random valid C/C++ code without curly brackets? The following code seems to compile fine:
macro(int a = 5; printf("%d\n", a););
You cannot pass arbitrary text to be ignored with your proposed scheme:
The C preprocessor will read the input file and parse it as a sequence of preprocessing tokens, after stripping comments, to pass to the macro as arguments. It matches parentheses to determine what tokens constitute arguments separated by ,.
Text containing strings or character constants with unrecognized escape sequences or stand alone backslashes does not parse as standard preprocessing token. Whether macro(##$); compiles is implementation dependent.
Note however that you can work around the , problem by defining your macro as taking a variable number of arguments:
#define macro(...)
Here's a formal grammar brain teaser (maybe :P)
I'm fairly certain there is no context where the character sequence => may appear in a valid C program (except obviously within a string). However, I'm unable to prove this to myself. Can you either:
Describe a method that I can use for an arbitrary character sequence to determine whether it is possible in a valid C program (outside a string/comment). Better solutions require less intuition.
Point out a program that does this. I have a weak gut feeling this could be undecidable but it'd be great if I was wrong.
To get your minds working, other combos I've been thinking about:
:- (b ? 1:-1), !? don't think so, ?! (b ?!x:y), <<< don't think so.
If anyone cares: I'm interested because I'm creating a little custom C pre-processor for personal use and was hoping to not have to parse any C for it. In the end I will probably just have my tokens start with $ or maybe a backquote but I still found this question interesting enough to post.
Edit: It was quickly pointed out that header names have almost no restrictions so let me amend that I'm particularly interested in non-pre-processor code, alternatively, we could consider characters within the <> of #include <...> as a string literal.
Re-edit: I guess macros/pre-processor directives beat this question any which way I ask it :P but if anyone can answer the question for pure (read: non-macro'd) C code, I think it's an interesting one.
#include <abc=>
is valid in a C program. The text inside the <...> can be any member of the source character set except a newline and >.
This means that most character sequences, including !? and <<<, could theoretically appear.
In addition to all the other quibbles, there are a variety of cases involving macros.
The arguments to a macro expansion don't need to be syntactically correct, although of course they would need to be syntactically correct in the context of their expansion. But then, they might never be expanded:
#include <errno.h>
#define S_(a) #a
#define _(a,cosmetic,c) [a]=#a" - "S_(c)
const char* err_names[] = {
_(EAGAIN, =>,Resource temporarily unavailable),
_(EINTR, =>,Interrupted system call),
_(ENOENT, =>,No such file or directory),
_(ENOTDIR, =>,Not a directory),
_(EPERM, =>,Operation not permitted),
_(ESRCH, =>,No such process),
};
#undef _
const int nerr = sizeof(err_names)/sizeof(err_names[0]);
Or, they could be used but in stringified form:
#define _(a,b,c) [a]=#a" "S_(b)" "S_(c)
Note: Why #a but S_(c)? Because EAGAIN and friends are macros, not constants, and in this case we don't want them to be expanded before stringification.
/*=>*/
//=>
"=>"
'=>'
I'm looking at implementing a C preprocessor in two phases, where the first phase converts the source file into an array of preprocessing tokens. This would be good for simplicity and performance, as the work of tokenizing would not need to be redone when a header file is included by multiple files in a project.
The snag:
#define f(x) #x
main() {
puts(f(a+b));
puts(f(a + b));
}
According to the standard, the output should be:
a+b
a + b
i.e. the information about whether constituent tokens were separated by whitespace is supposed to be preserved. This would require the two-phase design to be scrapped.
The uses of the # operator that I've seen so far don't actually need this, e.g. assert would still work fine if the output were always a + b regardless of whether the constituent tokens were separated by whitespace in the source file.
Is there any existing code anywhere that does depend on the exact behavior prescribed by the standard for this operator?
You might want to look at the preprocessor of the LCC compiler, written as an example ANSI C compiler for compiler courses. Another preprocessor is MCPP.
C/C++ preprocessing is quite tricky, if you stick to it make sure to get at least drafts of the relevant standards, and pilfer test suites somewhere.
Ignoring that there are sometimes better non-macro ways to do this (I have good reasons, sadly), I need to write a big bunch of generic code using macros. Essentially a macro library that will generate a large number of functions for some pre-specified types.
To avoid breaking a large number of pre-existing unit tests, one of the things the library must do is, for every type, generate the name of that type in all caps for printing. E.g. a type "flag" must be printed as "FLAG".
I could just manually write out constants for each type, e.g.
#define flag_ALLCAPSNAME FLAG
but this is not ideal. I'd like to be able to do this programatically.
At present, I've hacked this together:
char capname_buf[BUFSIZ];
#define __MACRO_TO_UPPERCASE(arg) strcpy(capname_buf, arg); \
for(char *c=capname_buf;*c;c++)*c = (*c >= 'a' && *c <= 'z')? *c - 'a' + 'A': *c;
__MACRO_TO_UPPERCASE(#flag)
which does what I want to some extent (i.e. after this bit of code, capname_buf has "FLAG" as its contents), but I would prefer a solution that would allow me to define a string literal using macros instead, avoiding the need for this silly buffer.
I can't see how to do this, but perhaps I'm missing something obvious?
I have a variadic foreach loop macro written (like this one), but I can't mutate the contents of the string literal produced by #flag, and in any case, my loop macro would need a list of character pointers to iterate over (i.e. it iterates over lists, not over indices or the like).
Thoughts?
It is not possible in portable C99 to have a macro which converts a constant string to all uppercase letters (in particular because the notion of letter is related to character encoding. An UTF8 letter is not the same as an ASCII one).
However, you might consider some other solutions.
customize your editor to do that. For example, you could write some emacs code which would update each C source file as you require.
use some preprocessor on your C source code (perhaps a simple C code generator script which would emit a bunch of #define in some #include-d file).
use GCC extensions to have perhaps
#define TO_UPPERCASE_COUNTED(Str,Cnt)
#define TO_UPPERCASE(Str) TO_UPPERCASE_COUNTED(Str,__COUNT__) {( \
static char buf_##Cnt[sizeof(Str)+4]; \
char *str_##Cnt = Str; \
int ix_##Cnt = 0; \
for (; *str_##Cnt; str_##Cnt++, ix_##Cnt++) \
if (ix_##Cnt < sizeof(buf_##Cnt)-1) \
buf_##Cnt[ix_##Cnt] = toupper(*str_##Cnt); \
buf_##Cnt; )}
customize GCC, perhaps using MELT (a domain specific language to extend GCC), to provide your __builtin_capitalize_constant to do the job (edit: MELT is now an inactive project). Or code in C++ your own GCC plugin doing that (caveat, it will work with only one given GCC version).
It's not possible to do this entirely using the c preprocessor. The reason for this is that the preprocessor reads the input as (atomic) pp-tokens from which it composes the output. There's no construct for the preprocessor to decompose a pp-token into individual characters in any way (no one that would help you here anyway).
In your example when the preprocessor reads the string literal "flag" it's to the preprocessor basically an atomic chunk of text. It have constructs to conditionally remove such chunks or glue them together into larger chunks.
The only construct that allows you in some sense to decompose a pp-token is via some expressions. However these expressions only can work on arithmetic types which is why they won't help you here.
Your approach circumvents this problem by using C language constructs, ie you do the conversion at runtime. The only thing the preprocessor does then is to insert the C code to convert the string.