Limit of preprocessor __VA_ARGS__ length? - c

I'm using a preprocessor macro va_args hack for SQL code to allow pasting directly in sqlite3.exe for that quick no-build debugging:
#define QUOTE(...) #__VA_ARGS__
char const example[] = QUOTE(
INSERT INTO Some_Table(p, q) VALUES(?, ?);
);
https://stackoverflow.com/a/17996915/1848654
However this is hardly how __VA_ARGS__ is supposed to be used. In particular my SQL code consists of hundreds of tokens.
What is the limit of __VA_ARGS__ length (if any)?

The only thing I can find is this bit in C99 (C11 still contains the same text):
5.2.4.1 Translation limits
The implementation shall be able to translate and execute at least one program that
contains at least one instance of every one of the following limits:13)
[...]
127 arguments in one macro invocation
[...]
4095 characters in a character string literal or wide string literal (after concatenation)
[...]
[...]
13) Implementations should avoid imposing fixed translation limits whenever possible.
So according to the standard there is no fixed limit. You will have to check whether your compiler documents any limits or just tries to support whatever you throw at it (until it runs out of RAM or something).

Related

Is there a limit to the length of a macro's contents?

For embedded programs, I often convert data tables into header #defines, which get dropped into variables/arrays in the .c program.
I've just written a conversion tool that can potentially produce massive output in this format, and now I'm wondering if I should be aware of any limitations of this pattern.
Header example:
#define BIG_IMAGE_BLOCK \
0x00, 0x01, 0x02, 0x03, \
0x04, 0x05, 0x06, 0x07, \
/* this goes on ... */ \
0xa8, 0xa9, 0xaa, 0xab
Code example (avr-gcc):
const uint8_t ImageData[] PROGMEM = {
BIG_IMAGE_BLOCK
};
Can't seem to find an answer to this particular question, seems drowned out by everyone asking about identifier, line length and macro re-evaluation limits.
C17 Section 5.2.4.1, clause 1, lists a number of minimum translation limits. This means implementations are permitted, but not required, to exceed those limits. In the quote below, I've omitted a couple of references to footnotes, and highlighted one that is most likely relevant to this question.
The implementation shall be able to translate and execute at least one program that contains at least
one instance of every one of the following limits:
— 127 nesting levels of blocks
— 63 nesting levels of conditional inclusion
— 12 pointer, array, and function declarators (in any combinations) modifying an arithmetic,
structure, union, or
void
type in a declaration
— 63 nesting levels of parenthesized declarators within a full declarator
— 63 nesting levels of parenthesized expressions within a full expression
— 63 significant initial characters in an internal identifier or a macro name(each universal character name or extended source character is considered a single character)
— 31 significant initial characters in an external identifier (each universal character name specifying a short identifier of 0000FFFF or less is considered 6 characters, each universal character name specifying a short identifier of 00010000 or more is considered 10 characters, and each extended source character is considered the same number of characters as the corresponding universal character name, if any)
— 4095 external identifiers in one translation unit
— 511 identifiers with block scope declared in one block
— 4095 macro identifiers simultaneously defined in one preprocessing translation unit
— 127 parameters in one function definition
— 127 arguments in one function call
— 127 parameters in one macro definition
— 127 arguments in one macro invocation
— 4095 characters in a logical source line
— 4095 characters in a string literal (after concatenation)
— 65535 bytes in an object (in a hosted environment only)
— 15 nesting levels for #included files
— 1023 case labels for a switch
statement (excluding those for any nested
switch
statements)
— 1023 members in a single structure or union
— 1023 enumeration constants in a single enumeration
— 63 levels of nested structure or union definitions in a single struct-declaration-list
Relevance of the number of characters in a logical source line comes about because expansion of a macro will be into a single logical source line. For example, if \ is used in a macro definition to indicate a multi-line macro, all the parts are spliced into a single source line. This is required by Section 5.1.1.2, clause 1, second item.
Depending on how the macro is defined, it may be affected by other limits as well.
Practically, all implementations (compilers and their preprocessors) do exceed these limits. For example, the allowed length of a logical source line for the gnu compiler is determined by available memory.
The C standard is very lax about specifying such limits. A C implementation must be able to translate “at least one program” with 4095 characters on a logical source line (C 2018 5.2.4.1). However, it may fail in other situations with shorter lines. The length of macro replacement text (measured in either characters or preprocessor tokens) is not explicitly addressed.
So, C implementations may have limits on the lengths of macro replacement text and other text, but it is not controlled by the C standard and often not well documented, or documented at all, by C implementations.
A common technique for preparing complicated or massive data needed in source code is to write a separate program to be executed at compile time to process the data and write the desired source text. This is generally preferable to abusing the C preprocessor features.

Why should the controlled group in a conditional inclusion be lexically valid when the conditional is false?

The following program compiles:
// #define WILL_COMPILE
#ifdef WILL_COMPILE
int i =
#endif
int main()
{
return 0;
}
GCC Live demo here.
But the following will issue a warning:
//#define WILL_NOT_COMPILE
#ifdef WILL_NOT_COMPILE
char* s = "failure
#endif
int main()
{
return 0;
}
GCC Live demo here.
I understand that in the first example, the controlled group is removed by the time the compilation phase of the translation is reached. So it compiles without errors or warnings.
But why is lexical validity required in the second example when the controlled group is not going to be included?
Searching online I found this quote:
Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended.
But this does not state why the lexical validity is checked when the conditional fails.
Have I missed something here?
In the translation phase 3 the preprocessor will generate preprocessor tokens and having a " end up in the catch all non-white-space character that cannot be one of the above
is undefined behavior.
See C11 6.4 Lexical elements p3:
A token is the minimal lexical element of the language in translation phases 7 and 8. The
categories of tokens are: keywords, identifiers, constants, string literals, and punctuators.
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string literals, punctuators, and
single non-white-space characters that do not lexically match the other preprocessing
token categories.69) If a ' or a " character matches the last category, the behavior is
undefined. ....
For reference the preprocessing-token are:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Of which the unmatched " in your second example matches non-white-space character that cannot be one of the above.
Since this is undefined behavior and not a constraint the compiler is not obliged to diagnose it but it is certainly allowed to and using -pedantic-errors it even becomes an error godbolt session. As rici points out it only becomes a constraint violation if the token survives preprocessing.
The gcc document you cite basically says the same thing:
... Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended. ...
"Why is [something about C] the way it is?" questions can't usually be answered, because none of the people who wrote the 1989 C standard are here to answer questions [as far as I know, anyway] and if they were here, it was nearly thirty years ago and they probably don't remember.
However, I can think of a plausible reason why the contents of skipped conditional groups are required to consist of a valid sequence of preprocessing tokens. Observe that comments are not required to consist of a valid sequence of preprocessing tokens:
/* this comment's perfectly fine even though it has an unclosed
character literal inside */
Observe also that it is really simple to scan for the end of a comment. /* you look for the next */, // you look for the end of the line. The only complication is that trigraphs and backslash-newline are supposed to be converted first. Tokenizing the contents of comments would be extra code to no useful purpose.
By contrast, it is not simple to scan for the end of a skipped conditional group, because conditional groups nest. You have to be looking for #if, #ifdef, and #ifndef as well as #else and #endif, and counting your depth. And all of those directives are lexically defined in terms of preprocessor tokens, because that's the most natural way to look for them when you're not in a skipped conditional group. Requiring skipped conditional groups to be tokenizable allows the preprocessor to use the same code to process directives within skipped conditional groups as it does elsewhere.
By default, GCC issues only a warning when it encounters an un-tokenizable line inside a skipped conditional group, an error elsewhere:
#if 0
"foo
#endif
"bar
gives me
test.c:2:1: warning: missing terminating " character
"foo
^
test.c:4:1: error: missing terminating " character
"bar
^~~~
This is an intentional leniency, possibly one I introduced myself (it's only been twenty years since I wrote a third of GCC's current preprocessor, but I have still forgotten a lot of the details). You see, the original C preprocessor, the one K and R wrote, did allow arbitrary nonsense inside skipped conditional groups, because it wasn't built around the concept of tokens in the first place; it transformed text into other text. So people would put comments between #if 0 and #endif instead of /* and */, and naturally enough those comments would sometimes contain apostrophes. So, when Per Bothner and Neil Booth and Chiaki Ishikawa and I replaced GCC's original "C-Compatible Compiler Preprocessor"1 with the integrated, fully standards-compliant "cpplib", circa GCC 3.0, we felt we needed to cut a little compatibility slack here.
1 Raise your hand if you're old enough to know why RMS thought this name was funny.
The description of Translation phase 3 (C11 5.1.1.2/3), which happens before preprocessing directives are actioned:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
And the grammar for preprocessing-token is:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Note in particular that a string-literal is a single preprocessing-token. The subsequent description (C11 6.4/3) clarifies that:
If a ' or a " character matches the last category, the behavior is
undefined.
So your second code causes undefined behaviour at translation phase 3.

Printf with predefined macro concatenate with strings has format specifier in C

I'm programming in Windows and MSVC. There are two ways to write a DEBUG_PRINT statement as I know of:
printf(__FUNCTION__": Error code: %x\n", ErrorCode);
printf("%s: Error code: %x\n", __FUNCTION__, ErrorCode);
Is it okay to concatenate predefined macro with strings like this? I don't know if predefined macro like __FUNCTION__ or __LINE__ is a legit string literal. And intuitively, seems like a dangerous way to treat strings like this in C.
And what's the difference between these two? As I used /FAcs compiler option to output the code snippet to assembly, I really can't see much of a difference.
First of all __FUNCTION__ is not in the C standard, you should probably use __func__ instead (except that microsoft has decided to skip support for that in their C compiler).
Second __FUNCTION__/__func__ "macro" is not really a macro (or at least not normally - microsoft's compiler seem to behave differently), it behaves more like a local variable and therefore it isn't a candidate for the string concatenation. You should use string formatting instead (since that will ensure that your code will become more portable).
The __LINE__ macro (is a macro), but it doesn't work well with string concatenation directly since it doesn't expand to a string - it expands to a number (which by the way can be useful in other cases). However you can use the preprocessor to stringify it (the XSTR macro will first expand it's argument and then stringify the result while STR will not expand the it's argument before stringifying it):
#define STR(x) # x
#define XSTR(x) STR(x)
printf("I'm on line #" XSTR(__LINE__));
The __FILE__ macro (which is also a macro) does expand to a string literal which plays well together with string concatenation directly.
The reason you don't see any difference is that the compiler knows what printf does and can use that for optimization. It would figure out that it doesn't need to rely on printf code to expand the %s at runtime since it can do it at compile time.
The former will concatenate the function name in __FUNCTION__ with the format string at compile-time. The second will format it into the output at runtime.
This assumes it's a pre-defined macro (it's not standard). MSVC has __FUNCTION__ as a proper string literal, GCC does not.
__LINE__ is supported by GCC, and expands to a decimal integer, i.e. not a string.
For performance reasons, I would suggest always using the first way when possible, i.e. when the two strings are compile-time constant. There will be a price to pay, as usual: the string pool will be larger, since each use of the macro creates a unique format string. For desktop-class systems, this is probably neglible.
The difference is that in the 1st case the string literal gets concatenated with the format string during the compiler translation phases, while in the second case, it gets read in during run-time. So the first method is much faster.
If you know that the macro is a pre-defined string literal, I don't see anything wrong with the code.
That being said, I have no idea what __FUNCTION__ is. There is a standard C macro __func__ but it is not a string literal and should be treated like a const char array.
__LINE__ is a standard C macro that gives the source file line as an integer.
Thanks for answering. Looks like the first way is a legit literal string concatenate, it just does it in build-time, and it's faster, just space consuming.
I think I'll stick in the first way. Thanks again.

Is it possible to convert a C string literal to uppercase using the preprocessor (macros)?

Ignoring that there are sometimes better non-macro ways to do this (I have good reasons, sadly), I need to write a big bunch of generic code using macros. Essentially a macro library that will generate a large number of functions for some pre-specified types.
To avoid breaking a large number of pre-existing unit tests, one of the things the library must do is, for every type, generate the name of that type in all caps for printing. E.g. a type "flag" must be printed as "FLAG".
I could just manually write out constants for each type, e.g.
#define flag_ALLCAPSNAME FLAG
but this is not ideal. I'd like to be able to do this programatically.
At present, I've hacked this together:
char capname_buf[BUFSIZ];
#define __MACRO_TO_UPPERCASE(arg) strcpy(capname_buf, arg); \
for(char *c=capname_buf;*c;c++)*c = (*c >= 'a' && *c <= 'z')? *c - 'a' + 'A': *c;
__MACRO_TO_UPPERCASE(#flag)
which does what I want to some extent (i.e. after this bit of code, capname_buf has "FLAG" as its contents), but I would prefer a solution that would allow me to define a string literal using macros instead, avoiding the need for this silly buffer.
I can't see how to do this, but perhaps I'm missing something obvious?
I have a variadic foreach loop macro written (like this one), but I can't mutate the contents of the string literal produced by #flag, and in any case, my loop macro would need a list of character pointers to iterate over (i.e. it iterates over lists, not over indices or the like).
Thoughts?
It is not possible in portable C99 to have a macro which converts a constant string to all uppercase letters (in particular because the notion of letter is related to character encoding. An UTF8 letter is not the same as an ASCII one).
However, you might consider some other solutions.
customize your editor to do that. For example, you could write some emacs code which would update each C source file as you require.
use some preprocessor on your C source code (perhaps a simple C code generator script which would emit a bunch of #define in some #include-d file).
use GCC extensions to have perhaps
#define TO_UPPERCASE_COUNTED(Str,Cnt)
#define TO_UPPERCASE(Str) TO_UPPERCASE_COUNTED(Str,__COUNT__) {( \
static char buf_##Cnt[sizeof(Str)+4]; \
char *str_##Cnt = Str; \
int ix_##Cnt = 0; \
for (; *str_##Cnt; str_##Cnt++, ix_##Cnt++) \
if (ix_##Cnt < sizeof(buf_##Cnt)-1) \
buf_##Cnt[ix_##Cnt] = toupper(*str_##Cnt); \
buf_##Cnt; )}
customize GCC, perhaps using MELT (a domain specific language to extend GCC), to provide your __builtin_capitalize_constant to do the job (edit: MELT is now an inactive project). Or code in C++ your own GCC plugin doing that (caveat, it will work with only one given GCC version).
It's not possible to do this entirely using the c preprocessor. The reason for this is that the preprocessor reads the input as (atomic) pp-tokens from which it composes the output. There's no construct for the preprocessor to decompose a pp-token into individual characters in any way (no one that would help you here anyway).
In your example when the preprocessor reads the string literal "flag" it's to the preprocessor basically an atomic chunk of text. It have constructs to conditionally remove such chunks or glue them together into larger chunks.
The only construct that allows you in some sense to decompose a pp-token is via some expressions. However these expressions only can work on arithmetic types which is why they won't help you here.
Your approach circumvents this problem by using C language constructs, ie you do the conversion at runtime. The only thing the preprocessor does then is to insert the C code to convert the string.

What is the lexical and syntactic analysis during the process of compiling in C Compiler?

What is the lexical and syntactic analysis during the process of compiling. Does the preprocessing happens after lexical and syntactic analysis ?
Consider this code:
int a = 10;
if (a < 4)
{
printf("%d", a);
}
In the Lexical Analysis phase: You identify each word/token and assign a meaning to it.
In the code above, you start by identifying that i followed by n followed by t and then a space is the word int, and that it is a language keyword;1 followed by 0 and a space is a number 10 and so on.
In the Syntactic Analysis phase: You verify whether the code follows the language syntax(grammar rules). For example, you check whether there is only one variable on the LHS of an operator(considering language C), that each statement is terminated by a ;, that if is followed by a conditional/Boolean statement etc.
Like others have mentioned, usually, preprocessing happens before lexical analysis or syntactical analysis.
Lexical analysis happens BEFORE the syntactical analysis. This is logical because when it is necessary to call a macro it is necessary to identify the borders of an identifier first. This is done with lexical analysis. After that syntactical analysis kicks in. Note that compilers are typically not generating the full preprocessed source before starting the syntactic analysis. They read the source picking one lexema at a time, do the preprocessing if needed, and feed the result to syntactic analysis.
In one case lexical analysis happens twice. This is the paste buffering. Look at the code:
#define En(x) Abcd ## x ## x
enum En(5)
{
a, b = 20, c, d
};
This code defines enum with a name Abcd55. When the ## are processed during the macro expansion, the data is placed into an internal buffer. After that this buffer is scanned much like a small #include. During the scanning compiler will break contents of the buffer into lexemas. It may happen that borders of scanned lexemas will not match the borders of original lexemas that were placed into the buffer. In the example above 3 lexemas are placed into the buffer but only one is retrieved.
Preprocessing happens before the lexical analysis iirc
Comments get filtered out, #define, ... and after that, a compiler generates tokens with a scanner/lexer (lexical analysis). After that compilers generate parsetrees, which are for the syntactic analysis
There are exceptions, but it usually breaks out like this:
Preprocess - transform program text to program text
Lexical analysis - transform program text to "tokens", which are essentially small integers with attributes attached
Syntactic analysis - transform program text to abstract syntax
The definition of "abstract syntax" can vary. In one-pass compilers, abstract syntax amounts to tartget code. But theses days it's usually a tree or DAG that logically represents the structure of the program.
When we are talking about C programming language, we should note that there is an ISO (ANSI) stadard for the language. Here is a last public draft of C99 (ISO/IEC 9899:1999): www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
There is a section "5.1.1.2 Translation phases" which says how should C program be parsed. There are stages:
... some steps for multi-byte, trigraph and backslash processing...
3). The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
This is lexical analysis for preprocessing. Only preprocessor directives, punctuation, string constants, identifiers, comments are lexed here.
4). Preprocessing directives are executed, macro invocations are expanded
This is preprocessing itself. This phase will also include files from #include and then it will delete preprocessing directives (like #define or #ifdef and other)
... processing of string literals...
7). White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
Conversion to token means language keyword detection and constants detection.
This is the step of final lexical analysis; syntactic and semantic analyses.
So, your question was:
Does the preprocessing happens after lexical and syntactic analysis ?
Some lexical analysis is needed to do preprocessing, so order is:
lexical_for_preprocessor, preprocessing, true_lexical, other_analysis.
PS: Real C compiler may be organized in slightly different way, but it must behave in the same way as written in standard.

Resources