Preprocessor only on arbitrary file?

Preprocessor only on arbitrary file? - c

I wanted to demonstrate that the preprocessor is totally independant of the build process. It is another grammar and another lexer than the C language. In fact I wanted to show that the preprocessor could be applied to any type of file.
So I have this arbitrary file:
#define FOO
#ifdef FOO
I am foo
#endif
#
#
#
Something
#pragma Hello World
And I thought this would work:
$ gcc -E test.txt -o-
gcc: warning: test.txt: linker input file unused because linking not done
Unfortunately it only work with this:
$ cat test.txt | gcc -E -
Why is this error with GCC?

You need to tell gcc it's a C file.
gcc -xc -E test.txt

The C compiler uses the file name suffix as an indicator of the files that have to be compiled (ending in .c) files that have only to be linked (ending in .o or .so) For the files ending in .s it calls the assembler as(1) and for files ending in .f it call the fortran compiler, and for .cc it switches to C++ compiling.
Indeed, normally, C compilers take everything they don't match as a linker file, so once you pass it a linker file, it tries to link it, calling the linker ld(1). This is what happens with your .txt file. The linker has some similar way to recognise ld(1) scripts against object or shared object files.
BTW, the CPP language is indeed a macro language, but there's some similarities with C that cannot be avoided. It has, at least, to recognise C identifiers, as macro names have the same syntax as C identifiers, and it has to check that an identifier matches a macro name or not. In other side... It has to recognise C comments, and C strings (it indeed eliminates comments for the compiler), as macro expansion doesn't enter to expand inside them, and it has also to recognize parenthesis (they are considered for macro parameter detection and the , symbol, used to separate parameters). It also recognizes (inside the macro string) the tokens # (to stringify a parameter) and ## (to catenate and merge two symbols into one) (this last operator must force cpp to recognise almost any C token, as it must check for errors if you try to merge something like +##+ into ++, which is an error)
So, the conclussion is: the cpp doesn't have to implement the whole C syntax as a language, but the tokens of the C language must be recognised almost completely. The standard for the C language forces the c preprocessor to tokenize the input, so the ## operator can be used to merge tokens (and to check for validity) This means that, if you define a macro like:
#define M(p) +p
and then you call it like:
a = +M(-c);
you will get a string similar to:
a = + +-c;
in the output (it will insert a space in between the two + signs, so they don't get merged into ++ operator. The symbols + and - are together, because they will never be scanned as one token) See the next example (input is preceded by > symbol)
$ cpp - <<EOF
> #define M(p) +p
> a = +M(p);
> b = -M(p);
> p = +M(+p);
> p = +M(-p);
> EOF
# 1 "<stdin>"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 346 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "<stdin>" 2
a = + +p;
b = -+p;
p = + + +p;
p = + +-p;
Another example will show more difficulties in parsing the tokens (input is delimited with >, stderr with >> and stdout is unquoted):
$ cpp - <<EOF
#define M(a,b) a##b
> a = M(a+,+b)
> a = M(a+,-b)
> a = M(a,+b)
> a = M(a,b)
> a = M(a,300)
> a = M(a,300.2)
> EOF
>> <stdin>:3:5: error: pasting formed '+-', an invalid preprocessing token
>> a = M(a+,-b)
>> ^
>> <stdin>:1:17: note: expanded from macro 'M'
>> #define M(a,b) a##b
>> ^
>> <stdin>:4:5: error: pasting formed 'a+', an invalid preprocessing token
>> a = M(a,+b)
>> ^
>> <stdin>:1:17: note: expanded from macro 'M'
>> #define M(a,b) a##b
>> ^
>> <stdin>:7:5: error: pasting formed 'a300.2', an invalid preprocessing token
>> a = M(a,300.2)
>> ^
>> <stdin>:1:17: note: expanded from macro 'M'
>> #define M(a,b) a##b
>> ^
>> 3 errors generated.
# 1 "<stdin>"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 346 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "<stdin>" 2
a = a++b
a = a+-b
a = a+b
a = ab
a = a300
a = a 300.2
As you can see in this example, merging a and 300 goes fine, as one token makes an identifier, which is valid and cpp(1) doesn't complain, but when merging a and 300.2 the resulting token a300.2 is not a valid token in C, so it is rejected (it is also not joined and the tool inserts a space, to make the compiler see both tokens as separate ---should it joined both together, they would have been scanned as the tokens a300 and .2).
If you want to use a language independent macro preprocesor, consider using m4(1) as a macro language. It's far more powerful than cpp in many ways. But beware, it's difficult to learn due to the complexity of macro expansions it allows.

You can use the C preprocessor, cpp (or the more traditional form, /lib/cpp):
cpp test.txt
or
/lib/cpp test.txt

Related

Expand pragma to a comment (for doxygen)

Comments are usually converted to a single white-space before the preprocesor is run. However, there is a compelling use case.
#pragma once
#ifdef DOXYGEN
#define DALT(t,f) t
#else
#define DALT(t,f) f
#endif
#define MAP(n,a,d) \
DALT ( COMMENT(| n | a | d |) \
, void* mm_##n = a \
)
/// Memory map table
/// | name | address | description |
/// |------|---------|-------------|
MAP (reg0 , 0 , foo )
MAP (reg1 , 8 , bar )
In this example, when the DOXYGEN flag is set, I want to generate doxygen markup from the macro. When it isn't, I want to generate the variables. In this instance, the desired behaviour is to generate comments in the macros. Any thoughts about how?
I've tried /##/ and another example with more indirection
#define COMMENT SLASH(/)
#define SLASH(s) /##s
neither work.

In doxygen it is possible to run commands on the sources before they are fed into the doxygen kernel. In the Doxyfile there are some FILTER possibilities. In this case: INPUT_FILTER the line should read:
INPUT_FILTER = "sed -e 's%^ *MAP *(\([^,]*\),\([^,]*\),\([^)]*\))%/// | \1 | \2 | \3 |%'"
Furthermore the entire #if construct can disappear and one, probably, just needs:
#define MAP(n,a,d) void* mm_##n = a

The ISO C standard describes the output of the preprocessor as a stream of preprocessing tokens, not text. Comments are not preprocessing tokens; they are stripped from the input before tokenization happens. Therefore, within the standard facilities of the language, it is fundamentally impossible for preprocessing output to contain comments or anything that resembles them.
In particular, consider
#define EMPTY
#define NOT_A_COMMENT_1(text) /EMPTY/EMPTY/ text
#define NOT_A_COMMENT_2(text) / / / text
NOT_A_COMMENT_1(word word word)
NOT_A_COMMENT_2(word word word)
After translation phase 4, both the fourth and fifth lines of the above will both become the six-token sequence
[/][/][/][word][word][word]
where square brackets indicate token boundaries. There isn't any such thing as a // token, and therefore there is nothing you can do to make the preprocessor produce one.
Now, the ISO C standard doesn't specify the behavior of doxygen. However, if doxygen is reusing a preprocessor that came with someone's C compiler, the people who wrote that preprocessor probably thought textual preprocessor output should be, above all, an accurate reflection of the token sequence that the "compiler proper" would receive. That means it will forcibly insert spaces where necessary to make separate tokens remain separate. For instance, with test.c the above example,
$ gcc -E test.c
...
/ / / word word word
/ / / word word word
(I have elided some irrelevant chatter above the output we're interested in.)
If there is a way around this, you are most likely to find it in the doxygen manual. There might, for instance, be configuration options that teach it that certain macros should be understood to define symbols, and what symbols those are, and what documentation they should have.

Creating identifiers containing universal character names via token concatenation

I wrote this code that creates identifiers containing universal character names via token concatenation.
//#include <stdio.h>
int printf(const char*, ...);
#define CAT(a, b) a ## b
int main(void) {
//int \u306d\u3053 = 10;
int CAT(\u306d, \u3053) = 10;
printf("%d\n", \u306d\u3053);
//printf("%d\n", CAT(\u306d, \u3053));
return 0;
}
This code worked well with gcc 4.8.2 with -fextended-identifiers option and gcc 5.3.1, but didn't work with clang 3.3 with error message:
prog.c:10:17: error: use of undeclared identifier 'ねこ'
printf("%d\n", \u306d\u3053);
^
1 error generated.
and local clang (Apple LLVM version 7.0.2 (clang-700.1.81)) with error message:
$ clang -std=c11 -Wall -Wextra -o uctest1 uctest1.c
warning: format specifies type 'int' but the argument has type
'<dependent type>' [-Wformat]
uctest1.c:10:17: error: use of undeclared identifier 'ねこ'
printf("%d\n", \u306d\u3053);
^
1 warning and 1 error generated.
When I used -E option to have the compilers output code with macro expanded, gcc 5.3.1 emitted this:
# 1 "main.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "main.c"
int printf(const char*, ...);
int main(void) {
int \U0000306d\U00003053 = 10;
printf("%d\n", \U0000306d\U00003053);
return 0;
}
local clang emitted this:
# 1 "uctest1.c"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 326 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "uctest1.c" 2
int printf(const char*, ...);
int main(void) {
int \u306d\u3053 = 10;
printf("%d\n", ねこ);
return 0;
}
As you see, the identifiers declared and used in printf() matches in gcc's output, but they don't match in clang's output.
I know that creating universal character names via token concatenation invokes undefined behavior.
Quote from N1570 5.1.1.2 Translation phases:
If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined.
I thought that this character sequence \u306d\u3053 may "match the syntax of a universal character name" because it contains universal character names as its substring.
I also thought that "match" may mean that the entire token produced via concatenation stands for one universal character name, and that therefore this undefined behavior isn't invoked in this code.
Reading PRE30-C. Do not create a universal character name through concatenation, I found a comment saying this kind of concatenation is allowed:
What is forbidden, to create a new UCN via concatenation. Like doing
assign(\u0001,0401,a,b,4)
just concatenating stuff that happens to contain UCNs anywhere is okay.
And a log that shows that a code example like this case (but with 4 characters) is replaced with another code example.
Does my code example invoke some undefined behaviors (not limited to ones invoked by producing universal character names via token concatenation)?
Or is this a bug in clang?

Your code is not triggering the undefined behavior you mention, as universal character name (6.4.3) not being produced by token concatenation.
And, according to 6.10.3.3, as both the left side and the right side of operator ## is an identifier, and the produced token is also a valid preprocessing token (an identifier too), the ## operator itself not trigger an undefined behavior.
After reading description about identifier (6.4.2, D.1, D.2), universal character names (6.4.3), I'm pretty sure that it is more like a bug in clang preprocessor, which treats identifier produced by token concatenation and normal identifier differently.

simple script or commands to substitute stray "\\n" with "\n"

alright, i understand that the title of this topic sounds a bit gibberish... so i'll try to explain it as clearly as i can...
this is related to this previous post (an approach that's been verified to work):
multipass a source code to cpp
-- which basically asks the cpp to preprocess the code once before starting the gcc compile build process
take the previous post's sample code:
#include <stdio.h>
#define DEF_X #define X 22
int main(void)
{
DEF_X
printf("%u", X);
return 1;
}
now, to be able to freely insert the DEF_X anywhere, we need to add a newline
this doesn't work:
#define DEF_X \
#define X 22
this still doesn't work, but is more likely to:
#define DEF_X \n \
#define X 22
if we get the latter above to work, thanks to C's free form syntax and constant string multiline concatenation, it works anywhere as far as C/C++ is concerned:
"literal_str0" DEF_X "literal_str1"
now when cpp preprocesses this:
# 1 "d:/Projects/Research/tests/test.c"
# 1 "<command-line>"
# 1 "d:/Projects/Research/test/test.c"
# 1 "c:\\mingw\\bin\\../lib/gcc/mingw32/4.7.2/../../../../include/stdio.h" 1 3
# 19 "c:\\mingw\\bin\\../lib/gcc/mingw32/4.7.2/../../../../include/stdio.h" 3
# 1 "c:\\mingw\\bin\\../lib/gcc/mingw32/4.7.2/../../../../include/_mingw.h" 1 3
# 32 "c:\\mingw\\bin\\../lib/gcc/mingw32/4.7.2/../../../../include/_mingw.h" 3=
# 33 "c:\\mingw\\bin\\../lib/gcc/mingw32/4.7.2/../../../../include/_mingw.h" 3
# 20 "c:\\mingw\\bin\\../lib/gcc/mingw32/4.7.2/../../../../include/stdio.h" 2 3
ETC_ETC_ETC_IGNORED_FOR_BREVITY_BUT_LOTS_OF_DECLARATIONS
int main(void)
{
\n #define X 22
printf("%u", X);
return 1;
}
we have a stray \n in our preprocessed file. so now the problem is to get rid of it....
now, the unix system commands aren't really my strongest suit. i've compiled dozens of packages in linux and written simple bash scripts that simply enter multiline commands (so i don't have to type them every time or keep pressing the up arrow and choose the correct command successions). so i don`t know the finer points of stream piping and their arguments.
having said that, i tried these commands:
cpp $MY_DIR/test.c | perl -p -e 's/\\n/\n/g' > $MY_DIR/test0.c
gcc $MY_DIR/test0.c -o test.exe
it works, it removes that stray \n.
ohh, as to using perl rather than sed, i'm just more familiar with perl's variant to regex... it's more consistent in my eyes.
anyways, this has the nasty side effect of eating up any \n in the file (even in string literals)... so i need a script or a series of commands to:
remove a \n if:
if it is not inside a quote -- so this won't be modified: "hell0_there\n"
not passed to a function call (inside the argument list)
this is safe as one can never pass a single \n, which is neither a keyword nor an identifier.
if i need to "stringify" an expression with \n, i can simply call a function macro QUOTE_VAR(token). so that encapsulates all instances that \n would have to be treated as a string.
this should cover all cases that \n should be substituted... at least for my own coding conventions.
really, i would do this if i could manage it on my own... but my skills in regex is extremely lacking, only using it in for simple substitutions.

The better way is to replace \n if it occurs in the beginning of line.
The following command should do the work:
sed -e 's/\s*\\n/\n/g'
or occurs before #
sed -e 's/\\n\s*#/\n#/g'
or you can reverse the order of preprocessing and substitute DEF_X with your own tool before C preprocessor.

Can the C preprocessor perform arithmetic and if so, how?

I'm currently writing code for a microcontroller; since the ATMega128 does not have a hardware multiplier or divider, these operations must be done in software and they take up a decent amount of cycles. However, for code portability and ease of use, I'd prefer not to hard-code precomputed values into my code So for instance, I have a number of tasks which are dependent on the system clock frequency. Currently I' running at 16MHz, but should I choose to lower that, say to reduce power consumption for battery applications, I'd like to change one line of code rather than many.
So with that said, can the C preprocessor compute arithmetic expressions and then "paste" the result into my code rather than "pasting" the original expression into the code? If so, how would I go about doing this? Are their compiler options and whatnot that I need to consider?
NOTE: The values I want to compute are constant values, so I see no reason why this would not be a feature.

This is one question:
Q1. Can the C preprocessor perform arithmetic?
And this is another:
Q2. Can the C preprocessor compute arithmetic expressions and then "paste"
the result into my code rather than "pasting" the original expression into the code?
The answer to Q1 is Yes. The answer to Q2 is No. Both facts can be illustrated
with the following file:
foo.c
#define EXPR ((1 + 2) * 3)
#if EXPR == 9
int nine = EXPR;
#else
int not_nine = EXPR;
#endif
If we pass this to the C preprocessor, either by cpp foo.c or
equivalently gcc -E foo.c, we see output like:
# 1 "foo.c"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 30 "/usr/include/stdc-predef.h" 3 4
# 1 "/usr/include/x86_64-linux-gnu/bits/predefs.h" 1 3 4
# 31 "/usr/include/stdc-predef.h" 2 3 4
# 1 "<command-line>" 2
# 1 "foo.c"
int nine = ((1 + 2) * 3);
The fact that the preprocessor retains the line defining int nine and
has dropped the line defining not_nine shows us that it has correctly performed
the arithmetic required to evaluate #if EXPR == 9.
The fact that the preprocessed text of the definition is int nine = ((1 + 2) * 3);
shows us that the #define directive causes the preprocessor to replace
EXPR with its definition ((1 + 2) * 3), and not with the arithmetic value
of its definition, 9.
Does the C preprocessor have any directive besides #define which has the second
effect? No.
But this does not of course imply that the definition of int nine must entail a
runtime calculation, because the compiler will almost certainly evaluate
the arithmetic expression ((1 + 2) * 3) at compiletime and replace it with
the constant 9.
We can see how the compiler has translated the source file by examining the
compiled object file. Most toolchains will provide something like GNU binutils
objdump to assist with this. If I compile foo.c with gcc:
gcc -c -o foo.o foo.c
and then invoke:
objdump -s foo.o
to see the full contents of foo.o, I get:
foo.o: file format elf64-x86-64
Contents of section .data:
0000 09000000 ....
Contents of section .comment:
0000 00474343 3a202855 62756e74 752f4c69 .GCC: (Ubuntu/Li
0010 6e61726f 20342e38 2e312d31 30756275 naro 4.8.1-10ubu
0020 6e747539 2920342e 382e3100 ntu9) 4.8.1.
And there is the hoped-for 9 hard-coded in the .data section.
Note that the preprocessor's arithmetic capabilities are restricted to integer arithmetic

It can, but is unnecessary: you don't actually need to involve the preprocessor unless you actually want to generate new identifiers that involve numbers in some way (e.g. stuff like func1, func2).
Expressions like 1 + 2 * 3, where all elements are compile-time constant integer values, will be replaced with the single result at compile-time (this is more or less demanded by the C standard, so it's not "really" an optimisation). So just #define constants where you need to name a value that can be changed from one place, make sure the expression doesn't involve any runtime variables, and unless your compiler is intentionally getting in your way you should have no runtime operations to worry about.

Yes you can do arithmetic using the preprocessor, but it takes a lot of work. Reading this page here, shows how to create an increment counter, and a while loop. So with that you could create addition:
#define ADD_PRED(x, y) y
#define ADD_OP(x, y) INC(x), DEC(y)
#define ADD(x, y) WHILE(ADD_PRED, ADD_OP, x, y)
EVAL(ADD(1, 2)) // Expands to 3
So reusing the ADD macro you can then create a MUL macro something like this:
#define MUL_PRED(r, x, y) y
#define MUL_OP(r, x, y) ADD(r, x), x, DEC(y)
#define MUL_FINAL(r, x, y) r
#define MUL(x, y) MUL_FINAL(WHILE(MUL_PRED, MUL_OP, 0, x, y))
EVAL(MUL(2, 3)) // Expands to 6
Division and subtraction can be built in a similiar way.

I compiled a file containing the following lines using gcc -E.
#define MUL(A, B) ((A)*(B))
#define CONST_A 10
#define CONST_B 20
int foo()
{
return MUL(CONST_A, CONST_B);
}
The output was:
# 1 "test-96.c"
# 1 "<command-line>"
# 1 "test-96.c"
int foo()
{
return ((10)*(20));
}
That's just one data point for you.

Is partial macro application / currying possible in the C preprocessor?

As an example of the problem, is there any way to implement the macro partialconcat in the following code?
#define apply(f, x) f(x)
apply(partialconcat(he),llo) //should produce hello
EDIT:
Here's another example, given a FOR_EACH variadic macro (see an example implementation in this answer to another question).
Say I want to call a member on several objects,
probably within another macro for a greater purpose.
I would like a macro callMember that behaves like this:
FOR_EACH(callMember(someMemberFunction), a, b, c);
produces
a.someMemberFunction(); b.someMemberFunction(); c.someMemberFunction();
This needs callMember(someMember) to produce a macro that behaves like
#define callMember_someMember(o) o.someMember()

You can achieve the desired result with the preprocessor using Vesa Karvonen's incredible "Order" language/library: http://rosettacode.org/wiki/Order
This works by implementing a whole second high-level language on top of the preprocessor itself, with support for things like currying and first-class macros and so on. It's pretty heavy-duty though, nontrivial Order code takes a very long time to compile because CPP wasn't designed to be used in that way, and most C compilers can't handle it. It's also very fragile: errors in the input code tend to produce incomprehensible gibberish output.
But yes, it can be done, and done in one preprocessor pass. It's just a lot more complicated than you might have been expecting.

Use higher order macros:
#define OBJECT_LIST(V) \
V(a) \
V(b) \
V(c)
#define MEMBER_CALL(X) \
X.some_func();
OBJECT_LIST(MEMBER_CALL)
output
$ g++ -E main.cc
# 1 "main.cc"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "main.cc"
# 10 "main.cc"
a.some_func(); b.some_func(); c.some_func();
since it is a compile time loop, currying is difficult. the OBJECT_LIST macro defines how many arguments every user of this list is allowed to curry. the (default) function call arguments are part of the define then. You can freely choose not to use the default supplied argument or use a constant value yourself. I was not able to find a proper way to reduce the amount of arguments in the preprocessor. This fact limits the generality of this technique.
#define OBJECT_LIST(V) \
V(a, 1,2,3) \
V(b, 4,5,6)
#define MEMBER_CALL(X, A1, A2, A3) \
X.somefunc(A1, A2, A3);
#define CURRY_CALL(X, A1, A2, A3) \
X.somefunc(A1, 2, 2);
#define NO_CURRY_CALL(X, A1, A2, A3) \
X.xomefunc(A1);
OBJECT_LIST(MEMBER_CALL)
OBJECT_LIST(CURRY_CALL)
OBJECT_LIST(NO_CURRY_CALL)
output:
# 1 "main2.cc"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "main2.cc"
# 12 "main2.cc"
a.somefunc(1, 2, 3); b.somefunc(4, 5, 6);
a.somefunc(1, 2, 2); b.somefunc(4, 2, 2);
a.somefunc(1); b.somefunc(4);

The C preprocessor is 'only' a simple text processor. In particular, one macro cannot define another macro; you cannot create a #define out of the expansion of a macro.
I think that means that the last two lines of your question:
This needs callMember(someMember) to produce a macro that behaves like
#define callMember_someMember(o) o.someMember()
are not achievable with a single application of the C preprocessor (and, in the general case, you'd need to apply the preprocessor an arbitrary number of times, depending on how the macros are defined).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight