ANTLR C grammar, optional init_declarator_list? - c

In the ANSI C grammar for ANTLR v3 ( http://antlr.org/grammar/1153358328744/C.g ), how can init_declarator_list be optional in rule declaration ?
Instead of:
| declaration_specifiers init_declarator_list? ';'
-I would say:
| declaration_specifiers init_declarator_list ';'
What part of the C standard allows statements like:
int;
EDIT:
I just tried, it is allowed! Okay then, why is it allowed?

A wild guess: To make it simpler to write programs that produce machine-generated C.

Probably the ANTLR grammar is directly following that of the C standard grammar. I haven't read the C standard but for C++, the standard says separately that the init_declarator_list can be omitted only when declaring a class or enum type. So the grammar alone only encompasses all the possible forms of declaration while each particular case is further defined using plain language.
As to the case you indicated, int; is disallowed by the rules outside the grammar.
Note that the C/C++ language cannot be completely defined by the grammar alone. Many extra rules must be specified in plain human language.

Related

why is `int test {}` a function definition in C language BNF

I'm interested in the famous The syntax of C in Backus-Naur Form and studied for a while, what confuse me is that some syntax looks wrong to me but is considered right according to the BNF.
For example, int test {}, what's this? I think this is a ill syntax in C, but the truth is the BNF considered this a function definition:
int -> type_const -> type_spec -> decl_specs
test-> id -> direct_declarator -> declarator
'{' '}' -> compound_stat
decl_specs declarator compound_stat -> function_definition
I tried this with bison, it considered the input int test {} is a right form, but I tried this on a C compiler, it will not compile.
So got questions:
int test {} a right syntax or not?
If it is a right syntax, what is that mean and why compiler do not recognized it?
If it is an ill syntax, can I say the BNF is not rigorous? And does that mean modern C compiler does not stick with this BNF?
The grammar is necessary but not sufficient to describe a valid C program. For that you need constraints from the standard too. A simpler example of this would be 0++, which follows the syntax of a C expression, but certainly isn't a valid program fragment...
C11 6.9.1p2:
The identifier declared in a function definition (which is the name of the function) shall have a function type, as specified by the declarator portion of the function definition. [162]
The footnote 162 explains that the intent of the constraint is that a typedef cannot be used, i.e. that
typedef int F(void);
F f { /* ... */ }
will not be valid, even though such a typedef could be used for a function declaration, i.e.
F f;
would declare the function
int f(void);
But mere existence of this constraint also proves that the BNF grammar in itself is not sufficient in this case. Hence you are correct in that the grammar would consider such a fragment a function definition.
The BNF form is a precise way to describe the syntax of a language, i.e. what to do precisely to get the parse tree starting from raw input.
For each language you can define infinite many grammars that describe that language. The properties of these grammars that describe the same language can differ a lot.
If you study the grammar of the C language, take care as it is not context free but context sensitive, which means, the decision of choosing a rule or other depends on what there is around that point in input.
Read about the lexer hack to see how to correctly interpret the Backus Naur form of the C grammar.

What does it mean that the language of preprocessor directives is weakly related to the grammar of C?

The Wikipedia article on the C Preprocessor says:
The language of preprocessor directives is only weakly related to the grammar of C, and so is sometimes used to process other kinds of text files.
How is the language of a preprocessor different from C grammar? What are the advantages? Has the C Preprocessor been used for other languages/purposes?
Can it be used to differentiate between inline functions and macros, since inline functions have the syntax of a normal C function whereas macros use slightly different grammar?
The Wikipedia article is not really an authoritative source for the C programming language. The C preprocessor grammar is a part of the C grammar. However it is completely distinct from the phrase structure grammar i.e. these 2 are not related at all, except that they both understand that the input consists of C language tokens, (though the C preprocessor has the concept of preprocessing numbers, which means that something like 123_abc is a legal preprocessing token, but it is not a valid identifier).
After the preprocessing has been completed and before the translation using the phrase structure grammar commences (the preprocessor directives have by now been removed, and macros expanded and so forth),
Each preprocessing token is converted into a token. (C11 5.1.1.2p1 item 7)
The use of C preprocessor for any other languages is really abuse. The reason is that the preprocessor requires that the file consists of proper C preprocessing tokens. It isn't designed to work for any other languages. Even C++, with its recent extensions, such as raw string literals, cannot be preprocessed by a C preprocessor!
Here's an excerpt from the cpp (GNU C preprocessor) manuals:
The C preprocessor is intended to be used only with C, C++, and
Objective-C source code. In the past, it has been abused as a general
text processor. It will choke on input which does not obey C's lexical
rules. For example, apostrophes will be interpreted as the beginning of
character constants, and cause errors. Also, you cannot rely on it
preserving characteristics of the input which are not significant to
C-family languages. If a Makefile is preprocessed, all the hard tabs
will be removed, and the Makefile will not work.
The preprocessor creates preprocessing tokens, which later are converted in C-tokens.
In general the conversion is quite direct, but not always. For example, if you have a conditional preprocessing directive that evaluates to false as in
#if 0
comments
#endif
then in comments you can write whatever you want, it will be converted in preprocessing tokens that will never be converted in C-tokens, so like this inside a C source file you can insert non-commented code.
The only link between the language of the preprocessor and C is that many tokens are defined almost the same but not always.
for example, it is valid to have preprocessor numbers (in ISO9899 standard called pp-numbers) like 4MD which are valid preprocessor numbers but not valid C numbers. Using the ## operator you can get a valid C identifier using these preprocessing numbers. For example
#define version 4A
#define name TEST_
#define VERSION(x, y) x##y
VERSION(name, version) <= this will be valid C identifier
The preprocessor was conceived such that to be applicable to any language to make text translation, not having C in mind. In C it is useful mainly to make a clear separation between interfaces and implementations.
Conditionals in the C preprocessor are valid C expressions so the link between the preprocessor and the C language proper is intimate.
#define A (6)
#if A > 5
Here is a 6
#elif A < 0
# error
#endif
This expands to meaningless C, but may be meaningful text.
Here is a 6
Though the expnded text is invalid C, the preprocessor uses features of C to expand the correct conditional lines. The C standard defines this in terms of the constant expression:
From the C99 standard §6.6:
6.10.1 Conditional inclusion
Preprocessing directives of the forms
# if constant-expression new-line group opt
# elif constant-expression new-line group opt
check whether the controlling constant expression evaluates to nonzero.
And here is the definition of a constant-expression
6.6 Constant expressions
Syntax:
constant-expression:
conditional-expression
Description A constant expression can be evaluated during translation rather than runtime, and accordingly may be used in any
place that a constant may be.
Constraints Constant expressions shall not contain assignment, increment, decrement, function-call, or comma operators, except when
they are contained within a subexpression that is not evaluated.
Each constant expression shall evaluate to a constant that is in the
range of representable values for its type.
Given the above, it's clear that the preprocessor requires a limited form of C language expression evaluation to work, and therefore knowledge of the C typesystem, grammar, and expression semantics.

How to make C language context-free?

I know that C is not a context-free language, a famous example is:
int foo;
typedef int foo;
foo x;
In this case the lexer doesn't know, whether foo in the 3rd line, is an identifier, or typedef.
My question is, is this the only reason that makes C a Context-Sensitive Language?
I mean, if we get rid of typedef, would it become context-free language? Or there are other reasons (examples) that prevent it from being so?
Yes. C can be parsed with a classical lex + yacc combo. The lexer definition and the yacc grammar are freely available at
http://www.quut.com/c/ANSI-C-grammar-l-2011.html
and
http://www.quut.com/c/ANSI-C-grammar-y-2011.html
As you can see from the lex file, it's straightforward except for the context-sensitive check_type() (and comment(), but comment processing technically belongs to the preprocessor), which makes typedef the only source of context-sensitivity there. Since the yacc file doesn't contain any context-sensitivity introducing tricks either, a typedef-less C would be a perfectly context-free language.
No. C cannot be a strict context independent language. For that, you should describe a syntax that doesn't allow to use a nondeclared variable (this is context) in a similar way as what you describe in your question. The language authors always describe syntax using some kind of context free grammar, but just to describe the main syntactic constructs of the language. The case you describe (making a type identifier to fit in a different token class to be able to go in places where it shouldn't) is only an example. If you look for example, the freedom in the order for things like static unsigned long long int variable simplifies the syntax remembering by programmers, but complicates things to the compiler authors.
As per my knowledge and research there are two basic reasons that make C context sensitive language. These are:
Variable is declared before it is used.
Matching the formal and actual parameters of functions or sub-routines.
These two can't be done by PushDown Automata (PDA) but Linear Bounded Automata (LBA) can do thes two.

Is the Syntax of C Language completely defined by CFGs?

I think the Question is self sufficient. Is the syntax of C Language completely defined through Context Free Grammars or do we have Language Constructs which may require non-Context Free definitions in the course of parsing?
An example of non CFL construct i thought was the declaration of variables before their use. But in Compilers(Aho Ullman Sethi), it is stated that the C Language does not distinguish between identifiers on the basis of their names. All the identifiers are tokenized as 'id' by the Lexical Analyzer.
If C is not completely defined by CFGs, please can anyone give an example of Non CFL construct in C?
The problem is that you haven't defined "the syntax of C".
If you define it as the language C in the CS sense, meaning the set of all valid C programs, then C – as well as virtually every other language aside from turing tarpits and Lisp – is not context free. The reasons are not related to the problem of interpreting a C program (e.g. deciding whether a * b; is a multiplication or a declaration). Instead, it's simply because context free grammars can't help you decide whether a given string is a valid C program. Even something as simple as int main() { return 0; } needs a more powerful mechanism than context free grammars, as you have to (1) remember the return type and (2) check that whatever occurs after the return matches the return type. a * b; faces a similar problem – you don't need to know whether it's a multiplication, but if it is a multiplication, that must be a valid operation for the types of a and b. I'm not actually sure whether a context sensitive grammar is enough for all of C, as some restrictions on valid C programs are quite subtle, even if you exclude undefined behaviour (some of which may even be undecidable).
Of course, the above notion is hardly useful. Generally, when talking grammars, we're only interested in a pretty good approximation of a valid program: We want a grammar that rules out as many strings which aren't C as possible without undue complexity in the grammar (for example, 1 a or (-)). Everything else is left to later phases of the compiler and called a semantic error or something similar to distinguish it from the first class of errors. These "approximate" grammars are almost always context free grammars (including in C's case), so if you want to call this approximation of the set of valid programs "syntax", C is indeed defined by a context free grammar. Many people do, so you'd be in good company.
The C language, as defined by the language standard, includes the preprocessor. The following is a syntactically correct C program:
#define START int main(
#define MIDDLE ){
START int argc, char** argv MIDDLE return 0; }
It seems to be really tempting to answer this question (which arises a lot) "sure, there is a CFG for C", based on extracting a subset of the grammar in the standard, which grammar in itself is ambiguous and recognizes a superset of the language. That CFG is interesting and even useful, but it is not C.
In fact, the productions in the standard do not even attempt to describe what a syntactically correct source file is. They describe:
The lexical structure of the source file (along with the lexical structure of valid tokens after pre-processing).
The grammar of individual preprocessor directives
A superset of the grammar of the post-processed language, which relies on some other mechanism to distinguish between typedef-name and other uses of identifier, as well as a mechanism to distinguish between constant-expression and other uses of conditional-expression.
There are many who argue that the issues in point 3 are "semantic", rather than "syntactic". However, the nature of C (and even more so its cousin C++) is that it is impossible to disentangle "semantics" from the parsing of a program. For example, the following is a syntactically correct C program:
#define base 7
#if base * 2 < 10
&one ?= two*}}
#endif
int main(void){ return 0; }
So if you really mean "is the syntax of the C language defined by a CFG", the answer must be no. If you meant, "Is there a CFG which defines the syntax of some language which represents strings which are an intermediate product of the translation of a program in the C language," it's possible that the answer is yes, although some would argue that the necessity to make precise what is a constant-expression and a typedef-name make the syntax necessarily context-sensitive, in a way that other languages are not.
Is the syntax of C Language completely defined through Context Free Grammars?
Yes it is. This is the grammar of C in BNF:
http://www.cs.man.ac.uk/~pjj/bnf/c_syntax.bnf
If you don't see other than exactly one symbol on the left hand side of any rule, then the grammar is context free. That is the very definition of context free grammars (Wikipedia):
In formal language theory, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form
V → w
where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty).
Since ambiguity is mentioned by others, I would like to clarify a bit. Imagine the following grammar:
A -> B x | C x
B -> y
C -> y
This is an ambiguous grammar. However, it is still a context free grammar. These two are completely separate concepts.
Obviously, the semantics analyzer of C is context sensitive. This answer from the duplicate question has further explanations.
There are two things here:
The structure of the language (syntax): this is context free as you do not need to know the surroundings to figure out what is an identifier and what is a function.
The meaning of the program (semantics): this is not context free as you need to know whether an identifier has been declared and with what type when you are referring to it.
If you mean by the "syntax of C" all valid C strings that some C compiler accepts, and after running the pre-processor, but ignoring typing errors, then this is the answer: yes but not unambiguously.
First, you could assume the input program is tokenized according to the C standard. The grammar will describe relations among these tokens and not the bare characters. Such context-free grammars are found in books about C and in implementations that use parser generators. This tokenization is a big assumption because quite some work goes into "lexing" C. So, I would argue that we have not described C with a context-free grammar yet, if we have not used context-free grammars to describe the lexical level. The staging between the tokenizer and the parser combined with the ordering emposed by a scanner generator (prefer keywords, longest match, etc) are a major increase in computational power which is not easily simulated in a context-free grammar.
So, If you do not assume a tokenizer which for example can distinguish type names from variable names using a symbol table, then a context-free grammar is going to be harder. However: the trick here is to accept ambiguity. We can describe the syntax of C including its tokens in a context-free grammar fully. Only the grammar will be ambiguous and produce different interpretations for the same string . For example for A *a; it will have derivations for a multiplication and a pointer declaration both. No problem, the grammar is still describing the C syntax as you requested, just not unambiguously.
Notice that we have assumed having run the pre-processor first as well, I believe your question was not about the code as it looks before pre-processing. Describing that using a context-free grammar would be just madness since syntactic correctness depends on the semantics of expanding user-defined macros. Basically, the programmer is extending the syntax of the C language every time a macro is defined. At CWI we did write context-free grammars for C given a set of known macro definitions to extend the C language and that worked out fine, but that is not a general solution.

Are C preprocessor statements a part of the C language?

I recall a claim made by one of my professors in an introductory C course. He stated that the #define preprocessor command enables a programmer to create a constant for use in later code, and that the command was a part of the C language.
/* Is this truly C code? */
#define FOO 42
Since this was in an introductory programming class, I suspect that he was merely simplifying the relationship between the source file and the compiler, but nevertheless I wish to verify my understanding.
Are preprocessor statements completely independent from the C language (dependent on the specific compiler used) or are they explicitly described in the C99 standard? Out of curiosity, did K&R ever mention preprocessor macros?
Yes, the standard describes the preprocessor. It's a standardized part of the C language.
Note that #include, which is essential for modularization of code, is a preprocessor directive.
In the publically-available draft of the C99 standard, the preprocessor is described in section 6.10.
The preprocessor is indeed part of the C and C++ standard (chapter 16 in the C++ standard) and the standards describe how the preprocessor and the language interact (for example it is illegal to re-#define the C keywords).
However the C preprocessor can work with other languages than C for any kind of simple file preprocessing (I have seen it used with LaTeX files for example).
Yes the preprocessor is part of the C language. Conceptually it is ran before the source is compiled.
Along with constant definitions, the preprocessor is used to implement two very important constructs:
#include which brings other files into the compilation unit.
include guards; i.e. the pattern,
#if !defined(METAWORD)
#define METAWORD 1
/* struct definition, function prototype */
#endif
Out of interest, these two usages have survived into C++, constant definition can be implemented in other (better?) ways.

Resources