How to make C language context-free? - c

I know that C is not a context-free language, a famous example is:
int foo;
typedef int foo;
foo x;
In this case the lexer doesn't know, whether foo in the 3rd line, is an identifier, or typedef.
My question is, is this the only reason that makes C a Context-Sensitive Language?
I mean, if we get rid of typedef, would it become context-free language? Or there are other reasons (examples) that prevent it from being so?

Yes. C can be parsed with a classical lex + yacc combo. The lexer definition and the yacc grammar are freely available at
http://www.quut.com/c/ANSI-C-grammar-l-2011.html
and
http://www.quut.com/c/ANSI-C-grammar-y-2011.html
As you can see from the lex file, it's straightforward except for the context-sensitive check_type() (and comment(), but comment processing technically belongs to the preprocessor), which makes typedef the only source of context-sensitivity there. Since the yacc file doesn't contain any context-sensitivity introducing tricks either, a typedef-less C would be a perfectly context-free language.

No. C cannot be a strict context independent language. For that, you should describe a syntax that doesn't allow to use a nondeclared variable (this is context) in a similar way as what you describe in your question. The language authors always describe syntax using some kind of context free grammar, but just to describe the main syntactic constructs of the language. The case you describe (making a type identifier to fit in a different token class to be able to go in places where it shouldn't) is only an example. If you look for example, the freedom in the order for things like static unsigned long long int variable simplifies the syntax remembering by programmers, but complicates things to the compiler authors.

As per my knowledge and research there are two basic reasons that make C context sensitive language. These are:
Variable is declared before it is used.
Matching the formal and actual parameters of functions or sub-routines.
These two can't be done by PushDown Automata (PDA) but Linear Bounded Automata (LBA) can do thes two.

Related

ANSI C - direct-declarator grammar - Why does the C grammar allow syntactically legal, but sementically illegal declarations like int func()()?

The ANSI C grammar specifies:
declarator:
pointer_opt direct-declarator
direct-declarator:
identifier
( declarator )
direct-declarator [ constant-expression_opt ]
direct-declarator ( parameter-type-list )
direct-declarator ( identifier-list_opt )
According to this grammar, it would be possible to derive
func()()
as a declarator, and
int func()()
as a declaration, which is semantically illegal. Why does the C grammar allow such syntactically legal, but sementically illegal declarations?
These kinds of questions typically can't be answered for certain, because you're asking for information about the collective thoughts and deliberations of the C committee, in 1989. They've never conducted the work of language development wholly in public, the way, say, the people responsible for Python do, and thirty years ago they did that even less. And if you polled them personally, they probably wouldn't remember.
We can look at the C Rationale document (I'm linking to the edition corresponding to C1999, but as far as I know it didn't change very much since 1989) for clues, but on a quick skim, I don't see anything relevant to your question.
That leaves me making guesses based on general principles of programming language design. There is a general principle relevant to your question: Particularly for older languages, designers try to make the formal syntax be context-free as much as possible. This makes it much easier to write an efficient parser. Rules like "you can't have a function that returns a function" require context, and so they are left out of the syntax. It's straightforward to handle them as post-hoc constraints applied to the parse tree instead, so that's what designers do.
The C grammar has a whole bunch of places where this principle appears to have been used, not just the one you're asking about. For instance, the "maximal munch" rule for tokenization exists because it means the tokenizer does not need to be aware of the full parser context, even though it leads to inconvenient results, such as a-----b being interpreted as a -- -- - b instead of a -- - -- b, even though the parser will reject the former but accept the latter.
This design principle for programming languages is often surprising to beginners, because it's so different from how humans understand natural languages; we will go out of our way to "repair" some kind of contextually appropriate meaning from even the most nonsensical sentences, and we actually rely on this in conversation. It might help to contemplate the meta-principle that worse is better (to oversimplify, because you can get the first 90% of the work done quickly and put it out there and then iterate on the remaining 90%).
Why does the C grammar allow syntactically legal, but semantically illegal declarations like int func()()?
Your question basically answers itself:
Quite simply, it's because it's a grammar's whole job to accept syntactically legal constructs. If something is syntactically legal, but semantically meaningless or illegal, it's not the grammar's job to reject it -- it gets rejected later, during semantic analysis.
And if the question is, "Why wasn't the grammar written differently, so that semantically illegal constructs were also syntactically illegal (such that the grammar could reject them)?", the answer is that it's often a tradeoff whether to reject things during parsing or during semantic analysis. C's declaration syntax is pretty complicated, and there's an obvious desire to make the grammar which accepts it about as complicated as, but not significantly more complicated than, it has to be. Often, you can keep a grammar nicely simple by deferring certain checks to the semantic analysis phase.
Why does the C grammar allow such syntactically legal, but sementically illegal declarations?
What makes you think it sensible to expect the language syntax to be unable to express any semantically incorrect statements?
Not all semantic problems can even be detected at compile time (example: y = 1 / x;, which is well-defined except when x is zero). Even formulating the syntax rules so that they do not accept any statements, declarations, or expressions that can be proven semantically wrong at compile time would be of little benefit. It would complicate the syntax rules tremendously for very little gain, as compilers have to do the semantic analysis either way.
Note well that the primary audience for the language standard is people, not machines. That's why it describes the language semantics with prose.

Reserved words vs standard identifier, how to see the difference

So I'm completely new to programming. I currently study computer science and have just read the first 200 pages of my programming book, but there's one thing I cannot seem to see the difference between and which havn't been clearly specified in the book and that's reserved words vs. standard identifiers - how can I see from code if it's one or the other.
I know the reserved words are some that cannot be changed, while the standard indentifiers can (though not recommended according to my book). The problem is while my book says reserved words are always in pure lowercase like,
(int, void, double, return)
it kinda seems to be the very same for standard indentifier like,
(printf, scanf)
so how do I know when it is what, or do I have to learn all the reserved words from the ANSI C, which is the current language we are trying to learn, (or whatever future language I might work with) to know when it is when?
First off, you'll have to learn the rules for each language you learn as it is one of the areas that varies between languages. There's no universal rule about what's what.
Second, in C, you need to know the list of keywords; that seems to be what you're referring to as 'reserved words'. Those are important; they're immutable; they can't be abused because the compiler won't let you. You can't use int as a variable name; it is always a type.
Third, the C preprocessor can be abused to hijack anything; if you compile with #define double int in effect, you get what you deserve, but there's nothing much to stop you doing that.
Fourth, the only predefined variable name is __func__, the name of the current function.
Fifth, names such as printf() are defined by the standard library, but the standard library has to be implemented by someone using a C compiler; ask the maintainers of the GNU C library. For a discussion of many of the ideas behind the treaty between the standard and the compiler writers, and between the compiler writers and the programmers using a compiler, see the excellent book The Standard C Library by P J Plauger from 1992. Yes, it is old and the modern standard C library is somewhat bigger than the one from C90, but the background information is still valid and very helpful.
Reserved words are part of the language's syntax. C without int is not C, but something else. They are built into the language and are not and cannot be defined anywhere in terms of this particular language.
For example, if is a reserved keyword. You can't redefine it and even if you could, how would you do this in terms of the C language? You could do that in assembly, though.
The standard library functions you're talking about are ordinary functions that have been included into the standard library, nothing more. They are defined in terms of the language's syntax. Also, you can redefine these functions, although it's not advised to do so as this may lead to all sorts of bugs and unexpected behavior. Yet it's perfectly valid to write:
int puts(const char *msg) {
printf("This has been monkey-patched!\n");
return -1;
}
You'd get a warning that'd complain about the redefinition of a standard library function, but this code is valid anyway.
Now, imagine reimplementing return:
unknown_type return(unknown_type stuff) {
// what to do here???
}

Is the Syntax of C Language completely defined by CFGs?

I think the Question is self sufficient. Is the syntax of C Language completely defined through Context Free Grammars or do we have Language Constructs which may require non-Context Free definitions in the course of parsing?
An example of non CFL construct i thought was the declaration of variables before their use. But in Compilers(Aho Ullman Sethi), it is stated that the C Language does not distinguish between identifiers on the basis of their names. All the identifiers are tokenized as 'id' by the Lexical Analyzer.
If C is not completely defined by CFGs, please can anyone give an example of Non CFL construct in C?
The problem is that you haven't defined "the syntax of C".
If you define it as the language C in the CS sense, meaning the set of all valid C programs, then C – as well as virtually every other language aside from turing tarpits and Lisp – is not context free. The reasons are not related to the problem of interpreting a C program (e.g. deciding whether a * b; is a multiplication or a declaration). Instead, it's simply because context free grammars can't help you decide whether a given string is a valid C program. Even something as simple as int main() { return 0; } needs a more powerful mechanism than context free grammars, as you have to (1) remember the return type and (2) check that whatever occurs after the return matches the return type. a * b; faces a similar problem – you don't need to know whether it's a multiplication, but if it is a multiplication, that must be a valid operation for the types of a and b. I'm not actually sure whether a context sensitive grammar is enough for all of C, as some restrictions on valid C programs are quite subtle, even if you exclude undefined behaviour (some of which may even be undecidable).
Of course, the above notion is hardly useful. Generally, when talking grammars, we're only interested in a pretty good approximation of a valid program: We want a grammar that rules out as many strings which aren't C as possible without undue complexity in the grammar (for example, 1 a or (-)). Everything else is left to later phases of the compiler and called a semantic error or something similar to distinguish it from the first class of errors. These "approximate" grammars are almost always context free grammars (including in C's case), so if you want to call this approximation of the set of valid programs "syntax", C is indeed defined by a context free grammar. Many people do, so you'd be in good company.
The C language, as defined by the language standard, includes the preprocessor. The following is a syntactically correct C program:
#define START int main(
#define MIDDLE ){
START int argc, char** argv MIDDLE return 0; }
It seems to be really tempting to answer this question (which arises a lot) "sure, there is a CFG for C", based on extracting a subset of the grammar in the standard, which grammar in itself is ambiguous and recognizes a superset of the language. That CFG is interesting and even useful, but it is not C.
In fact, the productions in the standard do not even attempt to describe what a syntactically correct source file is. They describe:
The lexical structure of the source file (along with the lexical structure of valid tokens after pre-processing).
The grammar of individual preprocessor directives
A superset of the grammar of the post-processed language, which relies on some other mechanism to distinguish between typedef-name and other uses of identifier, as well as a mechanism to distinguish between constant-expression and other uses of conditional-expression.
There are many who argue that the issues in point 3 are "semantic", rather than "syntactic". However, the nature of C (and even more so its cousin C++) is that it is impossible to disentangle "semantics" from the parsing of a program. For example, the following is a syntactically correct C program:
#define base 7
#if base * 2 < 10
&one ?= two*}}
#endif
int main(void){ return 0; }
So if you really mean "is the syntax of the C language defined by a CFG", the answer must be no. If you meant, "Is there a CFG which defines the syntax of some language which represents strings which are an intermediate product of the translation of a program in the C language," it's possible that the answer is yes, although some would argue that the necessity to make precise what is a constant-expression and a typedef-name make the syntax necessarily context-sensitive, in a way that other languages are not.
Is the syntax of C Language completely defined through Context Free Grammars?
Yes it is. This is the grammar of C in BNF:
http://www.cs.man.ac.uk/~pjj/bnf/c_syntax.bnf
If you don't see other than exactly one symbol on the left hand side of any rule, then the grammar is context free. That is the very definition of context free grammars (Wikipedia):
In formal language theory, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form
V → w
where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty).
Since ambiguity is mentioned by others, I would like to clarify a bit. Imagine the following grammar:
A -> B x | C x
B -> y
C -> y
This is an ambiguous grammar. However, it is still a context free grammar. These two are completely separate concepts.
Obviously, the semantics analyzer of C is context sensitive. This answer from the duplicate question has further explanations.
There are two things here:
The structure of the language (syntax): this is context free as you do not need to know the surroundings to figure out what is an identifier and what is a function.
The meaning of the program (semantics): this is not context free as you need to know whether an identifier has been declared and with what type when you are referring to it.
If you mean by the "syntax of C" all valid C strings that some C compiler accepts, and after running the pre-processor, but ignoring typing errors, then this is the answer: yes but not unambiguously.
First, you could assume the input program is tokenized according to the C standard. The grammar will describe relations among these tokens and not the bare characters. Such context-free grammars are found in books about C and in implementations that use parser generators. This tokenization is a big assumption because quite some work goes into "lexing" C. So, I would argue that we have not described C with a context-free grammar yet, if we have not used context-free grammars to describe the lexical level. The staging between the tokenizer and the parser combined with the ordering emposed by a scanner generator (prefer keywords, longest match, etc) are a major increase in computational power which is not easily simulated in a context-free grammar.
So, If you do not assume a tokenizer which for example can distinguish type names from variable names using a symbol table, then a context-free grammar is going to be harder. However: the trick here is to accept ambiguity. We can describe the syntax of C including its tokens in a context-free grammar fully. Only the grammar will be ambiguous and produce different interpretations for the same string . For example for A *a; it will have derivations for a multiplication and a pointer declaration both. No problem, the grammar is still describing the C syntax as you requested, just not unambiguously.
Notice that we have assumed having run the pre-processor first as well, I believe your question was not about the code as it looks before pre-processing. Describing that using a context-free grammar would be just madness since syntactic correctness depends on the semantics of expanding user-defined macros. Basically, the programmer is extending the syntax of the C language every time a macro is defined. At CWI we did write context-free grammars for C given a set of known macro definitions to extend the C language and that worked out fine, but that is not a general solution.

Examples of Non Context free language in C language?

What are examples of non - context free languages in C language ? How the following non-CFL exists in C language ?
a) L1 = {wcw|w is {a,b}*}
b) L2 = {a^n b^m c^n d^m| n,m >=1}
The question is clumsily worded, so I'm reading between the lines, here. Still, it's a common homework/study question.
The various ambiguities [1] in the C grammar as normally presented do not render the language non-context-free. (Indeed, they don't even render the grammars non-context-free.) The general rule "if it looks like a declaration, it's a declaration regardless of other possible parses" can probably be codified in a very complex context-free grammar (although it's not 100% obvious that that is true, since CFGs are not closed under intersection or difference), but it's easier to parse with a simpler CFG and then disambiguate according to the declaration rule.
Now, the important point about C (and most programming languages) is that the syntax of the language is quite a bit more complex than the BNF used for explanatory purposes. For example, a C program is not well-formed if a variable is used without being defined. That's a syntax error, but it's not detected by the CFG parser. The grammatical productions needed to define these cases are quite complicated, due to the complicated syntax of the language, but they're going to boil down to requiring that ids appear twice in a valid program. Hence L1 = {wcw|w is {a,b}+} (here w is the identifier, and c is way too complicated to spell out). In practice, checking this requirement is normally done with a symbol table, and the formal language rules, while precise, are not written in a logical formalism. Since L1is not a context-free language, the formalism could not be context-free, but a context-sensitive grammar can recognize L1, so it's not totally impossible. (See, for example, Algol 68.)
The symbol table is also used to decide whether a particular identifier is to be reduced to typedef-name [2]. This is required to resolve a number of ambiguities in the grammar. (It also further restricts the set of strings in the language, because there are some cases where an identifier must be resolved as a typedef-name in order for the program to be valid.)
For another type of context-sensitivity, function calls need to match function declarations in the number of arguments; this sort of requirement is modelled by L2 = {a^n b^m c^n d^m| n,m >=1} where a and c represent the definition and use of some function, and b and d represent the definition and use of a different function. (Again, in a highly-simplified form.)
This second requirement is possibly less clearly a syntactic requirement. Other languages (Python, for example) allow function calls with any number of arguments, and detect a argument/parameter count match as a semantic error only detected at runtime. In the case of C, however, a mismatch is clearly a syntax error.
In short, the set of grammatically valid strings which constitute the C language is a proper subset of the set of strings recognized by the CFG presented in the C language definition; the set of valid parses is a proper subset of the set of derivations generated by the CFG, and the language itself is (a) unambiguous, and (b) not context-free.
Note 1: Most of these are not really ambiguities, because they depend upon how a given identifier is resolved (typedef name, function identifier, declared variable,...).
Note 2: It is not the case that identifier must be resolved as a typedef-name if it happens to be one; that only happens in places where the reduction is possible. It is not a syntax error to use the same identifier for both a type and a variable, even in the same scope. (It's not a good idea, but it's valid.) The following example, adapted from an example in section 6.7.8 of the standard, shows the use of t as both a field name and a typedef:
typedef signed int t;
struct tag {
unsigned t:4; // field named 't' of type unsigned int
const t:5; // unnamed field of type 't' (signed int)
};
These things aren't context-free in C:
foo * bar; // foo multiplied by bar or declaration of bar pointing to foo?
foo(*bar); // foo called with *bar as param or declaration of bar pointing to foo?
foo bar[2] // is bar an array of foo or a pointer to foo?
foo (bar baz) // is foo a function or a pointer to a function?

What is the rationale behind typedef vs struct/union/enum, couldn't there be only one namespace?

In C if I declare a struct/union/enum:
struct Foo { int i ... }
when I want to use my structure I need to specify the tag:
struct Foo foo;
To loose this requirement, I have to alias my structure using typedef:
typedef struct Foo Foo;
Why not have all types/structs/whatever in the same "namespace" by default? What is the rationale behind the decision of requiring the declaration tag at each variable declaration (unless typdefe'd) ???
Many other languages do not make this distinction, and it seems that it's only bringing an extra level of complexity IMHO.
Structures/records were a very early pre-C addition to B, just after Dennis Ritchie added a the basic 'typed' structure. I believe that the original struct syntax did not have a tag at all, for every variable you made an anonymous struct:
struct {
int i;
char a[5];
} s;
Later, the tag was added to enable reuse of structure layout, but it wasn't really regarded as real 'type'. Also, removing the struct/union would make parsing impossible:
/* is Foo a union or a struct? */
Foo { int i; double x; };
Foo s;
or break the 'declaration syntax mimics expression syntax' paradigm that is so fundamental to C.
I suspect that typedef was added much later, possible a few years after the 'birth' of C.
The argument "C was the highest level language at the time." does not seem true. Algol-68 predates it and has records as proper types. The same holds for Pascal.
If you like to know more about the history of C you might find Ritchie's "The Development of the C Language" an interesting read.
Well, other languages also usually support namespaces. C doesn't.
It probably isn't the reason, but it makes sense to have at least this natural namespace.
Interesting question! Here are my thoughts.
When C was created, little abstraction existed over assembly language. There was FORTRAN, B, and others, but when C came to be it was arguably the highest level language in existence. It's goal was to provide functionality and syntax powerful enough to create and maintain an operating system, and it succeed remarkably.
Think that, at the time, porting a system to a new platform meant rewriting and adapting components to the platform's assembly language. With the release of C, it eventually came down to porting the C compiler, and recompiling existent code.
It was probably an asset back then that the very syntax of the language forced you to differentiate between types that could fit in a register, and types that couldn't.
Language syntax has evolved a lot since then, and most of the things we're used to see in modern languages are missing in C. User-defined namespaces is only one of them, and I don't think the concept of "syntax sugar" even existed back then. Or rather, C was the peak of syntax sugar.
We're surrounded with things like this. I mean, take a look at your keyboard: why do we ave a PAUSE/BREAK key? I don't think I've pressed that key for years.
It's inheritance from a time in which it made sense.

Resources