Write a compiler from scratch in C [duplicate] - c

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to code a compiler in C?
How would I start writing a compiler from scratch (no Flex or Bison or Lex or Yacc) in C? I have a language that I wrote an interpreter for, and it's kind of like Forth. Sort of. It takes in symbols and interprets them one at a time, using a stack.
How would I make a compiler?
That wasn't a particularly spammy bit; just to show people the syntax and simplicity.
http://github.com/tekknolagi/StackBased

Simple!
You tokenize the input.
You build a proper representation of it, generally this is an Abstract Syntax Tree, but that is not required.
You perform any tree transformations you may require (optional).
You generate the code by walking the tree.
You link any disparate portions together (optional)
Flex and Bison help with stage 1 and 2, everything else is up to you. If you're still stuck, I suggest going through "Programming Language Pragmatics" or The Dragon Book.

Related

What kind of lexer/parser was used in the very first C compiler? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
In the early 1970s, Dennis Ritchie wrote the very first C compiler.
In the year 2017, I wanted to write a C compiler. Books like Deep C Secrets (Peter Van Der Linden) say that C was, above all else, designed to be easy to compile. But I've been having an inordinate amount of trouble with it.
For starters, it's already relatively difficult to come up with Lex/Yacc specifications for the C language, and these tools didn't even exist yet when Ritchie made his compiler!
Plus, there are a great many examples of surprisingly small C compilers that do not use any help from Lex & Yacc. (Check out this tiny obfuscated C compiler from Fabrice Bellard. Note that his "production" tinycc source is actually quite a bit longer, most likely in an effort to accommodate more architectures, and to be more readable)
So what am I missing here? What kind of lexer/parser did Ritchie use in his compiler? Is there some easier way of writing compilers that I just haven't stumbled onto?
Yacc's name is an abbreviation for "yet another compiler compiler", which strongly suggests that it was neither the first nor the second such tool.
Indeed, the Wikipedia article on History of Compiler Construction notes that
In the early 1960s, Robert McClure at Texas Instruments invented a compiler-compiler called TMG, the name taken from "transmogrification". In the following years TMG was ported to several UNIVAC and IBM mainframe computers.
…
Not long after Ken Thompson wrote the first version of Unix for the PDP-7 in 1969, Doug McIlroy created the new system's first higher-level language: an implementation of McClure's TMG. TMG was also the compiler definition tool used by Ken Thompson to write the compiler for the B language on his PDP-7 in 1970. B was the immediate ancestor of C.
That's not quite an answer to your question, but it provides some possibilities.
Original answer:
I wouldn't be at all surprised if Ritchie just banged together a hand-built top-down or operator precedence parser. The techniques were well-known, and the original C language presented few challenges. But parser generating tools definitely existed.
Postscript:
A comment on the OP by Alexey Frunze points to this early version of the C compiler. It's basically a recursive-descent top-down parser, up to the point where expressions need to be parsed at which point it uses a shunting-yard-like operator precedence grammar. (See the function tree in the first source file for the expression parser.) This style of starting with a top-down algorithm and switching to a bottom-up algorithm (such as operator-precedence) when needed is sometimes called "left corner" (LC) parsing.
So that's basically the architecture which I said wouldn't surprise me, and it didn't :).
It's worth noting that the compiler unearthed by Alexey (and also by #Torek in a comment to this post) does not handle anything close to what we generally consider the C language these days. In particular, it handles only a small subset of the declaration syntax (no structs or unions, for example), which is probably the most complicated part of the K&R C grammar. So it does not answer your question about how to produce a "simple" parser for C.
C is (mostly) parseable with an LALR(1) grammar, although you need to implement some version of the "lexer hack" in order to correctly parse cast expressions. The input to the parser (translation phase 7) will be a stream of tokens produced by the preprocessing code (translation phase 4, probably incorporating phases 5 and 6), which itself may draw upon a (f)lex tokenizer (phase 3) whose input will have been sanitized in some fashion according to phases 1 and 2. (See § 5.1.1.2 for a precise definition of the phases).
Sadly, (f)lex was not designed to be part of a pipeline; they really want to just handle the task of reading the source. However, flex can be convinced to let you provide chunks of input by redefining the YY_INPUT macro. Handling trigraphs (if you chose to do that) and line continuations can be done using a simple state machine; it's convenient that these transformations only shrink the input, simplifying handling of the maximum input length parameter to YY_INPUT. (Don't provide input one character at a time as suggested by the example in the flex manual.)
Since the preprocessor must produce a stream of tokens (at this point, whitespace is no longer important), it is convenient to use bison's push-parser interface. (Indeed, it is very often more convenient to use the push API.) If you take that suggestion, you will end up with phase 4 as the top-level driver of the parse.
You could hand-build a preprocessor-directive parser, but getting #if expressions and pragmas right suggests the use of a separate bison parser for preprocessing.
If you just want to learn how to build a compiler, you might want to start with a simpler language such as Tiger, the language used as a running example in Andrew Appel's excellent textbooks on compiler construction.

Popular diagramming tools or methodologies for C [duplicate]

This question already has answers here:
Tools to get a pictorial function call graph of code [closed]
(7 answers)
Closed 9 years ago.
Coming from a java (and other OO backgrond ) i got very cosy in with my objects, natural encapsulation and polymorphism.
All this i expected, the one this i didn't expect was to miss my class diagrams!
When the going gets tough or you start to worry about over coupling it was always my first stop.
But i cant seem to find a C style equivalent (that doesn't date from the mid 90's) diagramming system or utility for C.
have i just missed some thing? is there a hidden gem out there some where?
Even just something to show function calls between files so i can get an idea of whats going on where.
In short: Does any one have a suggestion (or tool) for how to model C file sets? function calls, includes, etc.
Thanks.
You can generate C code from class diagrams with UML applications such as IBM Rational Rhapsody or Eclipse-based open source Topcased.
You can generate call graphs, calling graphs and dependency graphs from C code with doxygen, powered by graphviz.

How does C work? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How was the first compiler written?
I'm asking this as a single question because, essentially what I'm trying to ask is at the bottom how is all of this implemented, here goes:
How was the first C compiler generated, since C compiler is written in C itself then how was the first source of C compiler generated?
Is C written in ASM, how are languages actually designed?, because before we had high level languages the only way to design something was through ASM, even if C is derived from earlier languages, how were they designed? (My clue is ASM)
I'm getting confused as to how does C work down at the bottom. What I'm trying to say is since at the bottom, everything is implemented at the processor by OPcodes. So what my understanding was that C programs are "essentially" translated to Sys Calls which are implemented by the Kernel.
But then how are syscalls implemented? (Do they directly correspond to OPcodes or is there any other layer of abstraction.
How was the first C compiler generated, since C compiler is written in C itself then how was the first source of C compiler generated?
Bootstrapping.

Pascal to C converter [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm writing program which translate Pascal to C and need some help. I started with scanner generator Flex. I defined some rules and created scanner which is working more or less ok. It breaks Pascal syntax into tokens, for now it's only printing what it found. But I have no idea what should I do next. Are there any articles or books covering this subject? What is the next step?
Why do you want to do such a Pascal to C converter?
If you just want to run some Pascal programs, it is simpler to use (or improve) existing compilers like gpc, or Pascal to C translators, like e.g. p2c
If you want to convert hand-written Pascal code to humanly-readable (and improvable) C code, the task is much more difficult; in particular, you probably want to convert the indentation, the comments, keep the same names as much as possible -but avoiding clashes with system names- etc!
You always want to parse some abstract syntax tree, but the precise nature of these trees is different. Perhaps flex + bison or even ANTLR may or not be adequate (you can always write a hand-written parser). Also, error recovery may or not be important to you (aborting on the first syntax error is very easy; trying to make sense of an ill-written syntactically-incorrect Pascal source is quite hard).
If you want to build a toy Pascal compiler, consider using LLVM (or perhaps even GCC middle-end and back-ends)
You might want to take a look at "Translating Between Programming Languages Using A Canonical Representation And Attribute Grammar Inversion" and references therein.
The most common approach would be to build a parse tree in your front end, and then walk through that tree outputting the equivalent C in the back end. This gives you the flexibility to perform any reordering of declarations that's required (IIRC Pascal supports use before declaration, but C doesn't). If you're using flex for the scanner, tradition would dictate using bison for the parser, although there are alternatives. If you look, you can probably find a freely available Pascal syntax in the format expected by bison.
You have to know the Pascal grammar, the C grammar and built (design) a "something" (i.e. a grammar or an automata...) that can translate every Pascal rule in the corresponding C rule.
Than, once you have your tokenized stream, using some method like LR, you can find the semantic tree which correspond to the sequence of Pascal rule applied and convert every rule in the corresponding C rule (this can be easly done with Bison).
Pay attention that Pascal and C have not Context Free grammars, so more control will be necessary.

Which language is useful to create a report for a valid C program [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Can anyone suggest me a helpful programming language which can be used to create a tool which will analyse the given C program and generate a txt report or html report containing information about the given program (function list, variable list etc).
The program I intend to build is similar to doxygen but i want it for my personal use.
ctags, perhaps?
Ctags generates an index (or tag) file of language objects found in source files that allows these items to be quickly and easily located by a text editor or other utility. A tag signifies a language object for which an index entry is available (or, alternatively, the index entry created for that object).
Both Python and Perl have excellent string processing capabilities.
I'd suggest using something like ctags to parse the program, and just create a script to read the ctags file and output in txt/html.
The file format used by ctags is well-defined so that other programs can read it. See http://ctags.sourceforge.net for more information on ctags itself and the file it uses.
You're opening a big can of worms, this isn't an effective use of your time, blah blah blah, etc.
Moving on to an answer, if you're talking about anything beyond trivial analysis and you need accuracy, you will need to parse the C source code. You can do that in any language, but you will almost certainly want to generate your parser from a high-level grammar. There are any number of tools for that. A modern and particularly powerful parser generator is ANTLR; there are a number of ANTLR grammars for C, including easier-to-work-with subsets.
Look into scripting languages. I'd recommend Python or Perl.
Haskell has a relatively recent language-c project http://www.sivity.net/projects/language.c which allows the analysis of C code.
If you are familiar with Haskell, then it might be worth a look. Even if you are not, it might be interesting to have a go.
If it's a programming language you want then I'd say something which is known for string processing power so that would mean perl.
However the task you require can be rather complicated since you need to 'know' the language, so you would require to follow the same steps the compiler does, being lexical and grammatical analyses on the language (think flex, think yacc) in order to truly 'know' what meaning those strings have.
Perhaps the best starting point is to take a look at doxygen and try to reuses as much of the work done there as possible
Lex/yacc are appropriate for building parsers.
pycparser is a complete parser for ANSI C89/C90 written in pure Python. It's being widely used to analyze C source code for various needs. It comes with some example code, such as listing all the function definitions in files, etc.

Resources