I have built a small code for static analysis of C code. The purpose of building it is to warn users about the use of methods such as strcpy() which could essentially cause buffer overflows.
Now, to formalise the same, I need to write a formal Grammar which shows the excluded libraries as NOT a part of the allowed set of accepted library methods used.
For example,
AllowedSentence->ANSI C Permitted Code, NOT UnSafeLibraryMethods
UnSafeLibraryMethods->strcpy|other potentially unsafe methods
Any ideas on how this grammar can be formalised?
I think, this should not be done at the grammar level. It should be a rule that is applied to the parse tree after parsing is done.
You hardly need a parser for the way you have posed the problem. If your only goal is to object to the presence of certain identifiers ("strcpy"), you can simply build a lexer that processes C and picks identifiers. Special lexemes can recognize your list of "you shouldn't use this". This way you use positive recognition instead of negative recognition to pick out the identifiers that you belive to be trouble.
If you want a more sophisticated analaysis tool, you'll likely want to parse C, an name-resolve the identifers to their actual definitisn, then the scan the tree looking for identifiers that are objectionable. This will at least let you decide if the identifier is actually defined by the user, or comes from some known library; surely, if my code defines strcpy, you shouldn't complain unless you know my strcpy is defective somehow.
Related
I am currently in the process of writing a C to Assembly compiler, it is not meant to be practical, but I would like to do it for the educational value. I was wondering when I am testing for keywords, is there any more efficient way rather than just reading in the next word in the file and then running it through a bunch of nested if statements that test for the keywords. Is there any better way?
Your question is actually quite specific. You are asking about how to build the lexical analyzer, also known as the scanner, and how to efficiently and conveniently recognize keywords. The scanner is the first phase of a typical compiler, and it converts the source code, which is a sequence of characters, to a sequence of tokens, where a token is a unit such as a number, an operator or a keyword.
Since keywords match the pattern for general identifiers, a common trick is to put all the keywords in the symbol table, together with information that it is a keyword. Then, when the scanner finds an identifier, it as usual searches the symbol table to see if that identifier has been seen before. If this identifier was a kewyord, it will be found, together with the information about which keyword it is.
Are you doing this for part of a class? If so, there should be guidelines on parsing and lexing. If not, you're in for a lot of work!
Writing an actual compiler is much more complicated than just going through a bunch of if statements, because you need to keep track of the environment. You'll need to think about how you allow classes, functions, function calls, class instantiations, recursive functions... the list goes on.
Take a look at course lectures from UC Berkeley on the subject, i.e. parsing, lexing, code generation, and the tools you'll need:
http://www-inst.eecs.berkeley.edu/~cs164/fa13/
Note that this course in particular used C++ to write a Python2.5 to Assembly compiler, but the concepts in the Lectures and Readings and some of the tools are not language-restricted.
Keywords (instead of tokens in general) is a closed set, for which it's practical to generate a collision free hash function. Because the set is small, it's not even necessary to have a minimum hash function.
You can do it with a bunch of if - else if statements and strcmp(). However, writing statements for all of the keywords gets annoying very quickly. You'd be better off using a hash table - at the start of the compilation you put all keywords in the table and then you do lookups as needed. The drawback of this is that if you have to use C, you'll also have to write your own hash table (or use one from a library). If you can use C++, though, then you can use a map or an unordered_map from the STL. In any case, if you are worried about the performance, as someone else mentioned, it won't be a bottle neck.
I am trying to make semantic phase for c compiler using lex and yacc. Right now the problem is if I have multiple errors in the c program, it stops after the 1st. What can I do?
I strongly recommend that you perform the semantic analysis as a separate phase, not as a part of the parsing phase. Use YACC only to build an abstract syntax tree, then traverse this tree in a separate function. Said function will have unlimited freedom when it comes to moving around in the tree, as opposed to having to "follow the parsing". As for the specific problem you mentioned, #pmg's comment seems to have pinpointed the problem.
There is no one absolute answer to this. A typical way to handle it is to create a special pattern to read symbols until it gets to (for example) a semicolon at the end of a line, giving a reasonable signal that whatever's after that is intended as a new declaration, definition, statement, etc., and then re-start parsing from that point (retaining enough context to know that, for example, you're currently parsing a function body, so you accept/reject input on that basis).
I have a "a pain in the a$$" task to extract/parse all standard C functions that were called in the main() function. Ex: printf, fseek, etc...
Currently, my only plan is to read each line inside the main() and search if a standard C functions exists by checking the list of standard C functions that I will also be defining (#define CFUNCTIONS "printf...")
As you know there are so many standard C functions, so defining all of them will be so annoying.
Any idea on how can I check if a string is a standard C functions?
If you have heard of cscope, try looking into the database it generates. There are instructions available at the cscope front end to list out all the functions that a given function has called.
If you look at the list of the calls from main(), you should be able to narrow down your work considerably.
If you have to parse by hand, I suggest starting with the included standard headers. They should give you a decent idea about which functions could you expect to see in main().
Either way, the work sounds non-trivial and interesting.
Parsing C source code seems simple at first blush, but as others have pointed out, the possibility of a programmer getting far off the leash by using #defines and #includes is rather common. Unless it is known that the specific program to be parsed is mild-mannered with respect to text substitution, the complexity of parsing arbitrary C source code is considerable.
Consider the less used, but far more effective tactic of parsing the object module. Compile the source module, but do not link it. To further simplify, reprocess the file containing main to remove all other functions, but leave declarations in their places.
Depending on the requirements, there are two ways to complete the task:
Write a program which opens the object module and iterates through the external reference symbol table. If the symbol matches one of the interesting function names, list it. Many platforms have library functions for parsing an object module.
Write a command file or script which uses the developer tools to examine object modules. For example, on Linux, the command nm lists external references with a U.
The task may look simple at first but in order to be really 100% sure you would need to parse the C-file. It is not sufficient to just look for the name, you need to know the context as well i.e. when to check the id, first when you have determined that the id is a function you can check if it is a standard c-runtime function.
(plus I guess it makes the task more interesting :-)
I don't think there's any way around having to define a list of standard C functions to accomplish your task. But it's even more annoying than that -- consider macros,
for example:
#define OUTPUT(foo) printf("%s\n",foo)
main()
{
OUTPUT("Ha ha!\n");
}
So you'll probably want to run your code through the preprocessor before checking
which functions are called from main(). Then you might have cases like this:
some_func("This might look like a call to fclose(fp), but surprise!\n");
So you'll probably need a full-blown parser to do this rigorously, since string literals
may span multiple lines.
I won't bring up trigraphs...that would just be pointless sadism. :-) Anyway, good luck, and happy coding!
I'm building my own language using Flex, but I want to know some things:
Why should I use lexical analyzers?
Are they going to help me in something?
Are they obligatory?
Lexical analysis helps simplify parsing because the lexemes can be treated as abstract entities rather than concrete character sequences.
You'll need more than flex to build your language, though: Lexical analysis is just the first step.
Any time you are converting an input string into space-separated strings and/or numeric values, you are performing lexical analysis. Writing a cascading series of else if (strcmp (..)==0) ... statements counts as lexical analysis. Even such nasty tools as sscanf and strtok are lexical analysis tools.
You'd want to use a tool like flex instead of one of the above for one of several reasons:
The error handling can be made much better.
You can be much more flexible in what different things you recognize with flex. For instance, it is tough to parse a C-format hexidecimal value properly with scanf routines. scanf pretty much has to know the hex value is comming. Lex can figure it out for you.
Lex scanners are faster. If you are parsing a lot of files, and/or large ones, this could become important.
You would consider using a lexical analyzer because you could use BNF (or EBNF) to describe your language (the grammar) declaratively, and then just use a parser to parse a program written in your language and get it in a structure in memory and then manipulate it freely.
It's not obligatory and you can of course write your own, but that depends on how complex the language is and how much time you have to reinvent the wheel.
Also, the fact that you can use a language (BNF) to describe your language without changing the lexical analyzer itself, enables you to make many experiments and change the grammar of your language until you have exactly what it works for you.
Apparently (at least according to gcc -std=c99) C99 doesn't support function overloading. The reason for not supporting some new feature in C is usually backward compatibility, but in this case I can't think of a single case in which function overloading would break backward compatibility. What is the reasoning behind not including this basic feature?
To understand why you aren't likely to see overloading in C, it might help to better learn how overloading is handled by C++.
After compiling code, but before it is ready to run, the intermediate object code must be linked. This transforms a rough database of compiled functions and other objects into a ready to load/run binary file. This extra step is important because it is the principle mechanism of modularity available to compiled programs. This step allows you to take code from existing libraries and mix it with your own application logic.
At this stage, the object code may have been written in any language, with any combination of features. To make this possible, it's necessary to have some sort of convention so that the linker is able to pick the right object when another object refers to it. If you're coding in assembly language, when you define a label, that label is used exactly, because it is assumed you know what you're doing.
In C, functions become the symbol names for the linker, so when you write
int main(int argc, char **argv) { return 1; }
the compiler provides an archive of object code, which contains an object called main.
This works well, but it means that you cannot have two objects with the same name, because the linker would be unable to decide which name it should use. The linker doesn't know anything about argument types, and very little about code in general.
C++ resolves this by encoding additional information into the symbol name directly. The return type, the number and type of arguments, the reference type of the arguments, whether its const or not, etc., are added to the symbol name, and are referred to that way at the point of a function call. The linker doesn't have to know this is even happening, since as far as it can tell, the function call is unambiguous.
The downside of this is that the symbol names don't look anything like the original function names. In particular, it's almost impossible to predict what the name of an overloaded function will be so that you can link to it. To link to foriegn code, you can use extern "C", which causes those functions to follow the C style of symbol names, but of course you cannot overload such a function.
These differences are related to the design goals of each language. C is oriented toward portability and interoperability. C goes out of its way to do predictable and compatible things. C++ is more strongly oriented toward building rich and powerful systems, and not terribly focused on interacting with other languages.
I think it is unlikely for C to ever pursue any feature that would produce code that is as difficult to interact with as C++.
Edit: Imagist asks:
Would it really be less portable or
more difficult to interact with a
function if you resolved int main(int
argc, char** argv) to something like
main-int-int-char** instead of to main
(and this were part of the standard)?
I don't see a problem here. In fact,
it seems to me that this gives you
more information (which could be used
for optimization and the like)
To answer this, I will turn again to C++ and the way it deals with overloads. C++ uses this mechanism, almost exactly as described, but with one caveat. C++ does not standardize how certain parts of itself should be implemented, and then goes on to suggest how some of the consequences of that omission. In particular, C++ has a rich type system, that includes virtual class members. How this feature should be implemented is left to the compiler writers, and the details of vtable resolution has a strong effect on function signatures. For this reason, C++ deliberately suggests compiler writers make name mangling be mutually incompatible across compilers or same compilers with different implementations of these key features.
This is just a symptom of the deeper issue that while higher level languages like C++ and C have detailed type systems, the lower level machine code is totally typeless. arbitrarily rich type systems are built on top of the untyped binary provided at the machine level. Linkers do not have access to the rich type information available to the higher level languages. The linker is completely dependent on the compiler to handle all of the type abstractions and produce properly type-free object code.
C++ does this by encoding all of the necessary type information in the mangled object names. C, however, has a significantly different focus, aiming to be a sort of portable assembly language. C prefers thus to have a strict one to one correspondence between the declared name and the resulting objects' symbol name. If C Mangled it's names, even in a standardized and predictable way, you would have to go to great efforts to match the altered names to the desired symbol names, or else you would have to turn it off as you do in c++. This extra effort comes at almost no benefit, because unlike C++, C's type system is fairly small and simple.
At the same time, it's practically a standard practice to define several similarly named C functions that vary only by the types they take as arguments. for a lengthy example of just this, have a look at the OpenGL namespace.
When you compile a C source, symbol names will remain intact. If you introduce function overloading, you should provide a name mangling technique to prevent name clashes. Consequently, like C++, you'll have machine generated symbol names in the compiled binary.
Also, C does not feature strict typing. Many things are implicitly convertible to each other in C. The complexity of overload resolution rules could introduce confusion in such kind of language.
Lots of language designers, including me, think that the combination of function overloading with C's implicit promotions can result in code that is heinously difficult to understand. For evidence, look at the body of knowledge accumulated about C++.
In general, C99 was intended to be a modest revision largely compatible with existing practice. Overloading would have been a pretty big departure.