Parsing multiple C source files

Parsing multiple C source files - c

I have multiple C source files and respective header files. I am trying to parse these files using a compiler, e.g. ANTLR.
In ANTLR parser grammar, you can define your header files using the
#parser::includes
{#include"a.h"}
You can start parsing the first file e.g.
CommonTree tree = Parser.start("a.c");
and parser will parse the header file
a.h
but how to parse the files if you have multiple source file e.g. b.c, c.c and so on with their respective header files.

C is a pig to parse --- the semantic type of a token depends on what it's been declared as. Consider:
T(*b)[4]
If T is a type name, then this is a variable declaration. If it's an identifier, it's a functional call. In order to resolve this, any C parser that expects to actually work is going to have to keep a full type environment, which means it needs to be an unpleasantly large chunk of a C compiler.
There are ANTLR parsers for C that get all this stuff right but they're not trivial to use and I don't have any experience of them, so can't comment there.
Instead you might want to go look at using external tools to parse your C into something that's easier to deal with. gcc-xml is one such; it uses gcc itself to parse source files and then spit out XML that's much easier to handle.

Related

Is it ok to store functions in header files that aren't shared across multiple source files?

What if you have a minimal amount of structures, functions and macros but want to exclude them from the source file to convert the source code into a more concise and readable format and reduce the amount of lines of code.
Is structures, functions or macros/data in general accessible/viewable from examining the binary even if the data is not called within the source code? And if so how?
For the sake of readability is it safe to cut structures, functions and macros from a source file into a header file that is used by multiple source files even if some source files don't use all of the structures, functions and macros (for small header files)?

Is structures, functions or macros/data in general accessible/viewable
from examining the binary even if the data is not called within the
source code?
Depends on what you build. If you build a library (.a, .so, .lib, .dll, whatever) they're probably in there and everything in that library is accessible in some way. If you build an executable the linker will most likely remove unused code.
Have a look at nm
For the sake of readability is it safe to cut structures, functions
and macros from a source file into a header file that is used by
multiple source files
Yes and no. Put declarations of functions/structs in header files and their implementations in .c files. Don't put a lot of unrelated functions and structs in one header. You end up including all those declarations in every source file even though you're using 5% of them. That means extra work for your compiler, probably some extra work for your linker and extra work for your and future programmers brains when reading all this unnecessary stuff.
So, guessing what's happening in your code base, you probably want to put them in seperate header files.
Be careful when using macros and even more when putting them in header files. You should avoid this most of the time.
even if some source files don't use all of the structures, functions
and macros
That is quite common. You include some standard C headers too and don't use all of the functions and structs in there right? Just (as I said) put together what belongs together.

understanding GCC dependency pragma directive

I was exploring gcc supported pragmas and I just didn't get what the manual say about #pragma GCC dependency:
#pragma GCC dependency allows you to check the relative dates of the current file and another file. If the other file is more recent than the current file, a warning is issued. This is useful if the current file is derived from the other file, and should be regenerated. The other file is searched for using the normal include search path. Optional trailing text can be used to give more information in the warning message.
Can anyone explain this part with some minimal code?
This is useful if the current file is derived from the other file
How can the current file be derived from the other file? I can understand how another file can be derived form the current file but not vice versa.

How can the current file be derived from the other file? I can
understand how another file can be derived form the current file but
not vice versa.
The primary case served is when a C source file is created by a program, using the designated other file as an input. The C source is derived from the other file by running the program. It is to be presumed that differences in the other file would cause the code generator program to produce the C file differently, at least a little, else the pragma in question would not be used.
Thus, if the designated other file's last-modification timestamp is more recent than the C file's, then it is highly suspect to be compiling the C file at all, for it probably does not correspond to the current version of the other file. Instead, one should regenerate the C source from the other file by running the code generator program again, obtaining a whole new version of the C file to replace the current one. The new one will, of course, have a last modification timestamp newer than the other file's, because the other file had to exist before the new version of the C file could be generated from it.
Example:
There is a classic program named lex whose purpose is to help write programs that process text, especially the text of programming languages or rich data languages (the details are not important). The input file for this program describes how to recognize and categorize the basic units of this language, which are called "tokens". If the language being parsed were C, then tokens would include language keywords, numeric constants, and operators. The input file for lex is typically tens of lines long.
lex reads such an input file and writes a C source file defining several functions and some internal tables that implement the "scanning" behavior required: reading the input text and breaking it up into tokens, which it reports back to its caller. The C source generated by this program is typically a few thousand lines long, which, compared to the much smaller input file, explains in a nutshell why lex is useful.
To build a program that scans the language in question, one provides functions (in a different source file) that call those generated by lex, and compiles them along with the lex-generated C source to obtain a complete program. Say the lex input file is named language.l, and the output of running lex on that file is named language.c. If I want to change the behavior of the scanner functions then the thing to do is to modify (small, simple) language.l and then re-run lex to regenerate language.c.
When I change language.l in any meaningful way, language.c becomes out of date until I generate a new version of it from language.l by re-running lex. If I compile the outdated version of language.c then the result does not reflect the current version of language.l. This usually constitutes an error on the part of the person building the program, and #pragma GCC dependency provides a mechanism for eliciting a warning from the compiler in that situation.

What is the difference between include and link when linking to a library?

What does include and link REALLY do? What are the differences? And why do I need to specify both of them?
When I write #include math.h and then write -lm to compile it, what does #include math.h and -lm do respectively?
In my understanding, when linking a library, you need its .h file and its .o file. Does this suggest #include math.h means take in the .h file while -lm take in the .o file?

The reason that you need both a header (the interface description) and the library (the implementation) is that C separates the two clearer than languages like C# or Java do. One can compile a C function (e.g. by invoking gcc -c <sourcefile>) which calls library code even when the called library is not present; the header, which contains the interface description, suffices. (This is not possible with C# or Java; the assemblies resp. class files/jars must be present.) During the link stage though the library must be there, even when it's dynamic, afaik.
With C#, Java, or script languages, by contrast, the implementation contains all information necessary to define the interface. The compiler (which is not as clearly separated from the linker) looks in the jar file or the C# assembly which contain called implementations and obtains information about function signatures and types from there.
Theoretically, that information could probably be present in a library written in C as well — it's basically the debug information. But the classic C compiler (as opposed to the linker) is oblivious to libraries or object files and cannot parse them. (One should remember that the "compiler" executable you usually use to compile a C program , e.g. gcc, is a "compiler driver" which interprets the command line arguments and calls the programs which actually do stuff, e.g. the preprocessor, actual compiler and actual linker, to create the desired output.)
So in theory, if you have a properly annotated library in a known location, you could probably write a compiler which compiles a C function against it without having function declarations and type definitions; the compiler would have to produce the proper declarations. The compiler would have to know which library to parse (which corresponds to setting a C# project "Reference" in VS or having a class path and name/class correspondence in Java).
It would probably be easiest to use a well-known debugging format like stabs or dwarf and extract the interface definitions from it with a little helper program which uses the API for the debug format, extracts the information and produces a C header which is prepended to every source file. That would be the job of the compiler driver, and the actual compiler would still be oblivious to that.

It's because headers files contain only declaration and .o files (or .obj, .dll or .lib) contain definitions of methods.
If you open an .h file, you will not see the code of methods, because that is in the libraries.
One reason is commercial, because you need to publish your code and have the source code in your company. Libraries are compiled, so you could publish it.
Header files only tell compiler, what classes and methods it can find in the library.

The header files are kind of a table-of-contents plus a kind of dictionary for the compiler. It tells the compiler what the library offers and gives special values readable names.
The library file itself contains the contents.

What you are asking are entirely two different things.
Don't worry , i will explain them to you.
You use # symbol to instruct the preprocessor to include the math.h header files which internally contain the function prototypes of fabs(),ceil() etc..
And you use -lm to instruct the linker, to include the pre-compiled function definitions of fabs(),ceil() etc. functions in the exe file .
Now, you may ask why we have to explicitly link library file of math functions unlike for other functions and the answer is ,it is due to some undefined historical reasons.

Parsing C header files to extract information about data types, functions and function arguments

I have a C header file. I want to parse it and extract information about data types, functions and functions arguments. Who can help me? I need some example in C.
Thank you very much.

You could try Clang. In special The Lexer and Preprocessor Library.

Use ANTLR. There's a decent grammar for C already written for you, and ANTLR will generate C code (or some other languages if you prefer), which you can then traverse to get what you want.

There is also srcml.
Similar to c2xml it uses source code directly.
c2xml starts from preprocessor output.
Assume good C coding rules (as opposed to arbitrary use of preprocessing) this has been an advantage for my re-engineering tasks, as it preserves the names of #defines and being able to process selected macros in a specific way.

The DMS Software Reengineering Toolkit with its C Front End can do this.
DMS provides general purpose parsing, symbol table construction, flow analysis, and program transformations, parameterized by a language definition. Using DMS's C front end, DMS will parse any of a variety of C dialects, builds ASTs for the code elements, builds full symbol tables doing complete name and type resolution of all symbols (including parameter lists in function headers); you can stop there and dump those out. DMS can also do control and data flow analysis on the C code; you can use othe DMS facilities to further analyze or transform the code. (The C front end has a full C preprocessor built-in).
The EDG front end can also be used for parsing and symbol tables, but does not have the other capabilities of DMS.

Yet another option is to use the c2xml tool from "sparse". Its C parser isn't 100% standard-compliant (e.g. it won't parse K&R-style declarations), but for reasonably modern C code it works quite well.

If you need a human-readable output (e.g. in html or PDF), then you can use doxygene/doxywizard. In doxywizard "All entities" has to be selected.

A Java programmer has questions regarding C header files

I have a fair amount of practice with Java as a programming language, but I am completely new to C. I understand that a header file contains forward declarations for methods and variables. How is this different from an abstract class in Java?

The short answer:
Abstract classes are a concept of object oriented programming. Header files are a necessity due to the way that the C language is constructed. It cannot be compared in any way
The long answer
To understand the header file, and the need for header files, you must understand the concepts of "declaration" and "definition". In C and C++, a declaration means, that you declare that something exists somewhere, for example a function.
void Test(int i);
We have now declared, that somewhere in the program, there exists a function Test, that takes a single int parameter. When you have a definition, you define what it is:
void Test(int i)
{
...
}
Here we have defined what the function void Test(int) actually is.
Global variables are declared using the extern keyword
extern int i;
They are defined without the extern keyword
int i;
When you compile a C program, you compile each source file (.c file) into an .obj file. Definitions will be compiled into the .obj file as actual code. When all these have been compiled, they are linked to the final executable. Therefore, a function should only be defined on one .c file, otherwise, the same function will end up multiple times in the executable. This is not really critical if the function definitions are identical. It is more problematic if a global variable is linked into the same executable twice. That will leave half the code to use the one instance, and the other half of the code to use the other instance.
But functions defined in one .c file cannot see functions defined in another .c files. So if from file1.c file you need to access function Test(int) defined in file2.c, you need to have a declaration of Test(int) present when compiling file1.c. When file1.c is compiled into file1.obj, the resulting .obj file will contain information that it needs Test(int) to be defined somewhere. When the program is linked, the linker will identify that file2.obj contains the function that file1.obj depends on.
If there is no .obj file containing the definition for this function, you will get a linker error, not a compiler error (linker errors are considerably more difficult to find and correct that compiler errors because you get no filename and line number for the resulting file)
So you use the header file to store declarations for the definitions stored in the corresponding source file.

IMO it's mainly because many C programmers seem to think that Java programmers don't know how to program “for real”, e.g. handling pointers, memory and so on.
I would rather compare headers to Java interfaces, in the sense that they generally define how the API must be used.
Headers are basically just a way to avoid copy-pasting: the preprocessor simply includes the content of the header in the source file when encounters an #include directive.
You put in a header every declaration that the user will commonly use.

Here's the answers:
Java has had a bad reputation among some hardcore C programmers mainly because they think:
it's "too easy" (no memory-management, segfaults)
"can't be used for serious work"
"just for the web" or,
"slow".
Java is hardly the easiest language in the world these days, compared to some lanmguages like Python, etc.
It is used in many desktop apps - applets aren't even used that often. Finally, Java will always be slower than C, because it is not compiled directly to machine code. Sometimes, though, extreme speed isn't needed. Anyway, the JVM isn't the slowest language VM ever.
When you're working in C, there aren't abstract classes.
All a header file does is contain code which is pasted into other files. The main reason you put it in a header file is so that it is at the top of the file - this way, you don't need to care where you put your functions in the actual implementation file.
While you can kind-of use OO concepts in C, it doesn't have built-in support for classes and similar fundamentals of OO. It is nigh-impossible to implement inheritance in plain C, therefore there can never actually have OO, or abstract classes for that matter. I would suggest sticking to plain old structs.
If it makes it easier for you to learn, by all means think of them as abstract classes (with the implementation file being the inheriting class) - but IMHO it is a difficult mindset to use when for working in a language without explicit support of said features.
I'm not sure if Java has them, but I think a closer analogue could be partial classes in C#.

If you forward declare something, you have to actually deliver and implement it, else the compiler will complain. The header allows you to display a "module"'s public API and make the declarations available (for type checking and so) to other parts of the program.

Comprehensive reading: Learning C from Java. Recommended reading for developers who are coming from Java to C.

I think that there is much derision (mockery, laughter, contempt, ridicule) for Java simply because it's popular.
Abstract classes and interfaces specify a contract or a set of functions that can be invoked on an object of a certain type. Function prototypes in C only really do compile time type checking of function arguments/return values.

While your first question seems subjective to me, I will answer to the second one:
A header file contains the declarations which are then made available to other files via #inclusion by the preprocessor.
For instance you will declare in a header a function, and you will implement in a .c file. Other files will be able to use the function so long they can see the declaration (by including the header file).
At linking time the linker will look among the object files, or the various libraries linked, for some object which provides the code for the function.
A typical pattern is: you distribute the header files for your library, and a dll (for instance) which contains the object code. Then in your application you include the header, and the compiler will be able to compile because it will find the declaration in the header. No need to provide the actual implementation of the code, which will be available for the linker through the dll.

C programs run directy, while Java programs run inside the JVM, so a common belief is that Java programs are slow. Also in Java you are hidden from some low level constructs (pointer, direct memory access), memory management, etc...
In C the declaration and definition of a function is separated. Declaration "declares" that there exists a function that called by those arguments returns something. Definition "defines" what the function actually does. The former is done in header files, the latter in the actual code. When you are compiling your code, you must use the header files to tell your compiler that there is such a function, and link in a binary that contains the binary code for the function.
In Java, the binary code itself also contains the declaration of the functions, so it is enough for the compiler to look at the class files to get both the definition and declaration of the available functions.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight