Extracting the call graph with the accessed variables and called functions from C Source Code or C binary - text-parsing

I would like to know a reliable and convinient way (preexisting libraries, C-compilers) to extract the specific information from a C Source Code or C binary code - the format does not really matter, but rather the result is what matters ;-).
Basically the idea is to extract the read and write accesses to variables on a function level as well as called subfunctions recursively, top to bottom.
That means, I want to find out what variables does a certain parent function A access directly (Read, Write or both Read Write access), and then also recursively extract the same information from subfunctions function B, C called by the function A. Similarly, the directly accessed variables by subfunctions function B, C shall be identified, and then the same procedure shall continue recursively for all subfunctions of function B and function C.
I think of it as a call graph, where
Nodes represent a function
Arrows represent a function call
Each node contains additional information about:
name of the accessed variable
access type to a variable (Read, Write or Read/Write)
I attached a picture showing the call graph below to illustrate the idea better. Please be aware that the variable access information shall be collected for each function, and not only for the parent function A.
I am aware that there are there are preexisting compilers and libraries that can parse the C Source code or work directly on the binary such as clang compiler, LLVM, libclang Py-Library and pycparser. I am however not sure if those tools are capable of achieving my task. For example, from my understanding the combination of Clang and LLVM can extract the variable accesses from the binary application file, but not recusively so to speak. That means, all of the variable accesses done by child functions B, C, E, F, etc would be directly assigned to the parent function A.
I definitely want avoid writing a (e.g. python) script on my own that would extract this information from the C source code manually, but I rather prefer the well tested, and well documented tool/API/compiler to do this task.
Thank you a lot in advance!

Related

How can C have no inner functions, but languages implemented in C can?

I've been coding in scripting languages like Lua recently and the existence of anonymous inner functions has got me thinking. How can a language implemented with C, like Lua, have inner functions while in C, no matter what you do,you cannot escape that fact that functions must be declared well beforehand during compile time? Does this mean that in C, there is actually a way to achieve inner functions, and it's simply a matter of implementing a huge code base to make them possible?
For example
void *block = malloc(sizeof(1) * 1024); // somehow
// write bytes to this memory address to make it operate
// like an inner function?
// is that even possible?
char (*letterFunct)(int) = ((char (*letterFunct)(int))block;
// somehow trick C into thinking this block is a function?
printf("%c\n", (*letterFunct)(5)); // call it
What is the key concept I'm missing that bridges this gap in understanding why some languages with advanced features (classes, objects, inner functions, multi threading) can be implemented in languages devoid of all of those?
Just because the compiler / interpreter for a particular language is written in C doesn't mean that that language has to translate into C and then get compiled.
I don't know about Lua, but in the case of Java the code is compiled to Java Byte Code, which (loosely speaking) a Java VM reads and interprets.
The original C compiler was written in assembly, and the original C++ compiler was written in C, so it's possible to write a compiler for a higher-level language in a lower-level language.
Closures and inner functions usualluy involve passing an extra (hidden) argument to the function that holds the environment of the closure. In C you don't have those hidden extra arguments, so to implement closures or inner functions in C, you need to make those extra arguments explicit. So to implement an "inner function" in C, you might end up with something like:
struct shared_locals {
// locals of function shared with inner function
};
int inner_function(struct shared_locals *sl, /* other args to inner function */...) {
// code for the inner function -- shared locals accessed va sl
}
int function(...) {
struct shared_locals sl; // the shared locals
// call inner function directly
inner_function(&sl, ...);
// pass inner function as a callback
func_with_callback(inner_function, &sl);
}
The above kind of code is why 'callbacks' in C code usually involve both a function pointer and an extra void * argument that is passed to the callback.
To implement inner functions, you need closures. To implement closures you need some more advanced mechanism of allocating local variables than just the stack. C was meant to be a lightweight language, so advanced concepts like closures and garbage collectors were excluded.
C++ is kind of extension of C that has all the advanced concepts, closures and inner functions included.
Your example with a memory block that you fill with assembler code: You can do that, but it will not be portable. It would require cooperation from operating system and from the compiler. The only portable solution I can think of would be embedding the compiler into every executable, which is again too much.
And this would still be a normal non-inner function. To implement inner functions compiled at runtime you would again need closures.
Lua is a program---just like any other program---you run it, it reads input, it produces output. When you run lua MyProgram.lua, the lua program reads from the file MyProgram.lua, and it writes output to the console. As with many other programs, what it spits out depends on what it read in.
The lua program is written in C.
If your MyProgram.lua file contains print("x") at top level, then when the lua program reads that line it will print x.
Note: It was lua that printed x. It wasn't really MyProgram.lua. MyProgram.lua is just a data file. The lua program reads it in, and it uses the data to decide what it's supposed to do.
When the lua program reads that line, it doesn't "translate" the line into C or into any other language. It just does what the line says to do.
It prints x.
There's a name for that: We say that the lua program interprets MyProgram.lua.
Note: I lied. The lua program doesn't really do anything. The lua program is just a data file. When you type lua MyProgram.lua, the computer reads the data into memory, and then it uses the data to decide what it is supposed to do.
When we talk about a computer system, we speak at different levels of abstraction. When we say, "the computer hardware did X," we are speaking about a low level of abstraction. When we say, "MyProgram.lua did Z", we are speaking about a higher level of abstraction. And, when we say that the lua interpreter did something, we are talking about a level somewhere in-between.
In between the hardware and the end user's experience, you can find many levels of abstraction if you look deep enough.
But, back to Lua...
If your MyProgram.lua contains function p() print("y") end at top level, then the Lua program doesn't do anything with that right away. It just remembers what you wanted p() to mean. Then later, if it sees p() at top-level, then it prints y.
You could write the program that does those things (i.e., you could write Lua) in almost any language. Your choice of what language you used to implement lua might affect the internal architecture of your Lua interpreter, but it need not limit the language that your interpreter understands (i.e., the Lua language) in any way.
You're confusing the C source code with the binary executable. The Lua interpreter (the program that reads and runs Lua scripts) is written in C. But after it's compiled, it's not C anymore. It would behave the same if it were written in Fortran (assuming it compiled to the same binary CPU instructions).
There's no such thing as a "running C environment". There are only binary machine instructions. The CPU doesn't know C anymore than it knows French.
As far as how Lua handles inner functions, the designers of Lua sat down and figured out all of the context they would need to keep track of whenever the interpreter encounters an inner function, and wrote the code to assemble and keep track of that context for as long as the inner function is viable. The inner function is a specifically Lua construct -- it has nothing to do with C, because when the Lua interpreter is running, there is no C anywhere.

How to detect the language that lib caller is using?

I am writing a static library by C++, expecting it to be used by either Fortran or C. Since Fortran has all its index starting from 1, I have to do some index modification inside my library when called by Fortran. Because I am passing an array of indices to the library and it is important for further computation.
Of course an intuitive way to solve this problem is to set a argument at interface to let user tell me what language they are using, but I don't think it is a cool way to do this.
So I wonder if there is anyway to detect in my library if it is called by Fortran or C?
Thanks!
If you are just passing arrays and their lengths, there shouldn't be any issue. The problem is only if you pass an index, then you need to know what that index is relative to. (Which in Fortran could be any value, if the array is explicitly declared to start with an index other than one). If you have this case, my suggestion is to write glue routines for one of the languages that will convert the index values, then call the regular library routines. The problem with this solution is that it obligates the user of the "special" language to call the special glue routines; calling the regular routines is a mistake.
Applications and libraries in any language will be built to target the same ABI. The ABI defines calling conventions and other details that make it possible for two functions built by different compilers (possibly for different languages) to call each other. There shouldn't be anything obviously different in the calls because great effort has gone into avoiding those differences.
You could look for out-of-band information like symbols provided by the FORTRAN compiler (or pulled in from its utility libraries). You would declare some symbol with the weak attribute and if it became valid you would know that there was some FORTRAN somewhere in the current executable image. However, you could not know if it was calling you directly or simply due to some other library pulled in.
The right solution appears to be using explicit wrappers to call C from FORTRAN: Calling a FORTRAN subroutine from C
It would be better if you set a start index. Fortran arrays do not necessarily have to start at 1. They can start at any number. The array may have been declared as
float, dimension(-20:20):: neg
float, dimension(4:99) pos
So passing an index of 5 for neg would mean the 26th element and passing an index of 5 for pos would mean the 2nd element.

Is it possible to LD_PRELOAD a function with different parameters?

Say I replace a function by creating a shared object and using LD_PRELOAD to load it first. Is it possible to have parameters to that function different from the one in original library?
For example, if I replace pthread_mutex_lock, such that instead of parameter pthread_mutex_t it takes pthread_my_mutex_t. Is it possible?
Secondly, besides function, is it possible to change structure declarations using LD_PRELOAD? For example, one may add one more field to a structure.
Although you can arrange to provide your modified pthread_mutex_lock() function, the code will have been compiled to call the standard function. This will lead to problems when the replacement is called with the parameters passed to the standard function. This is a polite way of saying:
Expect it to crash and burn
Any pre-loaded function must implement the same interface — same name, same arguments in, same values out — as the function it replaces. The internals can be implemented as differently as you need, but the interface must be the same.
Similarly with structures. The existing code was compiled to expect one size for the structure, with one specific layout. You might get away with adding an extra field at the end, but the non-substituted code will probably not work correctly. It will allocate space for the original size of structure, not the enhanced structure, etc. It will never access the extra element itself. It probably isn't quite impossible, but you must have designed the program to handle dynamically changing structure sizes, which places severe enough constraints on when you can do it that the answer "you can't" is probably apposite (and is certainly much simpler).
IMNSHO, the LD_PRELOAD mechanism is for dire emergencies (and is a temporary band-aid for a given problem). It is not a mechanism you should plan to use on anything remotely resembling a regular basis.
LD_PRELOAD does one thing, and one thing only. It arranges for a particular DSO file to be at the front of the list that ld.so uses to look up symbols. It has nothing to do with how the code uses a function or data item once found.
Anything you can do with LD_PRELOAD, you can simulate by just linking the replacement library with -l at the front of the list. If, on the other hand, you can't accomplish a task with that -l, you can't do it with LD_PRELOAD.
The effects of what you're describing are conceptually the same as the effects of providing a mismatching external function at normal link time: undefined behavior.
If you want to do this, rather than playing with fire, why don't you make your replacement function also take pthread_mutex_t * as its argument type, and then just convert the pointer to pthread_my_mutex_t * in the function body? Normally this conversion will take place only at the source level anyway; no code should be generated for it.

How to implement standard C function extraction?

I have a "a pain in the a$$" task to extract/parse all standard C functions that were called in the main() function. Ex: printf, fseek, etc...
Currently, my only plan is to read each line inside the main() and search if a standard C functions exists by checking the list of standard C functions that I will also be defining (#define CFUNCTIONS "printf...")
As you know there are so many standard C functions, so defining all of them will be so annoying.
Any idea on how can I check if a string is a standard C functions?
If you have heard of cscope, try looking into the database it generates. There are instructions available at the cscope front end to list out all the functions that a given function has called.
If you look at the list of the calls from main(), you should be able to narrow down your work considerably.
If you have to parse by hand, I suggest starting with the included standard headers. They should give you a decent idea about which functions could you expect to see in main().
Either way, the work sounds non-trivial and interesting.
Parsing C source code seems simple at first blush, but as others have pointed out, the possibility of a programmer getting far off the leash by using #defines and #includes is rather common. Unless it is known that the specific program to be parsed is mild-mannered with respect to text substitution, the complexity of parsing arbitrary C source code is considerable.
Consider the less used, but far more effective tactic of parsing the object module. Compile the source module, but do not link it. To further simplify, reprocess the file containing main to remove all other functions, but leave declarations in their places.
Depending on the requirements, there are two ways to complete the task:
Write a program which opens the object module and iterates through the external reference symbol table. If the symbol matches one of the interesting function names, list it. Many platforms have library functions for parsing an object module.
Write a command file or script which uses the developer tools to examine object modules. For example, on Linux, the command nm lists external references with a U.
The task may look simple at first but in order to be really 100% sure you would need to parse the C-file. It is not sufficient to just look for the name, you need to know the context as well i.e. when to check the id, first when you have determined that the id is a function you can check if it is a standard c-runtime function.
(plus I guess it makes the task more interesting :-)
I don't think there's any way around having to define a list of standard C functions to accomplish your task. But it's even more annoying than that -- consider macros,
for example:
#define OUTPUT(foo) printf("%s\n",foo)
main()
{
OUTPUT("Ha ha!\n");
}
So you'll probably want to run your code through the preprocessor before checking
which functions are called from main(). Then you might have cases like this:
some_func("This might look like a call to fclose(fp), but surprise!\n");
So you'll probably need a full-blown parser to do this rigorously, since string literals
may span multiple lines.
I won't bring up trigraphs...that would just be pointless sadism. :-) Anyway, good luck, and happy coding!

Some general C questions

I am trying to fully understand the process pro writing code in some language to execution by OS. In my case, the language would be C and the OS would be Windows. So far, I read many different articles, but I am not sure, whether I understand the process right, and I would like to ask you if you know some good articles on some subjects I couldn´t find.
So, what I think I know about C (and basically other languages):
C compiler itself handles only data types, basic math operations, pointers operations, and work with functions. By work with functions I mean how to pass argument to it, and how to get output from function. During compilation, function call is replaced by passing arguments to stack, and than if function is not inline, its call is replaced by some symbol for linker. Linker than find the function definition, and replace the symbol to jump adress to that function (and of course than jump back to program).
If the above is generally true and I get it right, where to final .exe file actually linker saves the functions? After the main() function? And what creates the .exe header? Compiler or Linker?
Now, additional capabilities of C, today known as C standart library is set of functions and the declarations of them, that other programmers wrote to extend and simplify use of C language. But these functions like printf() were (or could be?) written in different language, or assembler. And there comes my next question, can be, for example printf() function be written in pure C without use of assembler?
I know this is quite big question, but I just mostly want to know, wheather I am right or not. And trust me, I read a lots of articles on the web, and I would not ask you, If I could find these infromation together on one place, in one article. Insted I must piece by piece gather informations, so I am not sure if I am right. Thanks.
I think that you're exposed to some information that is less relevant as a beginning C programmer and that might be confusing you - part of the goal of using a higher level language like this is to not have to initially think about how this process works. Over time, however, it is important to understand the process. I think you generally have the right understanding of it.
The C compiler merely takes C code and generates object files that contain machine language. Most of the object file is taken by the content of the functions. A simple function call in C, for example, would be represented in the compiled form as low level operators to push things into the stack, change the instruction pointer, etc.
The C library and any other libraries you would use are already available in this compiled form.
The linker is the thing that combines all the relevant object files, resolves all the dependencies (e.g., one object file calling a function in the standard library), and then creates the executable.
As for the language libraries are written in: Think of every function as a black box. As long as the black box has a standard interface (the C calling convention; that is, it takes arguments in a certain way, returns values in a certain way, etc.), how it is written internally doesn't matter. Most typically, the functions would be written in C or directly in assembly. By the time they make it into an object file (or as a compiled library), it doesn't really matter how they were initially created, what matters is that they are now in the compiled machine form.
The format of an executable depends on the operating system, but much of the body of the executable in windows is very similar to that of the object files. Imagine as if someone merged together all the object files and then added some glue. The glue does loading related stuff and then invokes the main(). When I was a kid, for example, people got a kick out of "changing the glue" to add another function before the main() that would display a splash screen with their name.
One thing to note, though is that regardless of the language you use, eventually you have to make use of operating system services. For example, to display stuff on the screen, to manage processes, etc. Most operating systems have an API that is also callable in a similar way, but its contents are not included in your EXE. For example, when you run your browser, it is an executable, but at some point there is a call to the Windows API to create a window or to load a font. If this was part of your EXE, your EXE would be huge. So even in your executable, there are "missing references". Usually, these are addressed at load time or run time, depending on the operating system.
I am a new user and this system does not allow me to post more than one link. To get around that restriction, I have posted some idea at my blog http://zhinkaas.blogspot.com/2010/04/how-does-c-program-work.html. It took me some time to get all links, but in totality, those should get you started.
The compiler is responsible for translating all your functions written in C into assembly, which it saves in the object file (DLL or EXE, for example). So, if you write a .c file that has a main function and a few other function, the compiler will translate all of those into assembly and save them together in the EXE file. Then, when you run the file, the loader (which is part of the OS) knows to start running the main function first. Otherwise, the main function is just like any other function for the compiler.
The linker is responsible for resolving any references between functions and variables in one object file with the references in other files. For example, if you call printf(), since you do not define the function printf() yourself, the linker is responsible for making sure that the call to printf() goes to the right system library where printf() is defined. This is done at compile-time.
printf() is indeed be written in pure C. What it does is call a system call in the OS which knows how to actually send characters to the standard output (like a window terminal). When you call printf() in your program, at compile time, the linker is responsible for linking your call to the printf() function in the standard C libraries. When the function is passed at run-time, printf() formats the arguments properly and then calls the appropriate OS system call to actually display the characters.

Resources