How to implement standard C function extraction? - c

I have a "a pain in the a$$" task to extract/parse all standard C functions that were called in the main() function. Ex: printf, fseek, etc...
Currently, my only plan is to read each line inside the main() and search if a standard C functions exists by checking the list of standard C functions that I will also be defining (#define CFUNCTIONS "printf...")
As you know there are so many standard C functions, so defining all of them will be so annoying.
Any idea on how can I check if a string is a standard C functions?

If you have heard of cscope, try looking into the database it generates. There are instructions available at the cscope front end to list out all the functions that a given function has called.
If you look at the list of the calls from main(), you should be able to narrow down your work considerably.
If you have to parse by hand, I suggest starting with the included standard headers. They should give you a decent idea about which functions could you expect to see in main().
Either way, the work sounds non-trivial and interesting.

Parsing C source code seems simple at first blush, but as others have pointed out, the possibility of a programmer getting far off the leash by using #defines and #includes is rather common. Unless it is known that the specific program to be parsed is mild-mannered with respect to text substitution, the complexity of parsing arbitrary C source code is considerable.
Consider the less used, but far more effective tactic of parsing the object module. Compile the source module, but do not link it. To further simplify, reprocess the file containing main to remove all other functions, but leave declarations in their places.
Depending on the requirements, there are two ways to complete the task:
Write a program which opens the object module and iterates through the external reference symbol table. If the symbol matches one of the interesting function names, list it. Many platforms have library functions for parsing an object module.
Write a command file or script which uses the developer tools to examine object modules. For example, on Linux, the command nm lists external references with a U.

The task may look simple at first but in order to be really 100% sure you would need to parse the C-file. It is not sufficient to just look for the name, you need to know the context as well i.e. when to check the id, first when you have determined that the id is a function you can check if it is a standard c-runtime function.
(plus I guess it makes the task more interesting :-)

I don't think there's any way around having to define a list of standard C functions to accomplish your task. But it's even more annoying than that -- consider macros,
for example:
#define OUTPUT(foo) printf("%s\n",foo)
main()
{
OUTPUT("Ha ha!\n");
}
So you'll probably want to run your code through the preprocessor before checking
which functions are called from main(). Then you might have cases like this:
some_func("This might look like a call to fclose(fp), but surprise!\n");
So you'll probably need a full-blown parser to do this rigorously, since string literals
may span multiple lines.
I won't bring up trigraphs...that would just be pointless sadism. :-) Anyway, good luck, and happy coding!

Related

Is it possible to LD_PRELOAD a function with different parameters?

Say I replace a function by creating a shared object and using LD_PRELOAD to load it first. Is it possible to have parameters to that function different from the one in original library?
For example, if I replace pthread_mutex_lock, such that instead of parameter pthread_mutex_t it takes pthread_my_mutex_t. Is it possible?
Secondly, besides function, is it possible to change structure declarations using LD_PRELOAD? For example, one may add one more field to a structure.
Although you can arrange to provide your modified pthread_mutex_lock() function, the code will have been compiled to call the standard function. This will lead to problems when the replacement is called with the parameters passed to the standard function. This is a polite way of saying:
Expect it to crash and burn
Any pre-loaded function must implement the same interface — same name, same arguments in, same values out — as the function it replaces. The internals can be implemented as differently as you need, but the interface must be the same.
Similarly with structures. The existing code was compiled to expect one size for the structure, with one specific layout. You might get away with adding an extra field at the end, but the non-substituted code will probably not work correctly. It will allocate space for the original size of structure, not the enhanced structure, etc. It will never access the extra element itself. It probably isn't quite impossible, but you must have designed the program to handle dynamically changing structure sizes, which places severe enough constraints on when you can do it that the answer "you can't" is probably apposite (and is certainly much simpler).
IMNSHO, the LD_PRELOAD mechanism is for dire emergencies (and is a temporary band-aid for a given problem). It is not a mechanism you should plan to use on anything remotely resembling a regular basis.
LD_PRELOAD does one thing, and one thing only. It arranges for a particular DSO file to be at the front of the list that ld.so uses to look up symbols. It has nothing to do with how the code uses a function or data item once found.
Anything you can do with LD_PRELOAD, you can simulate by just linking the replacement library with -l at the front of the list. If, on the other hand, you can't accomplish a task with that -l, you can't do it with LD_PRELOAD.
The effects of what you're describing are conceptually the same as the effects of providing a mismatching external function at normal link time: undefined behavior.
If you want to do this, rather than playing with fire, why don't you make your replacement function also take pthread_mutex_t * as its argument type, and then just convert the pointer to pthread_my_mutex_t * in the function body? Normally this conversion will take place only at the source level anyway; no code should be generated for it.

How to give name to the functions of a library header in C

I am trying to write a generic library in pure c , just some data structures like stack, queue...
In my stack.h when giving name to those functions. I have questions about that.
Can I use such name, for example "init" as the function name to init a stack. Will there be something wrong?
I know maybe there exist other functions which just do other things and have the same name as "init". Then would the program be confused, especially when i both include the different init's headers.
3.I know my worry may be unnecessary, but i still want to know the principle.
Any help is appreciated, thanks.
Can I use such name, for example "init" as the function name to init a
stack. Will there be something wrong?
Yes, if anyone else wants a function named init.
I know my worry may be unnecessary, but i still want to know the
principle
Your worry is necessary, this (the lack of namespaces) is a serious problem in C.
Export as few functions as possible. Make everything static if you can
Prefix function names with something. For instance, instead of init, try stack_init
You don't have namespaces in C so usually you prefix every identifier with the name or nickname of your library.
init();
becomes
fancy_lib_init();
There might be existing libraries doing what you want (e.g. Glib). At least, study them a little before writing your own.
If you claim to develop a generic reusable C library, I suggest having naming conventions. For instance, have all the identifiers (notably function names, typedef-s, struct names...) share some common prefix.
Be systematic in your naming conventions. For instance, initializers for stacks and for queues should have similar names & signatures, and end with _init. Document your naming conventions.
Define very clearly how should data be allocated and released. Who and when should call free?
init() might be okay (if you're including your library into something else as an actual library, rather than compiling its source in), but it's better practice to use something like stack_init(), and to prefix your library's functions with stack_ or queue_, etc.
A program using your library may get confused, depending on the order the libraries are included, see #1.
As far as the principles go, the linker (on Linux, anyway) will look for symbols, and there's an ordering to how those symbols will be found. For more information, you can check out the man page for dlsym(), and specifically for RTLD_NEXT.
Function names in C are global. If two functions in a program have the same name, the program should fail to compile. (Well, sometimes it fails at link time, but the idea still holds.)
Generally, you get around this problem by using some sort of prefix or suffix on the function names in your library. "apporc_stack_init()" is much less likely to collide with something than "init()" is.

Harmful C Source File Check?

Is there a way to programmatically check if a single C source file is potentially harmful?
I know that no check will yield 100% accuracy -- but am interested at least to do some basic checks that will raise a red flag if some expressions / keywords are found. Any ideas of what to look for?
Note: the files I will be inspecting are relatively small in size (few 100s of lines at most), implementing numerical analysis functions that all operate in memory. No external libraries (except math.h) shall be used in the code. Also, no I/O should be used (functions will be run with in-memory arrays).
Given the above, are there some programmatic checks I could do to at least try to detect harmful code?
Note: since I don't expect any I/O, if the code does I/O -- it is considered harmful.
Yes, there are programmatic ways to detect the conditions that concern you.
It seems to me you ideally want a static analysis tool to verify that the preprocessed version of the code:
Doesn't call any functions except those it defines and non I/O functions in the standard library,
Doesn't do any bad stuff with pointers.
By preprocessing, you get rid of the problem of detecting macros, possibly-bad-macro content, and actual use of macros. Besides, you don't want to wade through all the macro definitions in standard C headers; they'll hurt your soul because of all the historical cruft they contain.
If the code only calls its own functions and trusted functions in the standard library, it isn't calling anything nasty. (Note: It might be calling some function through a pointer, so this check either requires a function-points-to analysis or the agreement that indirect function calls are verboten, which is actually probably reasonable for code doing numerical analysis).
The purpose of checking for bad stuff with pointers is so that it doesn't abuse pointers to manufacture nasty code and pass control to it. This first means, "no casts to pointers from ints" because you don't know where the int has been :-}
For the who-does-it-call check, you need to parse the code and name/type resolve every symbol, and then check call sites to see where they go. If you allow pointers/function pointers, you'll need a full points-to analysis.
One of the standard static analyzer tool companies (Coverity, Klocwork) likely provide some kind of method of restricting what functions a code block may call. If that doesn't work, you'll have to fall back on more general analysis machinery like our DMS Software Reengineering Toolkit
with its C Front End. DMS provides customizable machinery to build arbitrary static analyzers, for the a language description provided to it as a front end. DMS can be configured to do exactly the test 1) including the preprocessing step; it also has full points-to, and function-points-to analyzers that could be used to the points-to checking.
For 2) "doesn't use pointers maliciously", again the standard static analysis tool companies provide some pointer checking. However, here they have a much harder problem because they are statically trying to reason about a Turing machine. Their solution is either miss cases or report false positives. Our CheckPointer tool is a dynamic analysis, that is, it watches the code as it runs and if there is any attempt to misuse a pointer CheckPointer will report the offending location immediately. Oh, yes, CheckPointer outlaws casts from ints to pointers :-} So CheckPointer won't provide a static diagnostic "this code can cheat", but you will get a diagnostic if it actually attempts to cheat. CheckPointer has rather high overhead (all that checking costs something) so you probably want to run you code with it for awhile to gain some faith that nothing bad is going to happen, and then stop using it.
EDIT: Another poster says There's not a lot you can do about buffer overwrites for statically defined buffers. CheckPointer will do those tests and more.
If you want to make sure it's not calling anything not allowed, then compile the piece of code and examine what it's linking to (say via nm). Since you're hung up on doing this by a "programmatic" method, just use python/perl/bash to compile then scan the name list of the object file.
There's not a lot you can do about buffer overwrites for statically defined buffers, but you could link against an electric-fence type memory allocator to prevent dynamically allocated buffer overruns.
You could also compile and link the C-file in question against a driver which would feed it typical data while running under valgrind which could help detect poorly or maliciously written code.
In the end, however, you're always going to run up against the "does this routine terminate" question, which is famous for being undecidable. A practical way around this would be to compile your program and run it from a driver which would alarm-out after a set period of reasonable time.
EDIT: Example showing use of nm:
Create a C snippet defining function foo which calls fopen:
#include <stdio.h>
foo() {
FILE *fp = fopen("/etc/passwd", "r");
}
Compile with -c, and then look at the resulting object file:
$ gcc -c foo.c
$ nm foo.o
0000000000000000 T foo
U fopen
Here you'll see that there are two symbols in the foo.o object file. One is defined, foo, the name of the subroutine we wrote. And one is undefined, fopen, which will be linked to its definition when the object file is linked together with the other C-files and necessary libraries. Using this method, you can see immediately if the compiled object is referencing anything outside of its own definition, and by your rules, can considered to be "bad".
You could do some obvious checks for "bad" function calls like network IO or assembly blocks. Beyond that, I can't think of anything you can do with just a C file.
Given the nature of C you're just about going to have to compile to even get started. Macros and such make static analysis of C code pretty difficult.

Some general C questions

I am trying to fully understand the process pro writing code in some language to execution by OS. In my case, the language would be C and the OS would be Windows. So far, I read many different articles, but I am not sure, whether I understand the process right, and I would like to ask you if you know some good articles on some subjects I couldn´t find.
So, what I think I know about C (and basically other languages):
C compiler itself handles only data types, basic math operations, pointers operations, and work with functions. By work with functions I mean how to pass argument to it, and how to get output from function. During compilation, function call is replaced by passing arguments to stack, and than if function is not inline, its call is replaced by some symbol for linker. Linker than find the function definition, and replace the symbol to jump adress to that function (and of course than jump back to program).
If the above is generally true and I get it right, where to final .exe file actually linker saves the functions? After the main() function? And what creates the .exe header? Compiler or Linker?
Now, additional capabilities of C, today known as C standart library is set of functions and the declarations of them, that other programmers wrote to extend and simplify use of C language. But these functions like printf() were (or could be?) written in different language, or assembler. And there comes my next question, can be, for example printf() function be written in pure C without use of assembler?
I know this is quite big question, but I just mostly want to know, wheather I am right or not. And trust me, I read a lots of articles on the web, and I would not ask you, If I could find these infromation together on one place, in one article. Insted I must piece by piece gather informations, so I am not sure if I am right. Thanks.
I think that you're exposed to some information that is less relevant as a beginning C programmer and that might be confusing you - part of the goal of using a higher level language like this is to not have to initially think about how this process works. Over time, however, it is important to understand the process. I think you generally have the right understanding of it.
The C compiler merely takes C code and generates object files that contain machine language. Most of the object file is taken by the content of the functions. A simple function call in C, for example, would be represented in the compiled form as low level operators to push things into the stack, change the instruction pointer, etc.
The C library and any other libraries you would use are already available in this compiled form.
The linker is the thing that combines all the relevant object files, resolves all the dependencies (e.g., one object file calling a function in the standard library), and then creates the executable.
As for the language libraries are written in: Think of every function as a black box. As long as the black box has a standard interface (the C calling convention; that is, it takes arguments in a certain way, returns values in a certain way, etc.), how it is written internally doesn't matter. Most typically, the functions would be written in C or directly in assembly. By the time they make it into an object file (or as a compiled library), it doesn't really matter how they were initially created, what matters is that they are now in the compiled machine form.
The format of an executable depends on the operating system, but much of the body of the executable in windows is very similar to that of the object files. Imagine as if someone merged together all the object files and then added some glue. The glue does loading related stuff and then invokes the main(). When I was a kid, for example, people got a kick out of "changing the glue" to add another function before the main() that would display a splash screen with their name.
One thing to note, though is that regardless of the language you use, eventually you have to make use of operating system services. For example, to display stuff on the screen, to manage processes, etc. Most operating systems have an API that is also callable in a similar way, but its contents are not included in your EXE. For example, when you run your browser, it is an executable, but at some point there is a call to the Windows API to create a window or to load a font. If this was part of your EXE, your EXE would be huge. So even in your executable, there are "missing references". Usually, these are addressed at load time or run time, depending on the operating system.
I am a new user and this system does not allow me to post more than one link. To get around that restriction, I have posted some idea at my blog http://zhinkaas.blogspot.com/2010/04/how-does-c-program-work.html. It took me some time to get all links, but in totality, those should get you started.
The compiler is responsible for translating all your functions written in C into assembly, which it saves in the object file (DLL or EXE, for example). So, if you write a .c file that has a main function and a few other function, the compiler will translate all of those into assembly and save them together in the EXE file. Then, when you run the file, the loader (which is part of the OS) knows to start running the main function first. Otherwise, the main function is just like any other function for the compiler.
The linker is responsible for resolving any references between functions and variables in one object file with the references in other files. For example, if you call printf(), since you do not define the function printf() yourself, the linker is responsible for making sure that the call to printf() goes to the right system library where printf() is defined. This is done at compile-time.
printf() is indeed be written in pure C. What it does is call a system call in the OS which knows how to actually send characters to the standard output (like a window terminal). When you call printf() in your program, at compile time, the linker is responsible for linking your call to the printf() function in the standard C libraries. When the function is passed at run-time, printf() formats the arguments properly and then calls the appropriate OS system call to actually display the characters.

What naming convention for a C API

We are working on a game engine written in C and currently we are using the following naming conventions.
ABClass object;
ABClassMethod(object, args)
AB Being our prefix.
Our API, even if working on objects, does not have inheritance, polymorphism or anything. All we have is data types and methods working on them.
Our Constants are named alike: AB_ConstantName and Preprocessor macros are named like AB_API_BEGIN. We don't use function like macros.
I was wondering how this was fitting as a C API. Also, you may note that the entire API is wrapper into lua, and you can either use the API from C or lua. Most of the time the engine will be used from lua.
Whatever the API you'll come out with, for your users' mental sanity (and for yours), ensure that it's consistent throughout the code.
Consistency, to me, includes three things:
Naming. Case and use of the underscore should be regulated. For example: ABClass() is a "public" symbol while AB_Class() is not (in the sense that it might be visible (for whatever reason) to other modules but it's reserved for internal use.
If you have "ABClass()", you should never have "abOtherClass()" or "AbYet_anotherClass()"
Nouns and verbs. If something is called "point" it must always be "point" and not "pnt" or "p" or similar.
Standard C library, for example, has both putc() and putchar() (yes, they are different but the name doesn't tell which one writes on stdout).
Also verbs should be consistent: avoid having "CreateNewPoint()", "BuildCircle()" and "NewSquareMake()" at the same time!
Argument position. If a set of related function takes similar arguments (e.g. a string or a file) ensure they have the same position. Again the C standard library do a poor job with fwrite() and fprintf(): one has the file as the last argument, the other as the first one.
The rest is much up to your taste and any other constraint you might have.
For example, you mentioned you're using Lua: Following a convention that is similar to the Lua one could be a plus if programmers have to be exposed to both API at the same time.
This seems standard enough. OpenGL did it with a gl prefix, so you can't be that far off. :)
There is a lot of C APIs. If you are creative enough to invent a new one, there's no "majority" to blame you. On the other hand, no matter which way you go there are enough zealots of other standards to get mad at you.

Resources