Is there a way to programmatically check if a single C source file is potentially harmful?
I know that no check will yield 100% accuracy -- but am interested at least to do some basic checks that will raise a red flag if some expressions / keywords are found. Any ideas of what to look for?
Note: the files I will be inspecting are relatively small in size (few 100s of lines at most), implementing numerical analysis functions that all operate in memory. No external libraries (except math.h) shall be used in the code. Also, no I/O should be used (functions will be run with in-memory arrays).
Given the above, are there some programmatic checks I could do to at least try to detect harmful code?
Note: since I don't expect any I/O, if the code does I/O -- it is considered harmful.
Yes, there are programmatic ways to detect the conditions that concern you.
It seems to me you ideally want a static analysis tool to verify that the preprocessed version of the code:
Doesn't call any functions except those it defines and non I/O functions in the standard library,
Doesn't do any bad stuff with pointers.
By preprocessing, you get rid of the problem of detecting macros, possibly-bad-macro content, and actual use of macros. Besides, you don't want to wade through all the macro definitions in standard C headers; they'll hurt your soul because of all the historical cruft they contain.
If the code only calls its own functions and trusted functions in the standard library, it isn't calling anything nasty. (Note: It might be calling some function through a pointer, so this check either requires a function-points-to analysis or the agreement that indirect function calls are verboten, which is actually probably reasonable for code doing numerical analysis).
The purpose of checking for bad stuff with pointers is so that it doesn't abuse pointers to manufacture nasty code and pass control to it. This first means, "no casts to pointers from ints" because you don't know where the int has been :-}
For the who-does-it-call check, you need to parse the code and name/type resolve every symbol, and then check call sites to see where they go. If you allow pointers/function pointers, you'll need a full points-to analysis.
One of the standard static analyzer tool companies (Coverity, Klocwork) likely provide some kind of method of restricting what functions a code block may call. If that doesn't work, you'll have to fall back on more general analysis machinery like our DMS Software Reengineering Toolkit
with its C Front End. DMS provides customizable machinery to build arbitrary static analyzers, for the a language description provided to it as a front end. DMS can be configured to do exactly the test 1) including the preprocessing step; it also has full points-to, and function-points-to analyzers that could be used to the points-to checking.
For 2) "doesn't use pointers maliciously", again the standard static analysis tool companies provide some pointer checking. However, here they have a much harder problem because they are statically trying to reason about a Turing machine. Their solution is either miss cases or report false positives. Our CheckPointer tool is a dynamic analysis, that is, it watches the code as it runs and if there is any attempt to misuse a pointer CheckPointer will report the offending location immediately. Oh, yes, CheckPointer outlaws casts from ints to pointers :-} So CheckPointer won't provide a static diagnostic "this code can cheat", but you will get a diagnostic if it actually attempts to cheat. CheckPointer has rather high overhead (all that checking costs something) so you probably want to run you code with it for awhile to gain some faith that nothing bad is going to happen, and then stop using it.
EDIT: Another poster says There's not a lot you can do about buffer overwrites for statically defined buffers. CheckPointer will do those tests and more.
If you want to make sure it's not calling anything not allowed, then compile the piece of code and examine what it's linking to (say via nm). Since you're hung up on doing this by a "programmatic" method, just use python/perl/bash to compile then scan the name list of the object file.
There's not a lot you can do about buffer overwrites for statically defined buffers, but you could link against an electric-fence type memory allocator to prevent dynamically allocated buffer overruns.
You could also compile and link the C-file in question against a driver which would feed it typical data while running under valgrind which could help detect poorly or maliciously written code.
In the end, however, you're always going to run up against the "does this routine terminate" question, which is famous for being undecidable. A practical way around this would be to compile your program and run it from a driver which would alarm-out after a set period of reasonable time.
EDIT: Example showing use of nm:
Create a C snippet defining function foo which calls fopen:
#include <stdio.h>
foo() {
FILE *fp = fopen("/etc/passwd", "r");
}
Compile with -c, and then look at the resulting object file:
$ gcc -c foo.c
$ nm foo.o
0000000000000000 T foo
U fopen
Here you'll see that there are two symbols in the foo.o object file. One is defined, foo, the name of the subroutine we wrote. And one is undefined, fopen, which will be linked to its definition when the object file is linked together with the other C-files and necessary libraries. Using this method, you can see immediately if the compiled object is referencing anything outside of its own definition, and by your rules, can considered to be "bad".
You could do some obvious checks for "bad" function calls like network IO or assembly blocks. Beyond that, I can't think of anything you can do with just a C file.
Given the nature of C you're just about going to have to compile to even get started. Macros and such make static analysis of C code pretty difficult.
Related
From bpf man page:
eBPF programs can be written in a restricted C that is compiled
(using the clang compiler) into eBPF bytecode. Various features are
omitted from this restricted C, such as loops, global variables,
variadic functions, floating-point numbers, and passing structures as
function arguments.
AFAIK the man page it's not updated. I'd like to know what is exactly forbidden when using restricted C to write an eBPF program? Is what the man page says still true?
It is not really a matter of what is “allowed” in the ELF file itself. This sentence means that once compiled into eBPF instructions, your C code may produce code that would be rejected by the verifier. For example, loops in BPF programs have long been rejected in BPF programs, because there was no guaranty that they would terminate (the only workaround was to unroll them at compile time).
So you can basically use pretty much whatever you want in C and produce successfully an ELF object file. But then you want it to pass the verifier. What components will surely result in the verifier complaining? Let's have a look at the list from man page:
Loops: Linux version 5.3 introduces support for bounded loops, so loops now work to some extent. “Bounded loops” means loops for which the verifier has a way to tell they will eventually finish: typically, a for (i = 0; i < CONSTANT; i++) kind loop should work (assuming i is not modified in the block).
Global variables: There has been some work recently to support global variables, but they are processed in a specific way (as single-entry maps) and I have not really experimented with them, so I don't know how transparent this is and if you can simply have global variables defined in your program. Feel free to experiment :).
Variadic functions: Pretty sure this is not supported, I don't see how that would translate in eBPF at the moment.
Floating point numbers: Still not supported.
Passing structure as function arguments: Not supported, although passing pointers to structs should work I think.
If you are interested in this level of details, you should really have a look at Cilium's documentation on BPF. It is not completely up-to-date (only the very new features are missing), but much more complete and accurate than the man page. In particular, the LLVM section has a list of items that should or should not work in C programs compiled to eBPF. In addition to the aforedmentioned items, they cite:
(All function needing to be inlined, no function calls) -> This one is outdated, BPF has function calls.
No shared library calls: This is true. You cannot call functions from standard libraries, or functions defined in other BPF programs. You can only call into functions defined in the same BPF programs, or BPF helpers implemented in the kernel, or perform “tail calls”.
Exception: LLVM built-in functions for memset()/memcpy()/memmove()/memcmp() are available (I think they're pretty much the only functions you can call, other than BPF helpers and your other BPF functions).
No const string or arrays allowed (because of how they are handled in the ELF file): I think this is still valid today?
BPF program stack is limited to 512 bytes, so your C program must not result in an executable that attempts to use more.
Additional allowed or good-to-know items are listed. I can only encourage you to dive into it!
Let's go over these:
Variadic functions, floating-point numbers, and passing structures as function arguments are still not possible. As far as I know, there is no ongoing work to support these.
Global variables should be supported in kernels >=5.2, thanks to a recent patchset by Daniel Borkmann.
Unbounded loops are still unsupported. There is limited support for bounded loops in kernels >=5.3. I'm using "limited" here because a small program may still be rejected if the loop is too large.
The best way to know what the last versions of Linux allow is to read the verifier's code and/or to follow the bpf mailing list (you might also want to follow the netdev mailing list as some patchsets may still land there). I've also found checking patchwork to be quite efficient as it more clearly outlines the state of each patchset.
I would like to use GCC kind of as a JIT compiler, where I just compile short snippets of code every now and then. While I could of course fork a GCC process for each function I want to compile, I find that GCC's startup overhead is too large for that (it seems to be about 50 ms on my computer, which would make it take 50 seconds to compile 1000 functions). Therefore, I'm wondering if it's possible to run GCC as a daemon or use it as a library or something similar, so that I can just submit a function for compilation without the startup overhead.
In case you're wondering, the reason I'm not considering using an actual JIT library is because I haven't found one that supports all the features I want, which include at least good knowledge of the ABI so that it can handle struct arguments (lacking in GNU Lightning), nested functions with closure (lacking in libjit) and having a C-only interface (lacking in LLVM; I also think LLVM lacks nested functions).
And no, I don't think I can batch functions together for compilation; half the point is that I'd like to compile them only once they're actually called for the first time.
I've noticed libgccjit, but from what I can tell, it seems very experimental.
My answer is "No (you can't run GCC as a daemon process, or use it as a library)", assuming you are trying to use the standard GCC compiler code. I see at least two problems:
The C compiler deals in complete translation units, and once it has finished reading the source, compiles it and exits. You'd have to rejig the code (the compiler driver program) to stick around after reading each file. Since it runs multiple sub-processes, I'm not sure that you'll save all that much time with it, anyway.
You won't be able to call the functions you create as if they were normal statically compiled and linked functions. At the least you will have to load them (using dlopen() and its kin, or writing code to do the mapping yourself) and then call them via the function pointer.
The first objection deals with the direct question; the second addresses a question raised in the comments.
I'm late to the party, but others may find this useful.
There exists a REPL (read–eval–print loop) for c++ called Cling, which is based on the Clang compiler. A big part of what it does is JIT for c & c++. As such you may be able to use Cling to get what you want done.
The even better news is that Cling is undergoing an attempt to upstream a lot of the Cling infrastructure into Clang and LLVM.
#acorn pointed out that you'd ruled out LLVM and co. for lack of a c API, but Clang itself does have one which is the only one they guarantee stability for: https://clang.llvm.org/doxygen/group__CINDEX.html
Hullo,
When one disasembly some win32 exe prog compiled by c compiler it
shows that some compilers links some 'hidden' routines in it -
i think even if c program is an empty one and has a 5 bytes or so.
I understand that such 5 bytes is enveloped in PE .exe format but
why to put some routines - it seem not necessary for me and even
somewhat annoys me. What is that? Can it be omitted? As i understand
c program (not speaking about c++ right now which i know has some
initial routines) should not need such complementary hidden functions..
Much tnx for answer, maybe even some extended info link, cause this
topic interests me much
//edit
ok here it is some disasembly Ive done way back then
(digital mars and old borland commandline (i have tested also)
both make much more code, (and Im specialli interested in bcc32)
but they do not include readable names/symbols in such dissassembly
so i will not post them here
thesse are somewhat readable - but i am not experienced in understending
what it is ;-)
https://dl.dropbox.com/u/42887985/prog_devcpp.htm
https://dl.dropbox.com/u/42887985/prog_lcc.htm
https://dl.dropbox.com/u/42887985/prog_mingw.htm
https://dl.dropbox.com/u/42887985/prog_pelles.htm
some explanatory comments whats that heere?
(I am afraid maybe there is some c++ sh*t here, I am
interested in pure c addons not c++ though,
but too tired now to assure that it was compiled in c
mode, extension of compiled empty-main prog was c
so I was thinking it will be output in c not c++)
tnx for longer explanations what it is
Since your win32 exe file is a dynamically linked object file, it will contain the necessary data needed by the dynamic linker to do its job, such as names of libraries to link to, and symbols that need resolving.
Even a program with an empty main() will link with the c-runtime and kernel32.dll libraries (and probably others? - a while since I last did Win32 dev).
You should also be aware that main() is only the entry point of your program - quite a bit has already gone on before this point such as retrieving and tokening the command-line, setting up the locale, creating stderr, stdin, and stdout and setting up the other mechanism required by the c-runtime library such a at_exit(). Similarly, when your main() returns, the runtime does some clean-up - and at the very least needs to call the kernel to tell it that you're done.
As to whether it's necessary? Yes, unless you fancy writing your own program prologue and epilogue each time. There are probably are ways of writing minimal, statically linked applications if you're sufficiently masochistic.
As for storage overhead, why are you getting so worked up? It's not enough to worry about.
There are several initialization functions that load whenever you run a program on Windows. These functions, among other things, call the main() function that you write - which is why you need either a main() or WinMain() function for your program to run. I'm not aware of other included functions though. Do you have some disassembly to show?
You don't have much detail to go on but I think most of what you're seeing is probably the routines of the specific C runtime library that your compiler works with.
For instance there will be code enabling it to run from the entry point 'main' which portable executable format understands to call the main(char ** args) that you wrote in your C program.
I am trying to run static analysis on a C project to identify dead code i.e functions or code lines that are never ever called. I can build this project with Visual Studio .Net for Windows or using gcc for Linux. I have been trying to find some reasonable tool that can do this for me but so far I have not succeeded. I have read related questions on Stack Overflow i.e this and this and I have tried to use -Wunreachable-code with gcc but the output in gcc is not very helpful. It is of the following format
/home/adnan/my_socket.c: In function ‘my_sockNtoH32’:
/home/adnan/my_socket.c:666: warning: will never be executed
but when I look at line 666 in my_socket.c, it's actually inside another function that is being called from function my_sockNtoH32() and will not be executed for this specific instance but will be executed when called from some other functions.
What I need is to find the code which will never be executed. Can someone plz help with this?
PS: I can't convince management to buy a tool for this task, so please stick to free/open source tools.
If GCC isn't cutting it for you, try clang (or more accurately, its static analyzer). It (generally, your mileage may vary of course) has a much better static analysis than GCC (and produces much better output). It's used in Apple's Xcode but it's open-source and can be used seperately.
When GCC says "will never be executed", it means it. You may have a bug that, in fact, does make that dead code. For example, something like:
if (a = 42) {
// some code
} else {
// warning: unreachable code
}
Without seeing the code it's not possible to be specific, of course.
Note that if there is a macro at line 666, it's possible GCC refers to a part of that macro as well.
GCC will help you find dead code within a compilation. I'd be surprised if it can find dead code across multiple compilation units. A file-level declaration of a function or variable in a compilation unit means that some other compilation unit might reference it. So anything declared at the top level of a file, GCC can't eliminate, as it arguably only sees one compilation unit at a time.
The problem gets get harder. Imagine that compilation unit A declares function a, and compilation unit B has a function b that calls a. Is a dead? On the face of it, no. But in fact, it depends; if b is dead, and the only reference to a is in b, then a is dead, too. We get the same problem if b merely takes &a and puts it into an array X. Now to decide if a is dead, we need a points-to analysis across the entire system, to see if that pointer to a is used anywhere.
To get this kind of accurate "dead" information, you need a global view of the entire set of compilation units, and need to compute a points-to analysis, followed by the construction of a call-graph based on that points-to analysis. Function a is dead only if the call graph (as a tree,
with main as the root) doesn't reference it somewhere.
(Some caveats are necessary: whatever the analysis is, as a practical matter it must be conservative, so even a full-points to analysis may not identify a function correctly as dead. You also have to worry about uses of a C artifact from outside the set of C functions, e.g., a call to a from some bit of assembler code).
Threading makes this worse; each thread has some root function which is probably at the top of the call DAG. Since how a thread gets started isn't defined by C compilers, it should be clear that to determine if a multithreaded C application has dead code, somehow the analysis has to be told the thread root functions, or be told how to discover them by looking for thread-initialization primitives.
You aren't getting a lot responses on how to get a correct answer. While it isn't open source, our DMS Software Reengineering Toolkit with its C Front End has all the machinery to do this, including C parsers, control- and dataflow- analysis, local and global points-to analysis, and global call graph construction. DMS is easily customized to include extra information such as external calls from assembler, and/or a list of thread roots or specific source-patterns that are thread initialization calls, and we've actually done that (easily) for some large embedded engine controllers with millions of lines of code. DMS has been applied to systems as large as 26 million lines of code (some 18,000 compilation units) for the purpose of building such calls graphs.
[An interesting aside: in processing individual comilation units, DMS for scaling reasons in effect deletes symbols and related code that aren't used in that compilation unit. Remarkably, this gets rid of about 95% of code by volume when you take into account the iceberg usually hiding in the include file nest. It says C software typically has poorly factored include files. I suspect you all know that already.]
Tools like GCC will remove dead code while compiling. That's helpful, but the dead code is still lying around in your compilation unit source code using up developer's attention (they have to figure out if it is dead, too!). DMS in its program transformation mode can be configured, modulo some preprocessor issues, to actually remove that dead code from the source. On very large software systems, you don't really want to do this by hand.
I have a "a pain in the a$$" task to extract/parse all standard C functions that were called in the main() function. Ex: printf, fseek, etc...
Currently, my only plan is to read each line inside the main() and search if a standard C functions exists by checking the list of standard C functions that I will also be defining (#define CFUNCTIONS "printf...")
As you know there are so many standard C functions, so defining all of them will be so annoying.
Any idea on how can I check if a string is a standard C functions?
If you have heard of cscope, try looking into the database it generates. There are instructions available at the cscope front end to list out all the functions that a given function has called.
If you look at the list of the calls from main(), you should be able to narrow down your work considerably.
If you have to parse by hand, I suggest starting with the included standard headers. They should give you a decent idea about which functions could you expect to see in main().
Either way, the work sounds non-trivial and interesting.
Parsing C source code seems simple at first blush, but as others have pointed out, the possibility of a programmer getting far off the leash by using #defines and #includes is rather common. Unless it is known that the specific program to be parsed is mild-mannered with respect to text substitution, the complexity of parsing arbitrary C source code is considerable.
Consider the less used, but far more effective tactic of parsing the object module. Compile the source module, but do not link it. To further simplify, reprocess the file containing main to remove all other functions, but leave declarations in their places.
Depending on the requirements, there are two ways to complete the task:
Write a program which opens the object module and iterates through the external reference symbol table. If the symbol matches one of the interesting function names, list it. Many platforms have library functions for parsing an object module.
Write a command file or script which uses the developer tools to examine object modules. For example, on Linux, the command nm lists external references with a U.
The task may look simple at first but in order to be really 100% sure you would need to parse the C-file. It is not sufficient to just look for the name, you need to know the context as well i.e. when to check the id, first when you have determined that the id is a function you can check if it is a standard c-runtime function.
(plus I guess it makes the task more interesting :-)
I don't think there's any way around having to define a list of standard C functions to accomplish your task. But it's even more annoying than that -- consider macros,
for example:
#define OUTPUT(foo) printf("%s\n",foo)
main()
{
OUTPUT("Ha ha!\n");
}
So you'll probably want to run your code through the preprocessor before checking
which functions are called from main(). Then you might have cases like this:
some_func("This might look like a call to fclose(fp), but surprise!\n");
So you'll probably need a full-blown parser to do this rigorously, since string literals
may span multiple lines.
I won't bring up trigraphs...that would just be pointless sadism. :-) Anyway, good luck, and happy coding!