how do call graphs resolve function pointers?

how do call graphs resolve function pointers? - c

I am implementing a call graph program for a C using perl script. I wonder how to resolve call graphs for function pointers using output of 'objdump'?
How different call graph applications resolve function pointers?
Are function pointers resolved at run time or they can be done statically?
EDIT
How do call graphs resolve cycles in static evaluation of program?

It is easy to build a call graph of A-calls-B when the call statement explicitly mentions B. It is much harder to handle indirect calls, as you've noticed.
Good static analysis tools form estimates of the contents of pointer variables by propagating pointer assignments/copies/arithmetic across program data flows (inter and intra-procedural ["global"]) using a variety of schemes, often conservative ("you get too much").
Without such an estimate, you cannot have any idea what a pointer contains and therefore simply cannot make a useful prediction (well, you can use the ultimate conservative estimate that it will go anywhere, but I think you've already rejected that solution).
Our DMS Software Reengineering Toolkit has static control/dataflow/points-to/call graph analysis that has been applied to huge systems (~~25 million lines) of C code, and produced such call graphs. The machinery to do this
is pretty complex but you can find it in advanced topics in the compiler literature. I doubt you want to implement this in Perl.
This is easier when you have source code, because you at least reliably know what is code, and what is not. You're trying to do this on object code, which means you can't even eliminate data.

Using function pointers is a way of choosing the actual function to call at runtime, so in general, it wouldn't be possible to know what would actually happen statically.
However, you could look at all functions that are possible to call and perhaps show those in some way. Often the callbacks have a unique enough signature (not always).
If you want to do better, you have to analyze the source code, to see which functions are assigned to pointers to begin with.

Related

C and MPI: function works differently with same data

I have successfully wrote a complicate function with PETSc library (it's a MPI-based scientific library for parallel solving huge linear systems). This library provides its own "malloc" version and basic datatypes (i.e. "PetscInt" as standard "int"). For this function, I've always been using PETSc stuff instead of standard stuff such as "malloc" and "int". The function has been extensevely tested and always worked fine. Despite the use of MPI, the function is fully serial, and all processors perform it on the same data (each processor has its copy): no communication involved at all.
Then, I decided to not use PETSc and write a standard MPI version. Basically, I rewrote all code substituting PETSc stuff with classic C stuff, not with brutal force but paying attention for substitutions (no "Replace" tool of any editor, I mean! All done by hands). During substitution, few minor changes have been made, such as declaring two different variables a and b, instead of declaring a[2]. These are the substitutions:
PetscMalloc -> malloc
PetscScalar -> double
PetscInt -> int
PetscBool -> created an enum structure to replicate it, as C doesn't have boolean datatype.
Basically, algorithms have not been changed during the substitution process. The main function is a "for" loop (actually 4 nested loops). At each iteration, it calls another function. Let's call it Disfunction. Well, Disfunction works perfectly outside the 4-cycle (as I tested it separately), but inside the 4-cycle, in some cases works, in some doesn't. Also, I checked the data passed to Disfunction at each iteration: with ECXACTELY the same input, Disfunction performs different computations between one iteration and another.
Also, computed data doesn't seem to be Undefined Behaviour, as Disfunction always gives back the same results with different runs of the program.
I've noticed that changing the number of processors for "mpiexec" gives different computational results.
That's my problem. Few other considerations: the program use extensively "malloc"; computed data is the same for all processes, correct or not; Valgrind doesn't detect errors (apart from detecting error with normal use of printf, which is another problem and an OT); Disfunction calls recursively two other functions (extensively tested in PETSc version as well); algorithms involved are mathematically correct; Disfunction depends on an integer parameter p>0: for p=1,2,3,4,5 it works PERFECTELY, for p>=6 it does not.
If asked, I can post the code but it's long and complicated (scientifically, not informatically) and I think it requires time to be explained.
My idea is that I mess up with memory allocations, but I can't understand where.
Sorry for my english and for bad formattation.

Well, I don't know if anyone is stll interested, but the problem was that PETSc functon PetscMalloc zero-initialize the data, not like standard C malloc. Stupid mistake... – user3029623

The only suggestion I can offer without reference to the code itself is to try to construct progressively simpler test cases that demonstrate your issue.
When you narrow down the iterative process to a single point in your data set or a single step (by eliminating some loops), does the error still occur? If not, that might suggest their bounds are wrong.
Does the erroneous output always occur on particular loop indices, especially the first or last? Perhaps there are some ghost or halo values you're missing or some boundary condition that you're not properly accounting for.

Is it good to use functions as much as possible?

When I read open source codes (Linux C codes), I see a lot functions are used instead of performing all operations on the main(), for example:
int main(void ){
function1();
return 0;
}
void function() {
// do something
function2();
}
void function2(){
function3();
//do something
function4();
}
void function3(){
//do something
}
void function4(){
//do something
}
Could you tell me what are the pros and cons of using functions as much as possible?
easy to add/remove functions (or new operations)
readability of the code
source efficiency(?) as the variables in the functions will be destroyed (unless dynamic allocation is done)
would the nested function slow the code flow?

Easy to add/remove functions (or new operations)
Definitely - it's also easy to see where does the context for an operation start/finish. It's much easier to see that way than by some arbitrary range of lines in the source.
Readability of the code
You can overdo it. There are cases where having a function or not having it does not make a difference in linecount, but does in readability - and it depends on a person whether it's positive or not.
For example, if you did lots of set-bit operations, would you make:
some_variable = some_variable | (1 << bit_position)
a function? Would it help?
Source efficiency(?) due to the variables in the functions being destroyed (unless dynamic allocation is done)
If the source is reasonable (as in, you're not reusing variable names past their real context), then it shouldn't matter. Compiler should know exactly where the value usage stops and where it can be ignored / destroyed.
Would the nested function slow the code flow?
In some cases where address aliasing cannot be properly determined it could. But it shouldn't matter in practice in most programs. By the time it starts to matter, you're probably going to be going through your application with a profiler and spotting problematic hotspots anyway.
Compilers are quite good these days at inlining functions though. You can trust them to do at least a decent job at getting rid of all cases where calling overhead is comparable to function length itself. (and many other cases)

This practice of using functions is really important as the amount of code you write increases. This practice of separating out to functions improves code hygiene and makes it easier to read. I read somewhere that there really is no point of code if it is only readable by you only (in some situations that is okay I'm assuming). If you want your code to live on, it must be maintainable and maintainability is one created by creating functions in the simplest sense possible. Also imagine where your code-base exceeds well over 100k lines. This is quite common and imagine having that all in the main function. That would be an absolute nightmare to maintain. Dividing the code into function helps create degrees of separability so many developers can work on different parts of the code-base. So basically short answer is yes, it is good to use functions when necessary.

Functions should help you structure your code. The basic idea is that when you identify some place in the code which does something that can be described in a coherent, self-contained way, you should think about putting it into a function.
Pros:
Code reuse. If you do many times some sequence of operations, why don't you write it once, use it many times?
Readability: it's much easier to understand strlen(st) than while (st[i++] != 0);
Correctness: the code in the previous line is actually buggy. If it is scattered around, you may probably not even see this bug, and if you will fix it in one place, the bug will stay somewhere else. But given this code inside a function named strlen, you will know what it should do, and you can fix it once.
Efficiency: sometimes, in certain situations, compilers may do a better job when compiling a code inside a function. You probably won't know it in advance, though.
Cons:
Splitting a code into functions just because it is A Good Thing is not a good idea. If you find it hard to give the function a good name (in your mother language, not only in C) it is suspicious. doThisAndThat() is probably two functions, not one. part1() is simply wrong.
Function call may cost you in execution time and stack memory. This is not as severe as it sounds, most of the time you should not care about it, but it's there.
When abused, it may lead to many functions doing partial work and delegating other parts from here to there. too many arguments may impede readability too.
There are basically two types of functions: functions that do a sequence of operations (these are called "procedures" in some contexts), and functions that does some form of calculation. These two types are often mixed in a single function, but it helps to remember this distinction.
There is another distinction between kinds of functions: Those that keep state (like strtok), those that may have side effects (like printf), and those that are "pure" (like sin). Function like strtok are essentially a special kind of a different construct, called Object in Object Oriented Programming.

You should use functions that perform one logical task each, at a level of abstraction that makes the function of each function easy to logically verify. For instance:
void create_ui() {
create_window();
show_window();
}
void create_window() {
create_border();
create_menu_bar();
create_body();
}
void create_menu_bar() {
for(int i = 0; i < N_MENUS; i++) {
create_menu(menus[i]);
}
assemble_menus();
}
void create_menu(arg) {
...
}
Now, as far as creating a UI is concerned, this isn't quite the way one would do it (you would probably want to pass in and return various components), but the logical structure is what I'm trying to emphasize. Break your task down into a few subtasks, and make each subtask its own function.
Don't try to avoid functions for optimization. If it's reasonable to do so, the compiler will inline them for you; if not, the overhead is still quite minimal. The gain in readability you get from this is a great deal more important than any speed you might get from putting everything in a monolithic function.
As for your title question, "as much as possible," no. Within reason, enough to see what each function does at a comfortable level of abstraction, no less and no more.

One condition you can use: if part of the code will be reuse/rewritten, then put it in a function.

I guess I think of functions like legos. You have hundreds of small pieces that you can put together into a whole. As a result of all of those well designed generic, small pieces you can make anything. If you had a single lego that looked like an entire house you couldn't then use it to build a plane, or train. Similarly, one huge piece of code is not so useful.
Functions are your bricks that you use when you design your project. Well chosen separation of functionality into small, easily testable, self contained "functions" makes building and looking after your whole project easy. Their benefits WAYYYYYYY out-weigh any possible efficiency issues you may think are there.
To be honest, the art of coding any sizeable project is in how you break it down into smaller pieces, so functions are key to that.

Is it okay to use functions to stay organized in C?

I'm a relatively new C programmer, and I've noticed that many conventions from other higher-level OOP languages don't exactly hold true on C.
Is it okay to use short functions to have your coding stay organized (even though it will likely be called only once)? An example of this would be 10-15 lines in something like void init_file(void), then calling it first in main().

I would have to say, not only is it OK, but it's generally encouraged. Just don't overly fragment the train of thought by creating myriads of tiny functions. Try to ensure that each function performs a single cohesive, well... function, with a clean interface (too many parameters can be a hint that the function is performing work which is not sufficiently separate from it's caller).
Furthermore, well-named functions can serve to replace comments that would otherwise be needed. As well as providing re-use, functions can also (or instead) provide a means to organize the code and break it down into smaller units which can be more readily understood. Using functions in this way is very much like creating packages and classes/modules, though at a more fine-grained level.

Yes. Please. Don't write long functions. Write short ones that do one thing and do it well. The fact that they may only be called once is fine. One benefit is that if you name your function well, you can avoid writing comments that will get out of sync with the code over time.

If I can take the liberty to do some quoting from Code Complete:
(These reason details have been abbreviated and in spots paraphrased, for the full explanation see the complete text.)
Valid Reasons to Create a Routine
Note the reasons overlap and are not intended to be independent of each other.
Reduce complexity - The single most important reason to create a routine is to reduce a program's complexity (hide away details so you don't need to think about them).
Introduce an intermediate, understandable abstraction - Putting a section of code int o a well-named routine is one of the best ways to document its purpose.
Avoid duplicate code - The most popular reason for creating a routine. Saves space and is easier to maintain (only have to check and/or modify one place).
Hide sequences - It's a good idea to hide the order in which events happen to be processed.
Hide pointer operations - Pointer operations tend to be hard to read and error prone. Isolating them into routines shifts focus to the intent of the operation instead of the mechanics of pointer manipulation.
Improve portability - Use routines to isolate nonportable capabilities.
Simplify complicated boolean tests - Putting complicated boolean tests into a function makes the code more readable because the details of the test are out of the way and a descriptive function name summarizes the purpose of the tests.
Improve performance - You can optimize the code in one place instead of several.
To ensure all routines are small? - No. With so many good reasons for putting code into a routine, this one is unnecessary. (This is the one thrown into the list to make sure you are paying attention!)
And one final quote from the text (Chapter 7: High-Quality Routines)
One of the strongest mental blocks to
creating effective routines is a
reluctance to create a simple routine
for a simple purpose. Constructing a
whole routine to contain two or three
lines of code might seem like
overkill, but experience shows how
helpful a good small routine can be.

If a group of statements can be thought of as a thing - then make them a function

i think it is more than OK, I would recommend it! short easy to prove correct functions with well thought out names lead to code which is more self documenting than long complex functions.
Any compiler worth using will be able to inline these calls to generate efficient code if needed.

Functions are absolutely necessary to stay organized. You need to first design the problem, and then depending on the different functionality you need to split them into functions. Some segment of code which is used multiple times, probably needs to be written in a function.
I think first thinking about what problem you have in hand, break down the components and for each component try writing a function. When writing the function see if there are some code segment doing the same thing, then break it into a sub function, or if there is a sub module then it is also a candidate for another function. But at some time this breaking job should stop, and it depends on you. Generally, do not make many too big functions and not many too small functions.
When construction the function please consider the design to have high cohesion and low coupling.
EDIT1::
you might want to also consider separate modules. For example if you need to use a stack or queue for some application. Make it separate modules whose functions could be called from other functions. This way you can save re-coding commonly used modules by programming them as a group of functions stored separately.

Yes
I follow a few guidelines:
DRY (aka DIE)
Keep Cyclomatic Complexity low
Functions should fit in a Terminal window
Each one of these principles at some point will require that a function be broken up, although I suppose #2 could imply that two functions with straight-line code should be combined. It's somewhat more common to do what is called method extraction than actually splitting a function into a top and bottom half, because the usual reason is to extract common code to be called more than once.
#1 is quite useful as a decision aid. It's the same thing as saying, as I do, "never copy code".
#2 gives you a good reason to break up a function even if there is no repeated code. If the decision logic passes a certain complexity threshold, we break it up into more functions that make fewer decisions.

It is indeed a good practice to refactor code into functions, irrespective of the language being used. Even if your code is short, it will make it more readable.
If your function is quite short, you can consider inlining it.
IBM Publib article on inlining

Automated tracing use of variables within source code

I'm working with a set of speech processing routines (written in C) meant to be compiled with the mex command on MATLAB. There is this C-function which I'm interested in accelerating using FPGA.
The hardware takes in specified input parameters through input ports, the rest of the inputs as constants to be hard coded, and passes a particular variable some where within the C-function, say foo, to the output port.
I am interested in tracing the computation graph (unsure if this is the right term to use) of foo. i.e. how foo relates to intermediate computed variables, which in turn eventually depends on input parameters and hard coded constants. This is to allow me to flatten the logic so they can be coded using a hardware description language, as well as remove irrevelant logic which does not affect the value of foo. The catch is that some intermediate variables are global, therefore tracing is a headache.
Is there an automated tool which analyzes a given set of C headers and source files and provide a means of tracing how a specified variable is altered, with some kind of dependency graph of all variables used?

I think what you are looking for is a tool to do value analysis.
Among the tools available to do this, I think Code Surfer is probably the best out there. Of course, it is also quite expensive but if you are a student, they do have an academic license program. On the open-source side, Frama-C can also do this in a more limited fashion and has a much, much steeper learning curve. But it is free and will get you where you want to go.

Best practices for number of arguments for subroutines in C

I've recently started working on developing APIs written in C. I see some subroutines which expect 8(Eight) parameters and to me it looks ugly and cumbersome passing 8 parameters while calling that particular subroutine. I was wondering if something more acceptable and cleaner way could be implemented .

If a number of arguments can be logically grouped together you may consider creating a structure containing them and simply pass that structure as an argument. For example instead of passing two coordinate values x and y you could pass a POINT structure instead.
But if such a grouping isn't applicable, then any number of arguments should be fine if you really need them, although it might be a sign that your function does a little too much and that you could spread work over more, but smaller functions.

Large numbers of arguments in a function call are usually showing up a design problem. There are ways of apparently reducing the number of parameters, by such means as creating structures which are passed instead of individual variables or having global variables. I'd recommend you DON'T do either of these things, and attend to the design. No quick or easy fix there, but the people who have to maintain the code will thank you for it.

Yes, 8 is almost certianly too much.
Here's some old-school software engineering terms for you. Cohesion and coupling. Cohesion is how well a subroutine holds together on its own, and coupling is how clean the interfaces between your routines are (or how self-sufficient your routines are).
With coupling, generally the looser the better. Interfacing only through parameters ("data coupling") is good low coupling, while using global variables ("common coupling") is very high coupling. When you have a high number of parameters, what is usually the case is that someone has tried to hide their common coupling with a thin veneer of data coupling. Its bad design with a paint job.
With cohesion, the higher (more cohesive) the better. Any routine that modifies eight different things is also quite likey to suffer from low cohesion. I'd have to see the code to see for sure, but I'd be willing to bet that it would be very difficult to clearly explain what that routine does in a short sentence. Sight unseen, I'd guess it is temporally cohesive (just a bunch of stuff that needs to be done at roughly the same time).

8 could be a proper number. or it could be that many of those 8 should all belong to a proper class as members, then you could pass a single instance of the class... hard to tell just by this kind of high level discussion.
edit: in c - classes would be similar to structures in this case.

I suggest the usage of structures too.
Maybe you want to rethink your APIs design.
Remember that your APIs will be used by developers, and it would be hard to use an 8 parameter function call.

One pattern used by some APIs (like pthreads) is "attribute objects" that are passed into functions instead of a bunch of discrete arguments. These attribute objects are opaque structures, with functions for creating, destroying, modifying and querying them. All in all this takes more code than simply dumping 10 arguments to a function, but it's a more robust approach. Sometimes a bit of extra code that can make your code much more understandable is worth the effort.
Once again, for a good example of this pattern, see the pthreads API. I wrote a lengthy article on the design of the pthreads API a couple of months ago, and this is one of the aspects I addressed in it.

It also seems to be interesting to consider this question not from the standpoint of ugliness/not ugliness but from the point of performance.
I know that there are some x86 calling conventions that can use registers for passing two first arguments and stack for all other arguments. So I think that if one use this type of calling convention and always use a pointer to structure to pass arguments in situations when a function needs more then 2 parameters in overall the function call might be faster. On Itanium registers are always used for passing parameters to a function.
I think it might be worth testing.

If the API seems cumbersome with that many parameters, use a pointer to a structure to pass parameters that are related in some way.
I'm not going to judge your design without seeing it firsthand. There are legitimate functions that require a large number of parameters, for example:
Filters with a large number of polynomial coefficients.
Color space transforms with sub-pixel masks and LUT indices.
Geometric arithmetic for n-dimensional irregular polygons.
There are also some very poor designs that lead to large parameter prototypes. Share more information about your design if you seek more germane responses.

Another thing you can do is convert some of the parameters to state. OpenGL functions have fewer parameters because you have calls like:
glBindFramebuffer( .. ) ;
glVertexAttribPointer( .. ) ;
glBindTexture( .. ) ;
glBindBuffer( .. ) ;
glBindVertexArray( .. ) ;
// the actual draw call
glDrawArrays( .. ) ;
All of these (glBind* type calls) represent "changing state", all of which will affect the next draw call. Imagine a draw call with 20-something arguments.. absolutely unmanageable!
The old Windows C api for drawing also had state, stored inside "opaque pointer" objects (HDC's, HWND's..). An opaque pointer is basically C's way of making private data members that you can't directly access. So for example, in the Windows drawing API, you would create an HDC opaque pointer via createDC. You could set the DC's internal values via
the SetDC* functions, for example SetDCBrushColor.
Now that you have a DC set up with a color and everything, you could use the Rectangle function to draw into the DC. You passed an HDC as the first parameter, which contained information about what color brush to use etc. Rectangle then only takes 5 parameters, the hdc, x,y,width and height.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight