I was wondering. Is there a tool I can use (on a C program) that would generate a call graph at the level of an instruction in a program, taking into consideration the dependency of such instruction on other instructions? Something like a "dependency graph" but at the level of instructions in a program. I took the idea from chapter 27 of the new Cormen book (see for example p. 778), but I won't even try to hack anything if there's a tool already available. (If you want, Chapter 27 is online here). Thanks for any help.
Any optimizing compiler for C should be doing this kind of control-flow analysis.
On the other hand, I have no idea how easy it is to get the graph out of it (in the standalone tool sense)
If you're taking inspiration from Figure 27.2 on page 778 of the Cormen/Rivest book, it is not a call graph in the usual sense.
it is a call tree, in which the nodes are execution instances of a function, not the function itself.
It's the call tree of a particular execution of the program, elaborated with information about the variables in each instance, and information about the parallelism.
To get such a complete call tree you're going to have to basically trace the entire execution. With different arguments, you will get a different trace.
It might be easier to help if your overall goal were more clear.
Related
I am seeking advice on how to incorporate C or C++ code into my R code to speed up a MCMC program, using a Metropolis-Hastings algorithm. I am using an MCMC approach to model the likelihood, given various covariates, that an individual will be assigned a particular rank in a social status hierarchy by a 3rd party (the judge): each judge (approx 80, across 4 villages) was asked to rank a group of individuals (approx 80, across 4 villages) based on their assessment of each individual's social status. Therefore, for each judge I have a vector of ranks corresponding to their judgement of each individual's position in the hierarchy.
To model this I assume that, when assigning ranks, judges are basing their decisions on the relative value of some latent measure of an individual's utility, u. Given this, it can then be assumed that a vector of ranks, r, produced by a given judge is a function of an unobserved vector, u, describing the utility of the individuals being ranked, where the individual with the kth highest value of u will be assigned the kth rank. I model u, using the covariates of interest, as a multivariate normally distributed variable and then determine the likelihood of the observed ranks, given the distribution of u generated by the model.
In addition to estimating the effect of, at most, 5 covariates, I also estimate hyperparameters describing variance between judges and items. Therefore, for every iteration of the chain I estimate a multivariate normal density approximately 8-10 times. As a result, 5000 iterations can take up to 14 hours. Obviously, I need to run it for much more than 5000 runs and so I need a means for dramatically speeding up the process. Given this, my questions are as follows:
(i) Am I right to assume that the best speed gains will be had by running some, if not all of my chain in C or C++?
(ii) assuming the answer to question 1 is yes, how do I go about this? For example, is there a way for me to retain all my R functions, but simply do the looping in C or C++: i.e. can I call my R functions from C and then do looping?
(iii) I guess what I really want to know is how best to approach the incorporation of C or C++ code into my program.
First make sure your slow R version is correct. Debugging R code might be easier than debugging C code. Done that? Great. You now have correct code you can compare against.
Next, find out what is taking the time. Use Rprof to run your code and see what is taking the time. I did this for some code I inherited once, and discovered it was spending 90% of the time in the t() function. This was because the programmer had a matrix, A, and was doing t(A) in a zillion places. I did one tA=t(A) at the start, and replaced every t(A) with tA. Massive speedup for no effort. Profile your code first.
Now, you've found your bottleneck. Is it code you can speed up in R? Is it a loop that you can vectorise? Do that. Check your results against your gold standard correct code. Always. Yes, I know its hard to compare algorithms that rely on random numbers, so set the seeds the same and try again.
Still not fast enough? Okay, now maybe you need to rewrite parts (the lowest level parts, generally, and those that were taking the most time in the profiling) in C or C++ or Fortran, or if you are really going for it, in GPU code.
Again, really check the code is giving the same answers as the correct R code. Really check it. If at this stage you find any bugs anywhere in the general method, fix them in what you thought was the correct R code and in your latest version, and rerun all your tests. Build lots of automatic tests. Run them often.
Read up about code refactoring. It's called refactoring because if you tell your boss you are rewriting your code, he or she will say 'why didn't you write it correctly first time?'. If you say you are refactoring your code, they'll say "hmmm... good". THIS ACTUALLY HAPPENS.
As others have said, Rcpp is made of win.
A complete example using R, C++ and Rcpp is provided by this blog post which was inspired by a this post on Darren Wilkinson's blog (and he has more follow-ups). The example is also included with recent releases of Rcpp in a directory RcppGibbs and should get you going.
I have a blog post which discusses exactly this topic which I suggest you take a look at:
http://darrenjw.wordpress.com/2011/07/31/faster-gibbs-sampling-mcmc-from-within-r/
(this post is more relevant than the post of mine that Dirk refers to).
I think the best method currently to integrate C or C++ is the Rcpp package of Dirk Eddelbuettel. You can find a lot of information at his website. There is also a talk at Google that is available through youtube that might be interesting.
Check out this project:
https://github.com/armstrtw/rcppbugs
Also, here is a link to the R/Fin 2012 talk:
https://github.com/downloads/armstrtw/rcppbugs/rcppbugs.pdf
I would suggest to benchmark each step of the MCMC sampler and identify the bottleneck. If you put each full conditional or M-H-step into a function, you can use the R compiler package which might give you 5%-10% speed gain. The next step is to use RCPP.
I think it would be really nice to have a general-purpose RCPP function which generates just one single draw using the M-H algorithm given a likelihood function.
However, with RCPP some things become difficult if you only know the R language: non-standard random distributions (especially truncated ones) and using arrays. You have to think more like a C programmer there.
Multivariate Normal is actually a big issue in R. Dmvnorm is very inefficient and slow. Dmnorm is faster, but it would give me NaNs quicker than dmvnorm in some models.
Neither does take an array of covariance matrices, so it is impossible to vectorize code in many instances. As long as you have a common covariance and means, however, you can vectorize, which is the R-ish strategy to speed up (and which is the oppositve of what you would do in C).
Hello everyone
I have a code and I want to find the number of times each assembly line executed. I dont care whether through profiling or emulation, yet I want high precision results. I came across a forum once that gave some scripting code to do so, yet I lost the link. Can anyone help me brainstorm some ways to do so?
Regards
Edit:
Okey I think I am halfway there. I have done some research on the BTS (Branch Trace Store) provided by Intel Manual 3A section 16.4.5 according to one the posts. This feature provides branch history. So now I need your help to find if there are any open source scripts or tools to do this. Waiting to check your feedback
cheers=)!
If your processor supports it, you can enable Branch Trace Store (BTS). BTS stores a log of all of the taken branches in a predefined area in memory. Each entry contains the branch source and destination. Using that, you can count how many times you were in each code segment.
Look at volume 3A of the Intel Software Developer's Manual, section 16.4.5 (in the current edition) for details on how to enable it.
If you do not care about performance, you can do a small trick to count that. Raise a single step exception and upon entering your custom seh handler, raise another one and step over to the next command.
Maybe some profiler tools like pin or valgrind do that for you in an easier manner. I would suggest that you take a look.
One (although slow) method would be to write your own debugger. It would then breakpoint the entry point of your program, and when it was hit it would set the trace flag on the EFlags in the context, so it would break to the debugger on the next instruction as well. You could then use a hash table with the EIP to count the number of times hit.
Only problem is that the overhead would be extreme and the application would run very slowly.
I am implementing a call graph program for a C using perl script. I wonder how to resolve call graphs for function pointers using output of 'objdump'?
How different call graph applications resolve function pointers?
Are function pointers resolved at run time or they can be done statically?
EDIT
How do call graphs resolve cycles in static evaluation of program?
It is easy to build a call graph of A-calls-B when the call statement explicitly mentions B. It is much harder to handle indirect calls, as you've noticed.
Good static analysis tools form estimates of the contents of pointer variables by propagating pointer assignments/copies/arithmetic across program data flows (inter and intra-procedural ["global"]) using a variety of schemes, often conservative ("you get too much").
Without such an estimate, you cannot have any idea what a pointer contains and therefore simply cannot make a useful prediction (well, you can use the ultimate conservative estimate that it will go anywhere, but I think you've already rejected that solution).
Our DMS Software Reengineering Toolkit has static control/dataflow/points-to/call graph analysis that has been applied to huge systems (~~25 million lines) of C code, and produced such call graphs. The machinery to do this
is pretty complex but you can find it in advanced topics in the compiler literature. I doubt you want to implement this in Perl.
This is easier when you have source code, because you at least reliably know what is code, and what is not. You're trying to do this on object code, which means you can't even eliminate data.
Using function pointers is a way of choosing the actual function to call at runtime, so in general, it wouldn't be possible to know what would actually happen statically.
However, you could look at all functions that are possible to call and perhaps show those in some way. Often the callbacks have a unique enough signature (not always).
If you want to do better, you have to analyze the source code, to see which functions are assigned to pointers to begin with.
As the title implies basically: Say we have a complex program and we want to make it faster however we can. Can we somehow detect which loops or other parts of its structure take most of the time for targeting them for optimizations?
edit: Notice, of importance is that the software is assumed to be very complex and we can't check each loop or other structure one by one, putting timers in them etc..
You're looking for a profiler. There are several around; since you mention gcc you might want to check gprof (part of binutils). There's also Google Perf Tools although I have never used them.
You can use GDB for that, by this method.
Here's a blow-by-blow example of using it to optimize a realistically complex program.
You may find "hotspots" that you can optimize, but more
generally the things that give you the greatest opportunity for saving time are mid-level function calls that you can avoid.
One example is, say, calling a function to extract information from a database, where the function is being called multiple times, when with some extra coding the result from a prior call could be used.
Often such calls are small and innocent-looking, and you're totally surprised to learn how much they're costing, as an overall percent of time.
Another example is doing some low-level I/O that escapes attention, but actually costs a hefty percent of clock time.
Another example is tidal waves of notifications that propagate from seemingly trivial changes to data.
Another good tool for finding these problems is Zoom.
Here's a discussion of the technical issues, but basically what to look for is:
It should tell you inclusive percent of time, at line-level resolution, not just functions.
a) Only knowing that a function is costly still leaves you wondering where the lines are in it that you should look at.
b) Inclusive percent tells the true cost of the line - how much bottom-line time it is responsible for and would not be spent if it were not there.
It should include both I/O (i.e. blocked) time and CPU time, not just CPU time. A tool that only considers CPU time will not see the first two problems mentioned above.
If your program is interactive, the tool should operate only during the time you care about, and not while waiting for user input. You don't want to include head-scratching time in your program's performance statistics.
gprof breaks it down by function. If you have many different loops in one function, it might not tell you which loop is taking the time. This is a clue to refactor ;-)
Yesterday I was reading about debugging techniques and found Valgrind to be really interesting. It seems to use techniques from dynamic code analysis. And I followed a link from the original reference to something else called Path Profiling.
I tried Googling but I guess I am using the wrong terms to search for a good reference on these concepts. Can someone suggest a good resource taking into account that I do not have a background in compilers and programming languages?
Path Profiling is interesting as a theoretical problem. gprof is also interesting, because it deals in call graphs, cyclical subgraphs, and such. There are nice algorithms for manipulating this information and propogating measurements throughout a structure.
All of which might tempt you to think it works (though they never say it does) - for finding general performance problems.
However, suppose your program hangs. How do you find the problem?
What I do is get it into the infinite loop, and then interrupt (pause) it to see what it's doing. I look at the code on each level of the call stack, because I know the loop is somewhere on the stack. If it's not obvious, I just step it along until I see it repeating itself, and then I know where the problem is. I suspect almost anyone would do that.
In fact, if you stop the program while it's taking too long and examine its state several times, you can not only find infinite loops, but almost any problem where the program runs longer than you would like.
There are profiler tools based on this concept, such as Zoom and LTProf, but for my money nothing gives as much insight as thoroughly understanding representative snapshots.
You won't find good references on this technique because (oddly) not many people are aware of it, and it's too simple to publish.
There's considerably more to say on the subject.
Actually, FWIW, I "published" an article on it, but it was only reviewed by an editor, and I don't think anyone's actually read it: Dunlavey, “Performance tuning with instruction-level cost derived from call-stack sampling”, ACM SIGPLAN Notices 42, 8 (August, 2007), pp. 4-8.