Does gprof support multithreaded applications? - c

We're developing a multithreaded project. My colleague said that gprof works perfectly with no work around with multithreaded programs. I read otherwise some time ago.
http://sam.zoy.org/writings/programming/gprof.html
http://lists.gnu.org/archive/html/bug-binutils/2010-05/msg00029.html
I also read this:
How to profile multi-threaded C++ application on Linux?
So I'm guessing the workaround is no longer needed? If so, since when is it not needed?

Unless you change the processing the gprof would work fine.
Changing the processing means using co-processor or gpus as computing units. In the worst case you have to manually call the setitimer function for every thread. But as per latest version, (2013-14) it's not needed.
In certain cases it behaves mischievously. So I advice to use the VTUNE from Intel which would give more accurate and more detailed information.

Related

How do I make many system calls at once with the linux kernel?

I was wondering if I could make a large number of system calls at the same time, with only one switch overhead. I need this because I have a need to make many (128) system calls at the same time. If I could do this without switching between kernel and userland 256+ times I think it could make my (speed sensitive) library significantly faster.
You really can't do that from an application program. What you could do is build a loadable kernel module that implements those operations and presents a simple API -- then you can change context once, do all the work, and return.
However, as with most of these sorts of optimization questions, the first thing to ask is "why do you think it's going to be necessary?" Do you have timing information etc? Have you profiled? How much of a performance issue do you really have, and is the additional complexity going to be worth the speedup?
I don't think Linux will support syscall chaining anytime soon. You might have more luck implementing this on another kernel and porting your application.
That said, it's not difficult to write a proxy to do the job in kernelspace for you, but don't expect it to be merged upstream. I've worked on real-time stuff and we had a solution like that, but never used in production because of support issues :/.

What free tools can I use to profile C code on Windows?

I'm a high-school student doing some C things where I'd like to profile my code to see where the actual performance bottlenecks are. I don't have much money, so I'd prefer free tools.
I like to use the MinGW/GCC compiler toolchain. This is not something I'm stuck with, but I'd prefer tools that are capable of working with this.
Features I need:
See how much total time is spent in a certain function.
Features I'd like:
See how much time a line of code takes.
Cross-platform (being able to use the same software on Linux & Mac)
See how often a function gets called (and how long each call takes on average).
See what causes the time spent (cache misses, branch mispredictions, etc).
I've tried using gprof, but I couldn't get it to work (it only shows main in the profile), and I've heard bad things about it, so what are my options?
if you want a free, Windows and Linux TBP (it also does event based and some other metric based forms of profiling) then AMD's code analyst should do the job nicely (even on Intel cpus, though Im not sure of the quality/reliability of the branching and cache analysis on Intel cpus), its also got a nice ui built in Qt which does the source + assembly line time breakdowns. its also got an API to embed events for the profiler to catch for more targeted profiling.

Linux library for profiling

Is there a Linux library that can run performance profiling within a running process?
I have a rather large linux program that is heavily script-based. Depending on the scripts, the program can have wildly different behaviors (and performance problems). What would be nice is a low-overhead performance library that I can embed in the same process that monitors and provides real-time feedback to the process about it's own performance.
Oprofile would be fantastic, if I could start it within the program and keep it isolated to only that program. From the documentation I've read, it doesn't appear possible.
Does anyone know of any such library?
Thanks!
Andrew Klofas
Check out gprof - it should do what you want.
I think gperftools works well for profiling. The runtime performance penalty for CPU profile data is very small.

Timing Kernel Executions on CUDA

I've used code from CUDA C Best Practices to implement an execution timer. However their is something strange and I don't know if it's an anomaly or if that's normal. I get different read outs each time I run my CUDA app.
Could these readings by related to design or is that something I should expect.
I'm not running any graphic intensive applications on my machine, other than Windows 7.
Well it depends how big the differences are. One thing you can see anomalies caused by is the kernel scheduler. It may just happen that the scheduler is giving some extra timeslices to kernel functions (because graphics API calls have error checking involved) which shows more execution time. If the differences are very large I would say check your code but if it's very low in orders of milliseconds I wouldn't worry about it +- 10msecs is the usual for the timeslicing quantum in most OS's (windows probably included).
Also Aero is kind of intensive so that may be adding to the discrepancies you are seeing.
I've used code from CUDA C Best Practices to implement an execution timer.
Yeah, well, that's not a "best practice" in my experience.
I suggest using the nvprof profiler instead for your device-side code and CUDA Runtime API calls (it also works relatively well, I think, for your own host-side code). It'll take you a bit of hassle to set up and figure out which options you want to use, but it's worth it.

Profiling network software / Profiling software with lot of system call waiting

I'm working on a complex network software and I have trouble determining how to improve the systems performance.
Specifically in one part of the software which is using blocking synchronous calls. Since this part of the system is doing heavy computations it's nearly impossible to determine whether the slowness of this component is caused by these computations or the waiting for the other parts of the system.
Are there any light-weight profilers that can capture this information? I can't use heavy duty profile like valgrind since that would completely skew the results (although valgrind would be perfect, since it captures all the required information).
I tried using oProfile but I just wasn't able to get any meaningful results out of it (perhaps if there is a concise tutorial somewhere...).
What you need is something that gives you stack samples, on wall-clock time (not just CPU time like gprof), and reports by line (not just by function) the percent of samples containing the line.
Zoom will do it,
but I just do random-pausing. Here's why it works.
Here's a blow-by-blow example.
Here's another explanation.
Comment out your "heavy computations" and see if it's still slow. That will tell you if it's waiting on other systems over the network or the computations. The answer may not be either/or and may just be an accumulation of things.
You could also do some old fashioned printf debugging and print the time before and after executing the function to standard output or syslog. That is about as light-weight as profiling gets.

Resources