I am using gprof to calculate the time spent during the execution of my program, for each function .
The last week I noticed that when CPU usage reached 100%, the program could not even start !
The code run for almost a day and nothing changed.
The CPU usage reaching 100% in some cases is inevitable and specially when I want to stress out my system and test the program while it uses the maximum amount of resources, with the help of the "stress" tool : http://weather.ou.edu/~apw/projects/stress/
I have read the thread :
Alternatives to gprof
and read the Mike Dunlavey's response :
What about problems that are not so localized? Do those not matter?
Don't place expectations on gprof that were never claimed for it. It
is only a measurement tool, and only of CPU-bound operations.
and also Norman Ramsey's response that had the high score :
Valgrind has an instruction-count profiler with a very nice visualizer called KCacheGrind. As Mike Dunlavey recommends, Valgrind counts the fraction of instructions for which a procedure is live on the stack, although I'm sorry to say it appears to become confused in the presence of mutual recursion. But the visualizer is very nice and light years ahead of gprof.
but as the thread is closed, as non constructive, I was wondering if this is the good direction to follow
Thanks in advance
P.S. While using google search, I didn't find something relevant when asking questions like
"why gprof doesn't work when cpu reach 100 %"
Thanks in advance
All that 100% means is it's hung, and it's not doing I/O.
You're saying the program hangs when you run it with gprof, but not if you don't?
That's weird, but I wouldn't bother trying to figure it out.
As I've said over and over, I would just grab several stack samples manually.
Then the percent of time used by any routine is just the fraction of samples it appears on, more or less.
If you think you need high-precision measurements, try a stack-sampler like Zoom or OProfile.
Related
I want to measure the execution time of a c-code segment using Linux.
I take one timestamps at the beginning of the code segment and one at the end.
But I don't know how to protect the code against IRQs and context switches to high prior tasks. The program runs in user space!
The code segment is short so don't panic hosing the system.
Does anyone know an easy solution for this kind of protection?
You can use getrusage(2) to get the CPU time used, rather than just measuring real time. That should get you the answer you want without having to resort to funny business like blocking other programs from running.
As a part of my academic project I have to execute a C program.
I want to get the execution time of the program. For that I have to sleep all other processes in Linux for some seconds. Is there any method for doing that?
(I have tried using the time command in Linux but it is not working properly: it shows different execution time when I am executing the same program. So I am computing execution time by seeing the difference between start time and end time).
About the best way I can think of is to drop to single-user mode, which you get with
# init 1
on pretty much any distribution. This will also stop X, you'll be on a raw console. Handling interrupts from stray mouse movement is likely to be one of the reasons for whatever variability you're seeing, so that's a good thing.
When you want your full system back, init 3 is probably the one, that or init 5.
The usual way to do this is to try to quiesce the machine as much as possible, then take several measurements and average them. It's advisable to discard the first reading, as that's likely to involve population of caches.
It is impossible to get the exact time of execution of a process into a system in which the scheduler commutes from 1 process to the other.
The Intel processors inserted a register that counts the number of clocks, but even so it is impossible to measure the time.
There is a book that you can find as PDF on google, "Computer Systems: A Programmer's Perspective" -- In this book an whole chapter is dedicated to time measurements.
Use the time command. The sum user + sys will give you the time your programm used the CPU directly plus the time the system used the CPU on behalf of your program. I think it is what you want to know.
There will always be a difference in execution time for things no matter how many processes you shut down, polling, IO, background daemons all affect execution priority.
The academic approach would be to run a sizeable sample and take statistics, you might also want to take a look at sar to log the background. To invalidate any readings you might take
Try executing your application with nice -n 20. It may help to make the other processes quieter.
nice man page
Greeting !!!
I have several c ap running in CentOS Linux compiled in gcc version 4.4.4 ,
using putty.exe in ssh connection to the server ,
THREADLIB=POSIX , because my ap use a lot of threads and I need to watch a lot of
information , using a lot of printf to the screen for watching speed and information ,
while I can not focus on one item , I use "printScr" keyboard and paste it to MS Paint ,
that is quite easy to use !!
While I print too many information in like for loop , I feel that the speed of
my ap is slower ever since , and it run faster if I take away those printf in for loop ..
My question is :Is "too many screen output" really affect the speed of ap ?
and if it is true , except for reduce printf , what else I can do to not affect speed
too much ?
Thanks for any information !!
I/O is slow and the terminal tends to be an exceptionally slow I/O device. Redirecting your output to a file will likely help substantially. To illustrate consider the following times for a million iterations:
No printf: 0.008s
To /dev/null: 0.182s
To file: 0.22s
To terminal: 2.513s
Logging to screen will cause a performance impact. Try to minimize the number of times printf is called, and write the output to file instead. That should help speed up your program somewhat.
Link
Printing to file may gain you some speed (depending on your system configuration), although the best way would be to reduce the amount of information you log (keep in mind that input/output operations are always considered slow). Is it really important to print in every loop of your cycle? Can't you count, average or somehow summarize the information, and then print that summary at the end of the loop?
I've been searching for a Linux sampling profiler, and callgrind has come the closest to showing useful results. However the overhead is estimated at 20--100x slower than normal. Additionally, I'm only interested in time spent per function (with particular emphasis on blocking calls such as read() and write(), which no other profiler will faithfully display).
Is there a way to turn off excess options, so that just the minimum data is recorded for generating times spent in various call stacks?
Does callgrind's cachegrind heritage imply that excess stuff is being done with regards to cache profiling etc?
I assume callgrind operates like a debugger. Can this be adjusted to sample the process at intervals, rather than every single instruction?
3) Callgrind is working like dynamic translator, which instruments orginal code with counting instrument code. Instrumenting is done for each memory access instruction in the code (for cache simulation), and (i suggest) for each jmp-like instruction to track exec. count of every basic block.
I have a small sampling profiler, which acts just like debugger; It does inject a setitimer() profiling counter into the application and then it does intercept all SIGALRM and prints current $eip value.
There were some sampling profilers with setitimer approach earlier, also there is a profil()for something like. This is used by glibc/gmon/gmon.c and gprof -p (to be exact, by gcc -pg). profil() function is able to profile single contonous code fragment with sampling a virtual cpu time each 1 or 10 millisecond. There is also sprofil() function.
Check also LD_PRELOAD=/lib/libpcprofile.so PCPROFILE_OUTPUT=output.file - but I don't know does it work or how it work
For numbered questions:
2) "Callgrind is an extension to Cachegrind. It provides all the information that Cachegrind does, plus extra information about callgraphs." - So it can provide any stuff that is in cachegrind, but also it allow user to turn off cache simulation: --simulate-cache=no (it is the default value)
For speed: According to http://www.valgrind.org/docs/manual/nl-manual.html - manual of Nul valgrind tool (aka nulgrind), which does no additional instrumentation, slowdown is 5 times. It is because program is dynamically translated by valgrind itself. So, there can be no tool for valgrind, which can work faster then nulgrind.
Have you tried gprof ? It does not have the big overhead as valgrind do.
Try using Zoom from RotateRight. It has a "Thread Time" configuration that samples all threads in a single process whether they are running or blocked.
I have a small C program to calculate hashes (for hash tables). The code looks quite clean I hope, but there's something unrelated to it that's bugging me.
I can easily generate about one million hashes in about 0.2-0.3 seconds (benchmarked with /usr/bin/time). However, when I'm printf()inging them in the for loop, the program slows down to about 5 seconds.
Why is this?
How to make it faster? mmapp()ing stdout maybe?
How is stdlibc designed in regards to this, and how may it be improved?
How could the kernel support it better? How would it need to be modified to make the throughput on local "files" (sockets,pipes,etc) REALLY fast?
I'm looking forward for interesting and detailed replies. Thanks.
PS: this is for a compiler construction toolset, so don't by shy to get into details. While that has nothing to do with the problem itself, I just wanted to point out that details interest me.
Addendum
I'm looking for more programatic approaches for solutions and explanations. Indeed, piping does the job, but I don't have control over what the "user" does.
Of course, I'm doing a testing right now, which wouldn't be done by "normal users". BUT that doesn't change the fact that a simple printf() slows down a process, which is the problem I'm trying to find an optimal programmatic solution for.
Addendum - Astonishing results
The reference time is for plain printf() calls inside a TTY and takes about 4 mins 20 secs.
Testing under a /dev/pts (e.g. Konsole) speeds up the output to about 5 seconds.
It takes about the same amount of time when using setbuffer() in my testing code to a size of 16384, almost the same for 8192: about 6 seconds.
setbuffer() has apparently no effect when using it: it takes the same amount of time (on a TTY about 4 mins, on a PTS about 5 seconds).
The astonishing thing is, if I'm starting the test on TTY1 and then switch to another TTY, it does take just the same as on a PTS: about 5 seconds.
Conclusion: the kernel does something which has to do with accessibility and user friendliness. HUH!
Normally, it should be equally slow no matter if you stare at the TTY while its active, or you switch over to another TTY.
Lesson: when running output-intensive programs, switch to another TTY!
Unbuffered output is very slow.
By default stdout is fully-buffered, however when attached to terminal, stdout is either unbuffered or line-buffered.
Try to switch on buffering for stdout using setvbuf(), like this:
char buffer[8192];
setvbuf(stdout, buffer, _IOFBF, sizeof(buffer));
You could store your strings in a buffer and output them to a file (or console) at the end or periodically, when your buffer is full.
If outputting to a console, scrolling is usually a killer.
If you are printf()ing to the console it's usually extremely slow. I'm not sure why but I believe it doesn't return until the console graphically shows the outputted string. Additionally you can't mmap() to stdout.
Writing to a file should be much faster (but still orders of magnitude slower than computing a hash, all I/O is slow).
You can try to redirect output in shell from console to a file. Using this, logs with gigabytes in size can be created in just seconds.
I/O is always slow in comparison to
straight computation. The system has
to wait for more components to be
available in order to use them. It
then has to wait for the response
before it can carry on. Conversely
if it's simply computing, then it's
only really moving data between the
RAM and CPU registers.
I've not tested this, but it may be quicker to append your hashes onto a string, and then just print the string at the end. Although if you're using C, not C++, this may prove to be a pain!
3 and 4 are beyond me I'm afraid.
As I/O is always much slower than CPU computation, you might store all values in fastest possible I/O first. So use RAM if you have enough, use Files if not, but it is much slower than RAM.
Printing out the values can now be done afterwards or in parallel by another thread. So the calculation thread(s) may not need to wait until printf has returned.
I discovered long ago using this technique something that should have been obvious.
Not only is I/O slow, especially to the console, but formatting decimal numbers is not fast either. If you can put the numbers in binary into big buffers, and write those to a file, you'll find it's a lot faster.
Besides, who's going to read them? There's no point printing them all in a human-readable format if nobody needs to read all of them.
Why not create the strings on demand rather that at the point of construction? There is no point in outputting 40 screens of data in one second how can you possibly read it? Why not create the output as required and just display the last screen full and then as required it the user scrolls???
Why not use sprintf to print to a string and then build a concatenated string of all the results in memory and print at the end?
By switching to sprintf you can clearly see how much time is spent in the format conversion and how much is spent displaying the result to the console and change the code appropriately.
Console output is by definition slow, creating a hash is only manipulating a few bytes of memory. Console output needs to go through many layers of the operating system, which will have code to handle thread/process locking etc. once it eventually gets to the display driver which maybe a 9600 baud device! or large bitmap display, simple functions like scrolling the screen may involve manipulating megabytes of memory.
I guess the terminal type is using some buffered output operations, so when you do a printf it does not happen to output in split micro-seconds, it is stored in the buffer memory of the terminal subsystem.
This could be impacted by other things that could cause a slow down, perhaps there's a memory intensive operation running on it other than your program. In short there's far too many things that could all be happening at the same time, paging, swapping, heavy i/o by another process, configuration of memory utilized, maybe memory upgrade, and so on.
It might be better to concatenate the strings until a certain limit is reached, then when it is, write it all out at once. Or even using pthreads to carry out the desired process execution.
Edited:
As for 2,3 it is beyond me. For 4, I am not familiar with Sun, but do know of and have messed with Solaris, There may be a kernel option to use a virtual tty.. i'll admit its been a while since messing with the kernel configs and recompiling it. As such my memory may not be great on this, have a root around with the options to see.
user#host:/usr/src/linux $ make; make menuconfig **OR kconfig if from X**
This will fire up the kernel menu, have a dig around in to see the video settings section under the devices sub-tree..
Edited:
but there's a tweak you put into the kernel by adding a file into the proc filesystem (if a such thing does exist), or possibly a switch passed into the kernel, something like this (this is imaginative and does not imply it actually exists), fastio
Hope this helps,
Best regards,
Tom.