I'm having performance problem when I run a benchmark program on a virtual machine with 1 pCPU and 4 vCPUs. I used mpstat to investigate about what might be causing the problem and how it is causing the problem. it shows an increase in steal time, idle time and sys time.
here is output of mpstat on my virtual machine when I run my program:
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
all 3.14 0.00 8.00 0.00 0.00 0.00 58.29 0.00 30.57
0 4.60 0.00 8.05 0.00 0.00 0.00 64.37 0.00 22.99
1 2.27 0.00 5.68 0.00 0.00 0.00 53.41 0.00 38.64
2 3.37 0.00 8.99 0.00 0.00 0.00 61.80 0.00 25.84
3 3.45 0.00 8.05 0.00 0.00 0.00 54.02 0.00 34.48
and here is its output on a host machine (when I run the program):
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
all 91.80 0.00 0.50 0.00 0.00 0.00 0.00 0.00 7.70
0 91.00 0.00 0.50 0.00 0.00 0.00 0.00 0.00 8.50
1 93.03 0.00 0.50 0.00 0.00 0.00 0.00 0.00 6.47
2 90.59 0.00 0.50 0.00 0.00 0.00 0.00 0.00 8.91
3 93.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.00
I know that (from ibm.com):
Steal time is the percentage of time a virtual CPU waits for a real
CPU while the hypervisor is servicing another virtual processor.
I know this increase in steal time is maybe because my virtual machine is over-committed, but I wonder why I did not have this problem with other programs that I tested. and I don't know why the idle time increases as well.
the source code for my test program is quite large and I posted it here in github. I don't know if it is important or not but this code does some busy waiting. here is the code where busy waiting happens:
int parsec_barrier_wait(parsec_barrier_t *barrier) {
int master;
int rv;
if(barrier==NULL) return EINVAL;
rv = pthread_mutex_lock(&barrier->mutex);
if(rv != 0) return rv;
//A (necessarily) unsynchronized polling loop followed by fall-back blocking
if(!barrier->is_arrival_phase) {
pthread_mutex_unlock(&barrier->mutex);
volatile spin_counter_t i=0;
while(!barrier->is_arrival_phase && i<SPIN_COUNTER_MAX) i++;
while((rv=pthread_mutex_trylock(&barrier->mutex)) == EBUSY);
if(rv != 0) return rv;
//Fall back to normal waiting on condition variable if necessary
while(!barrier->is_arrival_phase) {
rv = pthread_cond_wait(&barrier->cond, &barrier->mutex);
if(rv != 0) {
pthread_mutex_unlock(&barrier->mutex);
return rv;
}
}
}
//We are guaranteed to be in an arrival phase, proceed with barrier synchronization
master = (barrier->n == 0); //Make first thread at barrier the master
barrier->n++;
if(barrier->n >= barrier->max) {
//This is the last thread to arrive, don't wait instead
//start a new departure phase and wake up all other threads
barrier->is_arrival_phase = 0;
pthread_cond_broadcast(&barrier->cond);
} else {
//wait for last thread to arrive (which will end arrival phase)
//we use again an unsynchronized polling loop followed by synchronized fall-back blocking
if(barrier->is_arrival_phase) {
pthread_mutex_unlock(&barrier->mutex);
volatile spin_counter_t i=0;
while(barrier->is_arrival_phase && i<SPIN_COUNTER_MAX) i++;
while((rv=pthread_mutex_trylock(&barrier->mutex)) == EBUSY);
if(rv != 0) return rv;
//Fall back to normal waiting on condition variable if necessary
while(barrier->is_arrival_phase) {
rv = pthread_cond_wait(&barrier->cond, &barrier->mutex);
if(rv != 0) {
pthread_mutex_unlock(&barrier->mutex);
return rv;
}
}
}
}
barrier->n--;
//last thread to leave barrier starts a new arrival phase
if(barrier->n == 0) {
barrier->is_arrival_phase = 1;
pthread_cond_broadcast(&barrier->cond);
}
pthread_mutex_unlock(&barrier->mutex);
return (master ? PARSEC_BARRIER_SERIAL_THREAD : 0);
}
any idea would be very much appreciated.
Related
I'm interested in profiling the function grep_source_is_binary()[1], whose code goes as following:
static int grep_source_is_binary(struct grep_source *gs,
struct index_state *istate)
{
grep_source_load_driver(gs, istate);
if (gs->driver->binary != -1)
return gs->driver->binary;
if (!grep_source_load(gs))
return buffer_is_binary(gs->buf, gs->size);
return 0;
}
gprof's call graph gives me this information:
0.00 1.58 304254/304254 grep_source_1 [6]
[7] 72.9 0.00 1.58 304254 grep_source_is_binary [7]
0.01 1.20 304254/304254 show_line_header [8]
0.00 0.37 303314/607568 grep_source_load [15]
This seems weird to me, as show_line_header() is not called by grep_source_binary() nor any of its children. Am I misinterpreting gprof's output?
[1]: at https://github.com/git/git/blob/master/grep.c#L2183
Check if your code is being optimized by the compiler. If yes, then disable it by using -O0.
I am trying to profile a C program that uses some methods of openssl/libcrypto. Everything work well when I compile and run the code without profiling information. When I add options to profile it with gprof, I get unexpected results from the profiling tool.
I did many researched but I didn't find any page that solved my problem.
This is my code (named test.c):
#include <stdio.h>
#include <openssl/bn.h>
#include <openssl/rand.h>
static BIGNUM *x;
static BIGNUM *y;
static BIGNUM *z;
static BIGNUM *p;
static BN_CTX *tmp;
static unsigned int max_size;
int main(void){
int max_bytes, res_gen;
max_bytes = 50;
tmp = BN_CTX_new();
BN_CTX_init(tmp);
x = BN_new();
y = BN_new();
z = BN_new();
p = BN_new();
RAND_load_file("/dev/urandom", max_bytes);
max_size = 256;
BN_rand(x, max_size, 0, 0);
BN_rand(y, max_size, 0, 0);
res_gen = BN_generate_prime_ex(p, max_size, 0, NULL, NULL, NULL);
BN_mul(z, x, y, tmp);
BN_nnmod(x, z, p, tmp);
printf("\nOk\n");
BN_free(x);
BN_free(y);
BN_free(z);
BN_free(p);
BN_CTX_free(tmp);
return 0;
}
When I compile with profiling information using gcc -pg -static test.c -lcrypto -ldl, it produces the following results. I get 0% (and 0 second) for everything, which is unexpected.
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 1 0.00 0.00 main
Call graph
granularity: each sample hit covers 2 byte(s) no time propagated
index % time self children called name
0.00 0.00 1/1 __libc_start_main [4282]
[1] 0.0 0.00 0.00 1 main [1]
0.00 0.00 0/0 mcount (3495)
0.00 0.00 0/0 BN_CTX_new [275]
0.00 0.00 0/0 BN_CTX_init [274]
0.00 0.00 0/0 BN_new [372]
0.00 0.00 0/0 RAND_load_file [1636]
0.00 0.00 0/0 BN_rand [386]
0.00 0.00 0/0 BN_generate_prime_ex [331]
0.00 0.00 0/0 BN_mul [370]
0.00 0.00 0/0 BN_nnmod [378]
0.00 0.00 0/0 puts [3696]
0.00 0.00 0/0 BN_free [327]
0.00 0.00 0/0 BN_CTX_free [272]
-----------------------------------------------
Also, it seems that the profiler detects only the main method because details for others methods don't appear in flat profile and call graph.
So, I would like to know if I must compile OpenSSL library with some options (what options ?) or something else.
gprof is a CPU-profiler. That means it cannot account for any time spent in blocking system calls like I/O, sleep, locks, etc.
Assuming the goal is to find speedups,
what a lot of people do is get it running under a debugger like GDB and take stack samples manually.
It's not measuring (with any precision). It's pinpointing problems. Anything you see it doing that you could avoid, if you see it on more than one sample, will give you a substantial speedup. The fewer samples you have to take before seeing it twice, the bigger the speedup.
I have a decent understanding of x86 assembly and i know that when a function is called all the arguments are pushed onto the stack.
I have a function which basically loops through a 8 by 8 array and calls some functions based on the values in the array. Each of these function calls involves 6-10 arguments being passed. This program takes a very long time to run, it is a Chess AI, but this function takes 20% of the running time.
So i guess my question is, what can i do to give my functions access to the variables they need in a faster way?
int row,col,i;
determineCheckValidations(eval_check, b, turn);
int * eval_check_p = &(eval_check[0][0]);
for(row = 0; row < 8; row++){
for(col = 0; col < 8; col++, eval_check_p++){
if (b->colors[row][col] == turn){
int type = b->types[row][col];
if (type == PAWN)
findPawnMoves(b,moves_found,turn,row,col,last_move,*eval_check_p);
else if (type == KNIGHT)
findMappedNoIters(b,moves_found,turn,row,col,*move_map_knight, 8, *eval_check_p);
else if (type == BISHOP)
findMappedIters(b,moves_found,turn,row,col,*move_map_bishop, 4, *eval_check_p);
else if (type == ROOK)
findMappedIters(b,moves_found,turn,row,col,*move_map_rook, 4, *eval_check_p);
else if (type == QUEEN)
findMappedIters(b,moves_found,turn,row,col,*move_map_queen, 8, *eval_check_p);
else if (type == KING){
findMappedNoIters(b,moves_found,turn,row,col,*move_map_king, 8, *eval_check_p);
findCastles(b,moves_found,turn,row,col);
}
}
}
}
all the code can be found # https://github.com/AndyGrant/JChess/tree/master/_Core/_Scripts
A sample of the profile:
% cumulative self self total
time seconds seconds calls s/call s/call name
20.00 1.55 1.55 2071328 0.00 0.00 findAllValidMoves
14.84 2.70 1.15 10418354 0.00 0.00 checkMove
10.06 3.48 0.78 1669701 0.00 0.00 encodeBoard
7.23 4.04 0.56 10132526 0.00 0.00 findMappedIters
6.84 4.57 0.53 1669701 0.00 0.00 getElement
6.71 5.09 0.52 68112169 0.00 0.00 createNormalMove
You have performed good work on profiling. You need to take the function with the worst case and profile it in more detail.
You may want to try different compiler optimization settings when you profile.
Try some common optimization techniques, such as loop unrolling and factoring out invariants from loops.
You may get some improvements by designing your functions with the processor's data cache in mind. Search the web for "optimizing data cache".
If the function works correctly, I recommend posting to CodeReview#StackExchange.com.
Don't assume anything.
I want to store event time.
I found these two functions, but don't know which one is faster.
int main()
{
struct timespec tp;
struct timeval tv;
int i=0;
int j=0;
for(i=0; i < 500000; i++)
{
gettimeofday(&tv, NULL);
j+= tv.tv_usec%2;
clock_gettime(CLOCK_HIGHRES, &tp);
j+=tp.tv_sec%2;
}
return 0;
}
%Time Seconds Cumsecs #Calls msec/call Name
68.3 0.28 0.28 500000 0.0006 __clock_gettime
22.0 0.09 0.37 500000 0.0002 _gettimeofday
7.3 0.03 0.40 1000009 0.0000 _mcount
2.4 0.01 0.41 1 10. main
0.0 0.00 0.41 4 0. atexit
0.0 0.00 0.41 1 0. _exithandle
0.0 0.00 0.41 1 0. _fpsetsticky
0.0 0.00 0.41 1 0. _profil
0.0 0.00 0.41 1 0. exit
This ran on Solaris 9 v440 box. Profile the code on your own box. This result is completely different from the result quoted in a link supplied earlier. In other words, if there is a difference it will be due to the implmenetation.
It will depend on your system; there's no standard for which is faster. My guess would be they're usually both about the same speed. gettimeofday is more traditional and probably a good bet if you want your code to be portable to older systems. clock_gettime has more features.
You don't care.
If you're calling either of them often enough that it matters, You're Doing It Wrong.
Profile code which is causing specific performance problems and check how much time it is spending in there. If it's too much, consider refactoring it to call the function less often.
Depending on constant, used in clock_gettime(). For the fastest clocks, there are CLOCK_*_COARSE constants. These timers are fastest, but are not precise.
gettimeofday() should return the same as clock_gettime(CLOCK_REALTIME)
Also, benchmark results depend on architecture (for Linux). Since Linux has special technology (VDSO) to eliminate syscall to get time. These technologies do not work on X86-32bit architecture. See strace output.
Have a look at this discussion. It seems of good quality and very closely related to what you might be looking for though a little dated.
When I run gprof on my C program it says no time accumulated for my program and shows 0 time for all function calls. However it does count the function calls.
How do I modify my program so that gprof will be able to count how much time something takes to run?
Did you specify -pg when compiling?
http://sourceware.org/binutils/docs-2.20/gprof/Compiling.html#Compiling
Once it is compiled, you run the program and then run gprof on the binary.
E.g.:
test.c:
#include <stdio.h>
int main ()
{
int i;
for (i = 0; i < 10000; i++) {
printf ("%d\n", i);
}
return 0;
}
Compile as cc -pg test.c, then run as a.out, then gprof a.out, gives me
granularity: each sample hit covers 4 byte(s) for 1.47% of 0.03 seconds
% cumulative self self total
time seconds seconds calls ms/call ms/call name
45.6 0.02 0.02 10000 0.00 0.00 __sys_write [10]
45.6 0.03 0.02 0 100.00% .mcount (26)
2.9 0.03 0.00 20000 0.00 0.00 __sfvwrite [6]
1.5 0.03 0.00 20000 0.00 0.00 memchr [11]
1.5 0.03 0.00 10000 0.00 0.00 __ultoa [12]
1.5 0.03 0.00 10000 0.00 0.00 _swrite [9]
1.5 0.03 0.00 10000 0.00 0.00 vfprintf [2]
What are you getting?
I tried running Kinopiko's example, except I increased the number of iterations by a factor of 100.
test.c:
#include <stdio.h>
int main ()
{
int i;
for (i = 0; i < 1000000; i++) {
printf ("%d\n", i);
}
return 0;
}
Then I took 10 stackshots (under VC, but you can use pstack). Here are the stackshots:
9 copies of this stack:
NTDLL! 7c90e514()
KERNEL32! 7c81cbfe()
KERNEL32! 7c81cc75()
KERNEL32! 7c81cc89()
_write() line 168 + 57 bytes
_flush() line 162 + 23 bytes
_ftbuf() line 171 + 9 bytes
printf() line 62 + 14 bytes
main() line 7 + 14 bytes
mainCRTStartup() line 206 + 25 bytes
KERNEL32! 7c817077()
1 copy of this stack:
KERNEL32! 7c81cb96()
KERNEL32! 7c81cc75()
KERNEL32! 7c81cc89()
_write() line 168 + 57 bytes
_flush() line 162 + 23 bytes
_ftbuf() line 171 + 9 bytes
printf() line 62 + 14 bytes
main() line 7 + 14 bytes
mainCRTStartup() line 206 + 25 bytes
KERNEL32! 7c817077()
In case it isn't obvious, this tells you that:
mainCRTStartup() line 206 + 25 bytes Cost ~100% of the time
main() line 7 + 14 bytes Cost ~100% of the time
printf() line 62 + 14 bytes Cost ~100% of the time
_ftbuf() line 171 + 9 bytes Cost ~100% of the time
_flush() line 162 + 23 bytes Cost ~100% of the time
_write() line 168 + 57 bytes Cost ~100% of the time
In a nutshell, the program spends ~100% of it's time flushing to disk (or console) the output buffer as part of the printf on line 7.
(What I mean by "Cost of a line" is - it is the fraction of total time spent at the request of that line, and that's roughly the fraction of samples that contain it.
If that line could be made to take no time, such as by removing it, skipping over it, or passing its work off to an infinitely fast coprocessor, that time fraction is how much the total time would shrink. So if the execution of any of these lines of code could be avoided, time would shrink by somewhere in the range of 95% to 100%. If you were to ask "What about recursion?", the answer is It Makes No Difference.)
Now, maybe you want to know something else, like how much time is spent in the loop, for example. To find that out, remove the printf because it's hogging all the time. Maybe you want to know what % of time is spent purely in CPU time, not in system calls. To get that, just throw away any stackshots that don't end in your code.
The point I'm trying to make is if you're looking for things you can fix to make the code run faster, the data gprof gives you, even if you understand it, is almost useless. By comparison, if there is some of your code that is causing more wall-clock time to be spent than you would like, stackshots will pinpoint it.
One gotcha with gprof: it doesn't work with code in dynamically-linked libraries. For that, you need to use sprof. See this answer: gprof : How to generate call graph for functions in shared library that is linked to main program
First compile you application with -g, and check what CPU counters are you using.
If your application runs very quick than gprof could just miss all events or less that required (reduce the number of events to read).
Actually profiling should show you CPU_CLK_UNHALTED or INST_RETIRED events without any special switches. But with such data you'll be able only to say how well your code it performing: INST_RETIRED/CPU_CLK_UNHALTED.
Try to use Intel VTune profiler - it's free for 30 days and for education.