I have a decent understanding of x86 assembly and i know that when a function is called all the arguments are pushed onto the stack.
I have a function which basically loops through a 8 by 8 array and calls some functions based on the values in the array. Each of these function calls involves 6-10 arguments being passed. This program takes a very long time to run, it is a Chess AI, but this function takes 20% of the running time.
So i guess my question is, what can i do to give my functions access to the variables they need in a faster way?
int row,col,i;
determineCheckValidations(eval_check, b, turn);
int * eval_check_p = &(eval_check[0][0]);
for(row = 0; row < 8; row++){
for(col = 0; col < 8; col++, eval_check_p++){
if (b->colors[row][col] == turn){
int type = b->types[row][col];
if (type == PAWN)
findPawnMoves(b,moves_found,turn,row,col,last_move,*eval_check_p);
else if (type == KNIGHT)
findMappedNoIters(b,moves_found,turn,row,col,*move_map_knight, 8, *eval_check_p);
else if (type == BISHOP)
findMappedIters(b,moves_found,turn,row,col,*move_map_bishop, 4, *eval_check_p);
else if (type == ROOK)
findMappedIters(b,moves_found,turn,row,col,*move_map_rook, 4, *eval_check_p);
else if (type == QUEEN)
findMappedIters(b,moves_found,turn,row,col,*move_map_queen, 8, *eval_check_p);
else if (type == KING){
findMappedNoIters(b,moves_found,turn,row,col,*move_map_king, 8, *eval_check_p);
findCastles(b,moves_found,turn,row,col);
}
}
}
}
all the code can be found # https://github.com/AndyGrant/JChess/tree/master/_Core/_Scripts
A sample of the profile:
% cumulative self self total
time seconds seconds calls s/call s/call name
20.00 1.55 1.55 2071328 0.00 0.00 findAllValidMoves
14.84 2.70 1.15 10418354 0.00 0.00 checkMove
10.06 3.48 0.78 1669701 0.00 0.00 encodeBoard
7.23 4.04 0.56 10132526 0.00 0.00 findMappedIters
6.84 4.57 0.53 1669701 0.00 0.00 getElement
6.71 5.09 0.52 68112169 0.00 0.00 createNormalMove
You have performed good work on profiling. You need to take the function with the worst case and profile it in more detail.
You may want to try different compiler optimization settings when you profile.
Try some common optimization techniques, such as loop unrolling and factoring out invariants from loops.
You may get some improvements by designing your functions with the processor's data cache in mind. Search the web for "optimizing data cache".
If the function works correctly, I recommend posting to CodeReview#StackExchange.com.
Don't assume anything.
Related
I am trying to profile a C program that uses some methods of openssl/libcrypto. Everything work well when I compile and run the code without profiling information. When I add options to profile it with gprof, I get unexpected results from the profiling tool.
I did many researched but I didn't find any page that solved my problem.
This is my code (named test.c):
#include <stdio.h>
#include <openssl/bn.h>
#include <openssl/rand.h>
static BIGNUM *x;
static BIGNUM *y;
static BIGNUM *z;
static BIGNUM *p;
static BN_CTX *tmp;
static unsigned int max_size;
int main(void){
int max_bytes, res_gen;
max_bytes = 50;
tmp = BN_CTX_new();
BN_CTX_init(tmp);
x = BN_new();
y = BN_new();
z = BN_new();
p = BN_new();
RAND_load_file("/dev/urandom", max_bytes);
max_size = 256;
BN_rand(x, max_size, 0, 0);
BN_rand(y, max_size, 0, 0);
res_gen = BN_generate_prime_ex(p, max_size, 0, NULL, NULL, NULL);
BN_mul(z, x, y, tmp);
BN_nnmod(x, z, p, tmp);
printf("\nOk\n");
BN_free(x);
BN_free(y);
BN_free(z);
BN_free(p);
BN_CTX_free(tmp);
return 0;
}
When I compile with profiling information using gcc -pg -static test.c -lcrypto -ldl, it produces the following results. I get 0% (and 0 second) for everything, which is unexpected.
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 1 0.00 0.00 main
Call graph
granularity: each sample hit covers 2 byte(s) no time propagated
index % time self children called name
0.00 0.00 1/1 __libc_start_main [4282]
[1] 0.0 0.00 0.00 1 main [1]
0.00 0.00 0/0 mcount (3495)
0.00 0.00 0/0 BN_CTX_new [275]
0.00 0.00 0/0 BN_CTX_init [274]
0.00 0.00 0/0 BN_new [372]
0.00 0.00 0/0 RAND_load_file [1636]
0.00 0.00 0/0 BN_rand [386]
0.00 0.00 0/0 BN_generate_prime_ex [331]
0.00 0.00 0/0 BN_mul [370]
0.00 0.00 0/0 BN_nnmod [378]
0.00 0.00 0/0 puts [3696]
0.00 0.00 0/0 BN_free [327]
0.00 0.00 0/0 BN_CTX_free [272]
-----------------------------------------------
Also, it seems that the profiler detects only the main method because details for others methods don't appear in flat profile and call graph.
So, I would like to know if I must compile OpenSSL library with some options (what options ?) or something else.
gprof is a CPU-profiler. That means it cannot account for any time spent in blocking system calls like I/O, sleep, locks, etc.
Assuming the goal is to find speedups,
what a lot of people do is get it running under a debugger like GDB and take stack samples manually.
It's not measuring (with any precision). It's pinpointing problems. Anything you see it doing that you could avoid, if you see it on more than one sample, will give you a substantial speedup. The fewer samples you have to take before seeing it twice, the bigger the speedup.
I'm having performance problem when I run a benchmark program on a virtual machine with 1 pCPU and 4 vCPUs. I used mpstat to investigate about what might be causing the problem and how it is causing the problem. it shows an increase in steal time, idle time and sys time.
here is output of mpstat on my virtual machine when I run my program:
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
all 3.14 0.00 8.00 0.00 0.00 0.00 58.29 0.00 30.57
0 4.60 0.00 8.05 0.00 0.00 0.00 64.37 0.00 22.99
1 2.27 0.00 5.68 0.00 0.00 0.00 53.41 0.00 38.64
2 3.37 0.00 8.99 0.00 0.00 0.00 61.80 0.00 25.84
3 3.45 0.00 8.05 0.00 0.00 0.00 54.02 0.00 34.48
and here is its output on a host machine (when I run the program):
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
all 91.80 0.00 0.50 0.00 0.00 0.00 0.00 0.00 7.70
0 91.00 0.00 0.50 0.00 0.00 0.00 0.00 0.00 8.50
1 93.03 0.00 0.50 0.00 0.00 0.00 0.00 0.00 6.47
2 90.59 0.00 0.50 0.00 0.00 0.00 0.00 0.00 8.91
3 93.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.00
I know that (from ibm.com):
Steal time is the percentage of time a virtual CPU waits for a real
CPU while the hypervisor is servicing another virtual processor.
I know this increase in steal time is maybe because my virtual machine is over-committed, but I wonder why I did not have this problem with other programs that I tested. and I don't know why the idle time increases as well.
the source code for my test program is quite large and I posted it here in github. I don't know if it is important or not but this code does some busy waiting. here is the code where busy waiting happens:
int parsec_barrier_wait(parsec_barrier_t *barrier) {
int master;
int rv;
if(barrier==NULL) return EINVAL;
rv = pthread_mutex_lock(&barrier->mutex);
if(rv != 0) return rv;
//A (necessarily) unsynchronized polling loop followed by fall-back blocking
if(!barrier->is_arrival_phase) {
pthread_mutex_unlock(&barrier->mutex);
volatile spin_counter_t i=0;
while(!barrier->is_arrival_phase && i<SPIN_COUNTER_MAX) i++;
while((rv=pthread_mutex_trylock(&barrier->mutex)) == EBUSY);
if(rv != 0) return rv;
//Fall back to normal waiting on condition variable if necessary
while(!barrier->is_arrival_phase) {
rv = pthread_cond_wait(&barrier->cond, &barrier->mutex);
if(rv != 0) {
pthread_mutex_unlock(&barrier->mutex);
return rv;
}
}
}
//We are guaranteed to be in an arrival phase, proceed with barrier synchronization
master = (barrier->n == 0); //Make first thread at barrier the master
barrier->n++;
if(barrier->n >= barrier->max) {
//This is the last thread to arrive, don't wait instead
//start a new departure phase and wake up all other threads
barrier->is_arrival_phase = 0;
pthread_cond_broadcast(&barrier->cond);
} else {
//wait for last thread to arrive (which will end arrival phase)
//we use again an unsynchronized polling loop followed by synchronized fall-back blocking
if(barrier->is_arrival_phase) {
pthread_mutex_unlock(&barrier->mutex);
volatile spin_counter_t i=0;
while(barrier->is_arrival_phase && i<SPIN_COUNTER_MAX) i++;
while((rv=pthread_mutex_trylock(&barrier->mutex)) == EBUSY);
if(rv != 0) return rv;
//Fall back to normal waiting on condition variable if necessary
while(barrier->is_arrival_phase) {
rv = pthread_cond_wait(&barrier->cond, &barrier->mutex);
if(rv != 0) {
pthread_mutex_unlock(&barrier->mutex);
return rv;
}
}
}
}
barrier->n--;
//last thread to leave barrier starts a new arrival phase
if(barrier->n == 0) {
barrier->is_arrival_phase = 1;
pthread_cond_broadcast(&barrier->cond);
}
pthread_mutex_unlock(&barrier->mutex);
return (master ? PARSEC_BARRIER_SERIAL_THREAD : 0);
}
any idea would be very much appreciated.
I am writing a file system for one of my classes. This function is killing my performance by a LARGE margin and I can't figure out why. I've been staring at this code way too long and I am probably missing something very obvious. Does someone see why this function should go so slowly?
int getFreeDataBlock(struct disk *d, unsigned int dataBlockNumber)
{
if (d == NULL)
{
fprintf(stderr, "Invalid disk pointer to getFreeDataBlock()\n");
errorCheck();
return -1;
}
// Allocate a buffer
char *buffer = (char *) malloc(d->blockSize * sizeof(char));
if (buffer == NULL)
{
fprintf(stderr, "Out of memory.\n");
errorCheck();
return -1;
}
do {
// Read a block from the disk
diskread(d, buffer, dataBlockNumber);
// Cast to appropriate struct
struct listDataBlock *block = (struct listDataBlock *) buffer;
unsigned int i;
for (i = 0; i < DATABLOCK_FREE_SLOT_LENGTH; ++i)
{
// We are in the last datalisting block...and out of slots...break
if (block->listOfFreeBlocks[i] == -2)
{
break;
}
if (block->listOfFreeBlocks[i] != -1)
{
int returnValue = block->listOfFreeBlocks[i];
// MARK THIS AS USED NOW
block->listOfFreeBlocks[i] = -1;
diskwriteNoSync(d, buffer, dataBlockNumber);
// No memory leaks
free(buffer);
return returnValue;
}
}
// Ok, nothing in this data block, move to next
dataBlockNumber = block->nextDataBlock;
} while (dataBlockNumber != -1);
// Nope, didn't find any...disk must be full
free(buffer);
fprintf(stderr, "DISK IS FULL\n");
errorCheck();
return -1;
}
As you can see from the gprof, the diskread() nor the diskwriteNoSync() are taking extensive amounts of time?
% cumulative self self total
time seconds seconds calls ms/call ms/call name
99.45 12.25 12.25 2051 5.97 5.99 getFreeDataBlock
0.24 12.28 0.03 2220903 0.00 0.00 diskread
0.24 12.31 0.03 threadFunc
0.08 12.32 0.01 2048 0.00 6.00 writeHelper
0.00 12.32 0.00 6154 0.00 0.00 diskwriteNoSync
0.00 12.32 0.00 2053 0.00 0.00 validatePath
or am I not understanding the output properly?
Thanks for any help.
The fact that you've been staring at this code and puzzling over the gprof output puts you in good company, because gprof and the concepts that are taught with it only work with little academic-scale programs doing no I/O. Here's the method I use.
Some excerpts from a useful post that got deleted, giving some MYTHS about profiling:
that program counter sampling is useful.
It is only useful if you have an unnecessary hotspot bottleneck such as a bubble sort of a big array of scalar values. As soon as you, for example, change it into a sort using string-compare, it is still a bottleneck, but program counter sampling will not see it because now the hotspot is in string-compare. On the other hand if it were to sample the extended program counter (the call stack), the point at which the string-compare is called, the sort loop, is clearly displayed. In fact, gprof was an attempt to remedy the limitations of pc-only sampling.
that samples need not be taken when blocked
The reasons for this myth are twofold: 1) that PC sampling is meaningless when the program is waiting, and 2) the preoccupation with accuracy of timing. However, for (1) the program may very well be waiting for something that it asked for, such as file I/O, which you need to know, and which stack samples reveal. (Obviously you want to exclude samples while waiting for user input.) For (2) if the program is waiting simply because of competition with other processes, that presumably happens in a fairly random way while it's running.
So while the program may be taking longer, that will not have a large effect on the statistic that matters, the percentage of time that statements are on the stack.
that counting of statement or function invocations is useful.
Suppose you know a function has been called 1000 times. Can you tell from that what fraction of time it costs? You also need to know how long it takes to run, on average, multiply it by the count, and divide by the total time. The average invocation time could vary from nanoseconds to seconds, so the count alone doesn't tell much. If there are stack samples, the cost of a routine or of any statement is just the fraction of samples it is on. That fraction of time is what could in principle be saved overall if the routine or statement could be made to take no time, so that is what has the most direct relationship to performance.
There are more where those came from.
I want to store event time.
I found these two functions, but don't know which one is faster.
int main()
{
struct timespec tp;
struct timeval tv;
int i=0;
int j=0;
for(i=0; i < 500000; i++)
{
gettimeofday(&tv, NULL);
j+= tv.tv_usec%2;
clock_gettime(CLOCK_HIGHRES, &tp);
j+=tp.tv_sec%2;
}
return 0;
}
%Time Seconds Cumsecs #Calls msec/call Name
68.3 0.28 0.28 500000 0.0006 __clock_gettime
22.0 0.09 0.37 500000 0.0002 _gettimeofday
7.3 0.03 0.40 1000009 0.0000 _mcount
2.4 0.01 0.41 1 10. main
0.0 0.00 0.41 4 0. atexit
0.0 0.00 0.41 1 0. _exithandle
0.0 0.00 0.41 1 0. _fpsetsticky
0.0 0.00 0.41 1 0. _profil
0.0 0.00 0.41 1 0. exit
This ran on Solaris 9 v440 box. Profile the code on your own box. This result is completely different from the result quoted in a link supplied earlier. In other words, if there is a difference it will be due to the implmenetation.
It will depend on your system; there's no standard for which is faster. My guess would be they're usually both about the same speed. gettimeofday is more traditional and probably a good bet if you want your code to be portable to older systems. clock_gettime has more features.
You don't care.
If you're calling either of them often enough that it matters, You're Doing It Wrong.
Profile code which is causing specific performance problems and check how much time it is spending in there. If it's too much, consider refactoring it to call the function less often.
Depending on constant, used in clock_gettime(). For the fastest clocks, there are CLOCK_*_COARSE constants. These timers are fastest, but are not precise.
gettimeofday() should return the same as clock_gettime(CLOCK_REALTIME)
Also, benchmark results depend on architecture (for Linux). Since Linux has special technology (VDSO) to eliminate syscall to get time. These technologies do not work on X86-32bit architecture. See strace output.
Have a look at this discussion. It seems of good quality and very closely related to what you might be looking for though a little dated.
When I run gprof on my C program it says no time accumulated for my program and shows 0 time for all function calls. However it does count the function calls.
How do I modify my program so that gprof will be able to count how much time something takes to run?
Did you specify -pg when compiling?
http://sourceware.org/binutils/docs-2.20/gprof/Compiling.html#Compiling
Once it is compiled, you run the program and then run gprof on the binary.
E.g.:
test.c:
#include <stdio.h>
int main ()
{
int i;
for (i = 0; i < 10000; i++) {
printf ("%d\n", i);
}
return 0;
}
Compile as cc -pg test.c, then run as a.out, then gprof a.out, gives me
granularity: each sample hit covers 4 byte(s) for 1.47% of 0.03 seconds
% cumulative self self total
time seconds seconds calls ms/call ms/call name
45.6 0.02 0.02 10000 0.00 0.00 __sys_write [10]
45.6 0.03 0.02 0 100.00% .mcount (26)
2.9 0.03 0.00 20000 0.00 0.00 __sfvwrite [6]
1.5 0.03 0.00 20000 0.00 0.00 memchr [11]
1.5 0.03 0.00 10000 0.00 0.00 __ultoa [12]
1.5 0.03 0.00 10000 0.00 0.00 _swrite [9]
1.5 0.03 0.00 10000 0.00 0.00 vfprintf [2]
What are you getting?
I tried running Kinopiko's example, except I increased the number of iterations by a factor of 100.
test.c:
#include <stdio.h>
int main ()
{
int i;
for (i = 0; i < 1000000; i++) {
printf ("%d\n", i);
}
return 0;
}
Then I took 10 stackshots (under VC, but you can use pstack). Here are the stackshots:
9 copies of this stack:
NTDLL! 7c90e514()
KERNEL32! 7c81cbfe()
KERNEL32! 7c81cc75()
KERNEL32! 7c81cc89()
_write() line 168 + 57 bytes
_flush() line 162 + 23 bytes
_ftbuf() line 171 + 9 bytes
printf() line 62 + 14 bytes
main() line 7 + 14 bytes
mainCRTStartup() line 206 + 25 bytes
KERNEL32! 7c817077()
1 copy of this stack:
KERNEL32! 7c81cb96()
KERNEL32! 7c81cc75()
KERNEL32! 7c81cc89()
_write() line 168 + 57 bytes
_flush() line 162 + 23 bytes
_ftbuf() line 171 + 9 bytes
printf() line 62 + 14 bytes
main() line 7 + 14 bytes
mainCRTStartup() line 206 + 25 bytes
KERNEL32! 7c817077()
In case it isn't obvious, this tells you that:
mainCRTStartup() line 206 + 25 bytes Cost ~100% of the time
main() line 7 + 14 bytes Cost ~100% of the time
printf() line 62 + 14 bytes Cost ~100% of the time
_ftbuf() line 171 + 9 bytes Cost ~100% of the time
_flush() line 162 + 23 bytes Cost ~100% of the time
_write() line 168 + 57 bytes Cost ~100% of the time
In a nutshell, the program spends ~100% of it's time flushing to disk (or console) the output buffer as part of the printf on line 7.
(What I mean by "Cost of a line" is - it is the fraction of total time spent at the request of that line, and that's roughly the fraction of samples that contain it.
If that line could be made to take no time, such as by removing it, skipping over it, or passing its work off to an infinitely fast coprocessor, that time fraction is how much the total time would shrink. So if the execution of any of these lines of code could be avoided, time would shrink by somewhere in the range of 95% to 100%. If you were to ask "What about recursion?", the answer is It Makes No Difference.)
Now, maybe you want to know something else, like how much time is spent in the loop, for example. To find that out, remove the printf because it's hogging all the time. Maybe you want to know what % of time is spent purely in CPU time, not in system calls. To get that, just throw away any stackshots that don't end in your code.
The point I'm trying to make is if you're looking for things you can fix to make the code run faster, the data gprof gives you, even if you understand it, is almost useless. By comparison, if there is some of your code that is causing more wall-clock time to be spent than you would like, stackshots will pinpoint it.
One gotcha with gprof: it doesn't work with code in dynamically-linked libraries. For that, you need to use sprof. See this answer: gprof : How to generate call graph for functions in shared library that is linked to main program
First compile you application with -g, and check what CPU counters are you using.
If your application runs very quick than gprof could just miss all events or less that required (reduce the number of events to read).
Actually profiling should show you CPU_CLK_UNHALTED or INST_RETIRED events without any special switches. But with such data you'll be able only to say how well your code it performing: INST_RETIRED/CPU_CLK_UNHALTED.
Try to use Intel VTune profiler - it's free for 30 days and for education.