which one faster, getimeofday or clock_gettime? - c

I want to store event time.
I found these two functions, but don't know which one is faster.

int main()
{
struct timespec tp;
struct timeval tv;
int i=0;
int j=0;
for(i=0; i < 500000; i++)
{
gettimeofday(&tv, NULL);
j+= tv.tv_usec%2;
clock_gettime(CLOCK_HIGHRES, &tp);
j+=tp.tv_sec%2;
}
return 0;
}
%Time Seconds Cumsecs #Calls msec/call Name
68.3 0.28 0.28 500000 0.0006 __clock_gettime
22.0 0.09 0.37 500000 0.0002 _gettimeofday
7.3 0.03 0.40 1000009 0.0000 _mcount
2.4 0.01 0.41 1 10. main
0.0 0.00 0.41 4 0. atexit
0.0 0.00 0.41 1 0. _exithandle
0.0 0.00 0.41 1 0. _fpsetsticky
0.0 0.00 0.41 1 0. _profil
0.0 0.00 0.41 1 0. exit
This ran on Solaris 9 v440 box. Profile the code on your own box. This result is completely different from the result quoted in a link supplied earlier. In other words, if there is a difference it will be due to the implmenetation.

It will depend on your system; there's no standard for which is faster. My guess would be they're usually both about the same speed. gettimeofday is more traditional and probably a good bet if you want your code to be portable to older systems. clock_gettime has more features.

You don't care.
If you're calling either of them often enough that it matters, You're Doing It Wrong.
Profile code which is causing specific performance problems and check how much time it is spending in there. If it's too much, consider refactoring it to call the function less often.

Depending on constant, used in clock_gettime(). For the fastest clocks, there are CLOCK_*_COARSE constants. These timers are fastest, but are not precise.
gettimeofday() should return the same as clock_gettime(CLOCK_REALTIME)
Also, benchmark results depend on architecture (for Linux). Since Linux has special technology (VDSO) to eliminate syscall to get time. These technologies do not work on X86-32bit architecture. See strace output.

Have a look at this discussion. It seems of good quality and very closely related to what you might be looking for though a little dated.

Related

Unable to profile libcrypto methods using gprof

I am trying to profile a C program that uses some methods of openssl/libcrypto. Everything work well when I compile and run the code without profiling information. When I add options to profile it with gprof, I get unexpected results from the profiling tool.
I did many researched but I didn't find any page that solved my problem.
This is my code (named test.c):
#include <stdio.h>
#include <openssl/bn.h>
#include <openssl/rand.h>
static BIGNUM *x;
static BIGNUM *y;
static BIGNUM *z;
static BIGNUM *p;
static BN_CTX *tmp;
static unsigned int max_size;
int main(void){
int max_bytes, res_gen;
max_bytes = 50;
tmp = BN_CTX_new();
BN_CTX_init(tmp);
x = BN_new();
y = BN_new();
z = BN_new();
p = BN_new();
RAND_load_file("/dev/urandom", max_bytes);
max_size = 256;
BN_rand(x, max_size, 0, 0);
BN_rand(y, max_size, 0, 0);
res_gen = BN_generate_prime_ex(p, max_size, 0, NULL, NULL, NULL);
BN_mul(z, x, y, tmp);
BN_nnmod(x, z, p, tmp);
printf("\nOk\n");
BN_free(x);
BN_free(y);
BN_free(z);
BN_free(p);
BN_CTX_free(tmp);
return 0;
}
When I compile with profiling information using gcc -pg -static test.c -lcrypto -ldl, it produces the following results. I get 0% (and 0 second) for everything, which is unexpected.
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 1 0.00 0.00 main
Call graph
granularity: each sample hit covers 2 byte(s) no time propagated
index % time self children called name
0.00 0.00 1/1 __libc_start_main [4282]
[1] 0.0 0.00 0.00 1 main [1]
0.00 0.00 0/0 mcount (3495)
0.00 0.00 0/0 BN_CTX_new [275]
0.00 0.00 0/0 BN_CTX_init [274]
0.00 0.00 0/0 BN_new [372]
0.00 0.00 0/0 RAND_load_file [1636]
0.00 0.00 0/0 BN_rand [386]
0.00 0.00 0/0 BN_generate_prime_ex [331]
0.00 0.00 0/0 BN_mul [370]
0.00 0.00 0/0 BN_nnmod [378]
0.00 0.00 0/0 puts [3696]
0.00 0.00 0/0 BN_free [327]
0.00 0.00 0/0 BN_CTX_free [272]
-----------------------------------------------
Also, it seems that the profiler detects only the main method because details for others methods don't appear in flat profile and call graph.
So, I would like to know if I must compile OpenSSL library with some options (what options ?) or something else.
gprof is a CPU-profiler. That means it cannot account for any time spent in blocking system calls like I/O, sleep, locks, etc.
Assuming the goal is to find speedups,
what a lot of people do is get it running under a debugger like GDB and take stack samples manually.
It's not measuring (with any precision). It's pinpointing problems. Anything you see it doing that you could avoid, if you see it on more than one sample, will give you a substantial speedup. The fewer samples you have to take before seeing it twice, the bigger the speedup.

Calling with Arguments versus using Globals in C

I have a decent understanding of x86 assembly and i know that when a function is called all the arguments are pushed onto the stack.
I have a function which basically loops through a 8 by 8 array and calls some functions based on the values in the array. Each of these function calls involves 6-10 arguments being passed. This program takes a very long time to run, it is a Chess AI, but this function takes 20% of the running time.
So i guess my question is, what can i do to give my functions access to the variables they need in a faster way?
int row,col,i;
determineCheckValidations(eval_check, b, turn);
int * eval_check_p = &(eval_check[0][0]);
for(row = 0; row < 8; row++){
for(col = 0; col < 8; col++, eval_check_p++){
if (b->colors[row][col] == turn){
int type = b->types[row][col];
if (type == PAWN)
findPawnMoves(b,moves_found,turn,row,col,last_move,*eval_check_p);
else if (type == KNIGHT)
findMappedNoIters(b,moves_found,turn,row,col,*move_map_knight, 8, *eval_check_p);
else if (type == BISHOP)
findMappedIters(b,moves_found,turn,row,col,*move_map_bishop, 4, *eval_check_p);
else if (type == ROOK)
findMappedIters(b,moves_found,turn,row,col,*move_map_rook, 4, *eval_check_p);
else if (type == QUEEN)
findMappedIters(b,moves_found,turn,row,col,*move_map_queen, 8, *eval_check_p);
else if (type == KING){
findMappedNoIters(b,moves_found,turn,row,col,*move_map_king, 8, *eval_check_p);
findCastles(b,moves_found,turn,row,col);
}
}
}
}
all the code can be found # https://github.com/AndyGrant/JChess/tree/master/_Core/_Scripts
A sample of the profile:
% cumulative self self total
time seconds seconds calls s/call s/call name
20.00 1.55 1.55 2071328 0.00 0.00 findAllValidMoves
14.84 2.70 1.15 10418354 0.00 0.00 checkMove
10.06 3.48 0.78 1669701 0.00 0.00 encodeBoard
7.23 4.04 0.56 10132526 0.00 0.00 findMappedIters
6.84 4.57 0.53 1669701 0.00 0.00 getElement
6.71 5.09 0.52 68112169 0.00 0.00 createNormalMove
You have performed good work on profiling. You need to take the function with the worst case and profile it in more detail.
You may want to try different compiler optimization settings when you profile.
Try some common optimization techniques, such as loop unrolling and factoring out invariants from loops.
You may get some improvements by designing your functions with the processor's data cache in mind. Search the web for "optimizing data cache".
If the function works correctly, I recommend posting to CodeReview#StackExchange.com.
Don't assume anything.

Function going too slowly...i can't see why

I am writing a file system for one of my classes. This function is killing my performance by a LARGE margin and I can't figure out why. I've been staring at this code way too long and I am probably missing something very obvious. Does someone see why this function should go so slowly?
int getFreeDataBlock(struct disk *d, unsigned int dataBlockNumber)
{
if (d == NULL)
{
fprintf(stderr, "Invalid disk pointer to getFreeDataBlock()\n");
errorCheck();
return -1;
}
// Allocate a buffer
char *buffer = (char *) malloc(d->blockSize * sizeof(char));
if (buffer == NULL)
{
fprintf(stderr, "Out of memory.\n");
errorCheck();
return -1;
}
do {
// Read a block from the disk
diskread(d, buffer, dataBlockNumber);
// Cast to appropriate struct
struct listDataBlock *block = (struct listDataBlock *) buffer;
unsigned int i;
for (i = 0; i < DATABLOCK_FREE_SLOT_LENGTH; ++i)
{
// We are in the last datalisting block...and out of slots...break
if (block->listOfFreeBlocks[i] == -2)
{
break;
}
if (block->listOfFreeBlocks[i] != -1)
{
int returnValue = block->listOfFreeBlocks[i];
// MARK THIS AS USED NOW
block->listOfFreeBlocks[i] = -1;
diskwriteNoSync(d, buffer, dataBlockNumber);
// No memory leaks
free(buffer);
return returnValue;
}
}
// Ok, nothing in this data block, move to next
dataBlockNumber = block->nextDataBlock;
} while (dataBlockNumber != -1);
// Nope, didn't find any...disk must be full
free(buffer);
fprintf(stderr, "DISK IS FULL\n");
errorCheck();
return -1;
}
As you can see from the gprof, the diskread() nor the diskwriteNoSync() are taking extensive amounts of time?
% cumulative self self total
time seconds seconds calls ms/call ms/call name
99.45 12.25 12.25 2051 5.97 5.99 getFreeDataBlock
0.24 12.28 0.03 2220903 0.00 0.00 diskread
0.24 12.31 0.03 threadFunc
0.08 12.32 0.01 2048 0.00 6.00 writeHelper
0.00 12.32 0.00 6154 0.00 0.00 diskwriteNoSync
0.00 12.32 0.00 2053 0.00 0.00 validatePath
or am I not understanding the output properly?
Thanks for any help.
The fact that you've been staring at this code and puzzling over the gprof output puts you in good company, because gprof and the concepts that are taught with it only work with little academic-scale programs doing no I/O. Here's the method I use.
Some excerpts from a useful post that got deleted, giving some MYTHS about profiling:
that program counter sampling is useful.
It is only useful if you have an unnecessary hotspot bottleneck such as a bubble sort of a big array of scalar values. As soon as you, for example, change it into a sort using string-compare, it is still a bottleneck, but program counter sampling will not see it because now the hotspot is in string-compare. On the other hand if it were to sample the extended program counter (the call stack), the point at which the string-compare is called, the sort loop, is clearly displayed. In fact, gprof was an attempt to remedy the limitations of pc-only sampling.
that samples need not be taken when blocked
The reasons for this myth are twofold: 1) that PC sampling is meaningless when the program is waiting, and 2) the preoccupation with accuracy of timing. However, for (1) the program may very well be waiting for something that it asked for, such as file I/O, which you need to know, and which stack samples reveal. (Obviously you want to exclude samples while waiting for user input.) For (2) if the program is waiting simply because of competition with other processes, that presumably happens in a fairly random way while it's running.
So while the program may be taking longer, that will not have a large effect on the statistic that matters, the percentage of time that statements are on the stack.
that counting of statement or function invocations is useful.
Suppose you know a function has been called 1000 times. Can you tell from that what fraction of time it costs? You also need to know how long it takes to run, on average, multiply it by the count, and divide by the total time. The average invocation time could vary from nanoseconds to seconds, so the count alone doesn't tell much. If there are stack samples, the cost of a routine or of any statement is just the fraction of samples it is on. That fraction of time is what could in principle be saved overall if the routine or statement could be made to take no time, so that is what has the most direct relationship to performance.
There are more where those came from.

Trying to get the time of an operation and receiving time 0 seconds

I am trying to see how much does it take for about 10000 names to be inserted into a BST(writing in c).
I am reading these names from a txt file using fscanf. I have declared a file pointer(fp) at the main function. Calling a function that is at another .c file passing the fp through its arguments. I want to count the time needed for 2,4,8,16,32...,8192 names to be inserted saving the time at a long double array. I have included the time.h library at the .c file where the function is located.
Code:
void myfunct(BulkTreePtr *Bulktree, FILE* fp,long double time[])
{
double tstart, tend, ttemp;
TStoixeioyTree datainput;
int error = 0,counter=0,index=0,num=2,i;
tstart = ((double) clock())/CLOCKS_PER_SEC;
while (!feof(fp))
{
counter++;
fscanf(fp,"%s %s", datainput.lname, datainput.fname);
Tree_input(&((*Bulktree)->TreeRoot), datainput, &error);
if (counter == num)
{
ttemp = (double) clock()/CLOCKS_PER_SEC;
time[index] = ttemp-tstart;
num = num * 2;
index++;
}
}
tend = ((double) clock())/CLOCKS_PER_SEC;
printf("Last value of ttemp is %f\n",ttemp-tstart);
time[index] = (tend-tstart);
num = 2;
for(i=0;i<14;i++)
{
printf("Time after %d names is %f sec \n", num, (float)time[i]);
num=num*2;
}
I am getting this:
Last value of ttemp is 0.000000
Time after 2 names is 0.000000 sec
Time after 4 names is 0.000000 sec
Time after 8 names is 0.000000 sec
Time after 16 names is 0.000000 sec
Time after 32 names is 0.000000
ms Time after 64 names is
0.000000 sec Time after 128 names is 0.000000 sec Time after 256
names is 0.000000 sec Time after
512 names is 0.000000 sec Time
after 1024 names is 0.000000 sec
Time after 2048 names is 0.000000 sec
Time after 4096 names is 0.000000
sec Time after 8192 names is
0.000000 sec Time after 16384 names is 0.010000 sec
What am I doing wrong? :S
Use clock_getres() and clock_gettime(). Most likely you will find your system doesn't have a very fast clock. Note that the system might return different numbers when calling gettimeofday or clock_gettime(), but often times (depending on kernel) those numbers at greater than HZ resolution are lies generated to simulate time advancing.
You might find a better test to do fixed time tests. Find out how many inserts you can do in 10 seconds. Or have some kind of fast reset method (memset?) and find out how many groups of inserts of 1024 names you can do in 10 seconds.
[EDIT]
Traditionally, the kernel gets interrupted at HZ frequency by the hardware. Only when it gets this hardware interrupt does it know that time had advanced by 1/HZ of a second. The traditional value for HZ was 1/100 of a second. Surprise, surprise, you saw a 1/100th of a second increment in time. Now some systems and kernels have recently started providing other methods of getting higher resolution time, looking at the RTC device or whatever.
However, you should use the clock_gettime() function I pointed you to along with the clock_getres() function to find out how often you will get accurate time updates. Make sure your test runs many many multiples of clock_getres() unless you want it to be a total lie.
clock() returns the number of "ticks"; there are CLOCKS_PER_SEC ticks per second. For any operation which takes less than 1/CLOCKS_PER_SEC seconds, the return value of clock() will either be unchanged or changed by 1 tick.
From your results it looks like even 16384 insertions take no more than 1/100 seconds.
If you want to know how long a certain number of insertions take, try repeating them many, many times so that the total number of ticks is significant, and then divide that total time by the number of times they were repeated.
clock returns the amount of cpu time used, not the amount of actual time elapsed, but that might be what you want here. Note that Unix standards requires CLOCKS_PER_SECOND to be exactly one million (1000000), but the resolution can be much worse (e.g. it might jump by 10000 at a time). You should be using clock_gettime with the cpu time clock if you want to measure cpu time spent, or otherwise with the monotonic clock to measure realtime spent.
ImageMagick includes stopwatch functions such as these.
#include "magick/MagickCore.h"
#include "magick/timer.h"
TimerInfo *timer_info;
timer_info = AcquireTimerInfo();
<your code>
printf("elapsed=%.6f sec", GetElapsedTime(timer_info));
But that only seems to have a resolution of 1/100 second. Plus it requires installing ImageMagic. I suggest this instead. It's simple and has usec resolution in Linux.
#include <time.h>
double timer(double start_secs)
{
static struct timeval tv;
static struct timezone tz;
gettimeofday(&tv, &tz);
double now_secs = (double)tv.tv_sec + (double)tv.tv_usec/1000000.0;
return now_secs - start_secs;
}
double t1 = timer(0);
<your code>
printf("elapsed=%.6f sec", timer(t1));

How to modify a C program so that gprof can profile it?

When I run gprof on my C program it says no time accumulated for my program and shows 0 time for all function calls. However it does count the function calls.
How do I modify my program so that gprof will be able to count how much time something takes to run?
Did you specify -pg when compiling?
http://sourceware.org/binutils/docs-2.20/gprof/Compiling.html#Compiling
Once it is compiled, you run the program and then run gprof on the binary.
E.g.:
test.c:
#include <stdio.h>
int main ()
{
int i;
for (i = 0; i < 10000; i++) {
printf ("%d\n", i);
}
return 0;
}
Compile as cc -pg test.c, then run as a.out, then gprof a.out, gives me
granularity: each sample hit covers 4 byte(s) for 1.47% of 0.03 seconds
% cumulative self self total
time seconds seconds calls ms/call ms/call name
45.6 0.02 0.02 10000 0.00 0.00 __sys_write [10]
45.6 0.03 0.02 0 100.00% .mcount (26)
2.9 0.03 0.00 20000 0.00 0.00 __sfvwrite [6]
1.5 0.03 0.00 20000 0.00 0.00 memchr [11]
1.5 0.03 0.00 10000 0.00 0.00 __ultoa [12]
1.5 0.03 0.00 10000 0.00 0.00 _swrite [9]
1.5 0.03 0.00 10000 0.00 0.00 vfprintf [2]
What are you getting?
I tried running Kinopiko's example, except I increased the number of iterations by a factor of 100.
test.c:
#include <stdio.h>
int main ()
{
int i;
for (i = 0; i < 1000000; i++) {
printf ("%d\n", i);
}
return 0;
}
Then I took 10 stackshots (under VC, but you can use pstack). Here are the stackshots:
9 copies of this stack:
NTDLL! 7c90e514()
KERNEL32! 7c81cbfe()
KERNEL32! 7c81cc75()
KERNEL32! 7c81cc89()
_write() line 168 + 57 bytes
_flush() line 162 + 23 bytes
_ftbuf() line 171 + 9 bytes
printf() line 62 + 14 bytes
main() line 7 + 14 bytes
mainCRTStartup() line 206 + 25 bytes
KERNEL32! 7c817077()
1 copy of this stack:
KERNEL32! 7c81cb96()
KERNEL32! 7c81cc75()
KERNEL32! 7c81cc89()
_write() line 168 + 57 bytes
_flush() line 162 + 23 bytes
_ftbuf() line 171 + 9 bytes
printf() line 62 + 14 bytes
main() line 7 + 14 bytes
mainCRTStartup() line 206 + 25 bytes
KERNEL32! 7c817077()
In case it isn't obvious, this tells you that:
mainCRTStartup() line 206 + 25 bytes Cost ~100% of the time
main() line 7 + 14 bytes Cost ~100% of the time
printf() line 62 + 14 bytes Cost ~100% of the time
_ftbuf() line 171 + 9 bytes Cost ~100% of the time
_flush() line 162 + 23 bytes Cost ~100% of the time
_write() line 168 + 57 bytes Cost ~100% of the time
In a nutshell, the program spends ~100% of it's time flushing to disk (or console) the output buffer as part of the printf on line 7.
(What I mean by "Cost of a line" is - it is the fraction of total time spent at the request of that line, and that's roughly the fraction of samples that contain it.
If that line could be made to take no time, such as by removing it, skipping over it, or passing its work off to an infinitely fast coprocessor, that time fraction is how much the total time would shrink. So if the execution of any of these lines of code could be avoided, time would shrink by somewhere in the range of 95% to 100%. If you were to ask "What about recursion?", the answer is It Makes No Difference.)
Now, maybe you want to know something else, like how much time is spent in the loop, for example. To find that out, remove the printf because it's hogging all the time. Maybe you want to know what % of time is spent purely in CPU time, not in system calls. To get that, just throw away any stackshots that don't end in your code.
The point I'm trying to make is if you're looking for things you can fix to make the code run faster, the data gprof gives you, even if you understand it, is almost useless. By comparison, if there is some of your code that is causing more wall-clock time to be spent than you would like, stackshots will pinpoint it.
One gotcha with gprof: it doesn't work with code in dynamically-linked libraries. For that, you need to use sprof. See this answer: gprof : How to generate call graph for functions in shared library that is linked to main program
First compile you application with -g, and check what CPU counters are you using.
If your application runs very quick than gprof could just miss all events or less that required (reduce the number of events to read).
Actually profiling should show you CPU_CLK_UNHALTED or INST_RETIRED events without any special switches. But with such data you'll be able only to say how well your code it performing: INST_RETIRED/CPU_CLK_UNHALTED.
Try to use Intel VTune profiler - it's free for 30 days and for education.

Resources