Calculation time elapsed by a particular function in C program - c

I have a code in which i want to calculate the time taken by two sorting algorithms merge sort and quick sort to sort N numbers in microseconds or more precise.
The two times thus calculated will then we outputted to the terminal.
Code(part of code):
printf("THE LIST BEFORE SORTING IS(UNSORTED LIST):\n");
printlist(arr,n);
mergesort(extarr,0,n-1);
printf("THE LIST AFTER SORTING BY MERGE SORT IS(SORTED LIST):\n");
printlist(extarr,n);
quicksort(arr,0,n-1);
printf("THE LIST AFTER SORTING BY QUICK SORT IS(SORTED LIST):\n");
printlist(arr,n);
Help me by providing that how it will be done.I have tried clock_t by taking two variables as start stop and keeping them above and below the function call respectively but this doesnt help at all and always print out the its difference as zero.
Please suggest some other methods or function keeping in mind that it has no problem running in any type of OS.
Thanks for any help in advance.

Method : 1
To calculate total time taken by program You can use linux utility "time".
Lets your program name is test.cpp.
$g++ -o test test.cpp
$time ./test
Output will be like :
real 0m11.418s
user 0m0.004s
sys 0m0.004s
Method : 2
You can also use linux profiling method "gprof" to find the time by different functions.
First you have to compile the program with "-pg" flag.
$g++ -pg -o test test.cpp
$./test
$gprof test gmon.out
PS : gmon.out is default file created by gprof

You can call gettimeofday function in Linux and timeGetTime in Windows. Call these functions before calling your sorting function and after calling your sorting function and take the difference.
Please check the man page for further details. If you are still unable to get some tangible data (as the time taken may be too small due to smaller data sets), better to try to measure the time together for 'n' number of iterations and then deduce the time for a single run or increase the size of the data set to be sorted.

Not sure if you tried the following. I know your original post says that you have tried utilizing the CLOCKS_PER_SEC. Using CLOCKS_PER_SEC and doing (stop-start)/CLOCKS_PER_SEC will allow you get seconds. The double will provide more precision.
#include <time.h>
main()
{
clock_t launch = clock();
//do work
clock_t done = clock();
double diff = (done - launch) / CLOCKS_PER_SEC;
}

The reason to get Zeroas the result is likely the poor resolution of the time source you're using. These time sources typically increment by some 10 to 20 ms. This is poor but that's the way they work. When your sorting is done in less that this time increment, the result will be zero. You may increase this resultion into the 1 ms regime by increasing the systems interrupt frequency. There is no standard way to accomplish this for windows and Linux. They have their individual way.
An even higher resolution can be obtained by a high frequency counter. Windows and Linux do provide access to such counters, but again, the code may look slightly different.
If you deserve one piece of code to run on windows and linux, I'd recommend to perform the time measurement in a loop. Run the code to measure hundreds or even more times in a loop
and capture the time outside the loop. Divide the captured time by the numer of loop cycles and have the result.
Of course: This is for evaluation only. You don't want to have that in final code.
And: Taking into account that the time resolution is in the 1 to 20 ms you should make a good choice of the total time to go for to get decent resolution of you measurement. (Hint: Adjust the loop count to let it go for at least a second or so.)
Example:
clock_t start, end;
printf("THE LIST BEFORE SORTING IS(UNSORTED LIST):\n");
printlist(arr,n);
start = clock();
for(int i = 0; i < 100; i++){
mergesort(extarr,0,n-1);
}
end = clock();
double diff = (end - start) / CLOCKS_PER_SEC;
// and so on...
printf("THE LIST AFTER SORTING BY MERGE SORT IS(SORTED LIST):\n");
printlist(extarr,n);
quicksort(arr,0,n-1);
printf("THE LIST AFTER SORTING BY QUICK SORT IS(SORTED LIST):\n");
printlist(arr,n);

If you are in Linux 2.6.26 or above then getrusage(2) is the most accurate way to go:
#include <sys/time.h>
#include <sys/resource.h>
// since Linux 2.6.26
// The macro is not defined in all headers, but supported if your Linux version matches
#ifndef RUSAGE_THREAD
#define RUSAGE_THREAD 1
#endif
// If you are single-threaded then RUSAGE_SELF is POSIX compliant
// http://linux.die.net/man/2/getrusage
struct rusage rusage_start, rusage_stop;
getrusage(RUSAGE_THREAD, &rusage_start);
...
getrusage(RUSAGE_THREAD, &rusage_stop);
// amount of microseconds spent in user space
size_t user_time = ((rusage_stop.ru_utime.tv_sec - rusage_start.ru_stime.tv_sec) * 1000000) + rusage_stop.ru_utime.tv_usec - rusage_start.ru_stime.tv_usec;
// amount of microseconds spent in kernel space
size_t system_time = ((rusage_stop.ru_stime.tv_sec - rusage_start.ru_stime.tv_sec) * 1000000) + rusage_stop.ru_stime.tv_usec - rusage_start.ru_stime.tv_usec;

Related

Why does my Linux app get stopped every 0.5 seconds?

I have a 16 core Linux machine that is idle. If I run a trivial, single threaded C program that sits in a loop reading the cycle counter forever (using the rdtsc instruction), then every 0.5 seconds, I see a 0.17 ms jump in the timer value. In other words, every 0.5 seconds it seems that my application is stopped for 0.17ms. I would like to understand why this happens and what I can do about it. I understand Linux is not a real time operating system. I'm just trying to understand what is going on, so I can make the best use of what Linux provides.
I found someone else's software for measuring this problem - https://github.com/nokia/clocktick_jumps. Its results are consistent with my own.
To answer the "tell us what specific problem you are trying to solve" question - I work on high-speed networking applications using DPDK. About 60 million packets arrive per second. I need to decide what size to make the RX buffers and have reasons that the number I pick is sensible. The answer to this question is one part of that puzzle.
My code looks like this:
// Build with gcc -O2 -Wall
#include <stdio.h>
#include <unistd.h>
#include <x86intrin.h>
int main() {
// Bad way to learn frequency of cycle counter.
unsigned long long t1 = __rdtsc();
usleep(1000000);
double millisecs_per_tick = 1e3 / (double)(__rdtsc() - t1);
// Loop forever. Print message if any iteration takes unusually long.
t1 = __rdtsc();
while (1) {
unsigned long long t2 = __rdtsc();
double delta = t2 - t1;
delta *= millisecs_per_tick;
if (delta > 0.1) {
printf("%4.2f - Delay of %.2f ms.\n", (double)t2 * millisecs_per_tick, delta);
}
t1 = t2;
}
return 0;
}
I'm running on Ubuntu 16.04, amd64. My processor is an Intel Xeon X5672 # 3.20GHz.
I expect your system is scheduling another process to run on the same CPU, and you're either replaced or moved to another core with some timing penalty.
You can find the reason by digging into kernel events happening at the same time. For example systemtap, or perf can give you some insight. I'd start with the scheduler events to eliminate that one first: https://github.com/jav/systemtap/blob/master/tapset/scheduler.stp

Using multithreads to calculate data but it does't reduce the time

My CPU has four cores,MAC os. I use 4 threads to calculate an array. But the time of calculating does't being reduced. If I don't use multithread, the time of calculating is about 52 seconds. But even I use 4 multithreads, or 2 threads, the time doesn't change.
(I know why this happen now. The problem is that I use clock() to calculate the time. It is wrong when it is used in multithread program because this function will multiple the real time based on the num of threads. When I use time() to calculate the time, the result is correct.)
The output of using 2 threads:
id 1 use time = 43 sec to finish
id 0 use time = 51 sec to finish
time for round 1 = 51 sec
id 1 use time = 44 sec to finish
id 0 use time = 52 sec to finish
time for round 2 = 52 sec
id 1 and id 0 is thread 1 and thread 0. time for round is the time of finishing two threads. If I don't use multithread, time for round is also about 52 seconds.
This is the part of calling 4 threads:
for(i=1;i<=round;i++)
{
time_round_start=clock();
for(j=0;j<THREAD_NUM;j++)
{
cal_arg[j].roundth=i;
pthread_create(&thread_t_id[j], NULL, Multi_Calculate, &cal_arg[j]);
}
for(j=0;j<THREAD_NUM;j++)
{
pthread_join(thread_t_id[j], NULL);
}
time_round_end=clock();
int round_time=(int)((time_round_end-time_round_start)/CLOCKS_PER_SEC);
printf("time for round %d = %d sec\n",i,round_time);
}
This is the code inside the thread function:
void *Multi_Calculate(void *arg)
{
struct multi_cal_data cal=*((struct multi_cal_data *)arg);
int p_id=cal.thread_id;
int i=0;
int root_level=0;
int leaf_addr=0;
int neighbor_root_level=0;
int neighbor_leaf_addr=0;
Neighbor *locate_neighbor=(Neighbor *)malloc(sizeof(Neighbor));
printf("id:%d, start:%d end:%d,round:%d\n",p_id,cal.start_num,cal.end_num,cal.roundth);
for(i=cal.start_num;i<=cal.end_num;i++)
{
root_level=i/NUM_OF_EACH_LEVEL;
leaf_addr=i%NUM_OF_EACH_LEVEL;
if(root_addr[root_level][leaf_addr].node_value!=i)
{
//ignore, because this is a gap, no this node
}
else
{
int k=0;
locate_neighbor=root_addr[root_level][leaf_addr].head;
double tmp_credit=0;
for(k=0;k<root_addr[root_level][leaf_addr].degree;k++)
{
neighbor_root_level=locate_neighbor->neighbor_value/NUM_OF_EACH_LEVEL;
neighbor_leaf_addr=locate_neighbor->neighbor_value%NUM_OF_EACH_LEVEL;
tmp_credit += root_addr[neighbor_root_level][neighbor_leaf_addr].g_credit[cal.roundth-1]/root_addr[neighbor_root_level][neighbor_leaf_addr].degree;
locate_neighbor=locate_neighbor->next;
}
root_addr[root_level][leaf_addr].g_credit[cal.roundth]=tmp_credit;
}
}
return 0;
}
The array is very large, each thread calculate part of the array.
Is there something wrong with my code?
It could be a bug, but if you feel the code is correct, then the overhead of parallelization, mutexes and such, might mean the overall performance (runtime) is the same as for the non-parallelized code, for the size of elements to compute against.
It might be an interesting study, to do looped code, single-threaded, and the threaded code, against very large arrays (100k elements?), and see if the results start to diverge to be faster in the parallel/threaded code?
Amdahl's law, also known as Amdahl's argument,[1] is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.
https://en.wikipedia.org/wiki/Amdahl%27s_law
You don't always gain speed by multi-threading a program. There is a certain amount of overhead that comes with threading. Unless there is enough inefficiencies in the non-threaded code to make up for the overhead, you'll not see an improvement. A lot can be learned about how multi-threading works even if the program you write ends up running slower.
I know why this happen now. The problem is that I use clock() to calculate the time. It is wrong when it is used in multithread program because this function will multiple the real time based on the num of threads. When I use time() to calculate the time, the result is correct.

OpenMP for beginners

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?
First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.
The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

elapsed time in C

#include <time.h>
time_t start,end;
time (&start);
//code here
time (&end);
double dif = difftime (end,start);
printf ("Elasped time is %.2lf seconds.", dif );
I'm getting 0.000 for both start and end times. I'm not understanding the source of error.
Also is it better to use time(start) and time(end) or start=clock() and end=clock() for computing the elapsed time.
On most (practically all?) systems, time() only has a granularity of one second, so any sub-second lengths of time can't be measured with it. If you're on Unix, try using gettimeofday instead.
If you do want to use clock() make sure you understand that it measures CPU time only. Also, to convert to seconds, you need to divide by CLOCKS_PER_SEC.
Short excerpts of code typically don't take long enough to run for profiling purposes. A common technique is to repeat the call many many (millions) times and then divide the resultant time delta with the iteration count. Pseudo-code:
count = 10,000,000
start = readCurrentTime()
loop count times:
myCode()
end = readCurrentTime()
elapsedTotal = end - start
elapsedForOneIteration = elapsedTotal / count
If you want accuracy, you can discount the loop overhead. For example:
loop count times:
myCode()
myCode()
and measure elapsed1 (2 x count iterations + loop overhead)
loop count times:
myCode()
and measure elapsed2 (count iterations + loop overhead)
actualElapsed = elapsed1 - elapsed2
(count iterations -- because rest of terms cancel out)
time has (at best) second resolution. If your code runs in much less than that, you aren't likely to see a difference.
Use a profiler (such a gprof on *nix, Instruments on OS X; for Windows, see "Profiling C code on Windows when using Eclipse") to time your code.
The code you're using between the measurements is running too fast. Just tried your code printing numbers from 0 till 99,999 and I got
Elasped time is 1.00 seconds.
Your code is taking less than a second to run.

Analyzing time complexity of a function written in C

I was implementing Longest Common Subsequence problem in C. I wish to compare the time taken for execution of recursive version of the solution and dynamic programming version. How can I find the time taken for running the LCS function in both versions for various inputs? Also can I use SciPy to plot these values on a graph and infer the time complexity?
Thanks in advance,
Razor
For the second part of your question: the short answer is yes, you can. You need to get the two data sets (one for each solution) in a format that is convenient to parse with from Python. Something like:
x y z
on each line, where x is the sequence length, y is the time taken by the dynamic solution, z is the time taken by the recursive solution
Then, in Python:
# Load these from your data sets.
sequence_lengths = ...
recursive_times = ...
dynamic_times = ...
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
p1 = ax.plot(sequence_lengths, recursive_times, 'r', linewidth=2)
p2 = ax.plot(sequence_lengths, dynamic_times, 'b', linewidth=2)
plt.xlabel('Sequence length')
plt.ylabel('Time')
plt.title('LCS timing')
plt.grid(True)
plt.show()
The simplest way to account processor time is to use the clock() function from time.h:
clock_t start, elapsed;
start = clock();
run_test();
elapsed = clock() - start;
printf("Elapsed clock cycles: %ld\n", (long)elapsed);
Since you are simply comparing different implementations, you don't need to convert the clocks into real time units.
There are various ways to do it. One of the simpler is to find some code that does high resolution (microsecond or smaller) timing of intervals. Wrap calls to the start-timer and stop-timer functions around the call to the LCS function, then print the resulting elapsed time:
#include "timer.h"
Clock clk;
char elapsed[32];
clk_start(&clk);
lcs_recursive();
clk_stop(&clk);
printf("Elapsed time (recursive): %s\n",
clk_elapsed_us(&clk, elapsed, sizeof(elapsed)));
Similarly for the lcs_dynamic() function.
If the time for a single iteration is too small, then wrap a loop around the function. I usually put the timing code into a function, and then call that a few times to get consistent results.
I can point you to the package illustrated.
Yes, you could feed the results, with care, into a graphing package such as SciPy. Clearly, you'd have to parameterize the test size, and time the code several times at each size.

Resources