GCC performance - c

I am doing parallel programming with MPI on Beowulf cluster. We wrote parallel algorithm for simulated annealing. It works fine. We expect 15 time faster execution than with serial code. But we did some execution of serial C code on different architectures and operating systems just so we could have different data sets for performance measurement. We have used this Random function in our code. We use GCC on both windows and ubuntu linux. We figured out that execution takes much longer on linuxes, and we don't know why. Can someone compile this code on linux and windows with gcc and try to explain me.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main (int argc, char** argv){
double Random();
int k,NUM_ITERATIONS = 10;
clock_t start_time = clock();
NUM_ITERATIONS=atoi(argv[1]);
// iniciranje random generatora
srand(time(NULL));
for(k=0; k<NUM_ITERATIONS; k++){
double raa = Random();
}
clock_t end_time = clock();
printf("Time of algorithm execution: %lf seconds\n", ((double) (end_time - start_time)) / CLOCKS_PER_SEC);
return 0;
}
// generate random number bettwen 0 and 1
double Random(){
srand(rand());
double a = rand();
return a/RAND_MAX;
}
If I execute it with 100 000 000 as argument for NUM_ITERATIONS, I get 20 times slower execution on linux than on windows. Tested on machine with same architecture with dual boot win + ubuntu linux. We need help as this Random function is bottleneck for what we want to show with our data.

On Linux gcc, the call to srand(rand()); within the Random function accounts for more than 98 % of the time.
It is not needed for the generation of random numbers, at least not within the loop. You already call srand() once, it's enough.

I would investigate other random number generators available. Many exist that have been well tested and perform better than the standard library random functions, both in terms of speed of execution and in terms of pseudo-randomness. I have also implemented my own RNG for a graduate class, but I wouldn't use it in production code. Go with something that has been vetted by the community. Random.org is a good resource for testing whatever RNG you select.

Related

OpenMP for beginners

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?
First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.
The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

Run code for exactly one second

I would like to know how I can program something so that my program runs as long as a second lasts.
I would like to evaluate parts of my code and see where the time is spend most so I am analyzing parts of it.
Here's the interesting part of my code :
int size = 256
clock_t start_benching = clock();
for (uint32_t i = 0;i < size; i+=4)
{
myarray[i];
myarray[i+1];
myarray[i+2];
myarray[i+3];
}
clock_t stop_benching = clock();
This just gives me how long the function needed to perform all the operations.
I want to run the code for one second and see how many operations have been done.
This is the line to print the time measurement:
printf("Walking through buffer took %f seconds\n", (double)(stop_benching - start_benching) / CLOCKS_PER_SEC);
A better approach to benchmarking is to know the % of time spent on each section of the code.
Instead of making your code run for exactly 1 second, make stop_benchmarking - start_benchmarking the total run time - Take the time spent on any part of the code and divide by the total runtime to get a value between 0 and 1. Multiply this value by 100 and you have the % of time consumed at that specific section.
Non-answer advice: Use an actual profiler to profile the performance of code sections.
On *nix you can set an alarm(2) with a signal handler that sets a global flag to indicate the elapsed time. The Windows API provides something similar with SetTimer.
#include <unistd.h>
#include <signal.h>
int time_elapsed = 0;
void alarm_handler(int signal) {
time_elapsed = 1;
}
int main() {
signal(SIGALRM, &alarm_handler);
alarm(1); // set alarm time-out to 1 second
do {
// stuff...
} while (!time_elapsed);
return 0;
}
In more complicated cases you can use setitimer(2) instead of alarm(2), which lets you
use microsecond precision and
choose between counting
wall clock time,
user CPU time, or
user and system CPU time.

Use all PC resources to make intensive calculations

I'm writing in C a little script that will generate random numbers.
When I run it for 60 seconds, it generates only 17 million numbers, and in my PC task manager, I see that it uses almost 0% of the resources.
Can someone please, give me a piece of code or link that allow me to use the full resources of my PC to generate trillions of random numbers in a few seconds? Maybe multi-threaded?
Note: if you know a simple way (no heavy CUDA SDK) to use the Nvidia GPU rather than the CPU, it will be good too!
EDIT
Here's my code :
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
char getRandLetter(void);
int main()
{
int i,num=0,p=0;
char randhash[12];
time_t start,stop;
start = time(NULL);
while(1) {
for(i=0; i<12; i++){
randhash[i]=getRandLetter();
}
num++;
stop = time(NULL);
double diff = difftime(stop, start);
if (diff >= 60) {
printf("60 seconds passed... NUM=%d", num);
start = time(NULL);
break;
}
}
return 0;
}
char getRandLetter() {
static int range = 'Z'-'A'+1;
return rand()%range + 'A';
}
Note: i have a killer pc with i7 and a killer geforce :p So i just need to exploit these resources.
Generating random numbers should not be CPU bound. Functions that generate random numbers usually make a call to the system kernel, which gets random numbers from physical sources of entropy (the network, keyboard, mouse, etc.). The best you can do with computation is pseudo random numbers.
Effectively, there should be ways to use more of your CPU to generate "random" numbers quicker, but those random numbers wouldn't be as high quality as those plucked from physical sources of entropy (and using those sources doesn't flex the CPU much if at all) because CPUs tend to be quite deterministic (as they should be).
Some additional info on:
http://en.wikipedia.org/wiki/Random_number_generation

Computing algorithm running time in C

I am using the time.h lib in c to find the time taken to run an algorithm. The code structure is somewhat as follows :-
#include <time.h>
int main()
{
time_t start,end,diff;
start = clock();
//ALGORITHM COMPUTATIONS
end = clock();
diff = end - start;
printf("%d",diff);
return 0;
}
The values for start and end are always zero. Is it that the clock() function does't work? Please help.
Thanks in advance.
Not that it doesn't work. In fact, it does. But it is not the right way to measure time as the clock () function returns an approximation of processor time used by the program. I am not sure about other platforms, but on Linux you should use clock_gettime () with CLOCK_MONOTONIC flag - that will give you the real wall time elapsed. Also, you can read TSC, but be aware that it won't work if you have a multi-processor system and your process is not pinned to a particular core. If you want to analyze and optimize your algorithm, I'd recommend you use some performance measurement tools. I've been using Intel's vTune for a while and am quite happy. It will show you not only what part uses the most cycles, but highlight memory problems, possible parallelism issues etc. You may be very surprised with the results. For example, most of the CPU cycles might be spent waiting for memory bus. Hope it helps!
UPDATE: Actually, if you run later versions of Linux, it might provide CLOCK_MONOTONIC_RAW, which is a hardware-based clock that is not a subject to NTP adjustments. Here is a small piece of code you can use:
stopwatch.hpp
stopwatch.cpp
Note that clock() returns the execution time in clock ticks, as opposed to wall clock time. Divide a difference of two clock_t values by CLOCKS_PER_SEC to convert the difference to seconds. The actual value of CLOCKS_PER_SEC is a quality-of-implementation issue. If it is low (say, 50), your process would have to run for 20ms to cause a nonzero return value from clock(). Make sure your code runs long enough to see clock() increasing.
I usually do it this way:
clock_t start = clock();
clock_t end;
//algo
end = clock();
printf("%f", (double)(end - start));
Consider the code below:
#include <stdio.h>
#include <time.h>
int main()
{
clock_t t1, t2;
t1 = t2 = clock();
// loop until t2 gets a different value
while(t1 == t2)
t2 = clock();
// print resolution of clock()
printf("%f ms\n", (double)(t2 - t1) / CLOCKS_PER_SEC * 1000);
return 0;
}
Output:
$ ./a.out
10.000000 ms
Might be that your algorithm runs for a shorter amount of time than that.
Use gettimeofday for higher resolution timer.

Execution time of C program

I have a C program that aims to be run in parallel on several processors. I need to be able to record the execution time (which could be anywhere from 1 second to several minutes). I have searched for answers, but they all seem to suggest using the clock() function, which then involves calculating the number of clocks the program took divided by the Clocks_per_second value.
I'm not sure how the Clocks_per_second value is calculated?
In Java, I just take the current time in milliseconds before and after execution.
Is there a similar thing in C? I've had a look, but I can't seem to find a way of getting anything better than a second resolution.
I'm also aware a profiler would be an option, but am looking to implement a timer myself.
Thanks
CLOCKS_PER_SEC is a constant which is declared in <time.h>. To get the CPU time used by a task within a C application, use:
clock_t begin = clock();
/* here, do your time-consuming job */
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
Note that this returns the time as a floating point type. This can be more precise than a second (e.g. you measure 4.52 seconds). Precision depends on the architecture; on modern systems you easily get 10ms or lower, but on older Windows machines (from the Win98 era) it was closer to 60ms.
clock() is standard C; it works "everywhere". There are system-specific functions, such as getrusage() on Unix-like systems.
Java's System.currentTimeMillis() does not measure the same thing. It is a "wall clock": it can help you measure how much time it took for the program to execute, but it does not tell you how much CPU time was used. On a multitasking systems (i.e. all of them), these can be widely different.
If you are using the Unix shell for running, you can use the time command.
doing
$ time ./a.out
assuming a.out as the executable will give u the time taken to run this
In plain vanilla C:
#include <time.h>
#include <stdio.h>
int main()
{
clock_t tic = clock();
my_expensive_function_which_can_spawn_threads();
clock_t toc = clock();
printf("Elapsed: %f seconds\n", (double)(toc - tic) / CLOCKS_PER_SEC);
return 0;
}
You functionally want this:
#include <sys/time.h>
struct timeval tv1, tv2;
gettimeofday(&tv1, NULL);
/* stuff to do! */
gettimeofday(&tv2, NULL);
printf ("Total time = %f seconds\n",
(double) (tv2.tv_usec - tv1.tv_usec) / 1000000 +
(double) (tv2.tv_sec - tv1.tv_sec));
Note that this measures in microseconds, not just seconds.
Most of the simple programs have computation time in milli-seconds. So, i suppose, you will find this useful.
#include <time.h>
#include <stdio.h>
int main(){
clock_t start = clock();
// Execuatable code
clock_t stop = clock();
double elapsed = (double)(stop - start) * 1000.0 / CLOCKS_PER_SEC;
printf("Time elapsed in ms: %f", elapsed);
}
If you want to compute the runtime of the entire program and you are on a Unix system, run your program using the time command like this time ./a.out
(All answers here are lacking, if your sysadmin changes the systemtime, or your timezone has differing winter- and sommer-times. Therefore...)
On linux use: clock_gettime(CLOCK_MONOTONIC_RAW, &time_variable);
It's not affected if the system-admin changes the time, or you live in a country with winter-time different from summer-time, etc.
#include <stdio.h>
#include <time.h>
#include <unistd.h> /* for sleep() */
int main() {
struct timespec begin, end;
clock_gettime(CLOCK_MONOTONIC_RAW, &begin);
sleep(1); // waste some time
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
printf ("Total time = %f seconds\n",
(end.tv_nsec - begin.tv_nsec) / 1000000000.0 +
(end.tv_sec - begin.tv_sec));
}
man clock_gettime states:
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since some unspecified starting point. This clock is not affected by discontinuous jumps in the system time
(e.g., if the system administrator manually changes the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP.
Thomas Pornin's answer as macros:
#define TICK(X) clock_t X = clock()
#define TOCK(X) printf("time %s: %g sec.\n", (#X), (double)(clock() - (X)) / CLOCKS_PER_SEC)
Use it like this:
TICK(TIME_A);
functionA();
TOCK(TIME_A);
TICK(TIME_B);
functionB();
TOCK(TIME_B);
Output:
time TIME_A: 0.001652 sec.
time TIME_B: 0.004028 sec.
A lot of answers have been suggesting clock() and then CLOCKS_PER_SEC from time.h. This is probably a bad idea, because this is what my /bits/time.h file says:
/* ISO/IEC 9899:1990 7.12.1: <time.h>
The macro `CLOCKS_PER_SEC' is the number per second of the value
returned by the `clock' function. */
/* CAE XSH, Issue 4, Version 2: <time.h>
The value of CLOCKS_PER_SEC is required to be 1 million on all
XSI-conformant systems. */
# define CLOCKS_PER_SEC 1000000l
# if !defined __STRICT_ANSI__ && !defined __USE_XOPEN2K
/* Even though CLOCKS_PER_SEC has such a strange value CLK_TCK
presents the real value for clock ticks per second for the system. */
# include <bits/types.h>
extern long int __sysconf (int);
# define CLK_TCK ((__clock_t) __sysconf (2)) /* 2 is _SC_CLK_TCK */
# endif
So CLOCKS_PER_SEC might be defined as 1000000, depending on what options you use to compile, and thus it does not seem like a good solution.
#include<time.h>
#include<stdio.h>
int main(){
clock_t begin=clock();
int i;
for(i=0;i<100000;i++){
printf("%d",i);
}
clock_t end=clock();
printf("Time taken:%lf",(double)(end-begin)/CLOCKS_PER_SEC);
}
This program will work like charm.
You have to take into account that measuring the time that took a program to execute depends a lot on the load that the machine has in that specific moment.
Knowing that, the way of obtain the current time in C can be achieved in different ways, an easier one is:
#include <time.h>
#define CPU_TIME (getrusage(RUSAGE_SELF,&ruse), ruse.ru_utime.tv_sec + \
ruse.ru_stime.tv_sec + 1e-6 * \
(ruse.ru_utime.tv_usec + ruse.ru_stime.tv_usec))
int main(void) {
time_t start, end;
double first, second;
// Save user and CPU start time
time(&start);
first = CPU_TIME;
// Perform operations
...
// Save end time
time(&end);
second = CPU_TIME;
printf("cpu : %.2f secs\n", second - first);
printf("user : %d secs\n", (int)(end - start));
}
Hope it helps.
Regards!
ANSI C only specifies second precision time functions. However, if you are running in a POSIX environment you can use the gettimeofday() function that provides microseconds resolution of time passed since the UNIX Epoch.
As a side note, I wouldn't recommend using clock() since it is badly implemented on many(if not all?) systems and not accurate, besides the fact that it only refers to how long your program has spent on the CPU and not the total lifetime of the program, which according to your question is what I assume you would like to measure.
I've found that the usual clock(), everyone recommends here, for some reason deviates wildly from run to run, even for static code without any side effects, like drawing to screen or reading files. It could be because CPU changes power consumption modes, OS giving different priorities, etc...
So the only way to reliably get the same result every time with clock() is to run the measured code in a loop multiple times (for several minutes), taking precautions to prevent the compiler from optimizing it out: modern compilers can precompute the code without side effects running in a loop, and move it out of the loop., like i.e. using random input for each iteration.
After enough samples are collected into an array, one sorts that array, and takes the middle element, called median. Median is better than average, because it throws away extreme deviations, like say antivirus taking up all CPU up or OS doing some update.
Here is a simple utility to measure execution performance of C/C++ code, averaging the values near median: https://github.com/saniv/gauge
I'm myself still looking for a more robust and faster way to measure code. One could probably try running the code in controlled conditions on bare metal without any OS, but that will give unrealistic result, because in reality OS does get involved.
x86 has these hardware performance counters, which including the actual number of instructions executed, but they are tricky to access without OS help, hard to interpret and have their own issues ( http://archive.gamedev.net/archive/reference/articles/article213.html ). Still they could be helpful investigating the nature of the bottle neck (data access or actual computations on that data).
Every solution's are not working in my system.
I can get using
#include <time.h>
double difftime(time_t time1, time_t time0);
Some might find a different kind of input useful: I was given this method of measuring time as part of a university course on GPGPU-programming with NVidia CUDA (course description). It combines methods seen in earlier posts, and I simply post it because the requirements give it credibility:
unsigned long int elapsed;
struct timeval t_start, t_end, t_diff;
gettimeofday(&t_start, NULL);
// perform computations ...
gettimeofday(&t_end, NULL);
timeval_subtract(&t_diff, &t_end, &t_start);
elapsed = (t_diff.tv_sec*1e6 + t_diff.tv_usec);
printf("GPU version runs in: %lu microsecs\n", elapsed);
I suppose you could multiply with e.g. 1.0 / 1000.0 to get the unit of measurement that suits your needs.
If you program uses GPU or if it uses sleep() then clock() diff gives you smaller than actual duration. It is because clock() returns the number of CPU clock ticks. It only can be used to calculate CPU usage time (CPU load), but not the execution duration. We should not use clock() to calculate duration. We still should use gettimeofday() or clock_gettime() for duration in C.
perf tool is more accurate to be used in order to collect and profile the running program. Use perf stat to show all information related to the program being executed.
As simple as possible by using function-like macro
#include <stdio.h>
#include <time.h>
#define printExecTime(t) printf("Elapsed: %f seconds\n", (double)(clock()-(t)) / CLOCKS_PER_SEC)
int factorialRecursion(int n) {
return n == 1 ? 1 : n * factorialRecursion(n-1);
}
int main()
{
clock_t t = clock();
int j=1;
for(int i=1; i <10; i++ , j*=i);
printExecTime(t);
// compare with recursion factorial
t = clock();
j = factorialRecursion(10);
printExecTime(t);
return 0;
}
Comparison of execution time of bubble sort and selection sort
I have a program which compares the execution time of bubble sort and selection sort.
To find out the time of execution of a block of code compute the time before and after the block by
clock_t start=clock();
…
clock_t end=clock();
CLOCKS_PER_SEC is constant in time.h library
Example code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main()
{
int a[10000],i,j,min,temp;
for(i=0;i<10000;i++)
{
a[i]=rand()%10000;
}
//The bubble Sort
clock_t start,end;
start=clock();
for(i=0;i<10000;i++)
{
for(j=i+1;j<10000;j++)
{
if(a[i]>a[j])
{
int temp=a[i];
a[i]=a[j];
a[j]=temp;
}
}
}
end=clock();
double extime=(double) (end-start)/CLOCKS_PER_SEC;
printf("\n\tExecution time for the bubble sort is %f seconds\n ",extime);
for(i=0;i<10000;i++)
{
a[i]=rand()%10000;
}
clock_t start1,end1;
start1=clock();
// The Selection Sort
for(i=0;i<10000;i++)
{
min=i;
for(j=i+1;j<10000;j++)
{
if(a[min]>a[j])
{
min=j;
}
}
temp=a[min];
a[min]=a[i];
a[i]=temp;
}
end1=clock();
double extime1=(double) (end1-start1)/CLOCKS_PER_SEC;
printf("\n");
printf("\tExecution time for the selection sort is %f seconds\n\n", extime1);
if(extime1<extime)
printf("\tSelection sort is faster than Bubble sort by %f seconds\n\n", extime - extime1);
else if(extime1>extime)
printf("\tBubble sort is faster than Selection sort by %f seconds\n\n", extime1 - extime);
else
printf("\tBoth algorithms have the same execution time\n\n");
}

Resources