I have been tasked with implementing three different functions get_current_time_seconds1, 2 and 3, and then have to estimate the resolution of the various functions. How would I estimate this?
Which timing function would you suggest to use? What do the compiler options -O0 -lrt mean when I have to compile with gcc -O0 -lrt timing.c -o timing?
#define BILLION 1000000000L
#define LIMIT_I 1000
#define LIMIT_J 1000
double get_current_time_seconds1()
{
/* Get current time using gettimeofday */
time_t t = time(NULL);
struct tm *tm = localtime(&t);
printf("%s\n", asctime(tm));
return (double) tm;
}
double get_current_time_seconds2()
{
struct timespec start,stop;
clock_gettime(CLOCK_REALTIME, &start);
clock_gettime(CLOCK_REALTIME, &stop);
double x = (stop.tv_sec - start.tv_sec) + (stop.tv_nsec - start.tv_nsec);
printf("%lf\n", x);
return (double) x;
}
double get_current_time_seconds3()
{
uint64_t diff;
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
sleep(5);
clock_gettime(CLOCK_MONOTONIC, &end);
diff = BILLION * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
printf("elapsed time = %llu nanoseconds\n", (long long unsigned int)diff);
return (double) diff;
}
How would I estimate this? Which timing function would you suggest to use?
If you want to get the resolution (precision) of the various timing elements, you can use the clock_getres function, passing in the various CLOCK_ id types, for example:
#include <stdio.h>
#include <time.h>
static void printres(clockid_t id)
{
struct timespec ts;
int rc = clock_getres(id, &ts);
printf("clock id: %d\n", (unsigned int)id);
if (rc != 0) {
printf("Error: %d\n", rc);
return;
}
printf("tv_sec = %lu\ntv_nsec = %lu\n", ts.tv_sec, ts.tv_nsec);
}
int main(int argc, char** argv)
{
printres(CLOCK_REALTIME);
printres(CLOCK_MONOTONIC);
printres(CLOCK_PROCESS_CPUTIME_ID);
printres(CLOCK_THREAD_CPUTIME_ID);
return 0;
}
On my system, tv_nsec for all but CLOCK_THREAD_CPUTIME_ID is 1000, for CLOCK_THREAD_CPUTIME_ID the value for tv_nsec is 1, this means that the precision for the other clock types is 1 millisecond (1000 nanoseconds) while the precision of CLOCK_THREAD_CPUTIME_ID is 1 nanosecond.
For your first function that calls localtime the precision for that would be 1 second as that function calculates the time from the Unix epoch in seconds.
What do the compiler options -O0 -lrt mean when I have to compile with gcc -O0 -lrt timing.c -o timing?
For some compilers like gcc and clang the option -O means to optimize the code when compiling it to the level specified, so -O0 means not to optimize the code at all, this is usually useful when debugging code.
The -l option says to compile against the specified library, so -lrt says to compile using the rt library, or "real time library"; this is necessary on some systems as CLOCK_REALTIME can be defined in that library.
I hope that can help.
Related
I need to compare the performance of various pthread constructs like mutex, semaphores, read-write locks and also the corresponding serial programs, by designing some experiments. The main problem is deciding how to measure the execution time of the code for the analysis ?
I have read about some C functions like clock(), gettimeofday() etc. From what I could understand - we can use clock() to get the actual number of CPU cycles used by a program (by subtracting value returned by the function at the start and end of the code whose time we want to measure), gettimeofday() returns the wall-clock time for the execution of the program.
But the problem is total CPU cycles does not appear to be a good criteria to me as it would sum the CPU time taken across all the parallel running threads (so clock() is not good according to me). Also wall-clock time is not good since there might be other processes running in the background, so the time finally depends on how the threads get scheduled (so gettimeofday() is also not good according to me).
Some other functions that I know of also do more likely the same as the two of above. So, I wanted to know if there is some function which I can use for my analysis or am I wrong somewhere in my conclusion above ?
From linux clock_gettime:
CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12)
Per-process CPU-time clock (measures CPU time consumed by all
threads in the process).
CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12)
Thread-specific CPU-time clock.
I believe clock() was somewhere implemented as clock_gettime(CLOCK_PROCESS_CPUTIME_ID, but I see it's implemented using times() in glibc.
So if you want to measure thread-specific CPU-time you can use clock_gettimer(CLOCK_THREAD_CPUTIME_ID, ... on GNU/Linux systems.
Never use gettimeofday nor clock_gettime(CLOCK_REALTIME to measure the execution of a program. Don't even think about that. gettimeofday is the "wall-clock" - you can display it on the wall in your room. If you want to measure the flow of time, forget gettimeofday.
If you want, you can also even stay fully posixly compatible, by using pthread_getcpuclockid inside your thread and using it's returned clock_id value with clock_gettime.
I am not sure to sum an array is a good test, you do not need any mutex etc to sum an array in multi thread, each thread just have to sum a dedicated part of the array, and there are a lot of memory accesses for few CPU computation. Example (the value of SZ and NTHREADS are given when compiling ), the measured time is the real time (monotonic) :
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
static int Arr[SZ];
void * thSum(void * a)
{
int s = 0, i;
int sup = *((int *) a) + SZ/NTHREADS;
for (i = *((int *) a); i != sup; ++i)
s += Arr[i];
*((int *) a) = s;
}
int main()
{
int i;
for (i = 0; i != SZ; ++i)
Arr[i] = rand();
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
int s = 0;
for (i = 0; i != SZ; ++i)
s += Arr[i];
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("mono thread : %d %lf\n", s,
(t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec)/1000000000.0);
clock_gettime(CLOCK_MONOTONIC, &t0);
int n[NTHREADS];
pthread_t ths[NTHREADS];
for (i = 0; i != NTHREADS; ++i) {
n[i] = SZ / NTHREADS * i;
if (pthread_create(&ths[i], NULL, thSum, &n[i])) {
printf("cannot create thread %d\n", i);
return -1;
}
}
int s2 = 0;
for (i = 0; i != NTHREADS; ++i) {
pthread_join(ths[i], NULL);
s2 += n[i];
}
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("%d threads : %d %lf\n", NTHREADS, s2,
(t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec)/1000000000.0);
}
Compilations and executions:
(array of 100.000.000 elements)
/tmp % gcc -DSZ=100000000 -DNTHREADS=2 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : 563608529 0.035217
2 threads : 563608529 0.020407
/tmp % ./a.out
mono thread : 563608529 0.034991
2 threads : 563608529 0.022659
/tmp % gcc -DSZ=100000000 -DNTHREADS=4 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : 563608529 0.035212
4 threads : 563608529 0.014234
/tmp % ./a.out
mono thread : 563608529 0.035184
4 threads : 563608529 0.014163
/tmp % gcc -DSZ=100000000 -DNTHREADS=8 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : 563608529 0.035229
8 threads : 563608529 0.014971
/tmp % ./a.out
mono thread : 563608529 0.035142
8 threads : 563608529 0.016248
(array of 1000.000.000 elements)
/tmp % gcc -DSZ=1000000000 -DNTHREADS=2 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : -1471389927 0.343761
2 threads : -1471389927 0.197303
/tmp % ./a.out
mono thread : -1471389927 0.346682
2 threads : -1471389927 0.197669
/tmp % gcc -DSZ=1000000000 -DNTHREADS=4 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : -1471389927 0.346859
4 threads : -1471389927 0.130639
/tmp % ./a.out
mono thread : -1471389927 0.346506
4 threads : -1471389927 0.130751
/tmp % gcc -DSZ=1000000000 -DNTHREADS=8 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : -1471389927 0.346954
8 threads : -1471389927 0.123572
/tmp % ./a.out
mono thread : -1471389927 0.349652
8 threads : -1471389927 0.127059
As you can see even the execution time is not divided by the number of threads, the bottleneck is probably the access to the memory
I am working on a m/c Intel(R) Xeon(R) CPU E5-2640 v2 # 2.00GHz It supports SSE4.2.
I have written C code to perform XOR operation over string bits. But I want to write corresponding SIMD code and check for performance improvement. Here is my code
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define LENGTH 10
unsigned char xor_val[LENGTH];
void oper_xor(unsigned char *r1, unsigned char *r2)
{
unsigned int i;
for (i = 0; i < LENGTH; ++i)
{
xor_val[i] = (unsigned char)(r1[i] ^ r2[i]);
printf("%d",xor_val[i]);
}
}
int main() {
int i;
time_t start, stop;
double cur_time;
start = clock();
oper_xor("1110001111", "0000110011");
stop = clock();
cur_time = ((double) stop-start) / CLOCKS_PER_SEC;
printf("Time used %f seconds.\n", cur_time / 100);
for (i = 0; i < LENGTH; ++i)
printf("%d",xor_val[i]);
printf("\n");
return 0;
}
On compiling and running a sample code I am getting output shown below. Time is 00 here but in actual project it is consuming sufficient time.
gcc xor_scalar.c -o xor_scalar
pan88: ./xor_scalar
1110111100 Time used 0.000000 seconds.
1110111100
How can I start writing a corresponding SIMD code for SSE4.2
The Intel Compiler and any OpenMP compiler support #pragma simd and #pragma omp simd, respectively. These are your best bet to get the compiler to do SIMD codegen for you. If that fails, you can use intrinsics or, as a means of last resort, inline assembly.
Note the the printf function calls will almost certainly interfere with vectorization, so you should remove them from any loops in which you want to see SIMD.
I'm writing a simple program, and it gives me an error when I pass a non-integer value to the sleep() function. Is there another function, or a different way to use sleep(), that will allow me to pause the program temporarily for less than a second (or, say, 2.5)?
mario.c:34:13: error: implicit declaration of function 'usleep' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
usleep(30000);
^
usleep will pause for a particular number of microseconds. For example:
usleep(500000);
will pause for half a second.
On some systems, and depending on compiler configuration etc, you may need to define _GNU_SOURCE before including unistd.h (ideally, at the start of the source file), with:
#define _GNU_SOURCE
POSIX.1-2001 standard defines function nanosleep(), which is perfect for this. Unlike sleep() and usleep(), nanosleep() does not interfere with signals.
Current Linux C libraries tend to implement usleep() as a wrapper around nanosleep(), however. You can see this if you strace a test program (say, run strace /bin/sleep 1).
However, as nanosleep() is part of POSIX.1-2001 standard, it is definitely recommended.
For example:
#define _POSIX_C_SOURCE 200112L
#include <time.h>
#include <errno.h>
/* Sleep; returns un-slept (leftover) time.
*/
double dsleep(const double seconds)
{
const long sec = (long)seconds;
const long nsec = (long)((seconds - (double)sec) * 1e9);
struct timespec req, rem;
if (sec < 0L)
return 0.0;
if (sec == 0L && nsec <= 0L)
return 0.0;
req.tv_sec = sec;
if (nsec <= 0L)
req.tv_nsec = 0L;
else
if (nsec <= 999999999L)
req.tv_nsec = nsec;
else
req.tv_nsec = 999999999L;
rem.tv_sec = 0;
rem.tv_nsec = 0;
if (nanosleep(&req, &rem) == -1) {
if (errno == EINTR)
return (double)rem.tv_sec + (double)rem.tv_nsec / 1000000000.0;
else
return seconds;
} else
return 0.0;
}
Most of the logic above is to make sure (regardless of floating-point rounding mode et cetera) that the nanoseconds field is always within [0, 999999999], inclusive, and to make it safe to call it with negative values (in which case it'll just return zero).
If you want to specify the duration in integer milliseconds, you could use
#define _POSIX_C_SOURCE 200112L
#include <time.h>
#include <errno.h>
/* Sleep; returns un-slept (leftover) time.
*/
long msleep(const long ms)
{
struct timespec req, rem;
if (ms <= 0L)
return 0L;
req.tv_sec = ms / 1000L;
req.tv_nsec = (ms % 1000L) * 1000000L;
rem.tv_sec = 0;
rem.tv_nsec = 0;
if (nanosleep(&req, &rem) == -1) {
if (errno == EINTR)
return (long)rem.tv_sec * 1000L + rem.tv_nsec / 1000000L;
else
return ms;
} else
return 0L;
}
It is not necessary to zero out rem, above, but written this way, these two functions are extremely robust, even against an occasional system hiccup or library bug.
The apparent problem with the original suggestion to use usleep was overlooking _GNU_SOURCE, a given for Linux with almost any POSIX-extension.
This compiles without warning for me, using c99 with the strict warning flags:
#include <unistd.h>
#include <time.h>
int main(void)
{
usleep(1);
nanosleep(0,0);
return 0;
}
and (extracted from other scripts)
ACTUAL_GCC=c99 gcc-stricter -D_GNU_SOURCE -c foo.c
#!/bin/sh
OPTS="-O -Wall -Wmissing-prototypes -Wstrict-prototypes -Wshadow -Wconversion
-W
-Wbad-function-cast
-Wcast-align
-Wcast-qual
-Wmissing-declarations
-Wnested-externs
-Wpointer-arith
-Wwrite-strings
-ansi
-pedantic
"
${ACTUAL_GCC:-gcc} $OPTS "$#"
The usleep and nanosleep functions are extensions, as noted. More commonly, select or poll are used. Those functions are used to wait for a specified interval of time for activity on one or more open files. As a special case, select does not even have to wait for a file, but simply use the time interval. Like nanosleep, select uses a structure for the time interval to allow double-precision values -- but its range of values starts with microseconds while nanosleep starts with nanoseconds.
These functions are all part of the Linux C library.
ncurses (like any X/Open curses) provides a function napms which waits for milliseconds. Internally, ncurses uses (since 1998) nanosleep or usleep depending on their availability (they are extensions after all), or an internal function that uses select or poll, also depending on availability.
Use this macro to pause for non-integer seconds:
#include "unistd.h"
#define mysleep(s) usleep(s*1000000)
sleep()'s int value is for a pause time in milliseconds, not seconds. So, to pause for one second, you'd pass it 1000. For 2.5 seconds, you'd pass it 2500. For a half second, you'd pass 500.
This will let you specify a delay as a float in seconds.
The function executes until the difference between start and stop is greater than the requested delay.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
void floatdelay ( float delay);
int main()
{
float pause = 1.2;
floatdelay ( pause);
return 0;
}
void floatdelay ( float delay)
{
clock_t start;
clock_t stop;
start = clock();
stop = clock();
while ( ( (float)( stop - start)/ CLOCKS_PER_SEC) < delay) {
stop = clock();
}
}
Question
I am testing a simple code which calculates Mandelbrot fractal. I have been checking its performance depending on the number of iterations in the function that checks if a point belongs to the Mandelbrot set or not.
The surprising thing is that I am getting a big difference in times after adding the -fPIC flag. From what I read the overhead is usually negligible and the highest overhead I came across was about 6%. I measured around 30% overhead. Any advice will be appreciated!
Details of my project
I use the -O3 flag, gcc 4.7.2, Ubuntu 12.04.2, x86_64.
The results look as follow
#iter C (fPIC) C C/C(fPIC)
1 0.01 0.01 1.00
100 0.04 0.03 0.75
200 0.06 0.04 0.67
500 0.15 0.1 0.67
1000 0.28 0.19 0.68
2000 0.56 0.37 0.66
4000 1.11 0.72 0.65
8000 2.21 1.47 0.67
16000 4.42 2.88 0.65
32000 8.8 5.77 0.66
64000 17.6 11.53 0.66
Commands I use:
gcc -O3 -fPIC fractalMain.c fractal.c -o ffpic
gcc -O3 fractalMain.c fractal.c -o f
Code: fractalMain.c
#include <time.h>
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
int main()
{
int iterNumber[] = {1, 100, 200, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000};
int it;
for(it = 0; it < 11; ++it)
{
clock_t start = clock();
fractal(iterNumber[it]);
clock_t end = clock();
double millis = (end - start)*1000 / CLOCKS_PER_SEC/(double)1000;
printf("Iter: %d, time: %lf \n", iterNumber[it], millis);
}
return 0;
}
Code: fractal.h
#ifndef FRACTAL_H
#define FRACTAL_H
void fractal(int iter);
#endif
Code: fractal.c
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
void multiplyComplex(double a_re, double a_im, double b_re, double b_im, double* res_re, double* res_im)
{
*res_re = a_re*b_re - a_im*b_im;
*res_im = a_re*b_im + a_im*b_re;
}
void sqComplex(double a_re, double a_im, double* res_re, double* res_im)
{
multiplyComplex(a_re, a_im, a_re, a_im, res_re, res_im);
}
bool isInSet(double P_re, double P_im, double C_re, double C_im, int iter)
{
double zPrev_re = P_re;
double zPrev_im = P_im;
double zNext_re = 0;
double zNext_im = 0;
double* p_zNext_re = &zNext_re;
double* p_zNext_im = &zNext_im;
int i;
for(i = 1; i <= iter; ++i)
{
sqComplex(zPrev_re, zPrev_im, p_zNext_re, p_zNext_im);
zNext_re = zNext_re + C_re;
zNext_im = zNext_im + C_im;
if(zNext_re*zNext_re+zNext_im*zNext_im > 4)
{
return false;
}
zPrev_re = zNext_re;
zPrev_im = zNext_im;
}
return true;
}
bool isMandelbrot(double P_re, double P_im, int iter)
{
return isInSet(0, 0, P_re, P_im, iter);
}
void fractal(int iter)
{
int noIterations = iter;
double xMin = -1.8;
double xMax = 1.6;
double yMin = -1.3;
double yMax = 0.8;
int xDim = 512;
int yDim = 384;
double P_re, P_im;
int nop;
int x, y;
for(x = 0; x < xDim; ++x)
for(y = 0; y < yDim; ++y)
{
P_re = (double)x*(xMax-xMin)/(double)xDim+xMin;
P_im = (double)y*(yMax-yMin)/(double)yDim+yMin;
if(isMandelbrot(P_re, P_im, noIterations))
nop = x+y;
}
printf("%d", nop);
}
Story behind the comparison
It might look a bit artificial to add the -fPIC flag when building executable (as per one of the comments). So a few words of explanation: first I only compiled the program as executable and wanted to compare to my Lua code, which calls the isMandelbrot function from C. So I created a shared object to call it from lua - and had big time differences. But couldn't understand why they were growing with number of iterations. In the end found out that it was because of the -fPIC. When I create a little c program which calls my lua script (so effectively I do the same thing, only don't need the .so) - the times are very similar to C (without -fPIC). So I have checked it in a few configurations over the last few days and it consistently shows two sets of very similar results: faster without -fPIC and slower with it.
It turns out that when you compile without the -fPIC option multiplyComplex, sqComplex, isInSet and isMandelbrot are inlined automatically by the compiler. If you define those functions as static you will likely get the same performance when compiling with -fPIC because the compiler will be free to perform inlining.
The reason why the compiler is unable to automatically inline the helper functions has to do with symbol interposition. Position independent code is required to access all global data indirectly, i.e. through the global offset table. The very same constraint applies to function calls, which have to go through the procedure linkage table. Since a symbol might get interposed by another one at runtime (see LD_PRELOAD), the compiler cannot simply assume that it is safe to inline a function with global visibility.
The very same assumption can be made if you compile without -fPIC, i.e. the compiler can safely assume that a global symbol defined in the executable cannot be interposed because the lookup scope begins with the executable itself which is then followed by all other libraries, including the preloaded ones.
For a more thorough understanding have a look at the following paper.
As other people already pointed out -fPIC forces GCC to disable many optimizations e.g. inlining and cloning. I'd like to point out several ways to overcome this:
replace -fPIC with -fPIE if you are compiling main executable (not libraries) as this allows compiler to assume that interposition is not possible;
use -fvisibility=hidden and __attribute__((visibility("default"))) to export only necessary functions from the library and hide the rest; this would allow GCC to optimize hidden functions more aggressively;
use private symbol aliases (__attribute__((alias ("__f")));) to refer to library functions from within the library; this would again untie GCC's hands
previous suggestion can be automated with -fno-semantic-interposition flag that was added in recent GCC versions
It's interesting to note that Clang is different from GCC as it allows all optimizations by default regardless of -fPIC (can be overridden with -fsemantic-interposition to obtain GCC-like behavior).
As others have discussed in the comment section of your opening post, compiling with -flto should help to reduce the difference in run-times you are seeing for this particular case, since the link time optimisations of gcc will likely figure out that it's actually ok to inline a couple of functions ;)
In general, link time optimisations could lead to massive reductions in code size (~6%) link to paper on link time optimisations in gold, and thus run time as well (more of your program fits in the cache). Also note that -fPIC is mostly viewed as a feature that enables tighter security and is always enabled in android. This question on SO briefly discusses as well. Also, just to let you know, -fpic is the faster version of -fPIC, so if you must use -fPIC try -fpic instead - link to gcc docs. For x86 it might not make a difference, but you need to check this for yourself/ask on gcc-help.
How can I demonstrate for students the usability of likely and unlikely compiler hints (__builtin_expect)?
Can you write an sample code, which will be several times faster with these hints comparing the code without hints.
Here is the one I use, a really inefficient implementation of the Fibonacci numbers:
#include <stdio.h>
#include <inttypes.h>
#include <time.h>
#include <assert.h>
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
uint64_t fib(uint64_t n)
{
if (opt(n == 0 || n == 1)) {
return n;
} else {
return fib(n - 2) + fib(n - 1);
}
}
int main(int argc, char **argv)
{
int i, max = 45;
clock_t tm;
if (argc == 2) {
max = atoi(argv[1]);
assert(max > 0);
} else {
assert(argc == 1);
}
tm = -clock();
for (i = 0; i <= max; ++i)
printf("fib(%d) = %" PRIu64 "\n", i, fib(i));
tm += clock();
printf("Time elapsed: %.3fs\n", (double)tm / CLOCKS_PER_SEC);
return 0;
}
To demonstrate, using GCC:
~% gcc -O2 -Dopt= -o test-nrm test.c
~% ./test-nrm
...
fib(45) = 1134903170
Time elapsed: 34.290s
~% gcc -O2 -Dopt=unlikely -o test-opt test.c
~% ./test-opt
...
fib(45) = 1134903170
Time elapsed: 33.530s
A few hundred milliseconds less. This gain is due to the programmer-aided branch prediction.
But now, for what the programmer should really be doing instead:
~% gcc -O2 -Dopt= -fprofile-generate -o test.prof test.c
~% ./test.prof
...
fib(45) = 1134903170
Time elapsed: 77.530s /this run is slowed down by profile generation.
~% gcc -O2 -Dopt= -fprofile-use -o test.good test.c
~% ./test.good
fib(45) = 1134903170
Time elapsed: 17.760s
With compiler-aided runtime profiling, we managed to reduce from the original 34.290s to 17.760s. Much better than with programmer-aided branch prediction!
From this blog post. I think likely and unlikely are mostly obsolete. Very cheap CPUs (ARM Cortex A20 in the example) have branch predictors and there is no penalty regardless of jump is taken / jump is not taken. When you introduce likely/unlikely the results will be either the same or worse (because compiler has generated more instructions).