I need to compare the performance of various pthread constructs like mutex, semaphores, read-write locks and also the corresponding serial programs, by designing some experiments. The main problem is deciding how to measure the execution time of the code for the analysis ?
I have read about some C functions like clock(), gettimeofday() etc. From what I could understand - we can use clock() to get the actual number of CPU cycles used by a program (by subtracting value returned by the function at the start and end of the code whose time we want to measure), gettimeofday() returns the wall-clock time for the execution of the program.
But the problem is total CPU cycles does not appear to be a good criteria to me as it would sum the CPU time taken across all the parallel running threads (so clock() is not good according to me). Also wall-clock time is not good since there might be other processes running in the background, so the time finally depends on how the threads get scheduled (so gettimeofday() is also not good according to me).
Some other functions that I know of also do more likely the same as the two of above. So, I wanted to know if there is some function which I can use for my analysis or am I wrong somewhere in my conclusion above ?
From linux clock_gettime:
CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12)
Per-process CPU-time clock (measures CPU time consumed by all
threads in the process).
CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12)
Thread-specific CPU-time clock.
I believe clock() was somewhere implemented as clock_gettime(CLOCK_PROCESS_CPUTIME_ID, but I see it's implemented using times() in glibc.
So if you want to measure thread-specific CPU-time you can use clock_gettimer(CLOCK_THREAD_CPUTIME_ID, ... on GNU/Linux systems.
Never use gettimeofday nor clock_gettime(CLOCK_REALTIME to measure the execution of a program. Don't even think about that. gettimeofday is the "wall-clock" - you can display it on the wall in your room. If you want to measure the flow of time, forget gettimeofday.
If you want, you can also even stay fully posixly compatible, by using pthread_getcpuclockid inside your thread and using it's returned clock_id value with clock_gettime.
I am not sure to sum an array is a good test, you do not need any mutex etc to sum an array in multi thread, each thread just have to sum a dedicated part of the array, and there are a lot of memory accesses for few CPU computation. Example (the value of SZ and NTHREADS are given when compiling ), the measured time is the real time (monotonic) :
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
static int Arr[SZ];
void * thSum(void * a)
{
int s = 0, i;
int sup = *((int *) a) + SZ/NTHREADS;
for (i = *((int *) a); i != sup; ++i)
s += Arr[i];
*((int *) a) = s;
}
int main()
{
int i;
for (i = 0; i != SZ; ++i)
Arr[i] = rand();
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
int s = 0;
for (i = 0; i != SZ; ++i)
s += Arr[i];
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("mono thread : %d %lf\n", s,
(t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec)/1000000000.0);
clock_gettime(CLOCK_MONOTONIC, &t0);
int n[NTHREADS];
pthread_t ths[NTHREADS];
for (i = 0; i != NTHREADS; ++i) {
n[i] = SZ / NTHREADS * i;
if (pthread_create(&ths[i], NULL, thSum, &n[i])) {
printf("cannot create thread %d\n", i);
return -1;
}
}
int s2 = 0;
for (i = 0; i != NTHREADS; ++i) {
pthread_join(ths[i], NULL);
s2 += n[i];
}
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("%d threads : %d %lf\n", NTHREADS, s2,
(t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec)/1000000000.0);
}
Compilations and executions:
(array of 100.000.000 elements)
/tmp % gcc -DSZ=100000000 -DNTHREADS=2 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : 563608529 0.035217
2 threads : 563608529 0.020407
/tmp % ./a.out
mono thread : 563608529 0.034991
2 threads : 563608529 0.022659
/tmp % gcc -DSZ=100000000 -DNTHREADS=4 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : 563608529 0.035212
4 threads : 563608529 0.014234
/tmp % ./a.out
mono thread : 563608529 0.035184
4 threads : 563608529 0.014163
/tmp % gcc -DSZ=100000000 -DNTHREADS=8 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : 563608529 0.035229
8 threads : 563608529 0.014971
/tmp % ./a.out
mono thread : 563608529 0.035142
8 threads : 563608529 0.016248
(array of 1000.000.000 elements)
/tmp % gcc -DSZ=1000000000 -DNTHREADS=2 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : -1471389927 0.343761
2 threads : -1471389927 0.197303
/tmp % ./a.out
mono thread : -1471389927 0.346682
2 threads : -1471389927 0.197669
/tmp % gcc -DSZ=1000000000 -DNTHREADS=4 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : -1471389927 0.346859
4 threads : -1471389927 0.130639
/tmp % ./a.out
mono thread : -1471389927 0.346506
4 threads : -1471389927 0.130751
/tmp % gcc -DSZ=1000000000 -DNTHREADS=8 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : -1471389927 0.346954
8 threads : -1471389927 0.123572
/tmp % ./a.out
mono thread : -1471389927 0.349652
8 threads : -1471389927 0.127059
As you can see even the execution time is not divided by the number of threads, the bottleneck is probably the access to the memory
Related
Hello i am trying to learn openMP and i am confused by the results, i have pi.c
#include <stdio.h>
#include <omp.h>
#define NUM_THREADS 6
static long num_steps = 1000000000;
double step;
int main(){
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double)num_steps;
double start_time, run_time;
omp_set_num_threads(NUM_THREADS);
start_time = omp_get_wtime();
#pragma omp parallel
{
int i, id, nthrds; double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id == 0) nthreads = nthrds;
for(i = id, sum[id] = 0.0; i < num_steps; i+=nthrds){
x = (i+0.5)*step;
sum[id] += 4.0 / (1.0+x*x);
}
}
for(i = 0, pi = 0.0; i < nthreads; i++){
pi += step * sum[i];
}
run_time = omp_get_wtime();
printf("[PI %f TIME %.4fs ON %d THREADS]\n", pi, (run_time - start_time), nthreads);
}
and when i complile with gcc -fopenmp -Wall -Wextra pi.c i get these results:
[PI 3.141593 TIME 3.8663s ON 1 THREADS]
[PI 3.141593 TIME 7.9291s ON 2 THREADS]
[PI 3.141593 TIME 8.4961s ON 3 THREADS]
[PI 3.141593 TIME 10.8343s ON 4 THREADS]
[PI 3.141593 TIME 9.7167s ON 5 THREADS]
[PI 3.141593 TIME 10.0182s ON 6 THREADS]
but when i compile with gcc -fopenmp -Ofast -Wall -Wextra pi.c i get the results i expected:
[PI 3.141593 TIME 1.8380s ON 1 THREADS]
[PI 3.141593 TIME 0.7553s ON 2 THREADS]
[PI 3.141593 TIME 0.5525s ON 3 THREADS]
[PI 3.141593 TIME 0.3930s ON 4 THREADS]
[PI 3.141593 TIME 0.3694s ON 5 THREADS]
[PI 3.141593 TIME 0.3287s ON 6 THREADS]
-O2,-O3 behave similarly to -Ofast and -O1 has results similar to without compiler optimizations, with more threads giving worse results.
Short answer: No, you don't need Ofast to run omp properly.
If you do man gcc, you can see
-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations.
It also enables optimizations that are not valid for all standard-compliant
programs. It turns on -ffast-math and the Fortran-specific
-fno-protect-parens and -fstack-arrays.
So basically, Ofast turns on O3 optimisations + other optimisations and hence faster.
If you check -ffast-math (set by Ofast) in manual, you can see:
-ffast-math
This option causes the preprocessor macro "__FAST_MATH__" to be defined.
This option is not turned on by any -O option besides -Ofast since it can
result in incorrect output for programs that depend on an exact implementation
of IEEE or ISO rules/specifications for math functions. It may, however, yield
faster code for programs that do not require the guarantees of these
specifications.
Key point here is Ofast disregard strict standards compliance. and it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions (which is also mentioned in the comments)
So to summarize, Ofast can give you faster results if you don't care about the standard compliance or catastrophic cancellation (as mentioned by zwol in comments) or there could be others that I don't know..
I have two OS on my PC with i7-3770 # 3.40 GHz. One OS is latest Linux Kubuntu 18.04, the other OS is Windows 10 Pro running on same HDD.
I have tested a simple funny program written in C language doing some arithmetic calculations from number theory. On Kubuntu compiled with gcc 7.3.0, on Windows compiled with gcc 5.2.0. built by MinGW-W64 project.
The result is amazing, running program was 4-times slower on Linux, than on Windows.
On Windows the elapsed time is just 6 seconds. On Linux is elapsed time 24 seconds! On the same hardware.
I tried on Kubuntu to compile with some CPU specific options like "gcc -corei7" etc., but nothing helped. In the program is used "math.h" library, so the compilation is done with "-lm" on both systems. The source code is the same.
Is there a reason for this slow speed under Linux?
Further more I have compiled the same code also on older 32-bit machine with Core Duo T2250 # 1.73 GHz under Linux Mint 19 with gcc 7.3.0. The elapsed time was 28 seconds! Not much difference than 64-bit machine running on double frequency under Linux.
The sorce code is below, you can compile it and test it.
/* Program for playing with sigma(n) and tau(n) functions */
/* Compilation of code: "gcc name.c -o name -lm" */
#include <stdio.h>
#include <math.h>
#include <time.h>
int main(void)
{
double i, nq, x, zacatek, konec, p;
double odx, soucet, delitel, celkem, ZM;
unsigned long cas1, cas2;
i=(double)0; soucet=(double)0; celkem=(double)0; nq=(double)0;
zacatek=(double)1; konec=(double)1000000; x=zacatek;
ZM=(double)16 / (double)10;
printf("\n Program for playing with sigma(n) and tau(n) functions \n");
printf("---------------------------------------------------------\n");
printf("Calculation is running in range from %.0lf to %.0lf\n\n\n", zacatek, konec);
printf("Finding numbers which have sigma(n)/n = %.3lf\n\n", ZM);
cas1=time(NULL);
while (x <= konec) {
i=1; celkem=0; nq=0;
odx=sqrt(x)+1;
while (i <= odx) {
if (fmod(x, i)==0) {
nq++;
celkem=celkem+x/i+i;
}
i++;
}
nq=2*nq-1;
if ((odx-floor(odx))==0) {celkem=celkem-odx;}
if (fabs(celkem - (ZM*x)) < 0.001) {
printf("%.0lf has sum of all divisors = %.3lf times the number itself (%.0lf, %.0lf)\n", x, ZM, celkem, nq+1);
}
x++;
}
cas2=time(NULL);
printf("\n\nProgram ended.\n\n");
printf("Elapsed time %lu seconds.\n\n", cas2-cas1);
return (0);
}
I thought I`d first share this here to have your opinions before doing anything else. I found out while designing an algorithm that the gcc compiled code performance for some simple code was catastrophic compared to clang's.
How to reproduce
Create a test.c file containing this code :
#include <sys/stat.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
int main(int argc, char *argv[]) {
const uint64_t size = 1000000000;
const size_t alloc_mem = size * sizeof(uint8_t);
uint8_t *mem = (uint8_t*)malloc(alloc_mem);
for (uint_fast64_t i = 0; i < size; i++)
mem[i] = (uint8_t) (i >> 7);
uint8_t block = 0;
uint_fast64_t counter = 0;
uint64_t total = 0x123456789abcdefllu;
uint64_t receiver = 0;
for(block = 1; block <= 8; block ++) {
printf("%u ...\n", block);
counter = 0;
while (counter < size - 8) {
__builtin_memcpy(&receiver, &mem[counter], block);
receiver &= (0xffffffffffffffffllu >> (64 - ((block) << 3)));
total += ((receiver * 0x321654987cbafedllu) >> 48);
counter += block;
}
}
printf("=> %llu\n", total);
return EXIT_SUCCESS;
}
gcc
Compile and run :
gcc-7 -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377
real 0m23.367s
user 0m22.634s
sys 0m0.495s
info :
gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/7.3.0/libexec/gcc/x86_64-apple-darwin17.4.0/7.3.0/lto-wrapper
Target: x86_64-apple-darwin17.4.0
Configured with: ../configure --build=x86_64-apple-darwin17.4.0 --prefix=/usr/local/Cellar/gcc/7.3.0 --libdir=/usr/local/Cellar/gcc/7.3.0/lib/gcc/7 --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-7 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --enable-checking=release --with-pkgversion='Homebrew GCC 7.3.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --disable-nls
Thread model: posix
gcc version 7.3.0 (Homebrew GCC 7.3.0)
So we get about 23s of user time. Now let's do the same with cc (clang on macOS) :
clang
cc -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377
real 0m9.832s
user 0m9.310s
sys 0m0.442s
info :
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
That's more than 2.5x faster !! Any thoughts ?
I replaced the __builtin_memcpy function by memcpy to test things out and this time the compiled code runs in about 34s on both sides - consistent and slower as expected.
It would appear that the combination of __builtin_memcpy and bitmasking is interpreted very differently by both compilers.
I had a look at the assembly code, but couldn't see anything standing out that would explain such a drop in performance as I'm not an asm expert.
Edit 03-05-2018 :
Posted this bug : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719.
I find it suspicious that you get different code for memcpy vs __builtin_memcpy. I don't think that's supposed to happen, and indeed I cannot reproduce it on my (linux) system.
If you add #pragma GCC unroll 16 (implemented in gcc-8+) before the for loop, gcc gets the same perf as clang (making block a constant is essential to optimize the code), so essentially llvm's unrolling is more aggressive than gcc's, which can be good or bad depending on cases. Still, feel free to report it to gcc, maybe they'll tweak the unrolling heuristics some day and an extra testcase could help.
Once unrolling is taken care of, gcc does ok for some values (block equals 4 or 8 in particular), but much worse for some others, in particular 3. But that's better analyzed with a smaller testcase without the loop on block. Gcc seems to have trouble with memcpy(,,3), it works much better if you always read 8 bytes (the next line already takes care of the extra bytes IIUC). Another thing that could be reported to gcc.
Remark: I feel a little bit stupid about this, but this might help someone
So, I am trying to improve the performance of a program by using parallelism. However, I am encountering an issue with the measured speedup. I have 4 CPUs:
~% lscpu
...
CPU(s): 4
...
However, the speedup is much lower than fourfold. Here is a minimal working example, with a sequential version, a version using OpenMP and a version using POSIX threads (to be sure it is not due to either implementation).
Purely sequential (add_seq.c):
#include <stddef.h>
int main() {
for (size_t i = 0; i < (1ull<<36); i += 1) {
__asm__("add $0x42, %%eax" : : : "eax");
}
return 0;
}
OpenMP (add_omp.c):
#include <stddef.h>
int main() {
#pragma omp parallel for schedule(static)
for (size_t i = 0; i < (1ull<<36); i += 1) {
__asm__("add $0x42, %%eax" : : : "eax");
}
return 0;
}
POSIX threads (add_pthread.c):
#include <pthread.h>
#include <stddef.h>
void* f(void* x) {
(void) x;
const size_t count = (1ull<<36) / 4;
for (size_t i = 0; i < count; i += 1) {
__asm__("add $0x42, %%eax" : : : "eax");
}
return NULL;
}
int main() {
pthread_t t[4];
for (size_t i = 0; i < 4; i += 1) {
pthread_create(&t[i], NULL, f, NULL);
}
for (size_t i = 0; i < 4; i += 1) {
pthread_join(t[i], NULL);
}
return 0;
}
Makefile:
CFLAGS := -O3 -fopenmp
LDFLAGS := -O3 -lpthread # just to be sure
all: add_seq add_omp add_pthread
So, now, running this (using zsh's time builtin):
% make -B && time ./add_seq && time ./add_omp && time ./add_pthread
cc -O3 -fopenmp -O3 -lpthread add_seq.c -o add_seq
cc -O3 -fopenmp -O3 -lpthread add_omp.c -o add_omp
cc -O3 -fopenmp -O3 -lpthread add_pthread.c -o add_pthread
./add_seq 24.49s user 0.00s system 99% cpu 24.494 total
./add_omp 52.97s user 0.00s system 398% cpu 13.279 total
./add_pthread 52.92s user 0.00s system 398% cpu 13.266 total
Checking CPU frequency, sequential code has maximum CPU frequency of 2.90 GHz, and parallel code (all versions) has uniform CPU frequency of 2.60 GHz. So counting billions of instructions:
>>> 24.494 * 2.9
71.0326
>>> 13.279 * 2.6
34.5254
>>> 13.266 * 2.6
34.4916
So, all in all, threaded code is only running twice as fast as sequential code, although it is using four times as much CPU time. Why is it so?
Remark: assembly for asm_omp.c seemed less efficient, since it did the for-loop by incrementing a register, and comparing it to the number of iterations, rather than decrementing and directly checking for ZF; however, this had no effect on performance
Well, the answer is quite simple: there are really only two CPU cores:
% lscpu
...
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
...
So, although htop shows four CPUs, two are virtual and only there because of hyperthreading. Since the core idea of hyper-threading is of sharing resources of a single core in two processes, it does help run similar code faster (it is only useful when running two threads using different resources).
So, in the end, what happens is that time/clock() measures the usage of each logical core as that of the underlying physical core. Since all report ~100% usage, we get a ~400% usage, although it only represents a twofold speedup.
Up until then, I was convinced this computer contained 4 physical cores, and had completely forgotten to check about hyperthreading.
Similar question
Related question
How can I demonstrate for students the usability of likely and unlikely compiler hints (__builtin_expect)?
Can you write an sample code, which will be several times faster with these hints comparing the code without hints.
Here is the one I use, a really inefficient implementation of the Fibonacci numbers:
#include <stdio.h>
#include <inttypes.h>
#include <time.h>
#include <assert.h>
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
uint64_t fib(uint64_t n)
{
if (opt(n == 0 || n == 1)) {
return n;
} else {
return fib(n - 2) + fib(n - 1);
}
}
int main(int argc, char **argv)
{
int i, max = 45;
clock_t tm;
if (argc == 2) {
max = atoi(argv[1]);
assert(max > 0);
} else {
assert(argc == 1);
}
tm = -clock();
for (i = 0; i <= max; ++i)
printf("fib(%d) = %" PRIu64 "\n", i, fib(i));
tm += clock();
printf("Time elapsed: %.3fs\n", (double)tm / CLOCKS_PER_SEC);
return 0;
}
To demonstrate, using GCC:
~% gcc -O2 -Dopt= -o test-nrm test.c
~% ./test-nrm
...
fib(45) = 1134903170
Time elapsed: 34.290s
~% gcc -O2 -Dopt=unlikely -o test-opt test.c
~% ./test-opt
...
fib(45) = 1134903170
Time elapsed: 33.530s
A few hundred milliseconds less. This gain is due to the programmer-aided branch prediction.
But now, for what the programmer should really be doing instead:
~% gcc -O2 -Dopt= -fprofile-generate -o test.prof test.c
~% ./test.prof
...
fib(45) = 1134903170
Time elapsed: 77.530s /this run is slowed down by profile generation.
~% gcc -O2 -Dopt= -fprofile-use -o test.good test.c
~% ./test.good
fib(45) = 1134903170
Time elapsed: 17.760s
With compiler-aided runtime profiling, we managed to reduce from the original 34.290s to 17.760s. Much better than with programmer-aided branch prediction!
From this blog post. I think likely and unlikely are mostly obsolete. Very cheap CPUs (ARM Cortex A20 in the example) have branch predictors and there is no penalty regardless of jump is taken / jump is not taken. When you introduce likely/unlikely the results will be either the same or worse (because compiler has generated more instructions).