learning sample of likely() and unlikely() compiler hints - c

How can I demonstrate for students the usability of likely and unlikely compiler hints (__builtin_expect)?
Can you write an sample code, which will be several times faster with these hints comparing the code without hints.

Here is the one I use, a really inefficient implementation of the Fibonacci numbers:
#include <stdio.h>
#include <inttypes.h>
#include <time.h>
#include <assert.h>
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
uint64_t fib(uint64_t n)
{
if (opt(n == 0 || n == 1)) {
return n;
} else {
return fib(n - 2) + fib(n - 1);
}
}
int main(int argc, char **argv)
{
int i, max = 45;
clock_t tm;
if (argc == 2) {
max = atoi(argv[1]);
assert(max > 0);
} else {
assert(argc == 1);
}
tm = -clock();
for (i = 0; i <= max; ++i)
printf("fib(%d) = %" PRIu64 "\n", i, fib(i));
tm += clock();
printf("Time elapsed: %.3fs\n", (double)tm / CLOCKS_PER_SEC);
return 0;
}
To demonstrate, using GCC:
~% gcc -O2 -Dopt= -o test-nrm test.c
~% ./test-nrm
...
fib(45) = 1134903170
Time elapsed: 34.290s
~% gcc -O2 -Dopt=unlikely -o test-opt test.c
~% ./test-opt
...
fib(45) = 1134903170
Time elapsed: 33.530s
A few hundred milliseconds less. This gain is due to the programmer-aided branch prediction.
But now, for what the programmer should really be doing instead:
~% gcc -O2 -Dopt= -fprofile-generate -o test.prof test.c
~% ./test.prof
...
fib(45) = 1134903170
Time elapsed: 77.530s /this run is slowed down by profile generation.
~% gcc -O2 -Dopt= -fprofile-use -o test.good test.c
~% ./test.good
fib(45) = 1134903170
Time elapsed: 17.760s
With compiler-aided runtime profiling, we managed to reduce from the original 34.290s to 17.760s. Much better than with programmer-aided branch prediction!

From this blog post. I think likely and unlikely are mostly obsolete. Very cheap CPUs (ARM Cortex A20 in the example) have branch predictors and there is no penalty regardless of jump is taken / jump is not taken. When you introduce likely/unlikely the results will be either the same or worse (because compiler has generated more instructions).

Related

Sequences in C in a time efficient manner

I am making a simulation that updates at every timestep. I nearly 'kill' the virtual organisms in my grid (its a cellular automatum) at 20000 timesteps. I want to write off data at killing_time - 10000 and killing_time - 100 for every 200 times I kill. Now I can write a for loop and iterate from 1 to 200 like this
for(i=1; i<=200; i++)
{
if(Time%(i*killing_time-10000)==0 || Time%(i*killing_time-100)==0)
{
etcetera. But than I would have to loop from 1 to 200 every timestep and do this calculation. How do I do this in an intelligent manner? Bram
As far as I understand your problem, this could be something like this...
/**
gcc -std=c99 -o prog_c prog_c.c \
-pedantic -Wall -Wextra -Wconversion \
-Wc++-compat -Wwrite-strings -Wold-style-definition -Wvla \
-g -O0 -UNDEBUG -fsanitize=address,undefined
**/
#include <stdio.h>
void
simulate(void)
{
const int killing_period=20000;
int killing_time=killing_period;
int Time=1;
for(int kill_event=1; kill_event<=200; ++kill_event)
{
for(; Time<killing_time; ++Time)
{
if(Time==(killing_time-10000) || Time==(killing_time-100))
{
printf("display at %d\n", Time);
}
// recompute the grid here
}
printf("kill at %d\n", Time);
killing_time+=killing_period;
}
printf("DONE\n");
}
int
main(void)
{
simulate();
return 0;
}

Confusion about Timing Measurements

I have been tasked with implementing three different functions get_current_time_seconds1, 2 and 3, and then have to estimate the resolution of the various functions. How would I estimate this?
Which timing function would you suggest to use? What do the compiler options -O0 -lrt mean when I have to compile with gcc -O0 -lrt timing.c -o timing?
#define BILLION 1000000000L
#define LIMIT_I 1000
#define LIMIT_J 1000
double get_current_time_seconds1()
{
/* Get current time using gettimeofday */
time_t t = time(NULL);
struct tm *tm = localtime(&t);
printf("%s\n", asctime(tm));
return (double) tm;
}
double get_current_time_seconds2()
{
struct timespec start,stop;
clock_gettime(CLOCK_REALTIME, &start);
clock_gettime(CLOCK_REALTIME, &stop);
double x = (stop.tv_sec - start.tv_sec) + (stop.tv_nsec - start.tv_nsec);
printf("%lf\n", x);
return (double) x;
}
double get_current_time_seconds3()
{
uint64_t diff;
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
sleep(5);
clock_gettime(CLOCK_MONOTONIC, &end);
diff = BILLION * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
printf("elapsed time = %llu nanoseconds\n", (long long unsigned int)diff);
return (double) diff;
}
How would I estimate this? Which timing function would you suggest to use?
If you want to get the resolution (precision) of the various timing elements, you can use the clock_getres function, passing in the various CLOCK_ id types, for example:
#include <stdio.h>
#include <time.h>
static void printres(clockid_t id)
{
struct timespec ts;
int rc = clock_getres(id, &ts);
printf("clock id: %d\n", (unsigned int)id);
if (rc != 0) {
printf("Error: %d\n", rc);
return;
}
printf("tv_sec = %lu\ntv_nsec = %lu\n", ts.tv_sec, ts.tv_nsec);
}
int main(int argc, char** argv)
{
printres(CLOCK_REALTIME);
printres(CLOCK_MONOTONIC);
printres(CLOCK_PROCESS_CPUTIME_ID);
printres(CLOCK_THREAD_CPUTIME_ID);
return 0;
}
On my system, tv_nsec for all but CLOCK_THREAD_CPUTIME_ID is 1000, for CLOCK_THREAD_CPUTIME_ID the value for tv_nsec is 1, this means that the precision for the other clock types is 1 millisecond (1000 nanoseconds) while the precision of CLOCK_THREAD_CPUTIME_ID is 1 nanosecond.
For your first function that calls localtime the precision for that would be 1 second as that function calculates the time from the Unix epoch in seconds.
What do the compiler options -O0 -lrt mean when I have to compile with gcc -O0 -lrt timing.c -o timing?
For some compilers like gcc and clang the option -O means to optimize the code when compiling it to the level specified, so -O0 means not to optimize the code at all, this is usually useful when debugging code.
The -l option says to compile against the specified library, so -lrt says to compile using the rt library, or "real time library"; this is necessary on some systems as CLOCK_REALTIME can be defined in that library.
I hope that can help.

gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

I thought I`d first share this here to have your opinions before doing anything else. I found out while designing an algorithm that the gcc compiled code performance for some simple code was catastrophic compared to clang's.
How to reproduce
Create a test.c file containing this code :
#include <sys/stat.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
int main(int argc, char *argv[]) {
const uint64_t size = 1000000000;
const size_t alloc_mem = size * sizeof(uint8_t);
uint8_t *mem = (uint8_t*)malloc(alloc_mem);
for (uint_fast64_t i = 0; i < size; i++)
mem[i] = (uint8_t) (i >> 7);
uint8_t block = 0;
uint_fast64_t counter = 0;
uint64_t total = 0x123456789abcdefllu;
uint64_t receiver = 0;
for(block = 1; block <= 8; block ++) {
printf("%u ...\n", block);
counter = 0;
while (counter < size - 8) {
__builtin_memcpy(&receiver, &mem[counter], block);
receiver &= (0xffffffffffffffffllu >> (64 - ((block) << 3)));
total += ((receiver * 0x321654987cbafedllu) >> 48);
counter += block;
}
}
printf("=> %llu\n", total);
return EXIT_SUCCESS;
}
gcc
Compile and run :
gcc-7 -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377
real 0m23.367s
user 0m22.634s
sys 0m0.495s
info :
gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/7.3.0/libexec/gcc/x86_64-apple-darwin17.4.0/7.3.0/lto-wrapper
Target: x86_64-apple-darwin17.4.0
Configured with: ../configure --build=x86_64-apple-darwin17.4.0 --prefix=/usr/local/Cellar/gcc/7.3.0 --libdir=/usr/local/Cellar/gcc/7.3.0/lib/gcc/7 --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-7 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --enable-checking=release --with-pkgversion='Homebrew GCC 7.3.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --disable-nls
Thread model: posix
gcc version 7.3.0 (Homebrew GCC 7.3.0)
So we get about 23s of user time. Now let's do the same with cc (clang on macOS) :
clang
cc -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377
real 0m9.832s
user 0m9.310s
sys 0m0.442s
info :
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
That's more than 2.5x faster !! Any thoughts ?
I replaced the __builtin_memcpy function by memcpy to test things out and this time the compiled code runs in about 34s on both sides - consistent and slower as expected.
It would appear that the combination of __builtin_memcpy and bitmasking is interpreted very differently by both compilers.
I had a look at the assembly code, but couldn't see anything standing out that would explain such a drop in performance as I'm not an asm expert.
Edit 03-05-2018 :
Posted this bug : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719.
I find it suspicious that you get different code for memcpy vs __builtin_memcpy. I don't think that's supposed to happen, and indeed I cannot reproduce it on my (linux) system.
If you add #pragma GCC unroll 16 (implemented in gcc-8+) before the for loop, gcc gets the same perf as clang (making block a constant is essential to optimize the code), so essentially llvm's unrolling is more aggressive than gcc's, which can be good or bad depending on cases. Still, feel free to report it to gcc, maybe they'll tweak the unrolling heuristics some day and an extra testcase could help.
Once unrolling is taken care of, gcc does ok for some values (block equals 4 or 8 in particular), but much worse for some others, in particular 3. But that's better analyzed with a smaller testcase without the loop on block. Gcc seems to have trouble with memcpy(,,3), it works much better if you always read 8 bytes (the next line already takes care of the extra bytes IIUC). Another thing that could be reported to gcc.

Lower than expected speedup when using multithreading

Remark: I feel a little bit stupid about this, but this might help someone
So, I am trying to improve the performance of a program by using parallelism. However, I am encountering an issue with the measured speedup. I have 4 CPUs:
~% lscpu
...
CPU(s): 4
...
However, the speedup is much lower than fourfold. Here is a minimal working example, with a sequential version, a version using OpenMP and a version using POSIX threads (to be sure it is not due to either implementation).
Purely sequential (add_seq.c):
#include <stddef.h>
int main() {
for (size_t i = 0; i < (1ull<<36); i += 1) {
__asm__("add $0x42, %%eax" : : : "eax");
}
return 0;
}
OpenMP (add_omp.c):
#include <stddef.h>
int main() {
#pragma omp parallel for schedule(static)
for (size_t i = 0; i < (1ull<<36); i += 1) {
__asm__("add $0x42, %%eax" : : : "eax");
}
return 0;
}
POSIX threads (add_pthread.c):
#include <pthread.h>
#include <stddef.h>
void* f(void* x) {
(void) x;
const size_t count = (1ull<<36) / 4;
for (size_t i = 0; i < count; i += 1) {
__asm__("add $0x42, %%eax" : : : "eax");
}
return NULL;
}
int main() {
pthread_t t[4];
for (size_t i = 0; i < 4; i += 1) {
pthread_create(&t[i], NULL, f, NULL);
}
for (size_t i = 0; i < 4; i += 1) {
pthread_join(t[i], NULL);
}
return 0;
}
Makefile:
CFLAGS := -O3 -fopenmp
LDFLAGS := -O3 -lpthread # just to be sure
all: add_seq add_omp add_pthread
So, now, running this (using zsh's time builtin):
% make -B && time ./add_seq && time ./add_omp && time ./add_pthread
cc -O3 -fopenmp -O3 -lpthread add_seq.c -o add_seq
cc -O3 -fopenmp -O3 -lpthread add_omp.c -o add_omp
cc -O3 -fopenmp -O3 -lpthread add_pthread.c -o add_pthread
./add_seq 24.49s user 0.00s system 99% cpu 24.494 total
./add_omp 52.97s user 0.00s system 398% cpu 13.279 total
./add_pthread 52.92s user 0.00s system 398% cpu 13.266 total
Checking CPU frequency, sequential code has maximum CPU frequency of 2.90 GHz, and parallel code (all versions) has uniform CPU frequency of 2.60 GHz. So counting billions of instructions:
>>> 24.494 * 2.9
71.0326
>>> 13.279 * 2.6
34.5254
>>> 13.266 * 2.6
34.4916
So, all in all, threaded code is only running twice as fast as sequential code, although it is using four times as much CPU time. Why is it so?
Remark: assembly for asm_omp.c seemed less efficient, since it did the for-loop by incrementing a register, and comparing it to the number of iterations, rather than decrementing and directly checking for ZF; however, this had no effect on performance
Well, the answer is quite simple: there are really only two CPU cores:
% lscpu
...
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
...
So, although htop shows four CPUs, two are virtual and only there because of hyperthreading. Since the core idea of hyper-threading is of sharing resources of a single core in two processes, it does help run similar code faster (it is only useful when running two threads using different resources).
So, in the end, what happens is that time/clock() measures the usage of each logical core as that of the underlying physical core. Since all report ~100% usage, we get a ~400% usage, although it only represents a twofold speedup.
Up until then, I was convinced this computer contained 4 physical cores, and had completely forgotten to check about hyperthreading.
Similar question
Related question

gcc __builtin_expect doesn't seem to generate a correct code

The following two code snippets produces exactly the same assembly code, even though branches are enclosed with different branch predictions.
Let's say that we have test0.c
#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)
int bar0();
int bar1();
int bar2();
int bar3();
int foo(int arg0) {
if (likely(arg0 > 100)) {
return bar0();
} else if (likely(arg0 < -100)) {
return bar1();
} else if (likely(arg0 > 0)) {
return bar2();
} else {
return bar3();
}
}
and test1.c
#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)
int bar0();
int bar1();
int bar2();
int bar3();
int foo(int arg0) {
if (unlikely(arg0 > 100)) {
return bar0();
} else if (unlikely(arg0 < -100)) {
return bar1();
} else if (unlikely(arg0 > 0)) {
return bar2();
} else {
return bar3();
}
}
As you can see by comparing two snippets, these two have different branch predictions for each branch (likely() vs. unlikely()).
However, when it is compiled from a linux box(ubuntu 12.04 32bit, gcc 4.6.3). These two sources produce virtually same outputs.
$gcc -c -S -o test0.s test0.c
$gcc -c -S -o test1.s test1.c
$ diff test0.s test1.s
1c1
< .file "test0.c"
---
> .file "test1.c"
If anyone can explain this, it will be a big help.
Thanks for your help in advance!
The two files you've posted are identical -- I assume this isn't what you've really done.
Compile with -O2 or higher, you need to turn on optimisation. This should then generate different code.
I did some measurements on ARM7 (Allwinner sun71 A20), and with gcc 6.3 (-O3) and there was no performance difference between #likely and #unlikely even though it was clear from other tests that taking a branch was more expensive than not taking it, even in the case of perfect branch prediction.

Resources