using -O2 decreases time of bubble sort C program - c

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void sort();
int main() {
int i;
for (i = 0; i < 100000; i++) {
sort();
}
}
void sort() {
int i, j, k, array[100], l = 99, m;
for (i = 0; i < 100; i++) {
array[i] = rand() % 1000 + 1;
}
for (k = 0; k < 99; k++) {
for (j = 0; j < l; j++) {
if (array[j + 1] > array[j]) {
int temp = array[j];
array[j] = array[j + 1];
array[j + 1] = temp;
}
}
l--;
}
for (m = 0; m < 100; m++) {
printf("%d ", array[m]);
}
}
On the linux shell, gcc sort -o sort.c and then time ./sort >> out.
Here if I do gcc -o2 sort -o sort.c and similarly o3 and o4 then the time keeps on decreasing. How does the optimization options work? Please explain in terms of all real time, user time and system time.
PS: The code might be a little inefficient. Kindly ignore that.

Optimization options work between the reading of the source code and the writing of the binary instructions to the CPU.
GCC is a multi-phase compiler, where the phases roughly consist of:
Creating "tokens" from the input text.
Arranging those tokens into abstract syntax tree structure.
Pruning the abstract syntax tree.
Creating register based instructions, assuming an infinite number of CPU registers.
Mapping the registers into the actual registers available.
Writing the binary information out, in the loader's expected format.
Optimizations can impact a number of locations, typically they become active in the above mentioned steps 3 through 5. There are many optimizations, including:
Constant folding – Evaluate constant subexpressions in advance.
Strength reduction – Replace slow operations with faster equivalents.
Null sequences – Delete useless operations.
Combine operations – Replace several operations with one equivalent.
Algebraic laws – Use algebraic laws to simplify or reorder instructions.
Special case instructions – Use instructions designed for special operand cases.
Address mode operations – Use address modes to simplify code.
Loop unrolling - Replace a loop with equivalent instructions
Partial loop unrolling - Reduce times a loop is evaluated while preserving overall function.
Note that these are not all the optimizations that might be performed, but it starts to give you an idea.
For example, if the compiler sees
int s = 3;
while (s < 6) {
printf("%d\n", s);
s++;
}
and the flags are set to unroll loops, then it might write CPU instructions equivalent to
printf("%d\n", 3);
printf("%d\n", 4);
printf("%d\n", 5);
Those instructions might seem more wordy to us humans, but the CPU commands might be smaller, because there is no need to "lookup" the now-erased value of s, nor is there the need to add one to it, or store the new updated value back into RAM.
GCC arranges the optimizations into categories, ranging from "safe" to "risky". -O2 is a good compromise between speed and safety. Higher -O numbers are riskier.

The -O compiler flag controls the amount of compiler optimization that you wish the compiler to perform. In short, building the project will take longer but the resulting executable should be faster. For more information, type man gcc into the command prompt or gcc -c -Q -O3 --help=optimizers for specific information regarding the optimizations performed for a particular flag.

-O stands for optimize, in which gcc will automatically take the steps necessary to optimize your program. You can read more about the specific steps that GCC takes to optimize your program here: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
But essentially, -O2 is more optimized than -O1, and -O3 more than -O2. This might come with drawbacks in regard to compiled binary size, where the resulting binary could use more space, but run faster, and vice versa. You can actually paste your code into https://godbolt.org/, and write in -O1 or any of the optimization options beside the dropdown to choose a compiler, and godbolt will show you what the resulting code looks like. You will be able to see a difference between O1 and O2, namely, the O2 generated code is probably shorter and will use a lot of shortcuts to do your algorithm.

gcc offers a number of optimization flags. You can see what each one does specifically here:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
There's always a tradeoff with optimizations, either by increased compile time, increased use of memory, etc...
There are dozens of optimizations enabled by the -o2 flag, so it might not be immediately clear which specific ones affect the sorting. Instead of -o2, you can try each optimization individually, for example using the -falign-loops flag, to see whether that is the one providing the performance increase.

Related

Benchmarking C bubblesort performance compared to Julia

I wanted to create a formal comparison between C and Julia performance. For this purpose I wanted to compare different sorting algorithms, starting with the bubble. In Julia I wrote it like:
using BenchmarkTools
function bubble_sort(v::AbstractArray{T}) where T<:Real
for _ in 1:length(v)-1
for i in 1:length(v)-1
if v[i] > v[i+1]
v[i], v[i+1] = v[i+1], v[i]
end
end
end
return v
end
v = rand(Int32, 100_000)
#timed bubble_sort(_v)
In the case of C code (I don't know to program in C so I apologize for the code):
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static void swap(int *xp, int *yp){
int temp = *xp;
*xp = *yp;
*yp = temp;
}
void bubble_sort(int arr[], int n){
int i, j;
for (j = 0; j < n - 1; j++){
for (i = 0; i < n - 1; i++){
if (arr[i] > arr[i+1]){
swap(&arr[i], &arr[i+1]);
}
}
}
}
int main(){
int arr_sz = 100000;
int arr[arr_sz], i;
for (i = 0; i < arr_sz; i++){
arr[i] = rand();
}
double cpu_time_used;
clock_t begin = clock();
bubble_sort(arr, arr_sz);
clock_t end = clock();
cpu_time_used = ((double) (end - begin)) / CLOCKS_PER_SEC;
printf("time %f\n", cpu_time_used);
return 0;
}
The performance difference is (in my computer):
Julia
C
20s
~50s
I suppose that I have a big mistake in the C code, but I am not able to find it out, or is just Julia faster in loops?
Update: performance optimization
Changed the type to int32 in Julia so it is the same as C
swap method as static (+1s improvement on average)
compiling optimization (detailed bellow)
Instead of gcc main.c, I've used different optimization flags, as also the clang compiler. Results:
Time (s)
Julia
19.13
gcc -O main.c
47.58
gcc -O1 main.c
15.98
gcc -O2 main.c
19.52
gcc -O3 main.c
19.20
gcc -Os main.c
17.72
clang -O0 main.c
51.59
clang -O1 main.c
16.78
clang -O2 main.c
13.53
clang -O3 main.c
13.57
clang -Ofast main.c
12.39
clang -Os main.c
18.85
clang -Oz main.c
15.64
clang -Og main.c
16.37
It seems like this question may have been reopened after you discovered that your initial measurements were taken on code compiled for debugging, rather than fully optimised, with different sets of data, different compiler platforms and different integer representations.
I suppose that I have a big mistake in the C code, but I am not able to find it out, or is just Julia faster in loops?
I can answer this somewhat cringy (in my opinion) double-barreled question with a quote from the C standard: "The semantic descriptions in this International Standard describe the behavior of an abstract machine in which issues of optimization are irrelevant." In short, there's no speed in C; that's an attribute that occurs in implementations of C. We can't reproduce your speed without having your implementation (including your hardware, for example).
It's very possible that Julia has similar clauses in her spec. The gist is: some nifty optimisations may determine that your sorted arrays don't have any necessary side-effects and so those may theoretically be optimised away. I'd expect both programs to output somewhere near 0.0 in that case. This is your perfectly optimal compiler; one that spots code that has no actual impact upon the logic of the program, and optimises away the dead code.
We haven't always had loop-invariant code motion, and so it stands to reason that there may be a fifth element here: your compilers version. You'll probably get different statistics if the underlying llvm is different, for example:
LLVM 11 tends to take 2x longer to compile code with optimizations, and as a result produces code that runs 10-20% faster (with occasional outliers in either direction), compared to LLVM 2.7 which is more than 10 years old.
-- source
Perhaps one day you'll update this question with output that reads 0.0s for both programs. Then this question has truly lost its point.
It's hard to tell what further is being asked here, #cbk. The comments section managed to reduce the runtime for the C program significantly with those four improvements. The question kinda doesn't even make sense here anymore, because it largely cancels itself out by answering itself at the end.
Perhaps this is just one of those cases where a newcomer ought to have answered their own question with an answer (you can do that), rather than rotting the question with edits that answer it... Nonetheless, it's a question that now shows up unanswered in the list of questions. I'd vote to close for the reason "The question should include more details", but I suspect then you might include an example of compilation halted after assembly generation, when OP seems to have glossed over the solution, the details we need are more along the lines of "What didn't you understand about these comments that answer your original question?" and yet the question has varied so much in apparent meaning... Are we gonna have a close/reopen war?

How to use AVX instructions to optimize ReLU written in C

I have the following simplified ReLU simulation code that I am trying to optimize. The code uses a ternary operation which is perhaps coming in the way of automatic vectorization by the compiler. How can I vectorize this code?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mkl.h>
void relu(double* &Amem, double* Z_curr, int bo)
{
for (int i=0; i<bo; ++i) {
Amem[i] = Z_curr[i] > 0 ? Z_curr[i] : Z_curr[i]*0.1;
}
}
int main()
{
int i, j;
int batch_size = 16384;
int output_dim = 21;
// double* Amem = new double[batch_size*output_dim];
// double* Z_curr = new double[batch_size*output_dim];
double* Amem = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
double* Z_curr = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
memset(Amem, 0, sizeof(double)*batch_size*output_dim);
for (i=0; i<batch_size*output_dim; ++i) {
Z_curr[i] = -1+2*((double)rand())/RAND_MAX;
}
relu(Amem, Z_curr, batch_size*output_dim);
}
To compile it, if you have MKL then use the following, otherwise plain g++ -O3.
g++ -O3 ex.cxx -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5
So far, I have tried adding -march=skylake-avx512 as a compiler option, but it does not vectorize the loop as I found using option -fopt-info-vec-all for compilation:
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
ex.cxx:6:6: note: vectorized 0 loops in function.
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
and this is the time it takes currently at my end:
time ./a.out
real 0m0.034s
user 0m0.026s
sys 0m0.009s
There is usually no benefit to pass a pointer by reference (unless you want to modify the pointer itself). Furthermore, you can help your compiler using the (non-standard) __restrict keyword, telling it that no aliasing happens between input and output (of course, this will likely give wrong results, if e.g., Amem == Z_curr+1 -- but Amem == Z_curr should (in this case) be fine).
void relu(double* __restrict Amem, double* Z_curr, int bo)
Using that alone, clang actually is capable of vectorizing your loop using vcmpltpd and masked moves (for some reasons, only using 256bit registers).
If you simplify your expression to std::max(Z_curr[i], 0.1*Z_curr[i]) even gcc easily is capable of vectorizing it: https://godbolt.org/z/eTv4PnMWb
Generally, I would suggest compiling crucial routines of your code with different compilers and different compile options (sometimes trying -ffast-math can show you ways to simplify your expressions) and have a look at the generated code. For portability you could then re-translate the generated code into intrinsics (or leave it as is, if every compiler you care about gives good-enough results).
For completeness, here is a possible manually vectorized implementation using intrinsics:
void relu_avx512(double* __restrict Amem, double* Z_curr, int bo)
{
int i;
for (i=0; i<=bo-8; i+=8)
{
__m512d z = _mm512_loadu_pd(Z_curr+i);
__mmask8 positive = _mm512_cmplt_pd_mask (_mm512_setzero_pd(), z);
__m512d res = _mm512_mask_mul_pd(z, positive, z, _mm512_set1_pd(0.9));
_mm512_storeu_pd(Amem+i, res);
}
// remaining elements in scalar loop
for (; i<bo; ++i) {
Amem[i] = 0.0 < Z_curr[i] ? Z_curr[i] : Z_curr[i]*0.1;;
}
}
Godbolt: https://godbolt.org/z/s6br5KEEc (if you compile this with -O2 or -O3 on clang, it will heavily unroll the cleanup loop, even though it can't have more than 7 iterations. Theoretically, you could do the remaining elements with a masked or overlapping store (or maybe you have use-cases where the size is guaranteed to be a multiple of 8 and you can leave it away).

Lower than expected speedup when using multithreading

Remark: I feel a little bit stupid about this, but this might help someone
So, I am trying to improve the performance of a program by using parallelism. However, I am encountering an issue with the measured speedup. I have 4 CPUs:
~% lscpu
...
CPU(s): 4
...
However, the speedup is much lower than fourfold. Here is a minimal working example, with a sequential version, a version using OpenMP and a version using POSIX threads (to be sure it is not due to either implementation).
Purely sequential (add_seq.c):
#include <stddef.h>
int main() {
for (size_t i = 0; i < (1ull<<36); i += 1) {
__asm__("add $0x42, %%eax" : : : "eax");
}
return 0;
}
OpenMP (add_omp.c):
#include <stddef.h>
int main() {
#pragma omp parallel for schedule(static)
for (size_t i = 0; i < (1ull<<36); i += 1) {
__asm__("add $0x42, %%eax" : : : "eax");
}
return 0;
}
POSIX threads (add_pthread.c):
#include <pthread.h>
#include <stddef.h>
void* f(void* x) {
(void) x;
const size_t count = (1ull<<36) / 4;
for (size_t i = 0; i < count; i += 1) {
__asm__("add $0x42, %%eax" : : : "eax");
}
return NULL;
}
int main() {
pthread_t t[4];
for (size_t i = 0; i < 4; i += 1) {
pthread_create(&t[i], NULL, f, NULL);
}
for (size_t i = 0; i < 4; i += 1) {
pthread_join(t[i], NULL);
}
return 0;
}
Makefile:
CFLAGS := -O3 -fopenmp
LDFLAGS := -O3 -lpthread # just to be sure
all: add_seq add_omp add_pthread
So, now, running this (using zsh's time builtin):
% make -B && time ./add_seq && time ./add_omp && time ./add_pthread
cc -O3 -fopenmp -O3 -lpthread add_seq.c -o add_seq
cc -O3 -fopenmp -O3 -lpthread add_omp.c -o add_omp
cc -O3 -fopenmp -O3 -lpthread add_pthread.c -o add_pthread
./add_seq 24.49s user 0.00s system 99% cpu 24.494 total
./add_omp 52.97s user 0.00s system 398% cpu 13.279 total
./add_pthread 52.92s user 0.00s system 398% cpu 13.266 total
Checking CPU frequency, sequential code has maximum CPU frequency of 2.90 GHz, and parallel code (all versions) has uniform CPU frequency of 2.60 GHz. So counting billions of instructions:
>>> 24.494 * 2.9
71.0326
>>> 13.279 * 2.6
34.5254
>>> 13.266 * 2.6
34.4916
So, all in all, threaded code is only running twice as fast as sequential code, although it is using four times as much CPU time. Why is it so?
Remark: assembly for asm_omp.c seemed less efficient, since it did the for-loop by incrementing a register, and comparing it to the number of iterations, rather than decrementing and directly checking for ZF; however, this had no effect on performance
Well, the answer is quite simple: there are really only two CPU cores:
% lscpu
...
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
...
So, although htop shows four CPUs, two are virtual and only there because of hyperthreading. Since the core idea of hyper-threading is of sharing resources of a single core in two processes, it does help run similar code faster (it is only useful when running two threads using different resources).
So, in the end, what happens is that time/clock() measures the usage of each logical core as that of the underlying physical core. Since all report ~100% usage, we get a ~400% usage, although it only represents a twofold speedup.
Up until then, I was convinced this computer contained 4 physical cores, and had completely forgotten to check about hyperthreading.
Similar question
Related question

Initializing array in C - execution time

int a[5] = {0};
VS
typedef struct
{
int a[5];
} ArrStruct;
ArrStruct arrStruct;
sizeA = sizeof(arrStruct.a)/sizeof(int);
for (it = 0 ; it < sizeA ; ++it)
arrStruct.a[it] = 0;
Does initializing by for loop takes more execution time? if so, why?
It depends upon the compiler and the optimization flags.
On recent GCC (e.g. 4.8 or 4.9) with gcc -O3 (or probably even -O1 or -O2) it should not matter, since the same code would be emitted (GCC has even an optimization which would transform your loop into a builtin_memset which would be further optimized).
On some compilers, it could happen that the int a[5] = {0}; might be faster, because the compiler might emit e.g. vector instruction (or on x86 a rep stosw) to clear an array.
The best thing is to examine the generated (gimple representation and) assembler code (e.g. with gcc -fdump-tree-gimple -O3 -fverbose-asm -mtune=native -S) and to benchmark. Most of the cases it does not matter. Be sure to enable optimizations when compiling.
Generally, don't care about such micro-optimization; a good optimizing compiler is better than you have time to code.
It depends on the scope of the variables. For a static or global variable, the first initialization
int a[5]={0};
may be done at compile time, while the loop is run at, well, run time. Thus there is no "execution" associated with the former.
You may find the discussion of this question (and in particular this answer ) interesting.

Ensuring tail-call optimization in C [duplicate]

It seems to me that it would work perfectly well to do tail-recursion optimization in both C and C++, yet while debugging I never seem to see a frame stack that indicates this optimization. That is kind of good, because the stack tells me how deep the recursion is. However, the optimization would be kind of nice as well.
Do any C++ compilers do this optimization? Why? Why not?
How do I go about telling the compiler to do it?
For MSVC: /O2 or /Ox
For GCC: -O2 or -O3
How about checking if the compiler has done this in a certain case?
For MSVC, enable PDB output to be able to trace the code, then inspect the code
For GCC..?
I'd still take suggestions for how to determine if a certain function is optimized like this by the compiler (even though I find it reassuring that Konrad tells me to assume it)
It is always possible to check if the compiler does this at all by making an infinite recursion and checking if it results in an infinite loop or a stack overflow (I did this with GCC and found out that -O2 is sufficient), but I want to be able to check a certain function that I know will terminate anyway. I'd love to have an easy way of checking this :)
After some testing, I discovered that destructors ruin the possibility of making this optimization. It can sometimes be worth it to change the scoping of certain variables and temporaries to make sure they go out of scope before the return-statement starts.
If any destructor needs to be run after the tail-call, the tail-call optimization can not be done.
All current mainstream compilers perform tail call optimisation fairly well (and have done for more than a decade), even for mutually recursive calls such as:
int bar(int, int);
int foo(int n, int acc) {
return (n == 0) ? acc : bar(n - 1, acc + 2);
}
int bar(int n, int acc) {
return (n == 0) ? acc : foo(n - 1, acc + 1);
}
Letting the compiler do the optimisation is straightforward: Just switch on optimisation for speed:
For MSVC, use /O2 or /Ox.
For GCC, Clang and ICC, use -O3
An easy way to check if the compiler did the optimisation is to perform a call that would otherwise result in a stack overflow — or looking at the assembly output.
As an interesting historical note, tail call optimisation for C was added to the GCC in the course of a diploma thesis by Mark Probst. The thesis describes some interesting caveats in the implementation. It's worth reading.
As well as the obvious (compilers don't do this sort of optimization unless you ask for it), there is a complexity about tail-call optimization in C++: destructors.
Given something like:
int fn(int j, int i)
{
if (i <= 0) return j;
Funky cls(j,i);
return fn(j, i-1);
}
The compiler can't (in general) tail-call optimize this because it needs
to call the destructor of cls after the recursive call returns.
Sometimes the compiler can see that the destructor has no externally visible side effects (so it can be done early), but often it can't.
A particularly common form of this is where Funky is actually a std::vector or similar.
gcc 4.3.2 completely inlines this function (crappy/trivial atoi() implementation) into main(). Optimization level is -O1. I notice if I play around with it (even changing it from static to extern, the tail recursion goes away pretty fast, so I wouldn't depend on it for program correctness.
#include <stdio.h>
static int atoi(const char *str, int n)
{
if (str == 0 || *str == 0)
return n;
return atoi(str+1, n*10 + *str-'0');
}
int main(int argc, char **argv)
{
for (int i = 1; i != argc; ++i)
printf("%s -> %d\n", argv[i], atoi(argv[i], 0));
return 0;
}
Most compilers don't do any kind of optimisation in a debug build.
If using VC, try a release build with PDB info turned on - this will let you trace through the optimised app and you should hopefully see what you want then. Note, however, that debugging and tracing an optimised build will jump you around all over the place, and often you cannot inspect variables directly as they only ever end up in registers or get optimised away entirely. It's an "interesting" experience...
As Greg mentions, compilers won't do it in debug mode. It's ok for debug builds to be slower than a prod build, but they shouldn't crash more often: and if you depend on a tail call optimization, they may do exactly that. Because of this it is often best to rewrite the tail call as an normal loop. :-(

Resources