About the combination of OpenMP and -Ofast - c

I implemented OpenMP parallelization in a for loop where I have a sum that is the principal cause of slowing down my code. When I did so, I found out that the final results were not the same that I obtained for the non-parallelize code (which is written in C). So first, one might think "well, I just didn't implemented well the parallelization" but the curious thing is that when I run the parallelized code with the -Ofast optimization suddenly the results are correct.
That would be:
-O0 correct
-Ofast correct
OMP -O0 wrong
OMP -O1 wrong
OMP -O2 wrong
OMP -O3 wrong
OMP -Ofast correct!
What could -Ofast be doing that solves an error that only appears when I implement openmp?
Any recommendation of what could I check or test?
Thanks!
EDIT
Here I include the smallest version of my code that still reproduces the problem.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#define LENGTH 100
#define R 50.0
#define URD 1.0/sqrt(2.0)
#define PI (4.0*atan(1.0)) //pi
const gsl_rng_type * Type;
gsl_rng * item;
double CalcDeltaEnergy(double **M,int sx,int sy){
double DEnergy,r,zz;
int k,j;
double rrx,rry;
int rx,ry;
double Energy, Cpm, Cmm, Cmp, Cpp;
DEnergy = 0;
//OpenMP parallelization:
#pragma omp parallel for reduction (+:DEnergy)
for (int index = 0; index < LENGTH*LENGTH; index++){
k = index % LENGTH;
j = index / LENGTH;
zz = 0.5*(1.0 - pow(-1.0, k + j + sx + sy));
for (rx = -1; rx <= 1; rx++){
for (ry = -1; ry <= 1; ry++){
rrx = (sx - k - rx*LENGTH)*URD;
rry = (sy - j - ry*LENGTH)*URD;
r = sqrt(rrx*rrx + rry*rry + zz);
if(r != 0 && r <= R){
Cpm = sqrt((rrx+0.5*(0.702*cos(M[k][j])-0.702*cos(M[sx][sy])))*(rrx+0.5*(0.702*cos(M[k][j])-0.702*cos(M[sx][sy]))) + (rry+0.5*(0.702*sin(M[k][j])-0.702*sin(M[sx][sy])))*(rry+0.5*(0.702*sin(M[k][j])-0.702*sin(M[sx][sy]))) + zz);
Cmm = sqrt((rrx-0.5*(0.702*cos(M[k][j])-0.702*cos(M[sx][sy])))*(rrx-0.5*(0.702*cos(M[k][j])-0.702*cos(M[sx][sy]))) + (rry-0.5*(0.702*sin(M[k][j])-0.702*sin(M[sx][sy])))*(rry-0.5*(0.702*sin(M[k][j])-0.702*sin(M[sx][sy]))) + zz);
Cpp = sqrt((rrx+0.5*(0.702*cos(M[k][j])+0.702*cos(M[sx][sy])))*(rrx+0.5*(0.702*cos(M[k][j])+0.702*cos(M[sx][sy]))) + (rry+0.5*(0.702*sin(M[k][j])+0.702*sin(M[sx][sy])))*(rry+0.5*(0.702*sin(M[k][j])+0.702*sin(M[sx][sy]))) + zz);
Cmp = sqrt((rrx-0.5*(0.702*cos(M[k][j])+0.702*cos(M[sx][sy])))*(rrx-0.5*(0.702*cos(M[k][j])+0.702*cos(M[sx][sy]))) + (rry-0.5*(0.702*sin(M[k][j])+0.702*sin(M[sx][sy])))*(rry-0.5*(0.702*sin(M[k][j])+0.702*sin(M[sx][sy]))) + zz);
Cpm = 1.0/Cpm;
Cmm = 1.0/Cmm;
Cpp = 1.0/Cpp;
Cmp = 1.0/Cmp;
Energy = (Cpm + Cmm - Cpp - Cmp)/(0.702*0.702); // S=cte=1
DEnergy -= 2.0*Energy;
}
}
}
}
return DEnergy;
}
void Initialize(double **M){
double random;
for(int i=0;i<(LENGTH-1);i=i+2){
for(int j=0;j<(LENGTH-1);j=j+2) {
random=gsl_rng_uniform(item);
if (random<0.5) M[i][j]=PI/4.0;
else M[i][j]=5.0*PI/4.0;
random=gsl_rng_uniform(item);
if (random<0.5) M[i][j+1]=3.0*PI/4.0;
else M[i][j+1]=7.0*PI/4.0;
random=gsl_rng_uniform(item);
if (random<0.5) M[i+1][j]=3.0*PI/4.0;
else M[i+1][j]=7.0*PI/4.0;
random=gsl_rng_uniform(item);
if (random<0.5) M[i+1][j+1]=PI/4.0;
else M[i+1][j+1]=5.0*PI/4.0;
}
}
}
int main(){
//Choose and initiaze the random number generator
gsl_rng_env_setup();
Type = gsl_rng_default; //default=mt19937, ran2, lxs0
item = gsl_rng_alloc (Type);
double **S; //site matrix
S = (double **) malloc(LENGTH*sizeof(double *));
for (int i = 0; i < LENGTH; i++)
S[i] = (double *) malloc(LENGTH*sizeof(double ));
//Initialization
Initialize(S);
int l,m;
for (int cl = 0; cl < LENGTH*LENGTH; cl++) {
l = gsl_rng_uniform_int(item, LENGTH); // RNG[0, LENGTH-1]
m = gsl_rng_uniform_int(item, LENGTH); // RNG[0, LENGTH-1]
printf("%lf\n", CalcDeltaEnergy(S, l, m));
}
//Free memory
for (int i = 0; i < LENGTH; i++)
free(S[i]);
free(S);
return 0;
}
I compile with:
g++ [optimization] -lm test.c -o test.x -lgsl -lgslcblas -fopenmp
and run with:
GSL_RNG_SEED=123; ./test.x > test.dat
Comparing the outputs for different optimizations one can see what I stated before.

Disclaimer: I have little to no experience with OpenMP
It's probably a race condition you run into when using OpenMP.
You'll need to declare all those variables inside the OpenMP loop to be private. One core may calculate their values for a certain value of index, which gets promptly recalculated to different values on a core that uses another value of index: the variables such as k, j, rrx, rry etc are shared between the compute nodes.
Instead of using a pragma like
#pragma omp parallel for private(k,j,zz,rx,ry,rrx,rry,r,Cpm,Cmm,Cpp,Cmp,Energy) reduction (+:D\
(credits to comment by Zulan below:) you can also declare the variables inside the parallel region, as locally as possible. This makes them private implicitly and is less prone to initialization issues and easier to reason about.
(You could even consider putting everything inside the outer for-loop (over index) in a function: the function call overhead is minimal compared to the calculations.)
As to why -Ofast together with OpenMP does actually produce correct output.
My guess is: mostly luck. Here's what -Ofast does (gcc manual):
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math [...]
Here's the section on -ffast-math:
This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.
Thus, the sqrt, cos and sin will likely be a lot speedier. My guess is, that in this case, the calculations of the variables inside the outer loop don't bite each other, since the individual threads are so fast, they don't conflict. But that is a very handwavingly explanation and guess.

Related

Parallelizing inner loop with residual calculations in OpenMP with SSE vectorization

I'm trying to parallelizing the inner loop of a program that has data dependencies (min) outside the scope of the loops. I'm having an issue where the residual calculations occuring outside the scope of the inner j loop. The code gets errors if the "#pragma omp parallel" part is included on the j loop even if the loop doesn't run at all due to a k value being too low. say (1,2,3) for example.
for (i = 0; i < 10; i++)
{
#pragma omp parallel for shared(min) private (j, a, b, storer, arr) //
for (j = 0; j < k-4; j += 4)
{
mm_a = _mm_load_ps(&x[j]);
mm_b = _mm_load_ps(&y[j]);
mm_a = _mm_add_ps(mm_a, mm_b);
_mm_store_ps(storer, mm_a);
#pragma omp critical
{
if (storer[0] < min)
{
min = storer[0];
}
if (storer[1] < min)
{
min = storer[1];
}
//etc
}
}
do
{
#pragma omp critical
{
if (x[j]+y[j] < min)
{
min = x[j]+y[j];
}
}
}
} while (j++ < (k - 1));
round_min = min
}
The j-based loop is a parallel loop so you cannot use j after the loop. This is especially true since you explicitly put j as private, so only visible locally in the thread but not outside the parallel region. You can explicitly compute the position of the remaining j value using (k-4+3)/4*4 just after the parallel loop.
Furthermore, here is few important points:
You may not really need to vectorize the code yourself: you can use omp simd reduction. OpenMP can do all the boring job of computing the residual calculations for you automatically. Moreover, the code will be portable and much simpler. The generated code may also likely be faster than yours. Note however that some compilers might not be able to vectorize the code (GCC and ICC does, while Clang and MSVC often need some help).
Critical section (omp critical) are very costly. In your case this will just annihilate any possible improvement related to the parallel section. The code will likely be slower due to cache-line bouncing.
Reading data written by _mm_store_ps is inefficient here although some compiler (like GCC) may be able to understand the logic of your code and generate a faster implementation (extracting lane data).
Horizontal SIMD reductions inefficient. Use vertical ones that are much faster and that can be easily used here.
Here is a corrected code taking into account the above points:
for (i = 0; i < 10; i++)
{
// Assume min is already initialized correctly here
#pragma omp parallel for simd reduction(min:min) private(j)
for (j = 0; j < k; ++j)
{
const float tmp = x[j] + y[j];
if(tmp < min)
min = tmp;
}
// Use min here
}
The above code is vectorized correctly on x86 architecture on GCC/ICC (both with -O3 -fopenmp), Clang (with -O3 -fopenmp -ffastmath) and MSVC (with /O2 /fp:precise -openmp:experimental).

Why is iterating through an array backwards faster than forward in C

I'm studying for an exam and am trying to follow this problem:
I have the following C code to do some array initialisation:
int i, n = 61440;
double x[n];
for(i=0; i < n; i++) {
x[i] = 1;
}
But the following runs faster (0.5s difference in 1000 iterations):
int i, n = 61440;
double x[n];
for(i=n-1; i >= 0; i--) {
x[i] = 1;
}
I first thought that it was due to the loop accessing the n variable, thus having to do more reads (as suggested here for example: Why is iterating through an array backwards faster than forwards). But even if I change the n in the first loop to a hard coded value, or vice versa move the 0 in the bottom loop to a variable, the performance remains the same. I also tried to change the loops to only do half the work (go from 0 to < 30720, or from n-1 to >= 30720), to eliminate any special treatment of the 0 value, but the bottom loop is still faster
I assume it is because of some compiler optimisations? But everything I look up for the generated machine code suggests, that < and >= ought to be equal.
Thankful for any hints or advice! Thank you!
Edit: Makefile, for compiler details (this is part of a multi threading exercise, hence the OpenMP, though for this case it's all running on 1 core, without any OpenMP instructions in the code)
#CC = gcc
CC = /opt/rh/devtoolset-2/root/usr/bin/gcc
OMP_FLAG = -fopenmp
CFLAGS = -std=c99 -O2 -c ${OMP_FLAG}
LFLAGS = -lm
.SUFFIXES : .o .c
.c.o:
${CC} ${CFLAGS} -o $# $*.c
sblas:sblas.o
${CC} ${OMP_FLAG} -o $# $#.o ${LFLAGS}
Edit2: I redid the experiment with n * 100, getting the same results:
Forward: ~170s
Backward: ~120s
Similar to the previous values of 1.7s and 1.2s, just times 100
Edit3: Minimal Example - changes described above where all localized to the vector update method. This is the default forward version, which takes longer than the backwards version for(i = limit - 1; i >= 0; i--)
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
void vector_update(double a[], double b[], double x[], int limit);
/* SBLAS code */
void *main() {
int n = 1024*60;
int nsteps = 1000;
int k;
double a[n], b[n], x[n];
double vec_update_start;
double vec_update_time = 0;
for(k = 0; k < nsteps; k++) {
// Loop over whole program to get reasonable execution time
// (simulates a time-steping code)
vec_update_start = omp_get_wtime();
vector_update(a, b, x, n);
vec_update_time = vec_update_time + (omp_get_wtime() - vec_update_start);
}
printf( "vector update time = %f seconds \n \n", vec_update_time);
}
void vector_update(double a[], double b[], double x[] ,int limit) {
int i;
for (i = 0; i < limit; i++ ) {
x[i] = 0.0;
a[i] = 3.142;
b[i] = 3.142;
}
}
Edit4: the CPU is AMD quad-core Opteron 8378. The machine uses 4 of those, but I'm using only one on the main processor (core ID 0 in the AMD architecture)
It's not the backward iteration but the comparison with zero which causes the loop in the second case run faster.
for(i=n-1; i >= 0; i--) {
Comparison with zero can be done with a single assembly instruction whereas comparison with any other number takes multiple instructions.
The main reason is that your compiler isn't very good at optimising. In theory there's no reason that a better compiler couldn't have converted both versions of your code into the exact same machine code instead of letting one be slower.
Everything beyond that depends on what the resulting machine code is and what it's running on. This can include differences in RAM and/or CPU speeds, differences in cache behaviour, differences in hardware prefetching (and number of prefetchers), differences in instruction costs and instruction pipelining, differences in speculation, etc. Note that (in theory) this doesn't exclude the possibility that (on most computers but not on your computer) the machine code your compiler generates for forward loop is faster than the machine code it generates for backward loop (your sample size isn't large enough to be statistically significant, unless you're working on embedded systems or game consoles where all computers that run the code are identical).

Is it possible to effectively parallelise a brute-force attack on 4 different password patterns?

In the context of my homework task I need to smart brute-force a set of passwords. Every password in the set has either of three possible masks:
%%##
##%%
#%%#
%##%
( # - a numeric character, % - a lowercase alpha character ).
At this point I am doing something like this to run over only one pattern ( the 1st one ) in multithreading:
// Compile: $ gcc test.c -o test -fopenmp -O3 -std=c99
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <omp.h>
int main() {
const char alp[26] = "abcdefghijklmnopqrstuvwxyz";
const char num[10] = "0123456789";
register int i;
char pass[4];
#pragma omp parallel for private(pass)
for (i = 0; i < 67600; i++) {
pass[3] = num[i % 10];
pass[2] = num[i / 10 % 10];
pass[1] = alp[i / 100 % 26];
pass[0] = alp[i / 2600 % 26];
/* Slow password processing here */
}
return 0;
}
But, unfortunately, that technique has nothing to do with searching passwords with different patterns.
So my question is:
Is there a way to construct an effective set of parallel for instructions in order to run the attack simultaneously on each password pattern?
Help is much appreciated.
The trick here is to note that all four password options are simply rotations/shifts of each other.
That is, for the example password qr34 and the patterns you mention, you are looking at:
qr34 %%## #Original potential password
4qr3 #%%# #Rotate 1 place right
34qr ##%% #Rotate 2 places right
r34q %##% #Rotate 3 places right
Given this, you can use the same generation technique as in your first question.
For each potential password generated, check the potential password as well as the next three shifts of that password.
Note that the following code relies on an interesting property of C/C++: if the truth value of a statement can be deduced early, no further execution takes place. That is, given the statement if(A || B || C), if A is false, then B must be evaluated; however, if B is true, then C is never evaluated.
This means that we can have A=CheckPass(pass) and B=CheckPass(RotatePass(pass)) and C=CheckPass(RotatePass(pass)) with the guarantee that the password will only be rotated as many times as necessary.
Note that this scheme requires that each thread have its own, private copy of the potential password.
//Compile with, e.g.: gcc -O3 temp.c -std=c99 -fopenmp
#include <stdio.h>
#include <unistd.h>
#include <string.h>
int PassCheck(char *pass){
return strncmp(pass, "4qr3", 4)==0;
}
//Rotate string one character to the right
char* RotateString(char *str, int len){
char lastchr = str[len-1];
for(int i=len-1;i>0;i--)
str[i]=str[i-1];
str[0] = lastchr;
return str;
}
int main(){
const char alph[27] = "abcdefghijklmnopqrstuvwxyz";
const char num[11] = "0123456789";
char goodpass[4] = "----"; //Provide a default password to indicate an error state
#pragma omp parallel for collapse(4)
for(int i = 0; i < 26; i++)
for(int j = 0; j < 26; j++)
for(int m = 0; m < 10; m++)
for(int n = 0; n < 10; n++){
char pass[4] = {alph[i],alph[j],num[m],num[n]};
if(
PassCheck(pass) ||
PassCheck(RotateString(pass,4)) ||
PassCheck(RotateString(pass,4)) ||
PassCheck(RotateString(pass,4))
){
//It is good practice to use `critical` here in case two
//passwords are somehow both valid. This won't arise in
//your code, but is worth thinking about.
#pragma omp critical
{
memcpy(goodpass, pass, 4);
//#pragma omp cancel for //Escape for loops!
}
}
}
printf("Password was '%.4s'.\n",goodpass);
return 0;
}
I notice that you are generating your password using
pass[3] = num[i % 10];
pass[2] = num[i / 10 % 10];
pass[1] = alp[i / 100 % 26];
pass[0] = alp[i / 2600 % 26];
This sort of technique is occasionally useful, especially in scientific programming, but usually only for addressing convenience and memory locality.
For instance, an array of arrays where an element is accessed as a[y][x] can be written as a flat-array with elements accessed as a[y*width+x]. This gives a speed gain, but only because the memory is contiguous.
In your case, this indexing does not produce any speed gains, but does make it more difficult to reason about how your program works. I would avoid it for this reason.
It's been said that "premature optimization is the root of all evil". This is especially true of micro-optimizations such as the one you're trying here. The biggest speed gains come from high-level algorithmic decisions, not from fiddly stuff. The -O3 compilation flag does most of everything you'll ever need done in terms of making your code fast at this level.
Micro-optimizations assume that doing something convoluted in your high-level code will somehow enable you to out-smart the compiler. This is not a good assumption since the compiler is often quite smart and will be even smarter tomorrow. Your time is very valuable: don't use it on this stuff unless you have a clear justification. (Further discussion of "premature optimization" is here.)

Parallelization for Monte Carlo pi approximation

I am writing a c script to parallelize pi approximation with OpenMp. I think my code works fine with a convincing output. I am running it with 4 threads now. What I am not sure is that if this code is vulnerable to race condition? and if it is, how do I coordinate the thread action in this code ?
the code looks as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
double sample_interval(double a, double b) {
double x = ((double) rand())/((double) RAND_MAX);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int i;
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
#pragma omp parallel firstprivate(x, y, z, i) reduction(+:counter) num_threads(NumThreads)
{
srand(time(NULL));
for (int i=0; i < N; ++i)
{
x = sample_interval(-1.,1.);
y = sample_interval(-1.,1.);
z = ((x*x)+(y*y));
if (z<= 1)
{
counter++;
}
}
}
double approx_pi = 4.0 * counter/ (double)N;
printf("%i %1.6e %1.6e\n ", N, 4.0 * counter/ (double)N, fabs(4.0 * counter/ (double)N - pi) / pi);
return 0;
}
Also I was wondering if the seed for random number should be declared inside or outside parallelization. my output looks like this:
10 3.600000e+00 1.459156e-01
100 3.160000e+00 5.859240e-03
1000 3.108000e+00 1.069287e-02
10000 3.142400e+00 2.569863e-04
100000 3.144120e+00 8.044793e-04
1000000 3.142628e+00 3.295610e-04
10000000 3.141379e+00 6.794439e-05
100000000 3.141467e+00 3.994585e-05
1000000000 3.141686e+00 2.971945e-05
Which looks OK for now. your suggestion for race condition and seed placement is most welcome.
There are a few problems in your code that I can see. The main one is from my standpoint that it isn't parallelized. Or more precisely, you didn't enable the parallelism you introduced with OpenMP while compiling it. Here is the way one can see that:
The way the code is parallelized, the main for loop should be executed in full by all the threads (there is no worksharing here, no #pragma omp parallel for, only a #pragma omp parallel). Therefore, considering you set the number of threads to be 4, the global number of iterations should be 4*N. Thus, your output should slowly converge towards 4*Pi, not towards Pi.
Indeed, I tried your code on my laptop, compiled it with OpenMP support, and that is pretty-much what I get. However, when I don't enable OpenMP, I get an output similar to yours. So in conclusion, you need to:
Enable OpenMP at compilation time for getting a parallel version of your code.
Divide your result by NumThreads to get a "valid" approximation of Pi (or distribute your loop over N with a #pragma omp for for example)
But that is if / when your code is correct elsewhere, which it isn't yet.
As BitTickler already hinted, rand() isn't thread-safe. So you have to go for another random number generator, which will allow you to privatize it's state. That could be rand_r() for example. That said, this still has quite a few issues:
rand() / rand_r() is a terrible RNG in term of randomness and periodicity. While increasing your number of tries, you'll rapidly go over the period of the RNG and repeat over and over again the same sequence. You need something more robust to do anything remotely serious.
Even with a "good" RNG, the parallelism aspect can be an issue in the sense that you want your sequences in parallel to be uncorrelated between each-other. And just using a different seed value per thread doesn't guaranty that to you (although with a wide-enough RNG, you have a bit of headroom for that)
Anyway, bottom line is:
Use a better thread-safe RNG (I find drand48_r() or random_r() to be OK for toy codes on Linux)
Initialize its state per-thread based on the thread id for example, while keeping in mind that this won't ensure a proper decorrelation of the random series in some circumstances (and the larger the number of times you call the functions, the more likely you are to finally have overlapping series).
This done (along with a few minor fixes), your code becomes for example as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
typedef struct drand48_data RNGstate;
double sample_interval(double a, double b, RNGstate *state) {
double x;
drand48_r(state, &x);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
time_t ctime = time(NULL);
#pragma omp parallel private(x, y, z) reduction(+:counter) num_threads(NumThreads)
{
RNGstate state;
srand48_r(ctime+omp_get_thread_num(), &state);
for (int i=0; i < N; ++i) {
x = sample_interval(-1, 1, &state);
y = sample_interval(-1, 1, &state);
z = ((x*x)+(y*y));
if (z<= 1) {
counter++;
}
}
}
double approx_pi = 4.0 * counter / (NumThreads * N);
printf("%i %1.6e %1.6e\n ", N, approx_pi, fabs(approx_pi - pi) / pi);
return 0;
}
Which I compile like this:
gcc -std=gnu99 -fopenmp -O3 -Wall pi.c -o pi_omp

Speed up C program without using conditional compilation

we are working on a model checking tool which executes certain search routines several billion times. We have different search routines which are currently selected using preprocessor directives. This is not only very unhandy as we need to recompile every time we make a different choice, but also makes the code hard to read. It's now time to start a new version and we are evaluating whether we can avoid conditional compilation.
Here is a very artificial example that shows the effect:
/* program_define */
#include <stdio.h>
#include <stdlib.h>
#define skip 10
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (i + j % skip == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
Here, the variable skip is an example for a value that influences the behavior of the program. Unfortunately, we need to recompile every time we want a new value of skip.
Let's look at another version of the program:
/* program_variable */
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (i + j % skip == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
Here, the value for skip is passed as a command line parameter. This adds great flexibility. However, this program is much slower:
$ time ./program_define 1000 10
50004989999950500
real 0m25.973s
user 0m25.937s
sys 0m0.019s
vs.
$ time ./program_variable 1000 10
50004989999950500
real 0m50.829s
user 0m50.738s
sys 0m0.042s
What we are looking for is an efficient way to pass values into a program (by means of a command line parameter or a file input) that will never change afterward. Is there a way to optimize the code (or tell the compiler to) such that it runs more efficiently?
Any help is greatly appreciated!
Comments:
As Dirk wrote in his comment, it is not about the concrete example. What I meant was a way to replace an if that evaluates a variable that is set once and then never changed (say, a command line option) inside a function that is called literally billions of times by a more efficient construct. We currently use the preprocessor to tailor the desired version of the function. It would be nice if there is a nicer way that does not require recompilation.
You can take a look at libdivide which works to do fast division when the divisor isn't known until runtime: (libdivide is an open source library
for optimizing integer division).
If you calculate a % b using a - b * (a / b) (but with libdivide) you might find that it's faster.
I ran your program_variable code on my system to get a baseline of performance:
$ gcc -Wall test1.c
$ time ./a.out 1000 10
50004989999950500
real 0m55.531s
user 0m55.484s
sys 0m0.033s
If I compile test1.c with -O3, then I get:
$ time ./a.out 1000 10
50004989999950500
real 0m54.305s
user 0m54.246s
sys 0m0.030s
In a third test, I manually set the values of limit and skip:
int limit = 1000, skip = 10;
I then re-run the test:
$ gcc -Wall test2.c
$ time ./a.out
50004989999950500
real 0m54.312s
user 0m54.282s
sys 0m0.019s
Taking out the atoi() calls doesn't make much of a difference. But if I compile with -O3 optimizations turned on, then I get a speed bump:
$ gcc -Wall -O3 test2.c
$ time ./a.out
50004989999950500
real 0m26.756s
user 0m26.724s
sys 0m0.020s
Adding a #define macro for an ersatz atoi() function helped a little, but didn't do much:
#define QSaToi(iLen, zString, iOut) {int j = 1; iOut = 0; \
for (int i = iLen - 1; i >= 0; --i) \
{ iOut += ((zString[i] - 48) * j); \
j = j*10;}}
...
int limit, skip;
QSaToi(4, argv[1], limit);
QSaToi(2, argv[2], skip);
And testing:
$ gcc -Wall -O3 -std=gnu99 test3.c
$ time ./a.out 1000 10
50004989999950500
real 0m53.514s
user 0m53.473s
sys 0m0.025s
The expensive part seems to be those atoi() calls, if that's the only difference between -O3 compilation.
Perhaps you could write one binary, which loops through tests of various values of limit and skip, something like:
#define NUM_LIMITS 3
#define NUM_SKIPS 2
...
int limits[NUM_LIMITS] = {100, 1000, 1000};
int skips[NUM_SKIPS] = {1, 10};
int limit, skip;
...
for (int limitIdx = 0; limitIdx < NUM_LIMITS; limitIdx++)
for (int skipIdx = 0; skipIdx < NUM_SKIPS; skipIdx++)
/* per-limit, per-skip test */
If you know your parameters ahead of compilation time, perhaps you can do it this way. You could use fprintf() to write your output to a per-limit, per-skip file output, if you want results in separate files.
You could try using the GCC likely/unlikely builtins (e.g. here) or profile guided optimization (e.g. here). Also, do you intend (i + j) % 10 or i + (j % 10)? The % operator has higher precedence, so your code as written is testing the latter.
I'm a bit familiar with the program Niels is asking about.
There are a bunch of interesting answers around (thanks), but the answers slightly miss the spirit of the question. The given example programs are really just example programs. The logic that is subject to pre-processor statements is much much more involved. In the end, it is not just about executing a modulo operation or a simple division. it is about keeping or skipping certain procedure calls, executing an operation between two other operations etc, defining the size of an array, etc.
All these things could be guarded by variables that are set by command-line parameters. But that would be too costly as many of these routines, statements, memory allocations are executed a billion times. Perhaps that shapes the problem a bit better. Still very interested in your ideas.
Dirk
If you would use C++ instead of C you could use templates so that things can be calculated at compile time, even recursions are possible.
Please have a look at C++ template meta programming.
A stupid answer, but you could pass the define on the gcc command line and run the whole thing with a shell script that recompiles and runs the program based on a command-line parameter
#!/bin/sh
skip=$1
out=program_skip$skip
if [ ! -x $out ]; then
gcc -O3 -Dskip=$skip -o $out test.c
fi
time $out 1000
I got also an about 2× slowdown between program_define and program_variable, 26.2s vs. 49.0s. I then tried
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j, r;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
for (i = 0; i < 10000000; ++i) {
for (j = 0, r = 0; j < limit; ++j, ++r) {
if (r == skip) r = 0;
if (i + r == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
using an extra variable to avoid the costly division, and the resulting time was 18.9s, so significantly better than the modulo with a statically known constant. However, this auxiliary-variable technique is only promising if the change is easily predictable.
Another possibility would be to eliminate using the modulus operator:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
int current = 0;
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (++current == skip) {
current = 0;
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
If that is the actual code, you have a few ways to optimize it:
(i + j % 10==0) is only true when i==0, so you can skip that entire mod operation when i>0. Also, since i + j only increases by 1 on each loop, you can hoist the mod out and simply have a variable you increment and reset when it hits skip (as has been pointed out in other answers).
You can also have all possible function implementations already in the program, and at runtime you change the function pointer to select the function which you are actually are using.
You can use macros to avoid that you have to write duplicate code:
#define MYFUNCMACRO(name, myvar) void name##doit(){/* time consuming code using myvar */}
MYFUNCMACRO(TEN,10)
MYFUNCMACRO(TWENTY,20)
MYFUNCMACRO(FOURTY,40)
MYFUNCMACRO(FIFTY,50)
If you need to have too many of these macros (hundreds?) you can write a codegenerator which writes the cpp file automatically for a range of values.
I didn't compile nor test the code, but maybe you see the principle.
You might be compiling without optimisation, which will lead your program to load skip each time it's checked, instead of the literal of 10. Try adding -O2 to your compiler's command line, and/or use
register int skip;

Resources