Strange float behaviour in OpenMP - c

I am running the following OpenMP code
#pragma omp parallel shared(S2,nthreads,chunk) private(a,b,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("\nNumber of threads = %d\n", nthreads);
}
#pragma omp for schedule(dynamic,chunk) reduction(+:S2)
for(a=0;a<NREC;a++){
for(b=0;b<NLIG;b++){
S2=S2+cos(1+sin(atan(sin(sqrt(a*2+b*5)+cos(a)+sqrt(b)))));
}
} // end for a
} /* end of parallel section */
And for NREC=NLIG=1024 and higher values, in a 8 core board, I get up to 7 speedup. The problem is that if I compare the final results for variable S2, it differs between 1 to 5% to the exact results obtained in the serial version. What could be the reason? Should I use some specific compilation options to avoid this strange float behaviour ?

The order of additions/subtractions of floating-point numbers can affect the accuracy.
To take a simple example, let's say that your machine stores 2 decimal digits, and that you're computing the value of 1 + 0.04 + 0.04.
If you do the left addition first, you get 1.04, which is rounded to 1. The second addition will give 1 again, so the final result is 1.
If you do the right addition first, you get 0.08. Added to 1, this gives 1.08 which is rounded to 1.1.
For maximum accuracy, it's best to add values from small to large.
Another cause could be that float registers on the CPU may contain more bits than floats in main memory. Hence, if some intermediate result is cached in a register, it is more accurate, but if it gets swapped out to memory it gets truncated.
See also this question in the C++ FAQ.

It is known that machine floating-point operations are flawed when two large values are subtracted (or two large values with different signs are added) yielding the small difference as a result. Thus, summing an oscillated-sign sequences may introduce severe error on each iteration. Another flawed case is when magnitudes of two operands differ much - the lesser operand virtually cancels itself.
It might be useful to separate positive and negative operands, and perform summation of each group separately, then add (subtract) the group results.
If accuracy is crucial, it would probably require the need of pre-sorting of each of the groups, and perform two sums inside each. First sum will go from the center towards the largest (head), second will go from the smallest (tail) towards the center. Resultant group sum will be the sum of the partial runs.

Related

I am multiplying two complex numbers whose imaginary parts are both zero, I was expecting the result to be only real but I get an imaginary part

I am multiplying the reciprocal of the determinant of a matrix by the transposed cofactor matrix to get the inverse matrix. Some of the values in the transposed cofactor matrix will have an imaginary value not equaled to or very close to zero.
I am trying to replicate code originally written in matlab, so there are exact target values I am trying to achieve, any differences in the values propagate themselves throughout the rest of the calculations resulting in very different final values. Is it possible to do? Or will there always very differences between the two codes calculations?
(I have revised my code to show the small values)This the function and the output.
void MatrixScalarMultiply(int r, int c, double complex x, double complex mat[r][c],
double complex result[r][c]){
for (int R=0; R<r; R++){
for (int C=0; C<c; C++){
printf("%.16g%+.16gi times %.16g%+.16gi\n", creal(x), cimag(x), creal(mat[R][C]), cimag(mat[R][C]));
result[R][C] = x * mat[R][C];
printf("result[%d][%d]:%.16g%+.16gi\n", R, C, creal(result[R][C]), cimag(result[R][C]));
}
}
}
output:
1122579414.726753+0i times 0.0004943535237422733-2.632898458153072e-21i
result[0][0]:554951.0893507092-2.955637610188447e-12i
I am multiplying two complex numbers whose imaginary parts are both zero,
As OP found out by using exponential notation, the imaginary parts were not both zero.
... any differences in the values propagate themselves throughout the rest of the calculations resulting in very different final values. Is it possible to do?
Yes, it is possible, yet often not likely to have a floating-point calculation on 2 platforms result in the same exact result. A more reasonable approach is to tolerate a small difference. What constitutes a small difference depends on the calculation, which is not yet shown.
Or will there always very differences between the two codes calculations?
No, there will not always differ. Again, what constitutes a small difference depends on the calculation, which is not yet shown.
... can see that the imaginary part of the second number is actually a very small number. Is there a way I can round those off to zero?
Yes, code could round as it did using the "%+.16f" format in an earlier version of the question. That rounded the display value and not mat[R][C].
Instead of attempting to "round those off to zero", consider analyzing code and determine what tolerance is possible. A simply, though not so mathematical sound, approach adjusts the various input real and imaginary arguments 1 unit in the last place (ULP), both up and down with nextafter() and noticing the range of outputs.
Alternatively, the algorithm and true code should be posted in a separate question to help analyze why the imaginary part does not meet OP's expectations.

OpenMP: parallelizing (Approximation) Code makes it not only faster, but more accurate. Why?

i have parallelized a simple code for calculating numerically an integral of a function. I use it with the function y=2*sqrt(1-x^2) from -1 to 1. This integral is equal to Pi.
The Algorithm is the simplest way to calc an integral, I guess everyone learned it in school. I "draw" rectangles of a small size under the function and calculate their area.
The sequential algorithm is:
double calc_integral_seq(int left_bound, int right_bound){
int i;
double x, sum=0.0;
double step = 1.0/ (double) STEPS;
for(i=left_bound*STEPS; i<right_bound*STEPS; i++){
x = (i+0.5)*step;
sum += f(x);
}
return sum*step;
}
Now, when I parallelize this code (for instance by only using the for-loop construct #pragma omp parallel for private(x) reduction(+:sum)) the algorithm is way faster for huge sizes of STEPS.
But it is also more accurate! How can that be? This is a deterministic algorithm, it should calculate the exact same value or am I wrong? How can this be explained?
This is a rounding issue. Whenever you add a very small to a very large number, there is a rounding error, because the small change cannot be accurately described by the floating point number with a large exponent. The rounding error per addition increases with increasing sum value.
By doing the computation in parallel, the local sum does not grow as large as it does for the serial loop. So locally, there is less rounding error. Also the summation towards the global sum, the local results are much closer together, so there is less rounding.
General algorithms to avoid floating point rounding errors are Kahan summation or pairwise summation.

Execution time of different operators

I was reading Knuth's The Art of Computer Programming and I noticed that he indicates that the DIV command takes 6 times longer than the ADD command in his MIX assembly language.
To test the relevancy to modern architecture, I wrote the following code snippet:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
clock_t start;
unsigned int ia=0,ib=0,ic=0;
int i;
float fa=0.0,fb=0.0,fc=0.0;
int sample_size=100000;
if (argc > 1)
sample_size = atoi(argv[1]);
#define TEST(OP) \
start = clock();\
for (i = 0; i < sample_size; ++i)\
ic += (ia++) OP ((ib--)+1);\
printf("%d,", (int)(clock() - start))
TEST(+);
TEST(*);
TEST(/);
TEST(%);
TEST(>>);
TEST(<<);
TEST(&);
TEST(|);
TEST(^);
#undef TEST
//TEST must be redefined for floating point types
#define TEST(OP) \
start = clock();\
for (i = 0; i < sample_size; ++i)\
fc += (fa+=0.5) OP ((fb-=0.5)+1);\
printf("%d,", (int)(clock() - start))
TEST(+);
TEST(*);
TEST(/);
#undef TEST
printf("\n");
return ic+fc;//to prevent optimization!
}
I then generated 4000 test samples (each containing a sample size of 100000 operations of each type) using this command line:
for i in {1..4000}; do ./test >> output.csv; done
Finally, I opened the results with Excel and graphed the averages. What I found was rather surprising. Here is the graph of the results:
The actual averages were (from left-to-right): 463.36475,437.38475,806.59725,821.70975,419.56525,417.85725,426.35975,425.9445,423.792,549.91975,544.11825,543.11425
Overall this is what I expected (division and modulo are slow, as are floating point results).
My question is: why do both integer and floating-point multiplication execute faster than their addition counterparts? It is a small factor, but it is consistent across numerous tests. In TAOCP Knuth lists ADD as taking 2 units of time while MUL takes 10. Did something change in CPU architecture since then?
Different instructions take different amounts of time on the same CPU; and the same instructions can take different amounts of time on different CPUs. For example, for Intel's original Pentium 4 shifting was relatively expensive and addition was quite fast, so adding a register to itself was faster than shifting a register left by 1; and for Intel's recent CPUs shifting and addition are roughly the same speed (shifting is faster than it was on the original Pentium 4 and addition slower, in terms of "cycles").
To complicate things more, different CPUs may be able to do more or less at the same time, and have other differences that effect performance.
In theory (and not necessarily in practice):
Shifting and boolean operations (AND, OR, XOR) should be fastest (each bit can be done in parallel). Addition and subtraction should be next (relatively simple, but all bits of the result can't be done in parallel because of the carry from one pair of bits to the next).
Multiplication should be much slower as it involves many additions, but some of those additions can be done in parallel. For a simple example (using decimal digits not binary) something like 12 * 34 (with multiple digits) can be broken down into "single digit" form and becomes 2*4 + 2*3 * 10 + 1*4 * 10 + 1*3 * 100; where all "single digit" multiplications can be done in parallel, then 2 additions can be done in parallel, then the last addition can be done.
Division is mostly "compare and subtract if larger, repeated". It's the slowest because it can't be done in parallel (the results of the subtraction are needed for the next comparison). Modulo is the remainder of a division and essentially identical to division (and for most CPUs it's actually the same instruction - e.g. a DIV instruction gives you a quotient and a remainder).
For floating point; each number has 2 parts (significand and exponent), so things get a little more complicated. Floating point shifting is actually adding to or subtracting from the exponent (and should cost roughly the same as integer addition/subtraction). For floating point addition, subtraction and boolean operations you need to equalise the exponents, and after that you do the operation on the significands alone (and the "equalising" and "doing the operation" can't be done in parallel). Multiplication is multiplying the significands and adding the exponents (and adjusting the bias), where both parts can be done in parallel so the total cost is whichever is slowest (multiplying the significands); so it's as fast as integer multiplication. Division is dividing the significands and subtracting the exponents (and adjusting the bias), where both parts can be done in parallel and total cost is whichever is slowest (dividing the significands); so it's as fast as integer division.
Note: I've simplified in various places to make it much easier to understand.
to test the execution time, look at the instructions produced in the assembly listing and look at the documentation for the processor for those instructions and note if the FPU is performing the operation or if it is directly performed in the code.
Then, add up the execution time for each instruction.
However, if the cpu is pipelined or multi threaded, the operation could take MUCH less time than calculated.
It is true that division and modulo (a division operation) is slower than addition. The reason behind this is the design of ALU (Arithmetic Logical Unit). The ALU is combination of parallel adders and logic-circuits. Division is performed by repeated subtraction, therefore needs more level of subtract logic making division slower than addition. The propagation delays of gates involved in division adds cherry on cake.

Double precision computations

I am trying to compute numerically (using analytical formulae) the values of the following sequence of integrals:
I(k,t) = int_0^{N/2-1} u^k e^(-i*u*delta*t) du
where "i" is the imaginary unit. For small k, this integral can be computed by hand, but for larger k it is more convenient to notice that there is an iterative relationship between the terms of sequence that can be derived by integration by parts. This is implemented below by the function i1.
void i1(int N, double t, double delta, double complex ** result){
unsigned int k;
(*result)=(double complex*)malloc(sizeof(double complex)*N);
if(t==0){
for(k=0;k<N;k++){
(*result)[k]=pow(N-2,k+1)/(pow(2,k+1)*(k+1));
}
}
else{
(*result)[0]=2/(delta*t)*sin(delta*(N-2)*t/4)*cexp(-I*(N-2)*t*delta/4);
for(k=1;k<N;k++){
(*result)[k]=I/(delta*t)*(pow(N-2,k)/pow(2,k)*cexp(-I*delta*(N-2)*t/2)-k*(*result)[k-1]);
}
}
}
The problem is that in my case t is very small (1e-12) and delta is typically around 1e6. When testing in the case N=4, I noticed some weird results appearing for k=3, namely the results where suddenly very large, much larger than they should be as the norm of an integral is always smaller than the integral of the norm, the results of the test are printed below:
I1(0,1.0000e-12)=1.0000000000e+00+-5.0000000000e-07I
Norm=1.0000000000e+00
compare = 1.0000000000e+00
I1(1,1.0000e-12)=5.0000000000e-01+-3.3328895199e-07I
Norm=5.0000000000e-01
compare = 5.0000000000e-01
I1(2,1.0000e-12)=3.3342209601e-01+-2.5013324745e-07I
Norm=3.3342209601e-01
compare = 3.3333333333e-01
I1(3,1.0000e-12)=2.4960025766e-01+-2.6628804517e+02I
Norm=2.6628816215e+02
compare = 2.5000000000e-01
k=3 not being particularly big, I computed the value of the integral by hand, but I got using the calculator and the analytical formula I obtained the same larger than expected results for the imaginary part in the case. I also realized that if I changed the order of the terms the result changed. It therefore appears to be a problem with precision, as in the iterative process there is a subtraction of very large but almost equal terms, and following what was said on this thread: How to divide tiny double precision numbers correctly without precision errors?, this can cause small errors to be amplified. However I am finding it difficult to see how to resolve the issue in my case, and was also wondering if someone could briefly explain why this occurs?
You have to be very careful with floating point addition and subtraction.
Suppose a decimal floating point with 6 digits precision (to keep things simple). Adding/subtracting a small number to/from a large one discards some or even all of the smaller. So:
5.00000E+9 + 1.45678E+4 is: 5.00000 + 0.000014 E+9 = 5.00001E+9
which is as good as it gets. But if you add a series of small numbers to a large one, then you may be better off adding the small numbers together first, and adding the result to the large number.
Subtraction of similar size numbers is another way of losing precision. So:
5.12346E+4 - 5.12345E+4 = 1.00000E-1
Now, the two numbers can be at best their real value +/- half the least significant digit, in this case 0.5E-1 -- which is a relative error of about +/-1E-6. The result of the subtraction is still +/- 0.5E-1 (we cannot reduce the error !), which is a relative error of +/- 0.5 !!!
Multiplication and division are much better behaved -- until you over-/under-flow.
But as soon as you are doing anything iterative with add/subtract, keep saying (loudly) to yourself: floating point numbers are not (entirely) like real numbers.

Floating multiplication performing slower depending of operands in C

I am performing a stencil computation on a matrix I previously read from a file. I use two different kinds of matrices (NonZero type and Zero type). Both types share the value of the boundaries (1000 usually), whilst the rest of the elements are 0 for Zero type and 1 for NonZero type.
The code stores the matrix of the file in two allocated matrices of the same size. Then it performs an operation in every element of one matrix using its own value and values of neighbours (add x 4 and mul x 1), and stores the result in the second matrix. Once the computation is finished, the pointers for matrices are swapped and the same operation is perform for a finite amount of times. Here you have the core code:
#define GET(I,J) rMat[(I)*cols + (J)]
#define PUT(I,J) wMat[(I)*cols + (J)]
for (cur_time=0; cur_time<timeSteps; cur_time++) {
for (i=1; i<rows-1; i++) {
for (j=1; j<cols-1; j++) {
PUT(i,j) = 0.2f*(GET(i-1,j) + GET(i,j-1) + GET(i,j) + GET(i,j+1) + GET(i+1,j));
}
}
// Change pointers for next iteration
auxP = wMat;
wMat = rMat;
rMat = auxP;
}
The case I am exposing uses a fixed amount of 500 timeSteps (outer iterations) and a matrix size of 8192 rows and 8192 columns, but the problem persists while changing number of timeSteps or matrix size. Note that I only measure time of this concrete part of algorithm, so reading matrix from file nor anything else affects the time measure.
What it happens, is that I get different times depending on which type of matrix I use, obtaining a much worse performance when using Zero type (every other matrix performs same as NonZero type, as I have already tried to generate a matrix full of random values).
I am certain it is the multiplication operation, as if I remove it and leave only the adds, they perform the same. Note that with Zero matrix type, most of the type the result of the sum will be 0, so the operation will be "0.2*0".
This behaviour is certainly weird for me, as I thought that floating point operations were independent of values of operands, which does not look like the case here. I have also tried to capture and show SIGFPE exceptions in case that was the problem, but I obtained no results.
In case it helps, I am using an Intel Nehalem processor and gcc 4.4.3.
The problem has already mostly been diagnosed, but I will write up exactly what happens here.
Essentially, the questioner is modeling diffusion; an initial quantity on the boundary diffuses into the entirety of a large grid. At each time step t, the value at the leading edge of the diffusion will be 0.2^t (ignoring effects at the corners).
The smallest normalized single-precision value is 2^-126; when cur_time = 55, the value at the frontier of the diffusion is 0.2^55, which is a bit smaller than 2^-127. From this time step forward, some of the cells in the grid will contain denormal values. On the questioner's Nehalem, operations on denormal data are about 100 times slower than the same operation on normalized floating point data, explaining the slowdown.
When the grid is initially filled with constant data of 1.0, the data never gets too small, and so the denormal stall is avoided.
Note that changing the data type to double would delay, but not alleviate the issue. If double precision is used for the computation, denormal values (now smaller than 2^-1022) will first arise in the 441st iteration.
At the cost of precision at the leading edge of the diffusion, you could fix the slowdown by enabling "Flush to Zero", which causes the processor to produce zero instead of denormal results in arithmetic operations. This is done by toggling a bit in the FPSCR or MXSCR, preferably via the functions defined in the <fenv.h> header in the C library.
Another (hackier, less good) "fix" would be to fill the matrix initially with very small non-zero values (0x1.0p-126f, the smallest normal number). This would also prevent denormals from arising in the computation.
Maybe your ZeroMatrix uses the typical storage scheme for Sparse Matrices: store every non-zero value in a linked list. If that is the case, it is quite understandable why it performs worse than a typical array-based storage-scheme: because it needs to run thru the linked list once for every operation you perform. In that case you can maybe speed the process up by using a matrix-multiply-algorithm that accounts for having a sparse-matrix. If this is not the case please post minimal but complete code so we can play with it.
here is one of the possibilities for multiplying sparse matrices efficiently:
http://www.cs.cmu.edu/~scandal/cacm/node9.html

Resources