OpenCL RT errors in kernel

OpenCL RT errors in kernel - c

How can i find out about the runtime error in kernel code.
For example A[0] == 0:
__kernel void eval(__global float* A, __global float* C) {
C[0] = 3/A[0];
}
After evaluating C[0] == 3
I tried -Werrer like parameter for clBuildProgram, but he finds error only at compilation stage(C[0] = 3/0 -> 1 error generated: division by zero is undefined)

Unfortunately there is no way to have OpenCL serve you runtime errors such as division by zero on a silver platter. A division like 3.0f/A[0] is a perfectly valid operation. It is up to the user to ensure, through algorithm design, that the input A[0] is never zero. For large kernels, this may not be trivial. And if this cannot be ensured, you have to check the results for Inf/NaN and manually throw the error. Alternatively, if A>=0 is distance and A!=0 cannot be ensured, a common trick is to add a tolerance value in the division: 3.0f/(A[0]+1E-9f)
Same with array out of bounds exception: there is no runtime error, but in this case it's likely that the application will crash. You have to make sure by algorithm design that the array index is always valid.

Related

Where is the source of imprecise calculation in the assembler code of gcc -Ofast compared with -O3? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
The following 3 lines give imprecise results with "gcc -Ofast -march=skylake":
int32_t i = -5;
const double sqr_N_min_1 = (double)i * i;
1. - ((double)i * i) / sqr_N_min_1
Obviously, sqr_N_min_1 gets 25., and in the 3rd line (-5 * -5) / 25 should become 1. so that the overall result from the 3rd line is exactly 0.. Indeed, this is true for compiler options "gcc -O3 -march=skylake".
But with "-Ofast" the last line yields -2.081668e-17 instead of 0. and with other i than -5 (e.g. 6 or 7) it gets other very small positive or negative random deviations from 0..
My question is: Where exactly is the source of this imprecision?
To investigate this, I wrote a small test program in C:
#include <stdint.h> /* int32_t */
#include <stdio.h>
#define MAX_SIZE 10
double W[MAX_SIZE];
int main( int argc, char *argv[] )
{
volatile int32_t n = 6; /* try 6 7 or argv[1][0]-'0' */
double *w = W;
int32_t i = 1 - n;
const int32_t end = n - 1;
const double sqr_N_min_1 = (double)i * i;
/* Here is the crucial part. The loop avoids the compiler replacing it with constants: */
do {
*w++ = 1. - ((double)i * i) / sqr_N_min_1;
} while ( (i+=2) <= end );
/* Then, show the results (only the 1st and last output line matters): */
w = W;
i = 1 - n;
do {
fprintf( stderr, "%e\n", *w++ );
} while ( (i+=2) <= end );
return( 0 );
}
Godbolt shows me the assembly produced by an "x86-64 gcc9.3" with the option "-Ofast -march=skylake" vs. "-O3 -march=skylake". Please, inspect the five columns of the website (1. source code, 2. assembly with "-Ofast", 3. assembly with "-O3", 4. output of 1st assembly, 5. output of 2nd assembly):
Godbolt site with five columns
As you can see the differences in the assemblies are obvious, but I can't figure out where exactly the imprecision comes from. So, the question is, which assembler instruction(s) are responsible for this?
A follow-up question is: Is there a possibility to avoid this imprecision with "-Ofast -march=skylake" by reformulating the C-program?

Comments and another answer have pointed out the specific transformation that's happening in your case, with a reciprocal and an FMA instead of a division.
Is there a possibility to avoid this imprecision with "-Ofast -march=skylake" by reformulating the C-program?
Not in general.
-Ofast is (currently) a synonym for -O3 -ffast-math.
See https://gcc.gnu.org/wiki/FloatingPointMath
Part of -ffast-math is -funsafe-math-optimizations, which as the name implies, can change numerical results. (With the goal of allowing more optimizations, like treating FP math as associative to allow auto-vectorizing the sum of an array with SIMD, and/or unrolling with multiple accumulators, or even just rearranging a sequence of operations within one expression to combine two separate constants.)
This is exactly the kind of speed-over-accuracy optimization you're asking for by using that option. If you don't want that, don't enable all of the -ffast-math sub-options, only the safe ones like -fno-math-errno / -fno-trapping-math. (See How to force GCC to assume that a floating-point expression is non-negative?)
There's no way of formulating your source to avoid all possible problems.
Possibly you could use volatile tmp vars all over the place to defeat optimization between statements, but that would make your code slower than regular -O3 with the default -fno-fast-math. And even then, calls to library functions like sin or log may resolve to versions that assume the args are finite, not NaN or infinity, because of -ffinite-math-only.
GCC issue with -Ofast? points out another effect: isnan() is optimized into a compile-time 0.

From the comments, it seems that, for -O3, the compiler computes 1. - ((double)i * i) / sqr_N_min_1:
Convert i to double and square it.
Divide that by sqr_N_min_1.
Subtract that from 1.
and, for -Ofast, computes it:
Prior to the loop, calculate the reciprocal of sqr_N_min_1.
Convert i to double and square it.
Compute the fused multiply-subtract of 1 minus the square times the reciprocal.
The latter improves speed because it calculates the division only once, and multiplication is much faster than division in the target processors. On top of that, the fused operation is faster than a separate multiplication and subtraction.
The error occurs because the reciprocal operation introduces a rounding error that is not present in the original expression (1/25 is not exactly representable in a binary format, while 25/25 of course is). This is why the compiler does not make this optimization when it is attempting to provide strict floating-point semantics.
Additionally, simply multiplying the reciprocal by 25 would erase the error. (This is somewhat by “chance,” as rounding errors vary in complicated ways. 1./25*25 produces 1, but 1./49*49 does not.) But the fused operation produces a more accurate result (it produces the result as if the product were computed exactly, with rounding occurring only after the subtraction), so it preserves the error.

gcc compiler optimization influences result of floating point comparison

Problem
In using automated CI tests I found some code, which breaks if optimization of gcc is set to -O2. The code should increment a counter if a double value crosses a threshold in either direction.
Workaround
Going down to -O1 or using -ffloat-store option works around the problem.
Example
Here is a small example which shows the same problem.
The update() function should return true whenever a sequence of *pNextState * 1e-6 crosses a threshold of 0.03.
I used call by reference because the values are part of a large struct in the full code.
The idea behind using < and >= is that if a sequence hits the value exactly, the function should return 1 this time and return 0 the next cycle.
test.h:
extern int update(double * pState, double * pNextState);
test.c:
#include "test.h"
int update(double * pState, double * pNextState_scaled) {
static double threshold = 0.03;
double oldState = *pState;
*pState = *pNextState_scaled * 1e-6;
return oldState < threshold && *pState >= threshold;
}
main.c:
#include <stdio.h>
#include <stdlib.h>
#include "test.h"
int main(void) {
double state = 0.01;
double nextState1 = 20000.0;
double nextState2 = 30000.0;
double nextState3 = 40000.0;
printf("%d\n", update(&state, &nextState1));
printf("%d\n", update(&state, &nextState2));
printf("%d\n", update(&state, &nextState3));
return EXIT_SUCCESS;
}
Using gcc with at least -O2 the output is:
0
0
0
Using gcc with -O1, -O0 or -ffloat-store produces the desired output
0
1
0
As i understand the problem from debugging a problem arises if the compiler optimizes out the local variable oldstate on stack and instead compares against an intermediate result in an floting point register with higher precision (80bit) and the value *pState is a tiny bit smaller than the threshold.
If the value for comparison is stored in 64bit precision, the logic can't miss crossing the threshold. Because of the multiplication by 1e-6 the result is probably stored in an floating point register.
Would you consider this a gcc bug?
clang does not show the problem.
I am using gcc version 9.2.0 on an Intel Core i5, Windows and msys2.
Update
It is clear to me that floating point comparison is not exact and i would consider the following result as valid:
0
0
1
The idea was that, if (*pState >= threshold) == false in one cycle then comparing the same value (oldstate = *pState) against the same threshold in a subsequent call (*pState < threshold) has to be true.

[Disclaimer: This is a generic, shoot-from-the-hip answer. Floating-point issues can be subtle, and I have not analyzed this one carefully. Once in a while, suspicious-looking code like this can be made to work portably and reliably after all, and per the accepted answer, that appears to be the case here. Nevertheless, the general-purpose answer stands in the, well, general case.]
I would consider this a bug in the test case, not in gcc. This sounds like a classic example of code that's unnecessarily brittle with respect to exact floating-point equality.
I would recommend either:
rewriting the test case, or perhaps
removing the test case.
I would not recommend:
working around it by switching compilers
working around it by using different optimization levels or other compiler options
submitting compiler bug reports [although in this case, it appears there was a compiler bug, although it needn't be submitted as it has been submitted already]

I've analysed your code and my take is that it is sound according the standard but you are hit by gcc bug 323 about which you may find more accessible information in gcc FAQ.
A way to modify your function and render it robust in presence of gcc bug is to store the fact that the previous state was below the threshold instead (or in addition) of storing that state. Something like this:
int update(int* pWasBelow, double* pNextState_scaled) {
static double const threshold = 0.03;
double const nextState = *pNextState_scaled * 1e-6;
int const wasBelow = *pWasBelow;
*pWasBelow = nextState < threshold;
return wasBelow && !*pWasBelow;
}
Note that this does not guarantee reproductibility. You may get 0 1 0 in one set-up and 0 0 1 in another, but you'll detect the transition sooner or later.

I am putting this as an answer because I don't think I can do real code in a comment, but #SteveSummit should get the credit - I would not have likely found this without their comment above.
The general advice is: don't do exact comparisons with floating point values, and it seems like that's what this is doing. If a computed value is almost exactly 0.03 but due to internal representations or optimizations, it's ever so slightly off and not exactly, then it's going to look like a threshold crossing.
So one can resolve this by adding an epsilon for how close one can be to the threshold without having considered to cross it yet.
int update(double * pState, double * pNextState_scaled) {
static const double threshold = 0.03;
static const double close_enough = 0.0000001f; // or whatever
double oldState = *pState;
*pState = *pNextState_scaled * 1e-6;
// if either value is too close to the threshold, it's not a crossing
if (fabs(oldState - threshold) < close_enough) return 0;
if (fabs(*pState - threshold) < close_enough) return 0;
return oldState < threshold && *pState >= threshold;
}
I imagine you'd have to know your application in order to know how to tune this value appropriately, but a coupla orders of magnitude smaller than the value you're comparing with seems in the neighborhood.

PID implementation in arduino

I came across some code online in which the PID is implemented for arduino. I am confused of the implementation. I have basic understanding of how PID works, however my source of confusion is why the hexadecimal is being used for m_prevError? what is the value 0x80000000L representing and why is right shifting by 10 when calculating the velocity?
// ServoLoop Constructor
ServoLoop::ServoLoop(int32_t proportionalGain, int32_t derivativeGain)
{
m_pos = RCS_CENTER_POS;
m_proportionalGain = proportionalGain;
m_derivativeGain = derivativeGain;
m_prevError = 0x80000000L;
}
// ServoLoop Update
// Calculates new output based on the measured
// error and the current state.
void ServoLoop::update(int32_t error)
{
long int velocity;
char buf[32];
if (m_prevError!=0x80000000)
{
velocity = (error*m_proportionalGain + (error - m_prevError)*m_derivativeGain)>>10;
m_pos += velocity;
if (m_pos>RCS_MAX_POS)
{
m_pos = RCS_MAX_POS;
}
else if (m_pos<RCS_MIN_POS)
{
m_pos = RCS_MIN_POS;
}
}
m_prevError = error;
}

Shifting a binary number to right by 1 means multiplying its corresponding decimal value by 2. Here shifting by 10 means multiplying by 2^10 which is 1024. As any basic control loop, it could be a gain of the velocity where the returned-back value is converted to be suitable to re-use by any other method.
The L here 0x80000000L is declaring that value as long. So, this value 0x80000000 may be an initial value of error or so. Also, you need to revise the full program to see how things work and what value is assigned to something like error.

Contrary to the other answer, shifting to the right has the effect to divide by a power of two, in this case >> 10 would divide by 1024. But a real division would be better, more clear, and optimized by the compiler with a shift anyway. So I find this shift ugly.
The intent is to implement some float math without actually use floating point numbers - it is a kind of fixed point calculation, where the fractional part is about 10 bits. To understand, assuming to simplify the derivative coefficient=0, an m_proportionalGain set to 1024 would mean 1, while if set to 512 it would mean 0.5. In fact in the case of proportional=1024, and error=100, the formula would give
100*1024 / 1024 = 100
(gain=1), while proportional=512 would give
100*512 / 1024 = 50
(gain=0.5).
As for previous error m_prevError set to 0x80000000, it is simply a special value which is checked in the loop to see if "there is already" a previous error. If not, i.e. if prevError has the special value, the entire loop is skipped once; in other words, it serves the purpose to skip the first update after creation of the object. Not very cleaver I suppose, I would prefer to simply set the previous error equal to 0 and skip completely the check in ::update(). Using special values as flag has the problem that sometimes the calculations result in the special value itself - it would be a big bug. If absolutely needed, it is better to use a true flag.
All in all, I think this is a poor PID algorithm, as it lacks completely the integrative part; it seems that the variable m_pos is thought for this integrative purpose, it is managed quite that way, but never used - only set. Nevertheless this algorithm can work, but all depends on the target system and the wanted performances: on most situations, this algorithm leaves a residual error.

Matrix Multiplication of size 100*100 using SSE Intrinsics

int MAX_DIM = 100;
float a[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16)));
float b[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16)));
float d[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16)));
/*
* I fill these arrays with some values
*/
for(int i=0;i<MAX_DIM;i+=1){
for(int j=0;j<MAX_DIM;j+=4){
for(int k=0;k<MAX_DIM;k+=4){
__m128 result = _mm_load_ps(&d[i][j]);
__m128 a_line = _mm_load_ps(&a[i][k]);
__m128 b_line0 = _mm_load_ps(&b[k][j+0]);
__m128 b_line1 = _mm_loadu_ps(&b[k][j+1]);
__m128 b_line2 = _mm_loadu_ps(&b[k][j+2]);
__m128 b_line3 = _mm_loadu_ps(&b[k][j+3]);
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x00), b_line0));
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x55), b_line1));
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xaa), b_line2));
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xff), b_line3));
_mm_store_ps(&d[i][j],result);
}
}
}
the above code I made to make matrix multiplication using SSE. the code runs as flows I take 4 elements from row from a multiply it by 4 elements from a column from b and move to the next 4 elements in the row of a and next 4 elements in column b
I get an error Segmentation fault (core dumped) I don't really know why
I use gcc 5.4.0 on ubuntu 16.04.5
Edit :
The segmentation fault was solved by _mm_loadu_ps
Also there is something wrong with logic i will be greatfull if someone helps me to find it

The segmentation fault was solved by _mm_loadu_ps Also there is something wrong with logic...
You're loading 4 overlapping windows on b[k][j+0..7]. (This is why you needed loadu).
Perhaps you meant to load b[k][j+0], +4, +8, +12? If so, you should align b by 64, so all four loads come from the same cache line (for performance). Strided access is not great, but using all 64 bytes of every cache line you touch is a lot better than getting row-major vs. column-major totally wrong in scalar code with no blocking.
I take 4 elements from row from a multiply it by 4 elements from a column from b
I'm not sure your text description describes your code.
Unless you've already transposed b, you can't load multiple values from the same column with a SIMD load, because they aren't contiguous in memory.
C multidimensional arrays are "row major": the last index is the one that varies most quickly when moving to the next higher memory address. Did you think that _mm_loadu_ps(&b[k][j+1]) was going to give you b[k+0..3][j+1]? If so, this is a duplicate of SSE matrix-matrix multiplication (That question is using 32-bit integer, not 32-bit float, but same layout problem. See that for a working loop structure.)
To debug this, put a simple pattern of values into b[]. Like
#include <stdalign.>
alignas(64) float b[MAX_DIM][MAX_DIM] = {
0000, 0001, 0002, 0003, 0004, ...,
0100, 0101, 0102, ...,
0200, 0201, 0202, ...,
};
// i.e. for (...) b[i][j] = 100 * i + j;
Then when you step through your code in the debugger, you can see what values end up in your vectors.
For your a[][] values, maybe use 90000.0 + 100 * i + j so if you're looking at registers (instead of C variables) you can still tell which values are a and which are b.
Related:
Ulrich Drepper's What Every Programmer Should Know About Memory shows an optimized matmul with cache-blocking with SSE instrinsics for double-precision. Should be straightforward to adapt for float.
How does BLAS get such extreme performance? (You might want to just use an optimized matmul library; tuning matmul for optimal cache-blocking is non-trivial but important)
Matrix Multiplication with blocks
Poor maths performance in C vs Python/numpy has some links to other questions
how to optimize matrix multiplication (matmul) code to run fast on a single processor core

Goldberg's log1p vs. gsl_log1p

I am looking for a simple portable implementation of log1p. I have come across two implementations.
The first one appears as Theorem 4 here
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html,
An implementation of the above
double log1p(double p)
{
volatile double y = p;
return ( (1 + y) == 1 ) ? y : y * ( log( 1 + y) / ( ( 1 + y) - 1 ) );
}
The second one is in GSL http://fossies.org/dox/gsl-1.16/log1p_8c_source.html
double gsl_log1p (const double x)
{
volatile double y, z;
y = 1 + x;
z = y - 1;
return log(y) - (z-x)/y ; /* cancels errors with IEEE arithmetic */
}
Is there a reason to prefer one over the other?

I have tested these two approaches using a log() implementation with a maximum error of < 0.51 ulps, comparing to a multi-precision arithmetic library. Using that log() implementation as a building block for the two log1p() variants, I found the maximum error of Goldberg's version to be < 2.5 ulps, while the maximum error in the GSL variant was < 1.5 ulps. This indicates that the latter is significantly more accurate.
In terms of special case handling, the Goldberg variant showed one mismatch, in that it returns a NaN for an input of +infinity, whereas the correct result is +infinity. There were three mismatches for special cases with the GSL implementation: Inputs of -1 and +infinity delivered a NaN, while the correct results should be -infinity and +infinity, respectively. Also, for an input of -0 this code returned +0, whereas the correct result is -0.
It is difficult to assess performance without knowledge of the distribution of the inputs. As others have pointed out in comments, Goldberg's version is potentially faster when many arguments are close to zero, as it skips the expensive call to log() for such arguments.

There is no sure answer between the two. The GNU Scientific Library is quite robust, well used, and actively supported across all late versions of gcc. You are not as likely to run across too many surprises with its use. As for any other code you scrape up, there is absolutely no reason not to use it after you have validated its logic and are satisfied with its level/manner of error checking. The small downside to GSL is that it is another library you must carry around and depending on how widespread use of your code will be can provided more of a challenge for other users on other platforms. That is about the size of it.
The best piece of code is the one that most closely meets the requirements of your project.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight