How do I use compiler intrinsic __fmul_? - c

I am writing a massively parallel GPU application using CUDA. I have been optimizing it by hand. I received a 20% performance increase with __fdividef_(x, y), and according to The Cuda C Programming Guide (section C.2.1), using similar functions for multiplication and adding is also beneficial.
The function is stated as this: __fmul_[rn,rz,ru,rd](x,y).
__fdividef(x,y) was not stated with the arguments in brackets. I was wondering, what are those brackets?
If I run the simple code:
int t = __fmul_(5,4);
I get a compiler error about how __fmul_ is undefined. I have the CUDA runtime included, so I don't think it is a setup thing; rather it is something to do with those square brackets. How do I correctly use this function? Thank you.
EDIT: I should clarify, the compiler is the CUDA-compiler NVCC.

You should specify rounding mode with ru (rounding up) or rd (rounding down). There is no function __fmul_ but available function signatures are __fmul_rd or __fmul_ru.

CUDA Programming Guide explains the suffixes:
_rd: round down.
_rn: round to nearest even.
_ru: round up.
_rz: round towards zero.
See CUDA's Single Precision Intrinsics documentation for details on these functions.

Related

How do do BLAS/cuBLAS treat the factors alpha and beta in their routines?

Many linear algebra routines have constants such as alpha and beta as arguments. For example cublas?GEMM performs the following operation:
C := alpha*op( A )op( B ) + betaC
Suppose I set beta to 0.
Will the cuBLAS still perform an unnecessary scalar-matrix multiplication and matrix-matrix addition? What about other libraries such as BLAS/LAPACK/MKL?
If the necessary operations is not performed: Do I need to do something to ensure this, or is it avoided automatically?
Are there other values for alpha/beta for which there are other optimizations? For example, suppose I instead set beta=1, will scaling by beta operation be skipped?
Why does the cuBLAS documentation and BLAS documentation specify these factors in DGEMM as const double but in examples a double value is passed to them? What's the difference?
I would be surprised if these libraries did waste operations in the manner I described, but I didn't find an explicit discussion about it anywhere other than the cuBLAS documentation mentioning:
if beta == 0 then C does not have to be a valid input.
Even the reference implementation optimises here. No serious implementation with do the operation regardless of values for alpha or beta.
No it will not.
N/A
Just leave beta=0. to ignore C. beta=1. to skip scaling
The reason is compatibility with FORTRAN. There were no const variables in FORTRAN prior to F90. The BLAS interfaces were defined prior to F90 and everybody sticks to the conventions. I you want to have a C interface with proper keywords, look at the c-specific interfaces like sblas_dgemm.
Here is the reference implementation for DGEMM.
http://www.netlib.org/lapack/explore-html/d7/d2b/dgemm_8f_source.html. Look for Quick return if possible., And if alpha.eq.zero. etc

How does an AVR perform floating point Arithmetic

I'm trying to implement a support for double and float and corresponding basic arithmetic on a CPU without an FPU.
I know that it is possible on all AVR ATmega controllers. An ATmega also has no FPU. So here comes the question: How does it work? If there any suggestions for literature or links with explanations and examples?
At the best case I will provide a support for code like this:
double twice ( double x )
{
return x*x;
}
Many thanks in advance,
Alex
Here are AVR related links with explanations and examples for implementing soft double:
You will find one double floating point lib here.
Another one can be found it in the last message here.
Double is very slow, so if speed is in concern, you might opt for fixed point math. Just read my messages in this thread:
This post may be interesting for you: Floating point calculations in a processor with no FPU
As stated:
Your compiler may provide support, or you may need to roll your own.
There are freely-available implementations, too.
If it's for an ATmega, you probably don't have to write anything yourself. All the already available libraries are probably already optimized much further than you possible can do yourself. If you need more performance, you could consider to convert the floating points to fixed points. You should consider this anyway. If you can get the job done in fixed point, you should stay away from floating point.
Avr uses AVR Libc, which you can download and examine.
There is a math library, but that is not what you are looking for. That contains the standard functions defined in math.h.
Instead, you need functions that perform multiplication and things like like that. These are also in Libc, under fplib and written in assembly language. But the user doesn't call them directly. Instead, when the compiler comes across a multiplication involving floats, the compiler will choose the correct function to insert.
You can browse through the AVR fplib to get an idea of what to do, but you are going to have to write your own assembly or bit-twiddling code for your processor.
You need to find out what standard your processor and language are using for floating point. IEEE, perhaps? And you'll also need to know if the system is little-endian or big-endian.
I am assuming you system doesn't have a C compiler already. Otherwise, all the floating point operations would already be implemented. That "twice()" function (actually square()) would work just fine as it is.

Vectorized Trig functions in C?

I'm looking to calculate highly parallelized trig functions (in block of like 1024), and I'd like to take advantage of at least some of the parallelism that modern architectures have.
When I compile a block
for(int i=0; i<SIZE; i++) {
arr[i]=sin((float)i/1024);
}
GCC won't vectorize it, and says
not vectorized: relevant stmt not supported: D.3068_39 = __builtin_sinf (D.3069_38);
Which makes sense to me. However, I'm wondering if there's a library to do parallel trig computations.
With just a simple taylor series up the 11th order, GCC will vectorize all the loops, and I'm getting speeds over twice as fast as a naive sin loop (with bit-exact answers, or with 9th order series, only a single bit off for the last two out of 1600 values, for a >3x speedup). I'm sure someone has encountered a problem like this before, but when I google, I find no mentions of any libraries or the like.
A. Is there something existing already?
B. If not, advice for optimizing parallel trig functions?
EDIT: I found the following library called "SLEEF": http://shibatch.sourceforge.net/ which is described in this paper and uses SIMD instructions to calculate several elementary functions. It uses SSE and AVX specific code, but I don't think it will be hard to turn it into standard C loops.
Since you said you were using GCC it looks like there are some options:
http://gruntthepeon.free.fr/ssemath/
This uses SSE and SSE2 instructions to implement it.
http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php
This has an alternate implementation. Some of the comments are pretty good.
That said, I'd probably look into GPGPU for a solution. Maybe writing it in CUDA or OpenCL (If I remember correctly CUDA supports the sine function). Here are some libraries that look like they might make it easier.
https://code.google.com/p/slmath/
https://code.google.com/p/thrust/
Since you are looking to calculate harmonics here, I have some code that addressed a similar problem. It is vectorized already and faster than anything else I have found. As a side benefit, you get the cosine for free.
What platform are you using? Many libraries of this sort already exist:
Intel's provides the Vector Math Library (VML) with icc.
Apple provides the vForce library as part of the Accelerate framework.
HP provides their own Vector Math Library for Itanium (and may other architectures, too).
Sun provided libmvec with their compiler tools.
...
Instead of the taylor series, I would look at the algorithms fdlibm uses. They should get you as much precision with fewer steps.
My answer was to create my own library to do exactly this called vectrig: https://github.com/jeremysalwen/vectrig

optimizing with IEEE floating point - guaranteed mathematical identities?

I am having some trouble with IEEE floating point rules preventing compiler optimizations that seem obvious. For example,
char foo(float x) {
if (x == x)
return 1;
else
return 0;
}
cannot be optimized to just return 1 because NaN == NaN is false. Okay, fine, I guess.
However, I want to write such that the optimizer can actually fix stuff up for me. Are there mathematical identities that hold for all floats? For example, I would be willing to write !(x - x) if it meant the compiler could assume that it held all the time (though that also isn't the case).
I see some reference to such identities on the web, for example here, but I haven't found any organized information, including in a light scan of the IEEE 754 standard.
It'd also be fine if I could get the optimizer to assume isnormal(x) without generating additional code (in gcc or clang).
Clearly I'm not actually going to write (x == x) in my source code, but I have a function that's designed for inlining. The function may be declared as foo(float x, float y), but often x is 0, or y is 0, or x and y are both z, etc. The floats represent onscreen geometric coordinates. These are all cases where if I were coding by hand without use of the function I'd never distinguish between 0 and (x - x), I'd just hand-optimize stupid stuff away. So, I really don't care about the IEEE rules in what the compiler does after inlining my function, and I'd just as soon have the compiler ignore them. Rounding differences are also not very important since we're basically doing onscreen drawing.
I don't think -ffast-math is an option for me, because the function appears in a header file, and it is not appropriate that the .c files that use the function compile with -ffast-math.
Another reference that might be of some use for you is a really nice article on floating-point optimization in Game Programming Gems volume 2, by Yossarian King. You can read the article here. It discusses the IEEE format in quite detail, taking into account implementations and architecture, and provides many optimization tricks.
I think that you are always going to struggle to make computer floating-point-number arithmetic behave like mathematical real-number arithmetic, and suggest that you don't for any reason. I suggest that you are making a type error trying to compare the equality of 2 fp numbers. Since fp numbers are, in the overwhelming majority, approximations, you should accept this and use approximate-equality as your test.
Computer integers exist for equality testing of numerical values.
Well, that's what I think, you go ahead and fight the machine (well, all the machines actually) if you wish.
Now, to answer some parts of your question:
-- for every mathematical identity you are familiar with from real-number arithmetic, there are counter examples in the domain of floating-point numbers, whether IEEE or otherwise;
-- 'clever' programming almost always makes it more difficult for a compiler to optimise code than straightforward programming;
-- it seems that you are doing some graphics programming: in the end the coordinates of points in your conceptual space are going to be mapped to pixels on a screen; pixels always have integer coordinates; your translation from conceptual space to screen space defines your approximate-equality function
Regards
Mark
If you can assume that floating-point numbers used in this module will not be Inf/NaN, you can compile it with -ffinite-math-only (in GCC). This may "improve" the codegen for examples like the one you posted.
You could compare for bitwise equality. Although you might get bitten for some values that are equivalent but bitwise different, it will catch all those cases where you have a true equality as you mentioned. And I am not sure the compiler will recognize what you do and remove it when inlining (which I believe is what you are after), but that can easily be checked.
What happened when you tried it the obvious way and profiled it? or examined the generated asm?
If the function is inlined with values known at the call site, the optimizer has this information available. For example: foo(0, y).
You may be surprised at the work you don't have to do, but at the very least profiling or looking at what the compiler actually does with the code will give you more information and help you figure out where to proceed next.
That said, if you know certain things that the optimizer can't figure out itself, you can write multiple versions of the function, and specify the one you want to call. This is something of a hassle, but at least with inline functions they will all be specified together in one header. It's also quite a bit easier than the next step, which is using inline asm to do exactly what you want.

initialize a variable statically (at compile time)

1) I've got many constants in my C algo.
2) my code works both in floating-point and fixed-point.
Right now, these constants are initialized by a function, float2fixed, whereby in floating-point it does nothing, while in fixed-point, it finds their fixed-point representation. For instance, 0.5f stays 0.5f if working in floating-point, whereas it uses the pow() routine and becomes 32768 if working in fixed-point and the fixed-point representation is Qx.16.
That's easy to maintain, but it takes a lot of time actually to compute these constants in fixed-point (pow is a floatin-point function). In C++, I'd use some meta-programming, so the compiler computes these values at compile-time, so there's no hit at run-time. But in C, thats not possible. Or is it? Anybody knows of such a trick? Is any compiler clever enough to do that?
Looking forward to any answers.
A
Rather than using (unsigned)(x*pow(2,16)) to do your fixed point conversion, write it as (unsigned)(0.5f * (1 << 16))
This should be an acceptable as a compile-time constant expression since it involves only builtin operators.
When using fixed-point, can you write a program that takes your floating point values and converts them into correct, constant initializers for the fixed point type, so you effectively add a step to the compilation that generates the fixed point values.
One advantage of this will be that you can then define and declare your constants with const so that they won't change at run-time - whereas with the initialization functions, of course, the values have to be modifiable because they are calculated once.
I mean write a simple program that can scan for formulaic lines that might read:
const double somename = 3.14159;
it would read that and generate:
const fixedpoint_t somename = { ...whatever is needed... };
You design the operation to make it easy to manage for both notations - so maybe your converter always reads the file and sometimes rewrites it.
datafile.c: datafile.constants converter
converter datafile.constants > datafile.c
In plain C, there's not much you can do. You need to do the conversion at some point, and the compiler doesn't give you any access to call interesting user-provided functions at compile time. Theoretically, you could try to coax the preprocessor to do it for you, but that's the quick road to total insanity (i.e. you'd have to implement pow() in macros, which is pretty hideous).
Some options I can think of:
Maintain a persistent cache on disk. At least then it'd only be slow once, though you still have to load it, make sure it's not corrupt, etc.
As mentioned in another comment, use template metaprogramming anyway and compile with a C++ compiler. Most C works just fine (arguably better) with a C++ compiler.
Hmm, I guess that's about all I can think of. Good luck.
Recent versions of GCC ( around 4.3 ) added the ability to use GMP and MPFR to do some compile-time optimisations by evaluating more complex functions that are constant. That approach leaves your code simple and portable, and trust the compiler to do the heavy lifting.
Of course, there are limits to what it can do, and it would be hard to know if it's optimizing a given instance without going and looking at the assembly. But it might be worth checking out. Here's a link to the description in the changelog

Resources