Just curious how are these implemented. I can't see where I would start. Do they work directly on the float's/double's bits?
Also where can I find the source code of functions from math.h? All I find are either headers with prototypes or files with functions that call other functions from somewhere else.
EDIT: Part of the message was lost after editing the title. What I meant in particular were the ceil() and floor() functions.
If you're interested in seeing source code for algorithms for this kind of thing, then fdlibm - the "Freely Distributable libm", originally from Sun, and the reference implementation for Java's math libraries - might be a good place to start. (For casual browsing, it's certainly a better place to start than GNU libc, where the pieces are scattered around various subdirectories - math/, sysdeps/ieee754/, etc.)
fdlibm assumes that it's working with an IEEE 754 format double, and if you look at the implementations - for example, the core of the implementation of log() - you'll see that they use all sorts of clever tricks, often using a mixture of both standard double arithmetic, and knowledge of the bit representation of a double.
(And if you're interested in algorithms for supporting basic IEEE 754 floating point arithmetic, such as might be used for processors without hardware floating point support, take a look at John R. Hauser's SoftFloat.)
As for your edit: in general, ceil() and floor() might well be implemented in hardware; for example, on x86, GCC (with optimisations enabled) generates code using the frndint instruction with appropriate fiddling of the FPU control word to set the rounding mode. But fdlibm's pure software implementations (s_ceil.c, s_floor.c) do work using the bit representation directly.
math.h is part of the Standard C Library.
If you are interested in source code, the GNU C Library (glibc) is available to inspect.
EDIT TO ADD:
As others have said, math functions are typically implemented at the hardware level.
Math functions like addition and division are almost always implemented by machine instructions. The exceptions are mostly small processors, like the 8048 family, which use a library to implement the functions for which there are not a simple machine instruction sequence to compute.
Math functions like sin(), sqrt(), log(), etc. are almost always implemented in the runtime library. A few rare CPUs, like the Cray, have a square root instruction.
Tell us which particular implementation (gcc, MSVC, etc./Mac, Linux, etc.) you are using and someone will direct you precisely where to look.
On many platforms (such as any modern x86-compatible), many of the maths functions are implemented directly in the floating-point hardware (see for example http://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions). Not all of them are used, though (as I learnt through comments to other answers here). But, for instance, the sqrt library function is often implemented directly in terms of the SSE hardware instruction.
For some ideas on how the underlying algorithms work, you might try reading Numerical Recipes, which is available as PDFs somewhere.
A lot of it is done on processors these days.
The chip I cut my teeth on didn't even have a multiply instruction (z80)
We had to approximate stuff using the concept of the Taylor Series.
About 1/2 the way down this page, you can see how sin and cos are approximated.
While modern CPU do have hardware implementation of common transcendantal functions like sin, cos, etc..., it is rarely used as is. It may be for portability reason, speed, accuracy, etc... Instead, approximation algorithms are used.
Related
I'm trying to implement a support for double and float and corresponding basic arithmetic on a CPU without an FPU.
I know that it is possible on all AVR ATmega controllers. An ATmega also has no FPU. So here comes the question: How does it work? If there any suggestions for literature or links with explanations and examples?
At the best case I will provide a support for code like this:
double twice ( double x )
{
return x*x;
}
Many thanks in advance,
Alex
Here are AVR related links with explanations and examples for implementing soft double:
You will find one double floating point lib here.
Another one can be found it in the last message here.
Double is very slow, so if speed is in concern, you might opt for fixed point math. Just read my messages in this thread:
This post may be interesting for you: Floating point calculations in a processor with no FPU
As stated:
Your compiler may provide support, or you may need to roll your own.
There are freely-available implementations, too.
If it's for an ATmega, you probably don't have to write anything yourself. All the already available libraries are probably already optimized much further than you possible can do yourself. If you need more performance, you could consider to convert the floating points to fixed points. You should consider this anyway. If you can get the job done in fixed point, you should stay away from floating point.
Avr uses AVR Libc, which you can download and examine.
There is a math library, but that is not what you are looking for. That contains the standard functions defined in math.h.
Instead, you need functions that perform multiplication and things like like that. These are also in Libc, under fplib and written in assembly language. But the user doesn't call them directly. Instead, when the compiler comes across a multiplication involving floats, the compiler will choose the correct function to insert.
You can browse through the AVR fplib to get an idea of what to do, but you are going to have to write your own assembly or bit-twiddling code for your processor.
You need to find out what standard your processor and language are using for floating point. IEEE, perhaps? And you'll also need to know if the system is little-endian or big-endian.
I am assuming you system doesn't have a C compiler already. Otherwise, all the floating point operations would already be implemented. That "twice()" function (actually square()) would work just fine as it is.
I would like to know if any code in C or C++ using floating point arithmetic would produce bit exact results in any x86 based architecture, regardless of the complexity of the code.
To my knowledge, any x86 architecture since the Intel 8087 uses a FPU unit prepared to handle IEEE-754 floating point numbers, and I cannot see any reason why the result would be different in different architectures. However, if they were different (namely due to different compiler or different optimization level), would there be some way to produce bit-exact results by just configuring the compiler?
Table of contents:
C/C++
asm
Creating real-life software that achieves this.
In C or C++:
No, a fully ISO C11 and IEEE-conforming C implementation does not guarantee bit-identical results to other C implementations, even other implementations on the same hardware.
(And first of all, I'm going to assume we're talking about normal C implementations where double is the IEEE-754 binary64 format, etc., even though it would be legal for a C implementation on x86 to use some other format for double and implement FP math with software emulation, and define the limits in float.h. That might have been plausible when not all x86 CPUs included with an FPU, but in 2016 that's Deathstation 9000 territory.)
related: Bruce Dawson's Floating-Point Determinism blog post is an answer to this question. His opening paragraph is amusing (and is followed by a lot of interesting stuff):
Is IEEE floating-point math deterministic? Will you always get the same results from the same inputs? The answer is an unequivocal “yes”. Unfortunately the answer is also an unequivocal “no”. I’m afraid you will need to clarify your question.
If you're pondering this question, then you will definitely want to have a look at the index to Bruce's series of articles about floating point math, as implemented by C compilers on x86, and also asm, and IEEE FP in general.
First problem: Only "basic operations": + - * / and sqrt are required to return "correctly rounded" results, i.e. <= 0.5ulp of error, correctly-rounded out to the last bit of the mantissa, so the results is the closest representable value to the exact result.
Other math library functions like pow(), log(), and sin() allow implementers to make a tradeoff between speed and accuracy. For example, glibc generally favours accuracy, and is slower than Apple's OS X math libraries for some functions, IIRC. See also glibc's documentation of the error bounds for every libm function across different architectures.
But wait, it gets worse. Even code that only uses the correctly-rounded basic operations doesn't guarantee the same results.
C rules also allow some flexibility in keeping higher precision temporaries. The implementation defines FLT_EVAL_METHOD so code can detect how it works, but you don't get a choice if you don't like what the implementation does. You do get a choice (with #pragma STDC FP_CONTRACT off) to forbid the compiler from e.g. turning a*b + c into an FMA with no rounding of the a*b temporary before the add.
On x86, compilers targeting 32-bit non-SSE code (i.e. using obsolete x87 instructions) typically keep FP temporaries in x87 registers between operations. This produces the FLT_EVAL_METHOD = 2 behaviour of 80-bit precision. (The standard specifies that rounding still happens on every assignment, but real compilers like gcc don't actually do extra store/reloads for rounding unless you use -ffloat-store. See https://gcc.gnu.org/wiki/FloatingPointMath. That part of the standard seems to have been written assuming non-optimizing compilers, or hardware that efficiently provides rounding to the type width like non-x86, or like x87 with precision set to round to 64-bit double instead of 80-bit long double. Storing after every statement is exactly what gcc -O0 and most other compilers do, and the standard allows extra precision within evaluation of one expression.)
So when targeting x87, the compiler is allowed to evaluate the sum of three floats with two x87 FADD instructions, without rounding off the sum of the first two to a 32-bit float. In that case, the temporary has 80-bit precision... Or does it? Not always, because the C implementation's startup code (or a Direct3D library!!!) may have changed the precision setting in the x87 control word, so values in x87 registers are rounded to 53 or 24 bit mantissa. (This makes FDIV and FSQRT run a bit faster.) All of this from Bruce Dawson's article about intermediate FP precision).
In assembly:
With rounding mode and precision set the same, I think every x86 CPU should give bit-identical results for the same inputs, even for complex x87 instructions like FSIN.
Intel's manuals don't define exactly what those results are for every case, but I think Intel aims for bit-exact backwards compatibility. I doubt they'll ever add extended-precision range-reduction for FSIN, for example. It uses the 80-bit pi constant you get with fldpi (correctly-rounded 64-bit mantissa, actually 66-bit because the next 2 bits of the exact value are zero). Intel's documentation of the worst-case-error was off by a factor of 1.3 quintillion until they updated it after Bruce Dawson noticed how bad the worst-case actually was. But this can only be fixed with extended-precision range reduction, so it wouldn't be cheap in hardware.
I don't know if AMD implements their FSIN and other micro-coded instructions to always give bit-identical results to Intel, but I wouldn't be surprised. Some software does rely on it, I think.
Since SSE only provides instructions for add/sub/mul/div/sqrt, there's nothing too interesting to say. They implement the IEEE operation exactly, so there's no chance that any x86 implementation will ever give you anything different (unless the rounding mode is set differently, or denormals-are-zero and/or flush-to-zero are different and you have any denormals).
SSE rsqrt (fast approximate reciprocal square root) is not exactly specified, and I think it's possible you might get a different result even after a Newton iteration, but other than that SSE/SSE2 is always bit exact in asm, assuming the MXCSR isn't set weird. So the only question is getting the compiler go generate the same code, or just using the same binaries.
In real life:
So, if you statically link a libm that uses SSE/SSE2 and distribute those binaries, they will run the same everywhere. Unless that library uses run-time CPU detection to choose alternate implementations...
As #Yan Zhou points out, you pretty much need to control every bit of the implementation down to the asm to get bit-exact results.
However, some games really do depend on this for multi-player, but often with detection/correction for clients that get out of sync. Instead of sending the entire game state over the network every frame, every client computes what happens next. If the game engine is carefully implemented to be deterministic, they stay in sync.
In the Spring RTS, clients checksum their gamestate to detect desync. I haven't played it for a while, but I do remember reading something at least 5 years ago about them trying to achieve sync by making sure all their x86 builds used SSE math, even the 32-bit builds.
One possible reason for some games not allowing multi-player between PC and non-x86 console systems is that the engine gives the same results on all PCs, but different results on the different-architecture console with a different compiler.
Further reading: GAFFER ON GAMES: Floating Point Determinism. Some techniques that real game engines use to get deterministic results. e.g. wrap sin/cos/tan in non-optimized function calls to force the compiler to leave them at single-precision.
If the compiler and architecture is compliant to IEEE standards, yes.
For instance, gcc is IEEE compliant if configured properly. If you use the -ffast-math flag, it will not be IEEE compliant.
See http://www.validlab.com/goldberg/paper.pdf page 25.
If you want to know exactly what exactness you can rely on when using a IEEE 754-1985 hardware/compiler pair, you need to purchase the standard paper on IEEE site. Unfortunately, this is not publicly available
Link
I wrote an Ansi C compiler for a friend's custom 16-bit stack-based CPU several years ago but I never got around to implementing all the data types. Now I would like to finish the job so I'm wondering if there are any math libraries out there that I can use to fill the gaps. I can handle 16-bit integer data types since they are native to the CPU and therefore I have all the math routines (ie. +, -, *, /, %) done for them. However, since his CPU does not handle floating point then I have to implement floats/doubles myself. I also have to implement the 8-bit and 32-bit data types (bother integer and floats/doubles). I'm pretty sure this has been done and redone many times and since I'm not particularly looking forward to recreating the wheel I would appreciate it if someone would point me at a library that can help me out.
Now I was looking at GMP but it seems to be overkill (library must be absolutely huge, not sure my custom compiler would be able to handle it) and it takes numbers in the form of strings which would be wasteful for obvious reasons. For example :
mpz_set_str(x, "7612058254738945", 10);
mpz_set_str(y, "9263591128439081", 10);
mpz_mul(result, x, y);
This seems simple enough, I like the api... but I would rather pass in an array rather than a string. For example, if I wanted to multiply two 32-bit longs together I would like to be able to pass it two arrays of size two where each array contains two 16-bit values that actually represent a 32-bit long and have the library place the output into an output array. If I needed floating point then I should be able to specify the precision as well.
This may seem like asking for too much but I'm asking in the hopes that someone has seen something like this.
Many thanks in advance!
Let's divide the answer.
8-bit arithmetic
This one is very easy. In fact, C already talks about this under the term "integer promotion". This means that if you have 8-bit data and you want to do an operation on them, you simply pad them with zero (or one if signed and negative) to make them 16-bit. Then you proceed with the normal 16-bit operation.
32-bit arithmetic
Note: so long as the standard is concerned, you don't really need to have 32-bit integers.
This could be a bit tricky, but it is still not worth using a library for. For each operation, you would need to take a look at how you learned to do them in elementary school in base 10, and then do the same in base 216 for 2 digit numbers (each digit being one 16-bit integer). Once you understand the analogy with simple base 10 math (and hence the algorithms), you would need to implement them in assembly of your CPU.
This basically means loading the most significant 16 bit on one register, and the least significant in another register. Then follow the algorithm for each operation and perform it. You would most likely need to get help from overflow and other flags.
Floating point arithmetic
Note: so long as the standard is concerned, you don't really need to conform to IEEE 754.
There are various libraries already written for software emulated floating points. You may find this gcc wiki page interesting:
GNU libc has a third implementation, soft-fp. (Variants of this are also used for Linux kernel math emulation on some targets.) soft-fp is used in glibc on PowerPC --without-fp to provide the same soft-float functions as in libgcc. It is also used on Alpha, SPARC and PowerPC to provide some ABI-specified floating-point functions (which in turn may get used by GCC); on PowerPC these are IEEE quad functions, not IBM long double ones.
Performance measurements with EEMBC indicate that soft-fp (as speeded up somewhat using ideas from ieeelib) is about 10-15% faster than fp-bit and ieeelib about 1% faster than soft-fp, testing on IBM PowerPC 405 and 440. These are geometric mean measurements across EEMBC; some tests are several times faster with soft-fp than with fp-bit if they make heavy use of floating point, while others don't make significant use of floating point. Depending on the particular test, either soft-fp or ieeelib may be faster; for example, soft-fp is somewhat faster on Whetstone.
One answer could be to take a look at the source code for glibc and see if you could salvage what you need.
I'm working on a micro-controller that contains access to floating-point operations.
I need to make use of a power function. The problem is, there isn't enough memory to support the pow and sqrt function. This is because the microcontroller doesn't support FP operations natively, and produces a large number of instructions to use them. I can still multiply and divide floating point numbers.
Architecture: Freescale HCS12 (16-bit)
If you mentioned the architecture, you might get a more specific answer.
The linux kernel still has the old x87 IEEE-754 math emulation library for i386 and i486 processors without a hardware floating point unit, under: arch/x86/math-emu/
There are a lot of resources online for floating point routines implemented for PIC micros, and AVR libc has a floating point library - though it's in AVR assembly.
glibc has implementations for pow functions in sysdeps/ieee754. Obviously, the compiler must handle the elementary floating point ops using hardware instructions or emulation / function calls.
Make your own function that multiplies repeatedly in a loop.
Are the old tricks (lookup table, approx functions) for creating faster implementations of sqrt() still useful, or is the default implementation as fast as it is going to get with modern compilers and hardware?
Rule 1: profile before optimizing
Before investing any effort in the belief that you can beat the optimizer, you must profile everything and discover where the bottleneck really lies. In general, it is unlikely that sqrt() itself is your bottleneck.
Rule 2: replace the algorithm before replacing a standard function
Even if sqrt() is the bottleneck, then it is still reasonably likely that there are algorithmic approaches (such as sorting distances by length squared which is easily computed without a call to any math function) that can eliminate the need to call sqrt() in the first place.
What the compiler does for you if you do nothing else
Many modern C compilers are willing to inline CRT functions at higher optimization levels, making the natural expression including calls to sqrt() as fast as it needs to be.
In particular, I checked MinGW gcc v3.4.5 and it replaced a call to sqrt() with inline code that shuffled the FPU state and at the core used the FSQRT instruction. Thanks to the way that the C standard interacts with IEEE 754 floating point, it did have to follow the FSQRT with some code to check for exceptional conditions and a call to the real sqrt() function from the runtime library so that floating point exceptions can be handled by the library as required by the standard.
With sqrt() inline and used in the context of a larger all double expression, the result is as efficient as possible given the constraints of of standards compliance and preservation of full precision.
For this (very common) combination of compiler and target platform and given no knowledge of the use case, this result is pretty good, and the code is clear and maintainable.
In practice, any tricks will make the code less clear, and likely less maintainable. After all, would you rather maintain (-b + sqrt(b*b - 4.*a*c)) / (2*a) or an opaque block of inline assembly and tables?
Also, in practice, you can generally count on the compiler and library authors to take good advantage of your platform's capabilities, and usually to know more than you do about the subtleties of optimizations.
However, on rare occasions, it is possible to do better.
One such occasion is in calculations where you know how much precision you really need and also know that you aren't depending on the the C standard's floating point exception handling and can get along with what the hardware platform supplies instead.
Edit: I rearranged the text a bit to put emphasis on profiling and algorithms as suggested by Jonathan Leffler in comments. Thanks, Jonathan.
Edit2: Fixed precedence typo in the quadratic example spotted by kmm's sharp eyes.
Sqrt is basically unchanged on most systems. It's a relatively slow operation, but the total system speeds have improved, so it may not be worth trying to use "tricks".
The decision to optimize it with approximations for the (minor) gains this can achieve are really up to you. Modern hardware has eliminated some of the need for these types of sacrifices (speed vs. precision), but in certain situations, this is still valuable.
I'd use profiling to determine whether this is "still useful".
If you have proven that the call to sqrt() in your code is a bottleneck with a profiler then it may be worth trying to create an optimizated version. Otherwise it's a waste of time.
This probably is the fastest method of computing the square root:
float fastsqrt(float val) {
union
{
int tmp;
float val;
} u;
u.val = val;
u.tmp -= 1<<23; /* Remove last bit so 1.0 gives 1.0 */
/* tmp is now an approximation to logbase2(val) */
u.tmp >>= 1; /* divide by 2 */
u.tmp += 1<<29; /* add 64 to exponent: (e+127)/2 =(e/2)+63, */
/* that represents (e/2)-64 but we want e/2 */
return u.val;
}
wikipedia article
This probably is the fastest method of computing the inverse square root. Assume at most 0.00175228 error.
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
return x*(1.5f - xhalf*x*x);
}
This is (very roughly) about 4 times faster than (float)(1.0/sqrt(x))
wikipedia article
It is generally safe to assume that the standard library developers are quite clever, and have written performant code. You're unlikely to be able to match them in general.
So the question becomes, do you know something that'll let you do a better job? I'm not asking about special algorithms for computing the square root (the standard library developers knows of these too, and if they were worthwhile in general, they'd have used them already), but do you have any specific information about your use case, that changes the situation?
Do you only need limited precision? If so, you can speed it up compared to the standard library version, which has to be accurate.
Or do you know that your application will always run on a specific type of CPU? Then you can look at how efficient that CPU's sqrt instruction is, and see if there are better alternatives. Of course, the downside to this is that if I run your app on another CPU, your code might turn out slower than the standard sqrt().
Can you make assumptions in your code, that the standard library developers couldn't?
You're unlikely to be able to come up with a better solution to the problem "implement an efficient replacement for the standard library sqrt".
But you might be able to come up with a solution to the problem "implement an efficient square root function for this specific situation".
Why not? You probably learn a lot!
I find it very hard to believe that the sqrt function is your application's bottleneck because of the way modern computers are designed. Assuming this isn't a question in reference to some crazy low end processor, you take a tremendous speed hit to access memory outside of your CPU caches, so unless you're algorithm is doing math on a very few numbers (enough that they all basically fit within the L1 and L2 caches) you're not going to notice any speed up from optimizing any of your arithmetic.
I still find it useful even now, though this is the context of normalizing a million+ vectors every frame in response to deforming meshes.
That said, I'm generally not creating my own optimizations but relying on a crude approximation of inverse square root provided as a SIMD instruction: rsqrtps. That is still really useful in speeding up some real-world cases if you're willing to sacrifice precision for speed. Using rsqrtps can actually reduce the entirety of the operation which includes deforming and normalizing vertex normals to almost half the time, but at the cost of the precision of the results (that said, in ways that can barely be noticed by the human eye).
I've also still found the fast inverse sqrt as often credited incorrectly to John Carmack to still improve performance in scalar cases, though I don't use it much nowadays. It's generally natural to get some speed boost if you're willing to sacrifice accuracy. That said, I wouldn't even attempt to beat C's sqrt if you aren't trying to sacrifice precision for speed.
You generally have to sacrifice the generality of the solution (like its precision) if you want to beat standard implementations, and that tends to apply whether it's a mathematical function or, say, malloc. I can easily beat malloc with a narrowly-applicable free list lacking thread-safety that's suitable for very specific contexts. It's another thing to beat it with a general-purpose allocator which can allocate variable-sized chunks of memory and free any one of them at any given time.