Looking for Ansi C89 arbitrary precision math library - c

I wrote an Ansi C compiler for a friend's custom 16-bit stack-based CPU several years ago but I never got around to implementing all the data types. Now I would like to finish the job so I'm wondering if there are any math libraries out there that I can use to fill the gaps. I can handle 16-bit integer data types since they are native to the CPU and therefore I have all the math routines (ie. +, -, *, /, %) done for them. However, since his CPU does not handle floating point then I have to implement floats/doubles myself. I also have to implement the 8-bit and 32-bit data types (bother integer and floats/doubles). I'm pretty sure this has been done and redone many times and since I'm not particularly looking forward to recreating the wheel I would appreciate it if someone would point me at a library that can help me out.
Now I was looking at GMP but it seems to be overkill (library must be absolutely huge, not sure my custom compiler would be able to handle it) and it takes numbers in the form of strings which would be wasteful for obvious reasons. For example :
mpz_set_str(x, "7612058254738945", 10);
mpz_set_str(y, "9263591128439081", 10);
mpz_mul(result, x, y);
This seems simple enough, I like the api... but I would rather pass in an array rather than a string. For example, if I wanted to multiply two 32-bit longs together I would like to be able to pass it two arrays of size two where each array contains two 16-bit values that actually represent a 32-bit long and have the library place the output into an output array. If I needed floating point then I should be able to specify the precision as well.
This may seem like asking for too much but I'm asking in the hopes that someone has seen something like this.
Many thanks in advance!

Let's divide the answer.
8-bit arithmetic
This one is very easy. In fact, C already talks about this under the term "integer promotion". This means that if you have 8-bit data and you want to do an operation on them, you simply pad them with zero (or one if signed and negative) to make them 16-bit. Then you proceed with the normal 16-bit operation.
32-bit arithmetic
Note: so long as the standard is concerned, you don't really need to have 32-bit integers.
This could be a bit tricky, but it is still not worth using a library for. For each operation, you would need to take a look at how you learned to do them in elementary school in base 10, and then do the same in base 216 for 2 digit numbers (each digit being one 16-bit integer). Once you understand the analogy with simple base 10 math (and hence the algorithms), you would need to implement them in assembly of your CPU.
This basically means loading the most significant 16 bit on one register, and the least significant in another register. Then follow the algorithm for each operation and perform it. You would most likely need to get help from overflow and other flags.
Floating point arithmetic
Note: so long as the standard is concerned, you don't really need to conform to IEEE 754.
There are various libraries already written for software emulated floating points. You may find this gcc wiki page interesting:
GNU libc has a third implementation, soft-fp. (Variants of this are also used for Linux kernel math emulation on some targets.) soft-fp is used in glibc on PowerPC --without-fp to provide the same soft-float functions as in libgcc. It is also used on Alpha, SPARC and PowerPC to provide some ABI-specified floating-point functions (which in turn may get used by GCC); on PowerPC these are IEEE quad functions, not IBM long double ones.
Performance measurements with EEMBC indicate that soft-fp (as speeded up somewhat using ideas from ieeelib) is about 10-15% faster than fp-bit and ieeelib about 1% faster than soft-fp, testing on IBM PowerPC 405 and 440. These are geometric mean measurements across EEMBC; some tests are several times faster with soft-fp than with fp-bit if they make heavy use of floating point, while others don't make significant use of floating point. Depending on the particular test, either soft-fp or ieeelib may be faster; for example, soft-fp is somewhat faster on Whetstone.
One answer could be to take a look at the source code for glibc and see if you could salvage what you need.


Standard guarantees for using floating point arithmetic to represent integer operations

I am working on some code to be run on a very heterogeneous cluster. The program performs interval arithmetic using 3, 4, or 5 32 bit words (unsigned ints) to represent high precision boundaries for the intervals. It seems to me that representing some words in floating point in some situations may produce a speedup. So, my question is two parts:
1) Are there any guarantees in the C11 standard as to what range of integers will be represented exactly, and what range of input pairs would have their products represented exactly? One multiplication error could entirely change the results.
2) Is this even a reasonable approach? It seems that the separation of floating point and integer processing within the processor would allow data to be running through both pipelines simultaneously, improving throughput. I don't know much about hardware though, so I'm not sure that the pipelines for integers and floating points actually are all that separate, or, if they are, if they can be used simultaneously.
I understand that the effectiveness of this sort of thing is platform dependent, but right now I am concerned about the reliability of the approach. If it is reliable, I can benchmark it and see, but I am having trouble proving reliability. Secondly, perhaps this sort of approach shows little promise, and if so I would like to know so I can focus elsewhere.
I don't know about the Standard, but it seems that you can assume all your processors are using the normal IEEE floating point format. In this case, it's pretty easy to determine whether your calculations are correct. The first integer not representable by the 32-bit float format is 16777217 (224+1), so if all your intermediate results are less than that (in absolute value), float will be fine.
The reverse is also true: if any intermediate result is greater than 224 (in absolute value) and odd, float representation will alter it, which is unacceptable for you.
If you are worried specifically about multiplications, look at how the multiplicands are limited. If one is limited by 211, and the other by 213, you will be fine (just barely). If, for example, both are limited by 216, there almost certainly is a problem. To prove it, find a test case that causes their product to exceed 224 and be odd.
All that you need to know to which limits you may go and still have integer precision should be available to you through the macros defined in <float.h>. There you have the exact description of the floating point types, FLT_RADIX for the radix, FLT_MANT_DIG for the number of the digits, etc.
As you say, whether or not such an approach is efficient will depend on the platform. You should be aware that this is much dependent of the particular processor you'd have, not only the processor family. From one Intel or AMD processor variant to another there could already be sensible differences. So you'd basically benchmark all possibilities and have code that decides on program startup which variant to use.

Does any floating point-intensive code produce bit-exact results in any x86-based architecture?

I would like to know if any code in C or C++ using floating point arithmetic would produce bit exact results in any x86 based architecture, regardless of the complexity of the code.
To my knowledge, any x86 architecture since the Intel 8087 uses a FPU unit prepared to handle IEEE-754 floating point numbers, and I cannot see any reason why the result would be different in different architectures. However, if they were different (namely due to different compiler or different optimization level), would there be some way to produce bit-exact results by just configuring the compiler?
Table of contents:
Creating real-life software that achieves this.
In C or C++:
No, a fully ISO C11 and IEEE-conforming C implementation does not guarantee bit-identical results to other C implementations, even other implementations on the same hardware.
(And first of all, I'm going to assume we're talking about normal C implementations where double is the IEEE-754 binary64 format, etc., even though it would be legal for a C implementation on x86 to use some other format for double and implement FP math with software emulation, and define the limits in float.h. That might have been plausible when not all x86 CPUs included with an FPU, but in 2016 that's Deathstation 9000 territory.)
related: Bruce Dawson's Floating-Point Determinism blog post is an answer to this question. His opening paragraph is amusing (and is followed by a lot of interesting stuff):
Is IEEE floating-point math deterministic? Will you always get the same results from the same inputs? The answer is an unequivocal “yes”. Unfortunately the answer is also an unequivocal “no”. I’m afraid you will need to clarify your question.
If you're pondering this question, then you will definitely want to have a look at the index to Bruce's series of articles about floating point math, as implemented by C compilers on x86, and also asm, and IEEE FP in general.
First problem: Only "basic operations": + - * / and sqrt are required to return "correctly rounded" results, i.e. <= 0.5ulp of error, correctly-rounded out to the last bit of the mantissa, so the results is the closest representable value to the exact result.
Other math library functions like pow(), log(), and sin() allow implementers to make a tradeoff between speed and accuracy. For example, glibc generally favours accuracy, and is slower than Apple's OS X math libraries for some functions, IIRC. See also glibc's documentation of the error bounds for every libm function across different architectures.
But wait, it gets worse. Even code that only uses the correctly-rounded basic operations doesn't guarantee the same results.
C rules also allow some flexibility in keeping higher precision temporaries. The implementation defines FLT_EVAL_METHOD so code can detect how it works, but you don't get a choice if you don't like what the implementation does. You do get a choice (with #pragma STDC FP_CONTRACT off) to forbid the compiler from e.g. turning a*b + c into an FMA with no rounding of the a*b temporary before the add.
On x86, compilers targeting 32-bit non-SSE code (i.e. using obsolete x87 instructions) typically keep FP temporaries in x87 registers between operations. This produces the FLT_EVAL_METHOD = 2 behaviour of 80-bit precision. (The standard specifies that rounding still happens on every assignment, but real compilers like gcc don't actually do extra store/reloads for rounding unless you use -ffloat-store. See https://gcc.gnu.org/wiki/FloatingPointMath. That part of the standard seems to have been written assuming non-optimizing compilers, or hardware that efficiently provides rounding to the type width like non-x86, or like x87 with precision set to round to 64-bit double instead of 80-bit long double. Storing after every statement is exactly what gcc -O0 and most other compilers do, and the standard allows extra precision within evaluation of one expression.)
So when targeting x87, the compiler is allowed to evaluate the sum of three floats with two x87 FADD instructions, without rounding off the sum of the first two to a 32-bit float. In that case, the temporary has 80-bit precision... Or does it? Not always, because the C implementation's startup code (or a Direct3D library!!!) may have changed the precision setting in the x87 control word, so values in x87 registers are rounded to 53 or 24 bit mantissa. (This makes FDIV and FSQRT run a bit faster.) All of this from Bruce Dawson's article about intermediate FP precision).
In assembly:
With rounding mode and precision set the same, I think every x86 CPU should give bit-identical results for the same inputs, even for complex x87 instructions like FSIN.
Intel's manuals don't define exactly what those results are for every case, but I think Intel aims for bit-exact backwards compatibility. I doubt they'll ever add extended-precision range-reduction for FSIN, for example. It uses the 80-bit pi constant you get with fldpi (correctly-rounded 64-bit mantissa, actually 66-bit because the next 2 bits of the exact value are zero). Intel's documentation of the worst-case-error was off by a factor of 1.3 quintillion until they updated it after Bruce Dawson noticed how bad the worst-case actually was. But this can only be fixed with extended-precision range reduction, so it wouldn't be cheap in hardware.
I don't know if AMD implements their FSIN and other micro-coded instructions to always give bit-identical results to Intel, but I wouldn't be surprised. Some software does rely on it, I think.
Since SSE only provides instructions for add/sub/mul/div/sqrt, there's nothing too interesting to say. They implement the IEEE operation exactly, so there's no chance that any x86 implementation will ever give you anything different (unless the rounding mode is set differently, or denormals-are-zero and/or flush-to-zero are different and you have any denormals).
SSE rsqrt (fast approximate reciprocal square root) is not exactly specified, and I think it's possible you might get a different result even after a Newton iteration, but other than that SSE/SSE2 is always bit exact in asm, assuming the MXCSR isn't set weird. So the only question is getting the compiler go generate the same code, or just using the same binaries.
In real life:
So, if you statically link a libm that uses SSE/SSE2 and distribute those binaries, they will run the same everywhere. Unless that library uses run-time CPU detection to choose alternate implementations...
As #Yan Zhou points out, you pretty much need to control every bit of the implementation down to the asm to get bit-exact results.
However, some games really do depend on this for multi-player, but often with detection/correction for clients that get out of sync. Instead of sending the entire game state over the network every frame, every client computes what happens next. If the game engine is carefully implemented to be deterministic, they stay in sync.
In the Spring RTS, clients checksum their gamestate to detect desync. I haven't played it for a while, but I do remember reading something at least 5 years ago about them trying to achieve sync by making sure all their x86 builds used SSE math, even the 32-bit builds.
One possible reason for some games not allowing multi-player between PC and non-x86 console systems is that the engine gives the same results on all PCs, but different results on the different-architecture console with a different compiler.
Further reading: GAFFER ON GAMES: Floating Point Determinism. Some techniques that real game engines use to get deterministic results. e.g. wrap sin/cos/tan in non-optimized function calls to force the compiler to leave them at single-precision.
If the compiler and architecture is compliant to IEEE standards, yes.
For instance, gcc is IEEE compliant if configured properly. If you use the -ffast-math flag, it will not be IEEE compliant.
See http://www.validlab.com/goldberg/paper.pdf page 25.
If you want to know exactly what exactness you can rely on when using a IEEE 754-1985 hardware/compiler pair, you need to purchase the standard paper on IEEE site. Unfortunately, this is not publicly available

How to do floating point calculations with integers

I have a coprocessor attached to the main processor. Some floating point calculations needs to be done in the coprocessor, but it does not support hardware floating point instructions, and emulation is too slow.
Now one way is to have the main processor to scale the floating point values so that they can be represented as integers, send them to the co processor, who performs some calculations, and scale back those values on return. However, that wouldn't work most of the time, as the numbers would eventually become too big or small to be out of range of those integers. So my question is, what is the fastest way of doing this properly.
You are saying emulation is too slow. I guess you mean emulation of floating point. The only remaining alternative if scaled integers are not sufficient, is fixed point math but it's not exactly fast either, even though it's much faster than emulated float.
Also, you are never going to escape the fact that with both scaled integers, and fixed point math, you are going to get less dynamic range than with floating point.
However, if your range is known in advance, the fixed point math implementation can be tuned for the range you need.
Here is an article on fixed point. The gist of the trick is deciding how to split the variable, how many bits for the low and high part of the number.
A full implementation of fixed point for C can be found here. (BSD license.) There are others.
In addition to #Amigable Clark Kant's suggestion, Anthony Williams' fixed point math library provides a C++ fixed class that can be use almost interchangeably with float or double and on ARM gives a 5x performance improvement over software floating point. It includes a complete fixed point version of the standard math library including trig and log functions etc. using the CORDIC algorithm.

How to avoid FPU when given float numbers?

Well, this is not at all an optimization question.
I am writing a (for now) simple Linux kernel module in which I need to find the average of some positions. These positions are stored as floating point (i.e. float) variables. (I am the author of the whole thing, so I can change that, but I'd rather keep the precission of float and not get involved in that if I can avoid it).
Now, these position values are stored (or at least used to) in the kernel simply for storage. One user application writes these data (through shared memory (I am using RTAI, so yes I have shared memory between kernel and user spaces)) and others read from it. I assume read and write from float variables would not use the FPU so this is safe.
By safe, I mean avoiding FPU in the kernel, not to mention some systems may not even have an FPU. I am not going to use kernel_fpu_begin/end, as that likely breaks the real-time-ness of my tasks.
Now in my kernel module, I really don't need much precision (since the positions are averaged anyway), but I would need it up to say 0.001. My question is, how can I portably turn a floating point number to an integer (1000 times the original number) without using the FPU?
I thought about manually extracting the number from the float's bit-pattern, but I'm not sure if it's a good idea as I am not sure how endian-ness affects it, or even if floating points in all architectures are standard.
If you want to tell gcc to use a software floating point library there's apparently a switch for that, albeit perhaps not turnkey in the standard environment:
Using software floating point on x86 linux
In fact, this article suggests that linux kernel and its modules are already compiled with -msoft-float:
That said, #PaulR's suggestion seems most sensible. And if you offer an API which does whatever conversions you like then I don't see why it's any uglier than anything else.
The SoftFloat software package has the function float32_to_int32 that does exactly what you want (it implements IEEE 754 in software).
In the end it will be useful to have some sort of floating point support in a kernel anyway (be it hardware or software), so including this in your project would most likely be a wise decision. It's not too big either.
Really, I think you should just change your module's API to use data that's already in integer format, if possible. Having floating point types in a kernel-user interface is just a bad idea when you're not allowed to use floating point in kernelspace.
With that said, if you're using single-precision float, it's essentially ALWAYS going to be IEEE 754 single precision, and the endianness should match the integer endianness. As far as I know this is true for all archs Linux supports. With that in mind, just treat them as unsigned 32-bit integers and extract the bits to scale them. I would scale by 1024 rather than 1000 if possible; doing that is really easy. Just start with the mantissa bits (bits 0-22), "or" on bit 23, then right shift if the exponent (after subtracting the bias of 127) is less than 23 and left shift if it's greater than 23. You'll need to handle the cases where the right shift amount is greater than 32 (which C wouldn't allow; you have to just special-case the zero result) or where the left shift is sufficiently large to overflow (in which case you'll probably want to clamp the output).
If you happen to know your values won't exceed a particular range, of course, you might be able to eliminate some of these checks. In fact, if your values never exceed 1 and you can pick the scaling, you could pick it to be 2^23 and then you could just use ((float_bits & 0x7fffff)|0x800000) directly as the value when the exponent is zero, and otherwise right-shift.
You can use rational numbers instead of floats. The operations (multiplication, addition) can be implemented without loss in accuracy too.
If you really only need 1/1000 precision, you can just store x*1000 as a long integer.

Fortran/C Interlanguage problems: results differ in the 14th digit

I have to use C and Fortran together to do some simulations. In their course I use the same memory in both programming language parts, by defining a pointer in C to access memory allocated by Fortran.
The datatype of the problematic variable is
for Fortran, and
for C. The results of the same calculations now differ in the respective programming languages, and I need to directly compare them and get a zero. All calculations are done only with the above accuracies. The difference is always in the 13-14th digit.
What would be a good way to resolve this? Any compiler-flags? Just cut-off after some digits?
Many thanks!
Floating point is not perfectly accurate. Ever. Even cos(x) == cos(y) can be false if x == y.
So when doing your comparisons, take this into account, and allow the values to differ by some small epsilon value.
This is a problem with the inaccuracy with floating point numbers - they will be inaccurate and a certain place. You usually compare them either by rounding them to a digit that you know will be in the accurate area, or by providing an epsilon of appropiate value (small enough to not impact further calculations, and big enough to take care of the inaccuracy while comparing).
One thing you might check is to be sure that the FPU control word is the same in both cases. If it is set to 53-bit precision in one case and 64-bit in the other, it would likely produce different results. You can use the instructions fstcw and fldcw to read and load the control word value. Nonetheless, as others have mentioned, you should not depend on the accuracy being identical even if you can make it work in one situation.
Perfect portability is very difficult to achieve in floating point operations. Changing the order of the machine instructions might change the rounding. One compiler might keep values in registers, while another copy it to memory, which can change the precision. Currently the Fortran and C languages allow a certain amount of latitude. The IEEE module of Fortran 2008, when implemented, will allow requiring more specific and therefore more portable floating point computations.
Since you are compiling for an x86 architecture, it's likely that one of the compilers is maintaining intermediate values in floating point registers, which are 80 bits as opposed to the 64 bits of a C double.
For GCC, you can supply the -ffloat-store option to inhibit this optimisation. You may also need to change the code to explicitly store some intermediate results in double variables. Some experimentation is likely in order.
