How to do floating point calculations with integers - c

I have a coprocessor attached to the main processor. Some floating point calculations needs to be done in the coprocessor, but it does not support hardware floating point instructions, and emulation is too slow.
Now one way is to have the main processor to scale the floating point values so that they can be represented as integers, send them to the co processor, who performs some calculations, and scale back those values on return. However, that wouldn't work most of the time, as the numbers would eventually become too big or small to be out of range of those integers. So my question is, what is the fastest way of doing this properly.

You are saying emulation is too slow. I guess you mean emulation of floating point. The only remaining alternative if scaled integers are not sufficient, is fixed point math but it's not exactly fast either, even though it's much faster than emulated float.
Also, you are never going to escape the fact that with both scaled integers, and fixed point math, you are going to get less dynamic range than with floating point.
However, if your range is known in advance, the fixed point math implementation can be tuned for the range you need.
Here is an article on fixed point. The gist of the trick is deciding how to split the variable, how many bits for the low and high part of the number.
A full implementation of fixed point for C can be found here. (BSD license.) There are others.

In addition to #Amigable Clark Kant's suggestion, Anthony Williams' fixed point math library provides a C++ fixed class that can be use almost interchangeably with float or double and on ARM gives a 5x performance improvement over software floating point. It includes a complete fixed point version of the standard math library including trig and log functions etc. using the CORDIC algorithm.

Related

Standard guarantees for using floating point arithmetic to represent integer operations

I am working on some code to be run on a very heterogeneous cluster. The program performs interval arithmetic using 3, 4, or 5 32 bit words (unsigned ints) to represent high precision boundaries for the intervals. It seems to me that representing some words in floating point in some situations may produce a speedup. So, my question is two parts:
1) Are there any guarantees in the C11 standard as to what range of integers will be represented exactly, and what range of input pairs would have their products represented exactly? One multiplication error could entirely change the results.
2) Is this even a reasonable approach? It seems that the separation of floating point and integer processing within the processor would allow data to be running through both pipelines simultaneously, improving throughput. I don't know much about hardware though, so I'm not sure that the pipelines for integers and floating points actually are all that separate, or, if they are, if they can be used simultaneously.
I understand that the effectiveness of this sort of thing is platform dependent, but right now I am concerned about the reliability of the approach. If it is reliable, I can benchmark it and see, but I am having trouble proving reliability. Secondly, perhaps this sort of approach shows little promise, and if so I would like to know so I can focus elsewhere.
Thanks!
I don't know about the Standard, but it seems that you can assume all your processors are using the normal IEEE floating point format. In this case, it's pretty easy to determine whether your calculations are correct. The first integer not representable by the 32-bit float format is 16777217 (224+1), so if all your intermediate results are less than that (in absolute value), float will be fine.
The reverse is also true: if any intermediate result is greater than 224 (in absolute value) and odd, float representation will alter it, which is unacceptable for you.
If you are worried specifically about multiplications, look at how the multiplicands are limited. If one is limited by 211, and the other by 213, you will be fine (just barely). If, for example, both are limited by 216, there almost certainly is a problem. To prove it, find a test case that causes their product to exceed 224 and be odd.
All that you need to know to which limits you may go and still have integer precision should be available to you through the macros defined in <float.h>. There you have the exact description of the floating point types, FLT_RADIX for the radix, FLT_MANT_DIG for the number of the digits, etc.
As you say, whether or not such an approach is efficient will depend on the platform. You should be aware that this is much dependent of the particular processor you'd have, not only the processor family. From one Intel or AMD processor variant to another there could already be sensible differences. So you'd basically benchmark all possibilities and have code that decides on program startup which variant to use.

What is the difference between fixed point and floating point processor?How float is handled in both kinds?

I am bit confused about how floating point operations are handled in a processor which is do not support floating point operation.
Again how floating point processor is different from fixed point processor?
In which case IEEE floating point formats are used?
First off there are a number of different floating point formats, for various reasons. (some) DSPs do not use IEEE for performance reasons, it carries a lot of extra baggage (which most folks never use).
From elementary school we learned how to count then we learned how to add which is just a short cut for counting, then we learned to multiply which is just a short cut for adding, likewise subtraction and division are shortcuts for counting down rather than counting up. We also learned to do all of our math one column at a time, so if you have a processor that can do at least 1 bit math operations you can do addition, subtraction, multiplication and division as wide (As many bits per operand) as you desire, it may take a lot of operations but it is quite doable and anyone that made it through grade school has the tool box/skill set to do such a thing.
floating point is a middle school thing, manipulate a decimal point and use powers of some base (1.3 * 10^5) + (1.5 * 10*5). we know we have to get the 10 to the powers the same then we can just do basic elementary addition with the decimal points lined up. multiplication is even easier as you dont have to line up the decimal points you just do the math on the significant digits and simply add the exponents.
When your processor has a multiply instruction, it is just a shortcut for you having to do multiple additions (the shortcuts usually involve multiple additions). What they do is depending on how many clock cycles they want to get the multiply operation down to uses an increasingly large amount of chip real estate. Likewise for division, that is why you dont see divide on a lot of instruction sets and dont see multiply on some of the ones that dont have divide, it is a cost trade off, performance vs power and chip real estate, yield, etc.
Then floating point is just an extension of that at the core of a floating point operation you still have the fixed point operations, a floating point multiply requires a fixed point multiplication and an addition and some adjustment. A floating point addition requires some adjustment, an addition and some more adjustment.
Now what processors have fpus and what dont? What processors with an fpu support ieee and what dont? That is as easy to find as the information above, but I will leave you to solve that yourself.
if you are for example able to do math operations using scientific notation (1.345*10^4 + 2.456*10^6, or 2.3*10^6 * 4.5*10^7) then you should be able to break down the math steps involved and write your own soft float routines, not optimized but you can see how a cpu that either doesnt have an fpu or a programmer that doesnt want to use the fpu can do floating point operations. You have to be able to think in terms of base 2 not ten though which makes the problem significantly easier 1.101001*2^4 + 1.010101*2^5, in particular multiplies get real easy.
When floating point is not supported in hardware, the calculations are done by heavily optimised pieces of assembly code, usually from a library.
One Google search away you could have found this about fixed-point. I assume you can find info about IEEE floating point yourself ;-)
Good luck!
Explanation and arbitrary fixed point library can be found here.

Sine function in Matlab/C/Java or any other program in digital systems

How does Matlab/C generates Sine wave, I mean do they store the values for every angle ? but if they do then there are infinite values that are needed to be stored.
There are many, many routines for calculating sines of angles using only the basic arithmetic operators available on any modern digital computer. One such, but only one example, is the CORDIC algorithm. I don't know what algorithm(s) Matlab uses for trigonometric functions.
Computers don't simply look up the value of a trigonometric function in a stored table of values, though some of the algorithms in current use do look up critical values in stored tables. What those tables contain is algorithm-specific and would probably not accord with a naive expectation of a trig table.
Note that this question has been asked many times on SO, this answer seems to be the pick of them.
No, they generally don't. And even if they did, no, there are not "infinite values". Digital finite (real) computers don't deal well with the infinite, but that applies just as well to the angle (the input to the sine function). You can't ask for "every" angle, since the angle itself must be expressed in a finite set of bits.
I think using Taylor series is common, or other solutions based on interpolation.
Also, modern CPU:s have instructions for computing sine, but that of course just pushes the question of how it's implemented down a level or three.
See also this very similar question for more details.
Before the 90s computers would numerically calculate sine and other trigonometric functions using the basic functions addition, subtraction, multiplication and division. The algorithm for calculating them was usually included in basic compiler libraries. At some point, it become possible to add an optional hardware processor called the floating point processor (FPU). My understanding of the FPU is that it did have hard values of the trig functions stored on it. The speed to calculate trig functions would increase dramatically with the inclusion of the FPU. Since the 90s however, the FPU has been bundled with the CPU.
I can't seem to find any explicit description on precisely how sine and other trig functions are implemented by the FPU and generally that is an implementation detail left to the electrical engineer designing the chip. However it really is only necessary to know the values from 0 to pi/2. Everything else can be easily calculated from those values.
EDIT: Ok here is the implementation used on the FPU: CORDIC, which I found at this answer.

How to avoid FPU when given float numbers?

Well, this is not at all an optimization question.
I am writing a (for now) simple Linux kernel module in which I need to find the average of some positions. These positions are stored as floating point (i.e. float) variables. (I am the author of the whole thing, so I can change that, but I'd rather keep the precission of float and not get involved in that if I can avoid it).
Now, these position values are stored (or at least used to) in the kernel simply for storage. One user application writes these data (through shared memory (I am using RTAI, so yes I have shared memory between kernel and user spaces)) and others read from it. I assume read and write from float variables would not use the FPU so this is safe.
By safe, I mean avoiding FPU in the kernel, not to mention some systems may not even have an FPU. I am not going to use kernel_fpu_begin/end, as that likely breaks the real-time-ness of my tasks.
Now in my kernel module, I really don't need much precision (since the positions are averaged anyway), but I would need it up to say 0.001. My question is, how can I portably turn a floating point number to an integer (1000 times the original number) without using the FPU?
I thought about manually extracting the number from the float's bit-pattern, but I'm not sure if it's a good idea as I am not sure how endian-ness affects it, or even if floating points in all architectures are standard.
If you want to tell gcc to use a software floating point library there's apparently a switch for that, albeit perhaps not turnkey in the standard environment:
Using software floating point on x86 linux
In fact, this article suggests that linux kernel and its modules are already compiled with -msoft-float:
http://www.linuxsmiths.com/blog/?p=253
That said, #PaulR's suggestion seems most sensible. And if you offer an API which does whatever conversions you like then I don't see why it's any uglier than anything else.
The SoftFloat software package has the function float32_to_int32 that does exactly what you want (it implements IEEE 754 in software).
In the end it will be useful to have some sort of floating point support in a kernel anyway (be it hardware or software), so including this in your project would most likely be a wise decision. It's not too big either.
Really, I think you should just change your module's API to use data that's already in integer format, if possible. Having floating point types in a kernel-user interface is just a bad idea when you're not allowed to use floating point in kernelspace.
With that said, if you're using single-precision float, it's essentially ALWAYS going to be IEEE 754 single precision, and the endianness should match the integer endianness. As far as I know this is true for all archs Linux supports. With that in mind, just treat them as unsigned 32-bit integers and extract the bits to scale them. I would scale by 1024 rather than 1000 if possible; doing that is really easy. Just start with the mantissa bits (bits 0-22), "or" on bit 23, then right shift if the exponent (after subtracting the bias of 127) is less than 23 and left shift if it's greater than 23. You'll need to handle the cases where the right shift amount is greater than 32 (which C wouldn't allow; you have to just special-case the zero result) or where the left shift is sufficiently large to overflow (in which case you'll probably want to clamp the output).
If you happen to know your values won't exceed a particular range, of course, you might be able to eliminate some of these checks. In fact, if your values never exceed 1 and you can pick the scaling, you could pick it to be 2^23 and then you could just use ((float_bits & 0x7fffff)|0x800000) directly as the value when the exponent is zero, and otherwise right-shift.
You can use rational numbers instead of floats. The operations (multiplication, addition) can be implemented without loss in accuracy too.
If you really only need 1/1000 precision, you can just store x*1000 as a long integer.

Fortran/C Interlanguage problems: results differ in the 14th digit

I have to use C and Fortran together to do some simulations. In their course I use the same memory in both programming language parts, by defining a pointer in C to access memory allocated by Fortran.
The datatype of the problematic variable is
real(kind=8)
for Fortran, and
double
for C. The results of the same calculations now differ in the respective programming languages, and I need to directly compare them and get a zero. All calculations are done only with the above accuracies. The difference is always in the 13-14th digit.
What would be a good way to resolve this? Any compiler-flags? Just cut-off after some digits?
Many thanks!
Floating point is not perfectly accurate. Ever. Even cos(x) == cos(y) can be false if x == y.
So when doing your comparisons, take this into account, and allow the values to differ by some small epsilon value.
This is a problem with the inaccuracy with floating point numbers - they will be inaccurate and a certain place. You usually compare them either by rounding them to a digit that you know will be in the accurate area, or by providing an epsilon of appropiate value (small enough to not impact further calculations, and big enough to take care of the inaccuracy while comparing).
One thing you might check is to be sure that the FPU control word is the same in both cases. If it is set to 53-bit precision in one case and 64-bit in the other, it would likely produce different results. You can use the instructions fstcw and fldcw to read and load the control word value. Nonetheless, as others have mentioned, you should not depend on the accuracy being identical even if you can make it work in one situation.
Perfect portability is very difficult to achieve in floating point operations. Changing the order of the machine instructions might change the rounding. One compiler might keep values in registers, while another copy it to memory, which can change the precision. Currently the Fortran and C languages allow a certain amount of latitude. The IEEE module of Fortran 2008, when implemented, will allow requiring more specific and therefore more portable floating point computations.
Since you are compiling for an x86 architecture, it's likely that one of the compilers is maintaining intermediate values in floating point registers, which are 80 bits as opposed to the 64 bits of a C double.
For GCC, you can supply the -ffloat-store option to inhibit this optimisation. You may also need to change the code to explicitly store some intermediate results in double variables. Some experimentation is likely in order.

Resources