Floating type value in ATMega8

Floating type value in ATMega8 - c

My Microcontroller doesn't process floating type values so how can I do operations on floating type values using int?
Like I have a value stored in a register a=5
now I want to multiply it with 0.65 and store the result in another register c?
How do I do it?
on using int it solves leaving the fractional value aside while using a float it displays a "?"

you are mixing multiple problems:
first: even if your target controller does not contain a floating point unit (FPU) the calculations can be done with SW libries.
Usage of this libs usally happens automaticly. and you can do calculations in float.
these libraries are relativly huge in codesize and execution speed. even if you only add simple float arithmetics you will notice a huge increment in code size.
the second problem is the output via the printf-routines. Since floating port support is nomaly not needed, it is striped to save code size. you can explicitly activate it by adding the libraries libprintf_flt.a libm.a and use the linker options -Wl,-u,vfprintf
alternativly you can use the ftoa function.

Related

Pow implementation for double

I am developing a code that will be used for motion control and I am having a issue with the pow function.
I am using VS2010 as IDE.
Here is my issue:
I have:
double p = 100.0000;
double d = 1000.0000;
t1 = pow((p/(8.0000*d),1.00/4.000);
When evaluating this last function, I don't get the better approximation as result. I am getting a 7 decimal digits correct, and the consequent digits are all trash.
I am guessing that pow function only casts any input variable as float and proceds with calculation.
Am I right?
If so, is there any code I can get "inspired" with to reimplement pow for a better precision?
Edit: Solved.
After all, I was having problems with FPU config bits, caused by Direct3D which was being used by OGRE 3D framework.
If using OGRE, on the config GUI, just set "Floating-point mode=Consistent".
If using raw Direct3D, when calling CreateDevice, make sure to pass "D3DCREATE_FPU_PRESERVE" flag to it.
Original post:
You may be using a libray that is changing the default precision of
the FPU to single-precision. Then all floating-point operations, even
on doubles, will actually be performed as single-precision operations.
As a test, you can try calling _controlfp( _CW_DEFAULT, 0xfffff );
(you need to include ) before performing the calculation to
see if you get the correct result. This will reset the floating-point
control word to default values. Note that it will reset other settings
as well, which may cause issues.
One common library that changes the floating-point precision is
Direct3D 9 (and maybe other versions too): By default, it changes the
FPU to single-precision when creating a device. If you use it, specify
the flag D3DCREATE_FPU_PRESERVE when creating the device to prevent it
from changing the FPU precision.

You may be using a libray that is changing the default precision of the FPU to single-precision. Then all floating-point operations, even on doubles, will actually be performed as single-precision operations.
As a test, you can try calling _controlfp( _CW_DEFAULT, 0xfffff ); (you need to include <float.h>) before performing the calculation to see if you get the correct result. This will reset the floating-point control word to default values. Note that it will reset other settings as well, which may cause issues.
One common library that changes the floating-point precision is Direct3D 9 (and maybe other versions too): By default, it changes the FPU to single-precision when creating a device. If you use it, specify the flag D3DCREATE_FPU_PRESERVE when creating the device to prevent it from changing the FPU precision.

How did you determine you're only getting 7 digits of precision? Are you printing t1 and specifying the correct output format? On my machine, with VS2010, the following code:
int main()
{
double p = 100.0000;
double d = 1000.0000;
double t1 = pow(p/(8.0000*d),1.00/4.000);
printf("t1=%.15f\n", t1); // C
std::cout << "t1=" << std::setprecision(15) << t1 << '\n'; // C++
}
Produces this output:
t1=0.334370152488211
t1=0.334370152488211

How to avoid FPU when given float numbers?

Well, this is not at all an optimization question.
I am writing a (for now) simple Linux kernel module in which I need to find the average of some positions. These positions are stored as floating point (i.e. float) variables. (I am the author of the whole thing, so I can change that, but I'd rather keep the precission of float and not get involved in that if I can avoid it).
Now, these position values are stored (or at least used to) in the kernel simply for storage. One user application writes these data (through shared memory (I am using RTAI, so yes I have shared memory between kernel and user spaces)) and others read from it. I assume read and write from float variables would not use the FPU so this is safe.
By safe, I mean avoiding FPU in the kernel, not to mention some systems may not even have an FPU. I am not going to use kernel_fpu_begin/end, as that likely breaks the real-time-ness of my tasks.
Now in my kernel module, I really don't need much precision (since the positions are averaged anyway), but I would need it up to say 0.001. My question is, how can I portably turn a floating point number to an integer (1000 times the original number) without using the FPU?
I thought about manually extracting the number from the float's bit-pattern, but I'm not sure if it's a good idea as I am not sure how endian-ness affects it, or even if floating points in all architectures are standard.

If you want to tell gcc to use a software floating point library there's apparently a switch for that, albeit perhaps not turnkey in the standard environment:
Using software floating point on x86 linux
In fact, this article suggests that linux kernel and its modules are already compiled with -msoft-float:
http://www.linuxsmiths.com/blog/?p=253
That said, #PaulR's suggestion seems most sensible. And if you offer an API which does whatever conversions you like then I don't see why it's any uglier than anything else.

The SoftFloat software package has the function float32_to_int32 that does exactly what you want (it implements IEEE 754 in software).
In the end it will be useful to have some sort of floating point support in a kernel anyway (be it hardware or software), so including this in your project would most likely be a wise decision. It's not too big either.

Really, I think you should just change your module's API to use data that's already in integer format, if possible. Having floating point types in a kernel-user interface is just a bad idea when you're not allowed to use floating point in kernelspace.
With that said, if you're using single-precision float, it's essentially ALWAYS going to be IEEE 754 single precision, and the endianness should match the integer endianness. As far as I know this is true for all archs Linux supports. With that in mind, just treat them as unsigned 32-bit integers and extract the bits to scale them. I would scale by 1024 rather than 1000 if possible; doing that is really easy. Just start with the mantissa bits (bits 0-22), "or" on bit 23, then right shift if the exponent (after subtracting the bias of 127) is less than 23 and left shift if it's greater than 23. You'll need to handle the cases where the right shift amount is greater than 32 (which C wouldn't allow; you have to just special-case the zero result) or where the left shift is sufficiently large to overflow (in which case you'll probably want to clamp the output).
If you happen to know your values won't exceed a particular range, of course, you might be able to eliminate some of these checks. In fact, if your values never exceed 1 and you can pick the scaling, you could pick it to be 2^23 and then you could just use ((float_bits & 0x7fffff)|0x800000) directly as the value when the exponent is zero, and otherwise right-shift.

You can use rational numbers instead of floats. The operations (multiplication, addition) can be implemented without loss in accuracy too.
If you really only need 1/1000 precision, you can just store x*1000 as a long integer.

Fortran/C Interlanguage problems: results differ in the 14th digit

I have to use C and Fortran together to do some simulations. In their course I use the same memory in both programming language parts, by defining a pointer in C to access memory allocated by Fortran.
The datatype of the problematic variable is
real(kind=8)
for Fortran, and
double
for C. The results of the same calculations now differ in the respective programming languages, and I need to directly compare them and get a zero. All calculations are done only with the above accuracies. The difference is always in the 13-14th digit.
What would be a good way to resolve this? Any compiler-flags? Just cut-off after some digits?
Many thanks!

Floating point is not perfectly accurate. Ever. Even cos(x) == cos(y) can be false if x == y.
So when doing your comparisons, take this into account, and allow the values to differ by some small epsilon value.

This is a problem with the inaccuracy with floating point numbers - they will be inaccurate and a certain place. You usually compare them either by rounding them to a digit that you know will be in the accurate area, or by providing an epsilon of appropiate value (small enough to not impact further calculations, and big enough to take care of the inaccuracy while comparing).

One thing you might check is to be sure that the FPU control word is the same in both cases. If it is set to 53-bit precision in one case and 64-bit in the other, it would likely produce different results. You can use the instructions fstcw and fldcw to read and load the control word value. Nonetheless, as others have mentioned, you should not depend on the accuracy being identical even if you can make it work in one situation.

Perfect portability is very difficult to achieve in floating point operations. Changing the order of the machine instructions might change the rounding. One compiler might keep values in registers, while another copy it to memory, which can change the precision. Currently the Fortran and C languages allow a certain amount of latitude. The IEEE module of Fortran 2008, when implemented, will allow requiring more specific and therefore more portable floating point computations.

Since you are compiling for an x86 architecture, it's likely that one of the compilers is maintaining intermediate values in floating point registers, which are 80 bits as opposed to the 64 bits of a C double.
For GCC, you can supply the -ffloat-store option to inhibit this optimisation. You may also need to change the code to explicitly store some intermediate results in double variables. Some experimentation is likely in order.

Different Truncation Results When Casting

I'm having some some difficulty predicting how my C code will truncate results. Refer to the following:
float fa,fb,fc;
short ia,ib;
fa=160
fb=0.9;
fc=fa*fb;
ia=(short)fc;
ib=(short)(fa*fb);
The results are ia=144, ib=143.
I can understand the reasoning for either result, but I don't understand why the two calculations are treated differently. Can anyone refer me to where this behaviour is defined or explain the difference?
Edit: the results are compiled with MS Visual C++ Express 2010 on Intel core i3-330m. I get the same results on gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) under Virtual Box on the same machine.

The compiler is allowed to use more precision for a subexpression like fa*fb than it uses when assigning to a float variable like fc. So it's the fc= part which is very slightly changing the result (and happening to then make a difference in the integer truncation).

aschepler explained the mechanics of what's going on well, but the fundamental problem with your code is using a value which does not exist as a float in code that depends upon the value of its approximation in an unstable way. If you want to multiply by 0.9 (the actual number 0.9=9/10, not the floating point value 0.9 or 0.9f) you should multiply by 9 then divide by 10, or forget about floating point types and use a decimal arithmetic library.
A cheap and dirty way around the problem, when the unstable points are isolated as in your example here, is to just add a value (typically 0.5) which you know will be larger than the error but smaller than the difference from the next integer before truncating.

This is compiler dependent. On mine (gcc 4.4.3) it produces the same result for both expressions, namely -144, probably because the identical expression is optimized away.
Others explained well what happened. In other words I would say that the differences probably happens because your compiler internally promotes floats to 80 bits fpu registers before performing the multiplication, then convert back either to float or to short.
If my hypothesis is true if you write ib = (short)(float)(fa * fb); you should get the same result than when casting fc to short.

casting doubles to integers in order to gain speed

in Redis (http://code.google.com/p/redis) there are scores associated to elements, in order to take this elements sorted. This scores are doubles, even if many users actually sort by integers (for instance unix times).
When the database is saved we need to write this doubles ok disk. This is what is used currently:
snprintf((char*)buf+1,sizeof(buf)-1,"%.17g",val);
Additionally infinity and not-a-number conditions are checked in order to also represent this in the final database file.
Unfortunately converting a double into the string representation is pretty slow. While we have a function in Redis that converts an integer into a string representation in a much faster way. So my idea was to check if a double could be casted into an integer without lost of data, and then using the function to turn the integer into a string if this is true.
For this to provide a good speedup of course the test for integer "equivalence" must be fast. So I used a trick that is probably undefined behavior but that worked very well in practice. Something like that:
double x = ... some value ...
if (x == (double)((long long)x))
use_the_fast_integer_function((long long)x);
else
use_the_slow_snprintf(x);
In my reasoning the double casting above converts the double into a long, and then back into an integer. If the range fits, and there is no decimal part, the number will survive the conversion and will be exactly the same as the initial number.
As I wanted to make sure this will not break things in some system, I joined #c on freenode and I got a lot of insults ;) So I'm now trying here.
Is there a standard way to do what I'm trying to do without going outside ANSI C? Otherwise, is the above code supposed to work in all the Posix systems that currently Redis targets? That is, archs where Linux / Mac OS X / *BSD / Solaris are running nowaday?
What I can add in order to make the code saner is an explicit check for the range of the double before trying the cast at all.
Thank you for any help.

Perhaps some old fashion fixed point math could help you out. If you converted your double to a fixed point value, you still get decimal precision and converting to a string is as easy as with ints with the addition of a single shift function.
Another thought would be to roll your own snprintf() function. Doing the conversion from double to int is natively supported by many FPU units so that should be lightning fast. Converting that to a string is simple as well.
Just a few random ideas for you.

The problem with doing that is that the comparisons won't work out the way you'd expect. Just because one floating point value is less than another doesn't mean that its representation as an integer will be less than the other's. Also, I see you comparing one of the (former) double values for equality. Due to rounding and representation errors in the low-order bits, you almost never want to do that.
If you are just looking for some kind of key to do something like hashing on, it would probably work out fine. If you actually care about which values really have greater or lesser value, its a bad idea.

I don't see a problem with the casts, as long as x is within the range of long long. Maybe you should check out the modf() function which separates a double into its integral and fractional part. You can then add checks against (double)LLONG_MIN and (double)LLONG_MAX for the integral part to make sure. Though there may be difficulties with the precision of double.
But before doing anything of this, have you made sure it actually is a bottleneck by measuring its performance? And is the percentage of integer values high enough that it would really make a difference?

Your test is perfectly fine (assuming you have already separately handled infinities and NANs by this point) - and it's probably one of the very few occaisions when you really do want to compare floats for equality. It doesn't invoke undefined behaviour - even if x is outside of the range of long long, you'll just get an "implementation-defined result", which is OK here.
The only fly in the ointment is that negative zero will end up as positive zero (because negative zero compares equal to positive zero).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight