Strange float rounding [duplicate] - c

This question already has answers here:
Why does division result in zero instead of a decimal?
(5 answers)
Closed 8 years ago.
I'm programming a microcontroller atmega168 (8 bit).
I want to do something like:
float A = cos(- (2/3) * M_PI);
Including of course math.h (#define M_PI 3.14159265358979323846)
As result, instead of having -0.5, i get 1.
I check the result using a serial communication to my pc that i'm sure that works also for float numbers because if i set
A= -0.50;
I receive the correct result.
PS. I cannot use double...also because i don't see the reason of doing so
Help me please!

2/3 is evaluated using integer arithmetic. It evaluates to 0. You mean to use floating point divide
float A = cos(-(2.0/3.0) * M_PI);
If you want float literals use an f suffix:
float A = cos(-(2.0f/3.0f) * M_PI);
Do note however, that the M_PI macro, expanded here, is a double literal. Is that what you want?
I presume the real code doesn't look quite like this. If this really is your code then you would write float A = -0.5f and move on. I guess the real code has variables.

How much precision do you need, and where are the numbers coming from? If your goal is to compute the cosine of 120 degrees, just set a to 0.5. If your original numbers are not expressed as radians, and if you don't need an absolutely precise result, a table-based approximation may be more useful than built in trig functions, since you can strike whatever balance between table size, execution speed, and precision will best fit your needs. Note also that if your original numbers are integers, you may be able to compute trig functions without using any floating-point values [e.g. one could have a function that accepts angles from 0-65535 and returns values from -16384 to +16384]. Integer math is often much faster than floating-point, so such a function could be a major performance win.

Related

Floating point operations with no library

I am looking for a efficient way to properly do mathematical operations with floating values. As I am in the embedded C, I don't want to use any extra library for float data type.
As far as I understand, the correct way here would be to treat a floating value as a raw binary(sign, exponent, mantissa), and do the operations like that. But I cannot find any examples on how exactly that works.
I am looking for a explication on how to do the following with no float data type:
Given a variable int x that can have values from 0 to 10000.
y = x * 0.720 + 84.234;
y = y / 2.5;
Thank you for your time internet
Floating point libraries are not required for the example operations you have suggested, and while avoiding floating point code on an embedded system without an FPU is often advisable, doing that by implementing your own floating point encoding will save you nothing and will likely be less efficient, less comprehensible and more error prone than using compiler's built-in FP support.
Instead, you need to avoid floating-point code entirely, and use fixed-point encoding. In many cases that can be done ad-hoc for individual expressions, but if your application is math intensive (involving trig, logs, sqrt, exponentiation for example) you might to choose a fixed-point library or implement your own.
Floating-point dependency is trivially eradicated in the examples you have suggested; for example:
// y = x * 0.720 + 84.234
// Where x_x1000 = real value * 1000
int y_x1000 = (x_x1000 * 720) / 1000 + 84234 ;
or more efficiently using binary-fixed-point and a 10 bit fractional part:
// y = x * 0.720 + 84.234
// Where x_q10 = real value * 1024
int32_t y_q10 = (x_q10 * 737) >> 10 + 86256 ;
Although you might consider int64_t for greater numeric range - in which case you might also use more fractional bits for greater precision too.
If you are doing a lot of intensive fixed-point maths, you would do well to consider a library or implement one using CORDIC algorithms. An example of such a library can be found at https://www.justsoftwaresolutions.co.uk/news/optimizing-applications-with-fixed-point-arithmetic.html, although it is C++ - the clear advantage being that by defining a fixed class and extensive operator overloading, existing floating-point code can largely be converted to fixed point by replacing double or float keywords with fixed and compiling as C++ - even if the code is otherwise non-OOP and entirely C-like.

How to properly round up doubles in C? [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 3 years ago.
For some reason, ceil(x) rounds up round numbers to x+1.
For example:
double og_grade = 50;
double fact_grade = ceil(og_grade*1.1);
og_grade*1.1 should be 55.0000, but ceil(og_grade*1.1) returns 56.0000
Note that og_grade is always a whole number.
I tried the ceil(x) function, but for some reason when x in already
round, it rounds it up to x+1
No that would mean that the implementation of ceil was defective. Not impossible but extremely unlikely.
It's likely that the x for which this effect is observed is in fact not integral, and the decimal portion is omitted from the formatting or debugger.
Assuming IEEE754, the closest double to 1.1 is slightly larger than that; this most likely accounts for your result.
In your case, given that op_grade is in fact a whole number, your best bet is to use an int for op_grade, and multiply by 11 instead; the subsequent rounding checks are then both trivial and exact.

Simple floating point multiplication not giving expected result [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 3 years ago.
When given the input 150, I expect the output to be the mathematically correct answer 70685.7750, but I am getting the wrong output 70685.7812.
#include<stdio.h>
int main()
{
float A,n,R;
n=3.14159;
scanf("%f",&R);
A=n*(R*R);
printf("A=%.4f\n",A);
}
float and double numbers are not represented very accurately in the memory. The main reason is that the memory is limited, and most non-integers are not.
The best example is PI. You can specify as many digits as you want, but it will still be an approximation.
The limited precision of representing the numbers is the reason of the following rule:
when working with floats and double numbers, not not check for equality (m == n), but check that the difference between them is smaller than a certain error ((m-n) < e)
Please note, as mentioned in the comments too, that the above rule is not "the mother rule of all rules". There are other rules also.
Careful analysis must be done for each particular situation, in order to have a properly working application.
(Thanks #EricPostpischil for the reminder)
It is common for a variable of type float to be an IEEE-754 32-bit floating point number.
The number 3.14159 cannot be stored exactly in an IEEE-754 32-bit float - the closest value is approximately 3.14159012. 150 * 150 * 3.14159012 is 70685.7777, and the closest value to this that can be represented in a 32-bit float is 70685.78125, which you are then printing with %.4f so you see 70685.7812.
Another way of thinking about this is that your n value only ends up being accurate to the sixth significant figure, so - as you are just calculating a series of multiplications - your result is also only acccurate to the sixth significant figure (ie 70685.8). (In the general case this can be worse - for example subtraction of two close values can lead to a large increase in the relative error).
If you switch to using variables of type double (and change the scanf() to use %lf), then you will likely get the answer you are after. double is typically a 64-bit float, which means that the error in the representation of your n values and the result is small enough not to affect the fourth decimal place.
Have you heard that float and double values aren't always perfectly accurate, have limited precision? Have you heard that type float gives you the equivalent of only about 7 decimal digits' worth of precision? This is what that means. Your expected and actual answers, 70685.7750 and 70685.7812, differ in the seventh digit, just about as expected.
I expect the output to be the mathematically correct answer
I am sorry to disappoint you, but that's your mistake. As a general rule, when you're doing floating-point arithmetic, you will never get the mathematically correct answer, you will always get a limited-precision approximation of the mathematically correct answer.
The canonical SO answers to this sort of question are collected at Is floating point math broken?. You might want to read some of those answers for more enlightenment.

Sin function without using math.h library

I've got an assignment for FOP to make a scientific calculator, we haven't been taught about the math.h library! my basic approach for one of the function SIN was this
but i'm failing to make this work
#include <stdio.h>
int main()
{
int input;
float pi;
double degree;
double sinx;
long int powerseven;
long int powerfive;
long int powerthree;
input = 5;
degree= (input*pi)/180;
pi=3.142;
powerseven=(degree*degree*degree*degree*degree*degree*degree);
powerfive=(degree*degree*degree*degree*degree);
powerthree=(degree*degree*degree);
sinx = (degree - (powerthree/6) + (powerfive/120) - (powerseven/5040));
printf("%ld", sinx);
getchar();
}
Your code almost works. You have a few problems:
You are using pi before initializing it. I suggest using a more accurate value of pi such as 3.14159265359.
powerseven, powerfive and powerthree should be defined as double instead of as long int. You are losing precision by storing these values in an integer type. Also, when you divide an integer value by an integer value (such as powerthree/6) the remainder is lost. For instance, 9/6 is 1.
Since sinx is a double you should be using printf("%f", sinx);
vacawama covered most of the technical C-language reasons your program isn't working. I'll attempt to cover some algorithmic ones. Using a fixed finite number of taylor series terms to compute sine is going to lose precision quickly as the argument gets farther away from the point at which you did the series expansion, i.e. zero.
To avoid this problem, you want to use the periodicity of the sine function to reduce your argument to a bounded interval. If your input is in radians, this is actually a difficult problem in itself, since pi is not representable in floating point. But as long as you're working in degrees, you can perform argument reduction by repeatedly subtracting the greatest power-of-two multiple of 360 that's less than the argument, until your result is in the interval [0,360). (If you could use the standard library, you could just use fmod for this.)
Once your argument is in a bounded interval, you can just choose an approximation that's sufficiently precise on that interval. A taylor series approximation is certainly one approach you can use at this point, but not the only one.

unusual output from pow

The following C code
int main(){
int n=10;
int t1=pow(10,2);
int t2=pow(n,2);
int t3=2*pow(n,2);
printf("%d\n",t1);
printf("%d\n",t2);
printf("%d\n",t3);
return (0);
}
gives the following output
100
99
199
I am using a devcpp compiler.
It does not make any sense, right?
Any ideas?
(That pow(10,2) is maybe something
like 99.9999 does not explain the first
output. Moreover, I got the same
output even if I include math.h)
You are using a poor-quality math library. A good math library returns exact results for values that are exactly representable.
Generally, math library routines must be approximations both because floating-point formats cannot exactly represent the exact mathematical results and because computing the various functions is difficult. However, for pow, there are a limited number of results that are exactly representable, such as 102. A good math library will ensure that these results are returned correctly. The library you are using fails to do that.
Store the result computations as doubles. Print as double, using %f instead of %d. You will see that the 99 is really more like 99.999997, and this should make more sense.
In general, when working with any floating point math, you should assume results will be approximate; that is, a little off in either direction. So when you want exact results - like you did here - you're going to have trouble.
You should always understand the return type of functions before you use them. See, e.g. cplusplus.com:
double pow (double base, double exponent); /* C90 */
From other answers I understand there are situations when you can expect pow or other floating-point math to be precise. Once you understand the necessary imprecision that plagues floating point math, please consult these.
Your variables t1, t2 and t3 must be of type double because pow() returns double.
But if you do want them to be of type int, use round() function.
int t1 = pow(10,2);
int t2 = round(pow(n,2));
int t3 = 2 * round(pow(n,2));
It rounds the returned values 99.9... and 199.9... to 100.0 and 200.0. And then t2 == 100 because it is of type int and so does t3.
The output will be:
100
100
200
Because the round function returns the integer value nearest to x rounding half-way cases away from zero, regardless of the current rounding direction.
UPDATE: Here is comment from math.h:
/* Excess precision when using a 64-bit mantissa for FPU math ops can
cause unexpected results with some of the MSVCRT math functions. For
example, unless the function return value is stored (truncating to
53-bit mantissa), calls to pow with both x and y as integral values
sometimes produce a non-integral result. ... */

Resources