I use C to do computation using the following code:
#include <stdio.h>
#include <math.h>
void main() {
float x = 3.104924e-33;
int i = 6000, j = 1089;
float value, value_inv;
value = sqrt(x / ((float)i * j));
value_inv = 1. / value;
printf("value = %e\n", value);
printf("value_inv = %e\n", value_inv);
}
We can see, in fact, value = 2.18e-20. This does not exceed the boundary of float data type in C. But why the computer gives me
value = 0.000000e+00
value_inv = inf
Does anybody know why it happens and how to solve this problem without changing data type to double?
OP's float apparently does not support sub-normals. C allows non-support.
Does anybody know why it happens and how to solve this problem without changing data type to double?
This may be a implementation detail or due to a compiler option. Without changing to double, look to a different compiler or options. Look at options concerning sub-normal support, precision used for intermediate calculation and optimization levels (which sometimes short edge change cases like this.)
On my machine which does handle sub-normals, using C11, FLT_TRUE_MIN, smallest non-zero float is smaller than FLT_MIN, the smallest normal non-zero float.
#include<float.h>
float xx = x/((float)i*j);
printf("xx = %e %e %e\n",xx, FLT_MIN, FLT_TRUE_MIN);
Output
xx = 4.751943e-40 1.175494e-38 1.401298e-45
In OP's case, without sub-normal support, xx became 0.0f and led to the undesired output.
Using double math will handle the small intermediate float values.
value = sqrt(x/(1.0*i*j)); // Form product with `double` math
value_inv = 1.0f/value; // Here we can just use float math
printf("value = %e\n",value);
printf("value_inv = %e\n",value_inv);
Output
value = 2.179897e-20
value_inv = 4.587373e+19
On my computer (Ryzen 2700X, x86_64) the results are:
value = 2.179897e-020
value_inv = 4.587373e+019
You can try 1.f instead 1. , which actually is a double:
value_inv = 1.f/value;
Apparently your system hasn't support more digit for float. On my system the output is:
value = 2.179895e-020
value_inv = 4.587376e+019
I got the answer by myself.
I should change sqrt(x/((float)i*j)) to sqrt((double)x/((double)i*j)). After this, I can get correct result:
value = 2.179897e-20
value_inv = 4.587373e+19
There is no reason to use float instead of double for such computations:
3.104924e-33 is a double constant, it gets converted to float upon assignment, with a potential loss of precision
sqrt gets a double argument and returns a double value. Implicit conversions occur again with potential loss of precision.
1. / value computes with the type double because 1. has this type. value gets converted before the division and the result is converted to float to store to value_inv.
value and value_inv are implicitly converted to double when passed to printf.
All these conversions may incur loss of precision or even truncation to 0.. You should instead always use double unless there is a strong requirement to use float:
#include <stdio.h>
#include <math.h>
int main() {
double x = 3.104924e-33;
int i = 6000, j = 1089;
double value, value_inv;
value = sqrt(x / ((double)i * j));
value_inv = 1. / value;
printf("value = %e\n", value);
printf("value_inv = %e\n", value_inv);
return 0;
}
If for some reason you are required to use float, be careful to avoid unneeded conversions:
#include <stdio.h>
#include <math.h>
int main() {
float x = 3.104924e-33F;
int i = 6000, j = 1089;
float value, value_inv;
value = sqrtf(x / ((float)i * j));
value_inv = 1.F / value;
printf("value = %e\n", value);
printf("value_inv = %e\n", value_inv);
return 0;
}
Related
currently i'm doing my practice for C language and i've found one question about function pow() in C\C++.
#include <stdio.h>
#include <math.h>
int main(){
double k = 0.2;
printf("2.6^k = %f\n", pow(2.6, k));
printf("-2.6^k = %f\n", pow(-2.6, k));
}
OUTPUT:
2.6^k = 1.210583
-2.6^k = -1.#IND00
In this example -2.6 to the power of 0.2 i̶s̶ ̶n̶o̶t̶ ̶e̶v̶e̶n̶ ̶a̶ ̶c̶o̶m̶p̶l̶e̶x̶ ̶n̶u̶m̶b̶e̶r̶(Edit: it is),but output says(as i think) that number is indeterminable.
And in my practice there is the following:
image
I implemented this like that:
/* e = 2.1783; x = -2.6 */
result = pow(cos(pow(x,0.2) - pow(e,-x + sqrt(3))) + 1.61,2);
But due to (-x + sqrt(3)) being negative number it outputs:
-1.#IND00
The value 0.2 cannot be represented exactly in binary floating point. So what you have is not actually 0.2 but a value slightly more than that. This yields a complex result so pow returns NaN.
Reading into this further, section 7.12.7.4 of the C standard regarding the pow function states:
double pow(double x, double y);
A domain error occurs if x is finite
and negative and y is finite and not an integer value.
In the event of a domain error, an implementation-defined value is returned. While MSVC doesn't seem to document what it does in this case, it apparently returns NaN. In the case of Linux, the man pages explicitly state that NaN is returned in this case.
With complex number math, -2.60.2 is 0.979382 +0.711563*i.
pow(-2.6, k) does not have a real answer.
Alternative: use complex math:
#include <complex.h>
#include <math.h>
#include <stdio.h>
int main(void) {
double k = 0.2;
printf("2.6^k = %f\n", pow(2.6, k));
complex double y = cpow(-2.6, k);
printf("-2.6^k = %f %f*i\n", creal(y), cimag(y));
}
Output
2.6^k = 1.210583
-2.6^k = 0.979382 0.711563*i
EDITED TO BE REOPENED
People, how are you doing?
So, I am trying to atribute the value 0.000010 to a variable, but it becomes a very huge number, and it shouldn't be the case of overflow, due to the type. And it is important to really be 0.000010, because it is used into a condition.
In the code below, it is the varibale dif. During debug, as double, 0.000010 becomes 4.571853192736056e-315. As float, it becomes 9.99999975e-06. If I print it, after atribution, it giver me the right value (0.000010), but debug shows me thos other things.
EDIT TO HELP COMPREHENSION:
What am I supposed to do? I have a PI value calculates as the Gregory-Leibniz series (Pi = 4 -4/3 + 4/5 - 4/7 +...). Each operation (-4/3 and + 4/5, for example) are iteractions. I need to aproximate this Pi to the constant M_PI, from math.h library with a maximum difference of X (a number entered by the user). For exemple, it is necessary 100002 iteractions in the serie to aproximate Pi and M_PI with a difference of 0.000010. So, in this exemple, the user chose dif = 0.000010 and got 100002 iteractions.
The problem, as I said, is that the variable dif (as double or float) can get to be 0.000010 (DEBUG IMAGES AFTER THE CODE).
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
long int n = 0, iteractions = 0;
float Pi1 = 4.0, Pi2 = 0.0, sub = 0.0, sum = 0.0;
double dif = 0.0;
printf("Type the difference to be observed: ");
scanf("%f", &dif);
Pi1 = 4;
sub = Pi1 - M_PI;
for(n=1; sub >= dif; n++){
Pi2 = (pow(-1,n)*4)/(2*n + 1);
sum = Pi1 + Pi2;
Pi1 = sum;
sub = Pi1 - M_PI;
iteractions = iteractions + 1;
}
printf("Iteractions: %ld \n", iteractions);
return 0;
}
Image:
As Carcigenicate asked: What specifically is "it"? What is "it" "becoming"?
I suspect maybe you mean "iteracao" (because it's the only think you're printing), and I suspect maybe it's "huge" because the loop isn't behaving as you expect.
In any case:
Please read this article:
https://floating-point-gui.de/
What Every Programmer Should Know About Floating-Point Arithmetic
or
Why don’t my numbers add up?
Please update your post, clarifying exactly what the problem is, where in your code it's occurring, and what you "expected" vs. what you're seeing.
The precsion of a float is about 7 digits. You are calculating pi 3.... and want to get to within a difference of 0.000010. This is right at the limit of what a float can represent. Switching to double will give you close to 15 digits of precision.
You're using the wrong format specifier for scanf:
double dif = 0.0;
printf("Type the difference to be observed: ");
scanf("%f", &dif);
The %f format specifier expects a float *, but you're passing in a double *. These point to datatypes of different sizes and different representations. Using the wrong format specifier leads to undefined behavior which is why you're getting the wrong value.
To read a double, use %lf:
scanf("%lf", &dif);
Also, the value 0.000010 cannot be represented exactly in binary floating point, so even with this fix you'll see a value that is slightly larger or slightly smaller than the entered value.
people, thank you for the help. The right code is this one bellow.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
int n = 0, iteractions = 1;
double Pi1 = 4.0, serie = 0.0, sub = 0.0, sum = 0.0;
double dif = 0.0;
printf("Type the difference to be observed: ");
scanf("%lf", &dif);
Pi1 = 4;
sub = Pi1 - M_PI;
for(n=1; sub >= dif; n++){
serie = (pow(-1,n)*4)/(double)(2*n + 1);
iteractions = iteractions + 1;
sum = Pi1 + serie;
Pi1 = sum;
sub = Pi1 - M_PI;
if(sub < 0){
sub = -1 * sub;
}
}
printf("Iteractions: %d \n", iteractions);
return 0;
}
I am not an expert in programming, and I am facing the following issue.
I need to compute modulo between floats A and B.
So I use fmod((double)A, (double)B).
Theorically, if A is a multiple of B, then the result is 0.0.
However, due to floating point precision purpose, A and B are not exactly the number I expected to have.
Then, the result of the modulo computation is not 0.0, but something different.
Which is problematic.
Example:
A=99999.9, but the compiler interprets it as 99999.898.
B=99.9, but the compiler interprets it as 99.900002.
fmod(A,B) expected to be 0.0, but gives actually 99.9.
So the question is: how do you use to manage this kind of situation ?
Thank you
The trouble is that:
A is not 99999.9, but 99999.8984375 and
B is not 99.9, but 99.90000152587890625 and
A mod B is 99.89691162109375
OP is getting the correct answer for the arguments given.
Need to use different augments.
A reasonable alternative is to convert the arguments by a scaled power-of-10, then round to an integer, %, back to floating point and un-scale.
Overflow is a concern.
Since OP wants to treat numbers to the nearest 0.1, scale by 10.
#include <float.h>
#include <stdio.h>
int main(void) {
float A = 99999.9;
float B = 99.9;
printf("%.25f\n", A);
printf("%.25f\n", B);
printf("%.25f\n", fmod(A,B));
long long a = lround(A*10.0);
long long b = lround(B*10.0);
long long m = a%b;
double D = m/10.0;
printf("D = %.25f\n", D);
return 0;
}
Output
99999.8984375000000000000000000
99.9000015258789062500000000
99.8969116210937500000000000
D = 0.0000000000000000000000000
Alternative
long long a = lround(A*10.0);
long long b = lround(B*10.0);
long long m = a%b;
double D = m/10.0;
Scale, but skip the integer conversion part
double a = round(A*10.0);
double b = round(B*10.0);
double m = fmod(a,b);
double D = m/10.0;
I write this short program to test the conversion from double to int:
int main() {
int a;
int d;
double b = 0.41;
/* Cast from variable. */
double c = b * 100.0;
a = (int)(c);
/* Cast expression directly. */
d = (int)(b * 100.0);
printf("c = %f \n", c);
printf("a = %d \n", a);
printf("d = %d \n", d);
return 0;
}
Output:
c = 41.000000
a = 41
d = 40
Why do a and d have different values even though they are both the product of b and 100?
The C standard allows a C implementation to compute floating-point operations with more precision than the nominal type. For example, the Intel 80-bit floating-point format may be used when the type in the source code is double, for the IEEE-754 64-bit format. In this case, the behavior can be completely explained by assuming the C implementation uses long double (80 bit) whenever it can and converts to double when the C standard requires it.
I conjecture what happens in this case is:
In double b = 0.41;, 0.41 is converted to double and stored in b. The conversion results in a value slightly less than .41.
In double c = b * 100.0000;, b * 100.0000 is evaluated in long double. This produces a value slightly less than 41.
That expression is used to initialize c. The C standard requires that it be converted to double at this point. Because the value is so close to 41, the conversion produces exactly 41. So c is 41.
a = (int)(c); produces 41, as normal.
In d = (int)(b * 100.000);, we have the same multiplication as before. The value is the same as before, something slightly less than 41. However, this value is not assigned to or used to intialize a double, so no conversion to double occurs. Instead, it is converted to int. Since the value is slightly less than 41, the conversion produces 40.
The compiler can infer that c has to be initialized with 0.41 * 100.0 and does that better than the calculation of d.
The crux of the problem is that 0.41 is not exactly representable in IEEE 754 64-bit binary floating point. The actual value (with only enough precision to show the relevant part) is 0.409999999999999975575..., while 100 can be represented exactly. Multiplying these together should yield 40.9999999999999975575..., which is again not quite representable. In the likely case that the rounding mode is towards nearest, zero, or negative infinity, this should be rounded to 40.9999999999999964.... When cast to an int, this is rounded to 40.
The compiler is allowed to do calculations with higher precision, however, and in particular may replace the multiplication in the assignment of c with a direct store of the computed value.
Edit: I miscalculated the largest representable number less than 41, the correct value is approximately 40.99999999999999289.... As both Eric Postpischil and Daniel Fischer correctly point out, even the value calculated as a double should be rounded to 41 unless the rounding mode is towards zero or negative infinity. Do you know what the rounding mode is? It makes a difference, as this code sample shows:
#include <stdio.h>
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
int main(void)
{
int roundMode = fegetround( );
volatile double d1;
volatile double d2;
volatile double result;
volatile int rounded;
fesetround(FE_TONEAREST);
d1 = 0.41;
d2 = 100;
result = d1 * d2;
rounded = result;
printf("nearest rounded=%i\n", rounded);
fesetround(FE_TOWARDZERO);
d1 = 0.41;
d2 = 100;
result = d1 * d2;
rounded = result;
printf("zero rounded=%i\n", rounded);
fesetround(roundMode);
return 0;
}
Output:
nearest rounded=41
zero rounded=40
Is there a function to round a float in C or do I need to write my own?
float conver = 45.592346543;
I would like to round the actual value to one decimal place, conver = 45.6.
As Rob mentioned, you probably just want to print the float to 1 decimal place. In this case, you can do something like the following:
#include <stdio.h>
#include <stdlib.h>
int main()
{
float conver = 45.592346543;
printf("conver is %0.1f\n",conver);
return 0;
}
If you want to actually round the stored value, that's a little more complicated. For one, your one-decimal-place representation will rarely have an exact analog in floating-point. If you just want to get as close as possible, something like this might do the trick:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
float conver = 45.592346543;
printf("conver is %0.1f\n",conver);
conver = conver*10.0f;
conver = (conver > (floor(conver)+0.5f)) ? ceil(conver) : floor(conver);
conver = conver/10.0f;
//If you're using C99 or better, rather than ANSI C/C89/C90, the following will also work.
//conver = roundf(conver*10.0f)/10.0f;
printf("conver is now %f\n",conver);
return 0;
}
I doubt this second example is what you're looking for, but I included it for completeness. If you do require representing your numbers in this way internally, and not just on output, consider using a fixed-point representation instead.
Sure, you can use roundf(). If you want to round to one decimal, then you could do something like: roundf(10 * x) / 10
#include <math.h>
double round(double x);
float roundf(float x);
Don't forget to link with -lm. See also ceil(), floor() and trunc().
Just to generalize Rob's answer a little, if you're not doing it on output, you can still use the same interface with sprintf().
I think there is another way to do it, though. You can try ceil() and floor() to round up and down. A nice trick is to add 0.5, so anything over 0.5 rounds up but anything under it rounds down. ceil() and floor() only work on doubles though.
EDIT: Also, for floats, you can use truncf() to truncate floats. The same +0.5 trick should work to do accurate rounding.
To print a rounded value, #Matt J well answers the question.
float x = 45.592346543;
printf("%0.1f\n", x); // 45.6
As most floating point (FP) is binary based, exact rounding to one decimal place is not possible when the mathematically correct answer is x.1, x.2, ....
To convert the FP number to the nearest 0.1 is another matter.
Overflow: Approaches that first scale by 10 (or 100, 1000, etc) may overflow for large x.
float round_tenth1(float x) {
x = x * 10.0f;
...
}
Double rounding: Adding 0.5f and then using floorf(x*10.0f + 0.5f)/10.0 returns the wrong result when the intermediate sum x*10.0f + 0.5f rounds up to a new integer.
// Fails to round 838860.4375 correctly, comes up with 838860.5
// 0.4499999880790710449 fails as it rounds to 0.5
float round_tenth2(float x) {
if (x < 0.0) {
return ceilf(x*10.0f + 0.5f)/10.0f;
}
return floorf(x*10.0f + 0.5f)/10.0f;
}
Casting to int has the obvious problem when float x is much greater than INT_MAX.
Using roundf() and family, available in <math.h> is the best approach.
float round_tenthA(float x) {
double x10 = 10.0 * x;
return (float) (round(x10)/10.0);
}
To avoid using double, simply test if the number needs rounding.
float round_tenthB(float x) {
const float limit = 1.0/FLT_EPSILON;
if (fabsf(x) < limit) {
return roundf(x*10.0f)/10.0f;
}
return x;
}
There is a round() function, also fround(), which will round to the nearest integer expressed as a double. But that is not what you want.
I had the same problem and wrote this:
#include <math.h>
double db_round(double value, int nsig)
/* ===============
**
** Rounds double <value> to <nsig> significant figures. Always rounds
** away from zero, so -2.6 to 1 sig fig will become -3.0.
**
** <nsig> should be in the range 1 - 15
*/
{
double a, b;
long long i;
int neg = 0;
if(!value) return value;
if(value < 0.0)
{
value = -value;
neg = 1;
}
i = nsig - log10(value);
if(i) a = pow(10.0, (double)i);
else a = 1.0;
b = value * a;
i = b + 0.5;
value = i / a;
return neg ? -value : value;
}
you can use #define round(a) (int) (a+0.5) as macro
so whenever you write round(1.6) it returns 2 and whenever you write round(1.3) it return 1.