Error when calculating max value for double variable in C

Error when calculating max value for double variable in C - c

I followed the solution here: How to Calculate Double + Float Precision and have been unable to calculate the maximum value for variables of type double.
I run:
double dbl_max = pow(2, pow(2, 10)) * (1-pow(2, -53));
printf("%.2e", dbl_max);
Result: inf
Or:
double dbl_max = (pow(2, pow(2, 10)));
printf("%.2e", dbl_max);
Result: inf
Or:
double dbl_max = pow(2, pow(2, 9)) * (1-pow(2, -53));
printf("%.2e", dbl_max);
Result: 1.34e+154
Why isn't the calculation fitting into the variable? The top sample above works just fine for float variables.

The intermediate exponent is one too high.
Change pow(2, 10) to (pow(2, 10) - 1) and it
should work. You can compensate by multiplying the final result by
2.
– Tom Karzes
double dbl_max = pow(2, pow(2, 10)-1) * (1-pow(2, -53)) * 2;
printf("%.2e", dbl_max);

Related

Calculate the range of double

As part of a exercise from "The C programming Language" i am trying to find a way to calculate the maximum possible float and the maximum possible double on my computer. The technique shown below works with floats (to calculate the max float) but not with double:
// max float:
float f = 1.0;
float last_f;
float step = 9.0;
while(1) {
last_f = f;
f *= (1.0 + step);
while (f == INFINITY) {
step /= 2.0;
f = last_f * (1.0 + step);
}
if (! (f > last_f) )
break;
}
printf("calculated float max : %e\n", last_f);
printf("limits.h float max : %e\n", FLT_MAX);
printf("diff : %e\n", FLT_MAX - last_f);
printf("The expected value? : %s\n\n", (FLT_MAX == last_f)? "yes":"no");
// max double:
double d = 1.0;
double last_d;
double step_d = 9.0;
while(1) {
last_d = d;
d *= (1.0 + step_d);
while (d == INFINITY) {
step_d /= 2.0;
d = last_d * (1.0 + step_d);
}
if (! (d > last_d) )
break;
}
printf("calculated double max: %e\n", last_d);
printf("limits.h double max : %e\n", DBL_MAX);
printf("diff : %e\n", DBL_MAX - last_d);
printf("The expected value? : %s\n\n", (DBL_MAX == last_d)? "yes":"no");
and this results to:
calculated float max : 3.402823e+38
limits.h float max : 3.402823e+38
diff : 0.000000e+00
The expected value? : yes
calculated double max: 1.797693e+308
limits.h double max : 1.797693e+308
diff : 1.995840e+292
The expected value? : no
It looks to me like it still calculates using single precision in the second case.
What am i missing?

OP's approach works when calculations are done with wider precision than float in the first case and wider than double in the 2nd case.
In the first case, OP reports FLT_EVAL_METHOD == 0 so float calculations are done as float and double are done as double. Note that float step ... 1.0 + step is a double calculation.
The below code forces the calculation to double and so I can replicate OP's problem even with my FLT_EVEL_METHOD==2 (Use long double for internal calculations.)
volatile double d = 1.0;
volatile double last_d;
volatile double step_d = 9.0;
while(1) {
last_d = d;
d *= (1.0 + step_d);
while (d == INFINITY) {
step_d /= 2.0;
volatile double sum = 1.0 + step_d;
d = last_d * sum;
//d = last_d + step_d*last_d;
}
if (! (d > last_d) ) {
break;
}
}
diff : 1.995840e+292
The expected value? : no
Instead OP should use the following which does not form the inexact sum of 1.0 + step_d when step_d is small, rather it forms the exact product of step_d*last_d. The 2nd form results in a more accurate calculation for the new d, by providing an additional bit of calculation precision in d. Higher precision FP is not needed to employ OP's approach.
d = last_d + step_d*last_d;
diff : 0x0p+0 0.000000e+00
The expected value? : yes

The expressions with the literals n.0 are all double precision floating point types. That allows the assignment to f to be calculated using a higher precision intermediate value.
It's this effect that allows the algorithm to converge in the float case.
With strict double precision floating point such convergence is not possible.
If you had used the f suffix on the literals in the float case then convergence would not occur there either.
A fix would be to use long double suffixes on the literals if your platform has a wider long double type.

Division issues in C

I don't really know how to explain this (that's why the title was to vague) but I need a way to make C divide in a certain way, I need to make c divide without any decimals in the answer (besides the remainder) for example;
Instead of 5.21 / .25 = 20.84
I need this 5.21 / .25 = *20* Remainder = *.21*
I found out how to find the remainder with Fmod() but how do I find the 20?
Thanks ~

how about using implicit casts?
float k = 5.21 / .25;
int n = k;
k -= n;
results in
k = .84
n = 20
using only ints will also do the job if you don't need the remainder
int k = 5.21 / .25
will automatically truncate k and get k = 20

Use double modf(double value, double *iptr) to extract the integer portion of a FP number.
The modf functions break the argument value into integral and fractional parts, each of which has the same type and sign as the argument. C11 §7.12.6.12 2
#include <math.h>
#include <stdio.h>
int main() {
double a = 5.21;
double b = 0.25;
double q = a / b;
double r = fmod(a, b);
printf("quotient: %f\n", q);
printf("remander: %f\n", r);
double ipart;
double fpart = modf(q, &ipart);
printf("quotient i part: %f\n", ipart);
printf("quotient f part: %f\n", fpart);
return 0;
}
Output
quotient: 20.840000
remander: 0.210000
quotient i part: 20.000000
quotient f part: 0.840000
Using int is problematic due to a limited range, precision and sign issues.

casting signed to double different result than casting to float then double

So as part of an assignment I am working if a expression : (double) (float) x == (double) x
returns awlays 1 or not.(x is a signed integer)
it works for every value except for INT_MAX. I was wondering why is it so? if i print the values, they both show the same value,even for INT_MAX.
x = INT_MAX ;
printf("Signed X: %d\n",x);
float fx1 = (float)x;
double dx1 = (double)x;
double dfx = (double)(float)x;
printf("(double) x: %g\n",dx1);
printf("(float) x: %f \n",fx1);
printf("(double)(float)x: %g\n",dfx);
if((double) (float) x == (double) x){
printf("RESULT:%d\n", ((double)(float) x == (double) x));
}
EDIT: the entire program:
#include<stdio.h>
#include<stdlib.h>
#include<limits.h>
int main(int argc, char *argv[]){
//create random values
int x = INT_MAX ;
printf("Signed X: %d\n",x);
float fx1 = (float)x;
double dx1 = (double)x;
double dfx = (double)(float)x;
printf("(double) x: %g\n",dx1);
printf("(float) x: %f \n",fx1);
printf("(double)(float)x: %g\n",dfx);
if((double) (float) x == (double) x){
printf("RESULT:%d\n", ((double)(float) x == (double) x));
}
return 0;
}//end of main function

int and float have most likely the same number of bits in their representation, namely 32. float has a mantissa, an exponent and a sign bit, so the mantissa must have less than 31 bit, needed for the bigger int values like INT_MAX. So there loss of precision when storing in float.

Manually implementing a rounding function in C

I have written a C program (which is part of my project) to round off a float value to the given precision specified by the user. The function is something like this
float round_offf (float num, int precision)
What I have done in this program is convert the float number into a string and then processed it.
But is there a way to keep the number as float itself and implement the same.
Eg. num = 4.445 prec = 1 result = 4.4

Of course there is. Very simple:
#include <math.h>
float custom_round(float num, int prec)
{
int trunc = round(num * pow(10, prec));
return (float)trunc / pow(10, prec);
}
Edit: it seems to me that you want this because you think you can't have dynamic precision in a format string. Apparently, you can:
int precision = 3;
double pie = 3.14159265358979323648; // I'm hungry, I need a double pie
printf("Pi equals %.*lf\n", precision, pie);
This prints 3.142.

Yes:
float round_offf(float num, int precision)
{
int result;
int power;
power = pow(10, precision + 1);
result = num * power;
if ((result % 10) > 5)
result += 10;
result /= 10;
return ((float)result / (float)power);
}

C - finding cube root of a negative number with pow function

In real world cube root for a negative number should exist:
cuberoot(-1)=-1, that means (-1)*(-1)*(-1)=-1
or
cuberoot(-27)=-3, that means (-3)*(-3)*(-3)=-27
But when I calculate cube root of a negative number in C using pow function, I get nan (not a number)
double cuber;
cuber=pow((-27.),(1./3.));
printf("cuber=%f\n",cuber);
output: cuber=nan
Is there any way to calculate cube root of a negative number in C?

7.12.7.1 The cbrt functions
Synopsis
#include <math.h>
double cbrt(double x);
float cbrtf(float x);
long double cbrtl(long double x);
Description
The cbrt functions compute the real cube root of x.
If you're curious, pow can't be used to compute cube roots because one-third is not expressible as a floating-point number. You're actually asking pow to raise -27.0 to a rational power very nearly equal to 1/3; there is no real result that would be appropriate.

there is. Remember: x^(1/3) = -(-x)^(1/3). So the following should do it:
double cubeRoot(double d) {
if (d < 0.0) {
return -cubeRoot(-d);
}
else {
return pow(d,1.0/3.0);
}
}
Written without compiling, so there may be syntax errors.
Greetings,
Jost

As Stephen Canon answered, to correct function to use in this case is cbrt(). If you don't know the exponent beforehand, you can look into the cpow() function.
#include <stdio.h>
#include <math.h>
#include <complex.h>
int main(void)
{
printf("cube root cbrt: %g\n", cbrt(-27.));
printf("cube root pow: %g\n", pow(-27., 1./3.));
double complex a, b, c;
a = -27.;
b = 1. / 3;
c = cpow(a, b);
printf("cube root cpow: (%g, %g), abs: %g\n", creal(c), cimag(c), cabs(c));
return 0;
}
prints
cube root cbrt: -3
cube root pow: -nan
cube root cpow: (1.5, 2.59808), abs: 3
Keep in mind the definition of the complex power: cpow(a, b) = cexp(b* clog(a)).

Using Newton's Method:
def cubicroot(num):
flag = 1
if num < 0:
flag = -1
num = num - num - num
x0 = num / 2.
x1 = x0 - (((x0 * x0 * x0) - num) / (3. * x0 * x0))
while(round(x0) != round(x1)):
x0 = x1
x1 = x0 - (((x0 * x0 * x0) - num) / (3. * x0 * x0))
return x1 * flag
print cubicroot(27)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Error when calculating max value for double variable in C - c

The intermediate exponent is one too high. Change pow(2, 10) to (pow(2, 10) - 1) and it should work. You can compensate by multiplying the final result by 2. – Tom Karzes double dbl_max = pow(2, pow(2, 10)-1) * (1-pow(2, -53)) * 2; printf("%.2e", dbl_max);

Related

Calculate the range of double

Division issues in C

casting signed to double different result than casting to float then double

Manually implementing a rounding function in C

C - finding cube root of a negative number with pow function

Categories

Resources