C casting double ->long -> short (using right shift ">>") - c

I have a small code which does some number transformations. I want to turn a number from double to long and then using right bit shift to cast it to short. But it gives me different results and I don't know why.
I have 3 numbers in an array and I make the sum of them using a for loop and every time I am gonna cast the result to short.
There is a number with .000000007 more exactly 63897600.000000007. Adding this to the total and then subtracting it gives me different results.
I can't figure out why does this occur and how can I manage this particular case.
Here is my code:
#include <stdio.h>
#define DOUBLETOLONG(number) (long)(number)
#define NEAREST(number) ((short)((number + 32768) >> 16))
#define LONGTOSHORT(number) NEAREST(DOUBLETOLONG(number))
int main() {
int k = 0;
double array[3] ={ 41451520.000000, 63897600.000000007, -63897600.000000007 };
double total_x = array[0];
short j = LONGTOSHORT(total_x);
printf("j = %d\n", j);
for (k = 1; k < 3; k++) {
total_x = total_x+array[k];
j = LONGTOSHORT(total_x);
printf("j = %d\n", j);
}
return 0;
}
This are the results:
j = 633
j = 1608
j = 632

41451520 + 63897600 = 105349120
In a double this integer can still be accurately represented. However, we didn't account for the fractional part 0.000000007. Let's check what the next biggest double is:
#include <stdio.h>
#include <math.h>
int main(int argc, char** argv) {
printf("%.23f\n", nextafter(105349120.0, INFINITY));
return 0;
}
Turns out, it's 105349120.000000014901.... Let's put those next to eachother:
105349120.000000014901...
0.000000007
This means that 105349120.000000007 is closer to 105349120 than the next bigger double, so it correctly gets rounded down to 105349120.
However, when we subtract again, 105349120 - 63897600.000000007 gets rounded down, because the next smaller double than 41451520 is (nextafter(41451520.0, 0)) 41451519.999999992549.... Put them next to eachother:
41451519.999999992549...
41451519.999999993
Yep, closer to the first double below 41451520 than 41451520 itself. So it correctly gets rounded down to 41451519.999999992549....
When you convert 41451519.999999992549... to an integer it floors the number, resulting in one less than what you expect.
Floating point math is full of surprises. You should read What Every Computer Scientist Should Know About Floating-Point Arithmetic, but perhaps it's still too advanced for now. But it's important to be aware that yes, floating point is full of surprises, but no it isn't magic, and you can learn the pitfalls.

Related

Sum of array of floats returns different results [duplicate]

This question already has answers here:
How best to sum up lots of floating point numbers?
(5 answers)
Is floating point math broken?
(31 answers)
Closed 4 years ago.
Here I have a function sum() of type float that takes in a pointer t of type float and an integer size. It returns the sum of all the elements in the array. Then I create two arrays using that function. One that has the BIG value at the first index and one that has it at the last index. When I return the sums of each of those arrays I get different results. This is my code:
#include <stdlib.h>
#include <stdio.h>
#define N 1024
#define SMALL 1.0
#define BIG 100000000.0
float sum(float* t, int size) { // here I define the function sum()
float s = 0.0;
for (int i = 0; i < size; i++) {
s += t[i];
}
return s;
}
int main() {
float tab[N];
for (int i = 0; i < N; i++) {
tab[i] = SMALL;
}
tab[0] = BIG;
float sum1 = sum(tab, N); // initialize sum1 with the big value at index 0
printf("sum1 = %f\n", sum1);
tab[0] = SMALL;
tab[N-1] = BIG;
float sum2 = sum(tab, N); // initialize sum2 with the big value at last index
printf("sum2 = %f\n", sum2);
return 0;
}
After compiling the code and running it I get the following output:
Sum = 100000000.000000
Sum = 100001024.000000
Why do I get different results even though the arrays have the same elements ( but at different indexes ).
What you're experiencing is floating point imprecision. Here's a simple demonstration.
int main() {
float big = 100000000.0;
float small = 1.0;
printf("%f\n", big + small);
printf("%f\n", big + (19 *small));
return 0;
}
You'd expect 100000001.0 and 100000019.0.
$ ./test
100000000.000000
100000016.000000
Why'd that happen? Because computers don't store numbers like we do, floating point numbers doubly so. A float has a size of just 32 bits, but can store numbers up to about 3^38 rather than the just 2^31 a 32 bit integer can. And it can store decimal places. How? They cheat. What it really stores is the sign, an exponent, and a mantissa.
sign * 2^exponent * mantissa
The mantissa is what determines accuracy and there's only 24 bits in a float. So large numbers lose precision.
You can read about exactly how and play around with the representation.
To solve this either use a double which has greater precision, or use an accurate, but slower, arbitrary precision library such as GMP.
Why do I get different results even though the arrays have the same elements
In floating-point math, 100000000.0 + 1.0 equals 100000000.0 and not 100000001.0, but 100000000.0 + 1024.0 does equal 100001024.0. Given the value 100000000.0, the value 1.0 is too small to show up in the available bits used to represent 100000000.0.
So when you put 100000000.0 first, all the later + 1.0 operations have no effect.
When you put 100000000.0 last, though, all the previous 1000+ 1.0 + 1.0 + ... do add up to 1024.0, and 1024.0 is "big enough" to make a difference given the available precision of floating point math.

C - erroneous output after multiplication of large numbers

I'm implementing my own decrease-and-conquer method for an.
Here's the program:
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
double dncpow(int a, int n)
{
double p = 1.0;
if(n != 0)
{
p = dncpow(a, n / 2);
p = p * p;
if(n % 2)
{
p = p * (double)a;
}
}
return p;
}
int main()
{
int a;
int n;
int a_upper = 10;
int n_upper = 50;
int times = 5;
time_t t;
srand(time(&t));
for(int i = 0; i < times; ++i)
{
a = rand() % a_upper;
n = rand() % n_upper;
printf("a = %d, n = %d\n", a, n);
printf("pow = %.0f\ndnc = %.0f\n\n", pow(a, n), dncpow(a, n));
}
return 0;
}
My code works for small values of a and n, but a mismatch in the output of pow() and dncpow() is observed for inputs such as:
a = 7, n = 39
pow = 909543680129861204865300750663680
dnc = 909543680129861348980488826519552
I'm pretty sure that the algorithm is correct, but dncpow() is giving me wrong answers.
Can someone please help me rectify this? Thanks in advance!
Simple as that, these numbers are too large for what your computer can represent exactly in a single variable. With a floating point type, there's an exponent stored separately and therefore it's still possible to represent a number near the real number, dropping the lowest bits of the mantissa.
Regarding this comment:
I'm getting similar outputs upon replacing 'double' with 'long long'. The latter is supposed to be stored exactly, isn't it?
If you call a function taking double, it won't magically operate on long long instead. Your value is simply converted to double and you'll just get the same result.
Even with a function handling long long (which has 64 bits on nowadays' typical platforms), you can't deal with such large numbers. 64 bits aren't enough to store them. With an unsigned integer type, they will just "wrap around" to 0 on overflow. With a signed integer type, the behavior of overflow is undefined (but still somewhat likely a wrap around). So you'll get some number that has absolutely nothing to do with your expected result. That's arguably worse than the result with a floating point type, that's just not precise.
For exact calculations on large numbers, the only way is to store them in an array (typically of unsigned integers like uintmax_t) and implement all the arithmetics yourself. That's a nice exercise, and a lot of work, especially when performance is of interest (the "naive" arithmetic algorithms are typically very inefficient).
For some real-life program, you won't reinvent the wheel here, as there are libraries for handling large numbers. The arguably best known is libgmp. Read the manuals there and use it.

Allocating large memory in C (Project Euler Prob)

I was trying to solve a problem using C on project euler click here
Here is the code. It works fine for 10 values but for 1000 values it gives a wrong output. I noticed that it gives a right output till 32. I think I'm exceeding the memory or something. How do I allocate memory for such a large array?
#include <stdio.h>
int main() {
int a[10], i, sum=1, b=0;
for(i = 1; i < 10; i++) {
a[0] = 1;
a[i] = sum + a[i-1];
sum = sum + a[i-1];
}
for(int j = 1;j > 0; j++) {
b = b + sum%10;
if(sum<10)
break;
sum = sum/10;
}
printf("%d\n",b);
return 0;
}
You might try computing 21000 as an 80-bit long double, then using sprintf to convert it to a string, then summing the digits of that string.
Why this works:
Floating-point types store numbers as a mantissa times an exponent. The exponent is always a power of two, and the mantissa can be 1. For a long double, the exponent can be up to 216383. printf and friends, on modern implementations, will correctly print out the digits of a floating-point number.
Code:
int main() {
char buf[1024]; int ans = 0;
sprintf(buf, "%.0f", 0x1.0p1000);
for (int i = 0; buf[i]; i++) ans += buf[i] - '0';
printf("%i\n", ans);
}
I noticed that it gives a right output till 32
That is, because the integer type you're using has 32 bits. It simply can't hold larger numbers. You can't solve it the conventional way.
Here's what I'd suggest: First let's estimate how many digits that number will have. Every time a number gets 10-fold in decimal writing a new digit is required. So the number of digits for a number in decimal is given by ceil(log10(n)). So for 2^1000 you need ceil(log10(2^1000)) digits, but that is just ceil(1000*log10(2)) = 302, so you'll need 302 decimal digits to write it down.
This gives the next idea: Write down the number 1 in 302 digits, i.e. 301 times '0' and one '1' in a string. Then double the string 1000 times, by adding it to itself just like in elementary school, carrying the overflowing digits.
EDIT I think I should point out, that the problem encountered is the whole point of this Project Euler problem. Project Euler problems all have in common, that you can not solve them by using naive programming methods. You must get creative to solve them!

Efficient exponentials with small base

I need to perform a softmax operation. That is, given a sequence of n real values ranging from -inf to +inf, I turn them into probabilities by exponentianting each value and dividing for the sum of exponentials:
for (i = 0; i < n; i++)
p_x[i] = exp(x[i]) / sum_exp(x, n)
(don't take the code literally, I'm not summing up all exp's every iteration!)
I'm having overflow problems when values go above 700 in some extreme cases (using 8-bytes doubles). I know I could use another base instead of e, however, I'm afraid calling pow will be much slower than exp (speed is critical for me).
What is the fastest way to solve this?
Use each number as the 52-bit mantissa in a 64-bit floating point number. This is simply a matter of masking then casting.
#include <stdio.h>
int main(int argc, char *argv[])
{
long long val = 1234567890;
long long mval = val & ~0xfff0000000000000ULL;
float fval = *((float*)&mval);
printf("%f", fval);
}
b^x = e^(x * ln b)
So using a smaller base b is equivalent to multiplying your values by ln b before applying exp, and dividing again at the end.

Why does this division result in zero?

I was writing this code in C when I encountered the following problem.
#include <stdio.h>
int main()
{
int i=2;
int j=3;
int k,l;
float a,b;
k=i/j*j;
l=j/i*i;
a=i/j*j;
b=j/i*i;
printf("%d %d %f %f\n",k,l,a,b);
return 0;
}
Can anyone tell me why the code is returning zero for the first and third variables (k and a)?
Are you asking why k and a show up as zero? This is because in integer division 2/3 = 0 (the fractional part is truncated).
What I think you are experiencing is integer arithmetic. You correctly suppose l and b to be 2, but incorrectly assume that k and a will be 3 because it's the same operation. But it's not, it's integer arithmetic (rather than floating-point arithmetic). So when you do i / j (please consider using some whitespace), 2 / 3 = 0.33333... which is cast to an int and thus becomes 0. Then we multiply by 3 again, and 0 * 3 = 0.
If you change i and j to be floats (or pepper your math with (float) casts), this will do what you expect.
You haven't said what you're getting or what you expect, but in this case it's probably easy to guess. When you do 'a=i/j*j', you're expecting the result to be roughly .2222 (i.e. 2/9), but instead you're getting 0.0. This is because i and j are both int's, so the multiplication and (crucially) division are done in integer math, yielding 0. You assign the result to a float, so that 0 is then converted to 0.0f.
To fix it, convert at least one operand to floating point BEFORE the division: a = (float)i/j*j);
this is due to how the c compiler treats int in divisions:
#include <stdio.h>
int main()
{
int i=2;
int j=3;
int k,l;
float a,b;
k=i/j*j; // k = (2/3)*3=0*3=0
l=j/i*i; // l = (3/2)*2=1*2=2
a=i/j*j; // same as k
b=j/i*i; // same as b
printf("%d %d %f %f/n",k,l,a,b);
return 0;
}
If you're asking why k and a are 0: i/j*j is the same as (i/j)*j. Since j is larger than i, i/j is 0 (integer division). 0*j is still 0, so the result (k) is 0. The same applies to the value of a.
it doesn’t matter if you’re variable is float or not, as long you’re using
integer / integer , you’ll get 0,
but because you’re using a float output, you get 0.0

Resources