why no overflow when using double variable? - c

i'm trying to understand why in the following situation i don't get an overflow:
double x = 1.7976931348623157E+308; //this is the max value of double
x = x + 0.5;
When checking the value of x after adding 0.5 i still get the same result.
Anyone?

Generally if you want to add a value to a double x the added value should be within the precision range for it to change the value at all.
For a double you get a precision of ~16 digits. So, if the added value is less than (x/1E+16), then there will be no change in the result.
With a little trial and error, in your case, adding a value of 1E+292, to the given double, gives a result of +INF.
double x = 1.7976931348623157E+308; //this is the max value of double
x = x + 1E+292;
printf ("\nx = %lf",x);
Result
x = 1.#INF00

Consider the analogy with exponent notation.
Suppose you are allowed 4 significant digits, so the number 1234000.0 will be represented by 1.234e6.
Now try adding 0.5 which should be 1234000.5.
Even if the intermediate buffer is big enough to hold that significance, its representation within the proscribed limit is still 1.234e6.
But if the intermediate buffer can hold, say, only 7 digits, the aligned values to add are
1234000
0
-------
1234000
so the 0.5 loses its significance even before the addition is performed. In the case of double you can be quite sure the intermediate buffer cannot hold 308 digits equivalent.

Related

Integer division which results in less than 1

How we can use scale factor of 1000 for example to not get 0 to a as we work with integers. Its on 32bit microcontroller.
Example:
uint32 a;
a = 211/555 * x;
Should we just multiply everything on right by 1000, and then divide final result with 1000?
You may apply the scale factor before doing the division.
In your example you are effectively doing (assuming that x=1000)
a = (211/555) * x;
which will turn out to be
a = 0*x;
If you change it around to
a =(x*211)/555;
you can force the multiplication first, creating a numerator larger than 555 which will allow a to be greater than 0.
You cannot then divide this result by 1000 though because it will still be less than 0 which cannot be stored in an integer data type.
You need to keep it in this form and always treat that number as having a 1000 multiplier (for example if the units were originally kilometers, the new number is in meters) or you will have to use a type which can handle numbers less than 1 (like a float or double).

How many times will this string be printed?

I saw this code on Exam: the question is how many times this string will be printed.
I thought first it will be 10 times but this is wrong.
Can someone tell me why my answer is wrong.
This is part of a code in the C language.
for (float x = 100000001.0f; x <= 100000010.0f; x += 1.0f) {
printf("lbc");
}
Assuming x is a 32 bit floating point:
Floating point values have a limited resolution. 100000001 is 1*10^8 so you lose your 1 at the end. If you would add 1, it again gets lost because the next float value is 1.00000008*10^8. You can add as many 1s as you like the result will always be the same.
That is the reason why your code is an endless loop.
float x = 100000001.0f;
will initialize x with the nearest representable float, which is 100000000. Adding 1 to this value will lead to the same value.
If you print the value of x in the loop you will see what happen: http://ideone.com/3FJGTz

accuracy of sqrt of integers

I have a loop like this:
for(uint64_t i=0; i*i<n; i++) {
This requires doing a multiplication every iteration. If I could calculate the sqrt before the loop then I could avoid this.
unsigned cut = sqrt(n)
for(uint64_t i=0; i<cut; i++) {
In my case it's okay if the sqrt function rounds up to the next integer but it's not okay if it rounds down.
My question is: is the sqrt function accurate enough to do this for all cases?
Edit: Let me list some cases. If n is a perfect square so that n = y^2 my question would be - is cut=sqrt(n)>=y for all n? If cut=y-1 then there is a problem. E.g. if n = 120 and cut = 10 it's okay but if n=121 (11^2) and cut is still 10 then it won't work.
My first concern was the fractional part of float only has 23 bits and double 52 so they can't store all the digits of some 32-bit or 64-bit integers. However, I don't think this is a problem. Let's assume we want the sqrt of some number y but we can't store all the digits of y. If we let the fraction of y we can store be x we can write y = x + dx then we want to make sure that whatever dx we choose does not move us to the next integer.
sqrt(x+dx) < sqrt(x) + 1 //solve
dx < 2*sqrt(x) + 1
// e.g for x = 100 dx < 21
// sqrt(100+20) < sqrt(100) + 1
Float can store 23 bits so we let y = 2^23 + 2^9. This is more than sufficient since 2^9 < 2*sqrt(2^23) + 1. It's easy to show this for double as well with 64-bit integers. So although they can't store all the digits as long as the sqrt of what they can store is accurate then the sqrt(fraction) should be sufficient. Now let's look at what happens for integers close to INT_MAX and the sqrt:
unsigned xi = -1-1;
printf("%u %u\n", xi, (unsigned)(float)xi); //4294967294 4294967295
printf("%u %u\n", (unsigned)sqrt(xi), (unsigned)sqrtf(xi)); //65535 65536
Since float can't store all the digits of 2^31-2 and double can they get different results for the sqrt. But the float version of the sqrt is one integer larger. This is what I want. For 64-bit integers as long as the sqrt of the double always rounds up it's okay.
First, integer multiplication is really quite cheap. So long as you have more than a few cycles of work per loop iteration and one spare execute slot, it should be entirely hidden by reorder on most non-tiny processors.
If you did have a processor with dramatically slow integer multiply, a truly clever compiler might transform your loop to:
for (uint64_t i = 0, j = 0; j < cut; j += 2*i+1, i++)
replacing the multiply with an lea or a shift and two adds.
Those notes aside, let’s look at your question as stated. No, you can’t just use i < sqrt(n). Counter-example: n = 0x20000000000000. Assuming adherence to IEEE-754, you will have cut = 0x5a82799, and cut*cut is 0x1ffffff8eff971.
However, a basic floating-point error analysis shows that the error in computing sqrt(n) (before conversion to integer) is bounded by 3/4 of an ULP. So you can safely use:
uint32_t cut = sqrt(n) + 1;
and you’ll perform at most one extra loop iteration, which is probably acceptable. If you want to be totally precise, instead use:
uint32_t cut = sqrt(n);
cut += (uint64_t)cut*cut < n;
Edit: z boson clarifies that for his purposes, this only matters when n is an exact square (otherwise, getting a value of cut that is “too small by one” is acceptable). In that case, there is no need for the adjustment and on can safely just use:
uint32_t cut = sqrt(n);
Why is this true? It’s pretty simple to see, actually. Converting n to double introduces a perturbation:
double_n = n*(1 + e)
which satisfies |e| < 2^-53. The mathematical square root of this value can be expanded as follows:
square_root(double_n) = square_root(n)*square_root(1+e)
Now, since n is assumed to be a perfect square with at most 64 bits, square_root(n) is an exact integer with at most 32 bits, and is the mathematically precise value that we hope to compute. To analyze the square_root(1+e) term, use a taylor series about 1:
square_root(1+e) = 1 + e/2 + O(e^2)
= 1 + d with |d| <~ 2^-54
Thus, the mathematically exact value square_root(double_n) is less than half an ULP away from[1] the desired exact answer, and necessarily rounds to that value.
[1] I’m being fast and loose here in my abuse of relative error estimates, where the relative size of an ULP actually varies across a binade — I’m trying to give a bit of the flavor of the proof without getting too bogged down in details. This can all be made perfectly rigorous, it just gets to be a bit wordy for Stack Overflow.
All my answer is useless if you have access to IEEE 754 double precision floating point, since Stephen Canon demonstrated both
a simple way to avoid imul in loop
a simple way to compute the ceiling sqrt
Otherwise, if for some reason you have a non IEEE 754 compliant platform, or only single precision, you could get the integer part of square root with a simple Newton-Raphson loop. For example in Squeak Smalltalk we have this method in Integer:
sqrtFloor
"Return the integer part of the square root of self"
| guess delta |
guess := 1 bitShift: (self highBit + 1) // 2.
[
delta := (guess squared - self) // (guess + guess).
delta = 0 ] whileFalse: [
guess := guess - delta ].
^guess - 1
Where // is operator for quotient of integer division.
Final guard guess*guess <= self ifTrue: [^guess]. can be avoided if initial guess is fed in excess of exact solution as is the case here.
Initializing with approximate float sqrt was not an option because integers are arbitrarily large and might overflow
But here, you could seed the initial guess with floating point sqrt approximation, and my bet is that the exact solution will be found in very few loops. In C that would be:
uint32_t sqrtFloor(uint64_t n)
{
int64_t diff;
int64_t delta;
uint64_t guess=sqrt(n); /* implicit conversions here... */
while( (delta = (diff=guess*guess-n) / (guess+guess)) != 0 )
guess -= delta;
return guess-(diff>0);
}
That's a few integer multiplications and divisions, but outside the main loop.
What you are looking for is a way to calculate a rational upper bound of the square root of a natural number. Continued fraction is what you need see wikipedia.
For x>0, there is
.
To make the notation more compact, rewriting the above formula as
Truncate the continued fraction by removing the tail term (x-1)/2's at each recursion depth, one gets a sequence of approximations of sqrt(x) as below:
Upper bounds appear at lines with odd line numbers, and gets tighter. When distance between an upper bound and its neighboring lower bound is less than 1, that approximation is what you need. Using that value as the value of cut, here cut must be a float number, solves the problem.
For very large number, rational number should be used, so no precision is lost during conversion between integer and floating point number.

Fixed-point scaling and accuracy in multiplication

I need to perform a multiplication operation on a fixed-point variable x (unsigned 16-bit integer [U16] type with binary point 6 [BP6]) with a coefficient A, which I know will always be between 0 and 1. Code is being written in C for a 32-bit embedded platform.
I know that if I were to also make this coefficient a U16 BP6, then I would end up with a U32 BP12 from the multiplication. I want to rescale this result back down to U16 BP6, so I just lop off the first 10 bits and the last 6.
However, since the coefficient is limited in precision by the number of fractional bits, and I do not necessarily need the full 10 bits of integer, I was thinking that I could just make the coefficient variable A a U16 BP15 to yield a more precise result.
I have worked out the following example (bear with me):
Let's say that x = 172.0 (decimal) and I want to use a coefficient A = 0.82 (decimal). The ideal decimal result would be 172.0 * 0.82 = 141.04.
In binary, x = 0010101100.000000.
If I am using BP6 for A, the binary representation will be either
A_1 = 0000000000.110100 = 0.8125 or
A_2 = 0000000000.110101 = 0.828125
(depending on whether value is based on floor or ceiling).
Performing the binary multiplication between x and either value of A yields (leaving out leading zeroes):
A_1 * x = 10001011.110000000000 = 139.75
A_2 * x = 10001110.011100000000 = 142.4375
In both cases, triming down the last 6 bits would not affect the result.
Now, if I expanded A to have BP15, then
A_3 = 0.110100011110110 = 0.82000732421875
and the resulting multiplication yields
A_3 * x = 10001101.000010101001000000000 = 141.041259765625
When trimming the extra 15 fractional bits, the result is
A_3 * x = 10001101.000010 = 141.03125
So it's pretty clear here that by expanding the coefficient to have more fractional bits yields a more precise result (at least in my example). Is this something which will hold true in general? Is this good/bad to use in practice? Am I missing or misunderstanding something?
EDIT: I should have said "accuracy" in place of "precision" here. I am looking for a result which is closer to my expected value rather than a result which contains more fractional bits.
Having done similar code, I'd say you what you are doing will hold true in general with the following concerns.
It is very easy to get unexpected overflow when shifting around your binary point. Rigorous testing/analysis and/or code detect is recommended. Notable failure: Ariane_5
You want precision, thus I disagree with "lop off ... last 6". Instead I recommend rounding your results as processing time allows. Use the MSBit to be lopped off to possibly adjust the result.

Is Multiplying a decimal number where all results are full integers, considered Floating Point Math?

Sorry for the wordy title. My code is targeting a microcontroller (msp430) with no floating point unit, but this should apply to any similar MCU.
If I am multiplying a large runtime variable with what would normally be considered a floating point decimal number (1.8), is this still treated like floating point math by the MCU or compiler?
My simplified code is:
int multip = 0xf; // Can be from 0-15, not available at compile time
int holder = multip * 625; // 0 - 9375
holder = holder * 1.8; // 0 - 16875`
Since the result will always be a positive full, real integer number, is it still floating point math as far as the MCU or compiler are concerned, or is it fixed point?
(I realize I could just multiply by 18, but that would require declaring a 32bit long instead of a 16 bit int then dividing and downcasting for the array it will be put in, trying to skimp on memory here)
The result is not an integer; it rounds to an integer.
9375 * 1.8000000000000000444089209850062616169452667236328125
yields
16875.0000000000004163336342344337026588618755340576171875
which rounds (in double precision floating point) to 16875.
If you write a floating-point multiply, I know of no compiler that will determine that there's a way to do that in fixed-point instead. (That does not mean they do not exist, but it ... seems unlikely.)
I assume you simplified away something important, because it seems like you could just do:
result = multip * 1125;
and get the final result directly.
I'd go for chux's formula if there's some reason you can't just multiply by 1125.
Confident FP code will be created for
holder = holder * 1.8
To avoid FP and 32-bit math, given the OP values of
int multip = 0xf; // Max 15
unsigned holder = multip * 625; // Max 9375
// holder = holder * 1.8;
// alpha depends on rounding desired, e.g. 2 for round to nearest.
holder += (holder*4u + alpha)/5;
If int x is non-negative, you can compute x *= 1.8 rounded to nearest using only int arithmetic, without overflow unless the final result overflows, with:
x - (x+2)/5 + x
For truncation instead of round-to-nearest, use:
x - (x+4)/5 + x
If x may be negative, some additional work is needed.

Resources