How can I get the best accurate result? - c

Given:
unsigned int a, b, c, d;
I want:
d = a * b / c;
and (a *b ) may overflow; also (b/c) may equal zero and give less accuracy.
Maybe a cast to 64-bits would get things to work, but I want to know the best way to get the most accurate result in d.
Is there any good solution?

I would either:
Cast to 64 bits, if that will work for your ranges of a, b, and c.
Use an infinite precision library like GMP
Cast to a float or double and back, if you find those results acceptable.

For best accuracy/precision you'll want to do your multiplies before your divides. As you imply, you'll want to use something with twice as many bits as an int:
int64_t d = (int64_t) a * (int64_t) b;
d /= c;
You don't need both casts, but they arguably make it a bit clearer.
Note that if c is small enough, then d can still be bigger than an int. That may or may not be an issue for you. If you're sure it isn't you can cast down to an int at the end.

For your problem as stated, I'd do d = (long long)a * b / c;
No sense in going to float when you only need more bits. No need to redeclare or cast everything. Casting a is enough to promote b and c to larger size in the expression.

Use a float or double, in floating-point arithmetic, division by zero is allowed, results will be a positive or negative infinity

You can always do an explicit check for overflow on a * b:
long long e = (long long) a * (long long) b;
if (e <= INT_MAX) {
d = e / c;
} else {
d = a * (b / c);
}
Of course this only works for non-negative a, b, c. If they can be negative you'll also have to check against INT_MIN.
[Update] You could also check which of a and b is larger and thus loses less precision when divided by c:
if (a >= b) {
d = a / c * b;
} else {
d = a * (b / c);
}

Why not use a float or double? A float (on intel chips) is a 32-bit floating-point number, so you wouldn't necessarily need 64 bits for the operation?

I'd do something along the lines of the following:
if(c){
d = (long long)a * b;
d /= c;
}
else{
// some error code because div by 0 is not allowed
}

Related

Integer division without changing data type [duplicate]

This question already has answers here:
Dividing 1/n always returns 0.0 [duplicate]
(3 answers)
Closed 9 years ago.
Can anyone explain why b gets rounded off here when I divide it by an integer although it's a float?
#include <stdio.h>
void main() {
int a;
float b, c, d;
a = 750;
b = a / 350;
c = 750;
d = c / 350;
printf("%.2f %.2f", b, d);
// output: 2.00 2.14
}
http://codepad.org/j1pckw0y
This is because of implicit conversion. The variables b, c, d are of float type. But the / operator sees two integers it has to divide and hence returns an integer in the result which gets implicitly converted to a float by the addition of a decimal point. If you want float divisions, try making the two operands to the / floats. Like follows.
#include <stdio.h>
int main() {
int a;
float b, c, d;
a = 750;
b = a / 350.0f;
c = 750;
d = c / 350;
printf("%.2f %.2f", b, d);
// output: 2.14 2.14
return 0;
}
Use casting of types:
int main() {
int a;
float b, c, d;
a = 750;
b = a / (float)350;
c = 750;
d = c / (float)350;
printf("%.2f %.2f", b, d);
// output: 2.14 2.14
}
This is another way to solve that:
int main() {
int a;
float b, c, d;
a = 750;
b = a / 350.0; //if you use 'a / 350' here,
//then it is a division of integers,
//so the result will be an integer
c = 750;
d = c / 350;
printf("%.2f %.2f", b, d);
// output: 2.14 2.14
}
However, in both cases you are telling the compiler that 350 is a float, and not an integer. Consequently, the result of the division will be a float, and not an integer.
"a" is an integer, when divided with integer it gives you an integer. Then it is assigned to "b" as an integer and becomes a float.
You should do it like this
b = a / 350.0;
Specifically, this is not rounding your result, it's truncating toward zero. So if you divide -3/2, you'll get -1 and not -2. Welcome to integral math! Back before CPUs could do floating point operations or the advent of math co-processors, we did everything with integral math. Even though there were libraries for floating point math, they were too expensive (in CPU instructions) for general purpose, so we used a 16 bit value for the whole portion of a number and another 16 value for the fraction.
EDIT: my answer makes me think of the classic old man saying "when I was your age..."
Chapter and verse
6.5.5 Multiplicative operators
...
6 When integers are divided, the result of the / operator is the algebraic quotient with any
fractional part discarded.105) If the quotient a/b is representable, the expression
(a/b)*b + a%b shall equal a; otherwise, the behavior of both a/b and a%b is
undefined.
105) This is often called ‘‘truncation toward zero’’.
Dividing an integer by an integer gives an integer result. 1/2 yields 0; assigning this result to a floating-point variable gives 0.0. To get a floating-point result, at least one of the operands must be a floating-point type. b = a / 350.0f; should give you the result you want.
Probably the best reason is because 0xfffffffffffffff/15 would give you a horribly wrong answer...
Dividing two integers will result in an integer (whole number) result.
You need to cast one number as a float, or add a decimal to one of the numbers, like a/350.0.

How to check if a long long can fit into a double variable

I'd like to check if a long long variable can be safely cast into a double. DBL_MAX doesn't help, because there are integers smaller than that which are not representable by double, while some of integers larger than 2^53 can still fit.
Is there a reliable way to do this?
Can a compiler optimise out a statement like the one below?
(long long)((double)a) == a (where a is a long long)
This does not ask for a largest integer that can be represented as double, I ask for a general function that can check if I can exactly convert any long long value to double without errors.
OP's method is a good start.
(long long)((double)a) == a
Yet has a problem. E.g. long long a = LLONG_MAX; ((double)a) results is a rounded value exceeding LLONG_MAX.
The following will certainly not overflow double.
(Pathological exception: LLONG_MIN exceeds -DBL_MAX).
volatile double b = (double) a;
Converting back to long long and testing against a is sufficient to meet OP's goal. Only need to insure b is in long long range. #gnasher729 Let us assume 2's complement and double uses FLT_RADIX != 10. In that case, the lowest long long is a power-of-2 and the highest is a power-of-2 minus 1 and conversion to double can be made exact with careful calculation of the long long limits, as follows.
bool check_ll(long long a) {
constant double d_longLong_min = LLONG_MIN;
constant double d_longLong_max_plus_1 = (LLONG_MAX/2 + 1)*2.0;
volatile double b = (double) a;
if (b < d_longLong_min || b >= d_longLong_max_plus_1) {
return false;
}
return (long long) b == a;
}
[edit simplify - more general]
A test of b near LLONG_MIN is only needed when long long does not use 2's complement
bool check_ll2(long long a) {
volatile double b = (double) a;
constant double d_longLong_max_plus_1 = (LLONG_MAX/2 + 1)*2.0;
#if LLONG_MIN == -LLONG_MAX
constant double d_longLong_min_minus_1 = (LLONG_MIN/2 - 1)*2.0;;
if (b <= d_longLong_min_minus_1 || b >= d_longLong_max_plus_1) {
return false;
}
#else
if (b >= d_longLong_max_plus_1) {
return false;
}
#endif
return (long long) b == a;
}
I would not expect a compile to be able to optimize out (long long)((double)a) == a. IAC, by using an intermediate volatile double, code prevents that.
I'm not sure you can check this conversion before you cast, but fenv.h seems like it can help you for after-cast checking. FE_INEXACT can allow you to check if the operation you just performed could not be exactly stored.
http://www.cplusplus.com/reference/cfenv/FE_INEXACT/

Why dividing two integers doesn't get a float? [duplicate]

This question already has answers here:
Dividing 1/n always returns 0.0 [duplicate]
(3 answers)
Closed 9 years ago.
Can anyone explain why b gets rounded off here when I divide it by an integer although it's a float?
#include <stdio.h>
void main() {
int a;
float b, c, d;
a = 750;
b = a / 350;
c = 750;
d = c / 350;
printf("%.2f %.2f", b, d);
// output: 2.00 2.14
}
http://codepad.org/j1pckw0y
This is because of implicit conversion. The variables b, c, d are of float type. But the / operator sees two integers it has to divide and hence returns an integer in the result which gets implicitly converted to a float by the addition of a decimal point. If you want float divisions, try making the two operands to the / floats. Like follows.
#include <stdio.h>
int main() {
int a;
float b, c, d;
a = 750;
b = a / 350.0f;
c = 750;
d = c / 350;
printf("%.2f %.2f", b, d);
// output: 2.14 2.14
return 0;
}
Use casting of types:
int main() {
int a;
float b, c, d;
a = 750;
b = a / (float)350;
c = 750;
d = c / (float)350;
printf("%.2f %.2f", b, d);
// output: 2.14 2.14
}
This is another way to solve that:
int main() {
int a;
float b, c, d;
a = 750;
b = a / 350.0; //if you use 'a / 350' here,
//then it is a division of integers,
//so the result will be an integer
c = 750;
d = c / 350;
printf("%.2f %.2f", b, d);
// output: 2.14 2.14
}
However, in both cases you are telling the compiler that 350 is a float, and not an integer. Consequently, the result of the division will be a float, and not an integer.
"a" is an integer, when divided with integer it gives you an integer. Then it is assigned to "b" as an integer and becomes a float.
You should do it like this
b = a / 350.0;
Specifically, this is not rounding your result, it's truncating toward zero. So if you divide -3/2, you'll get -1 and not -2. Welcome to integral math! Back before CPUs could do floating point operations or the advent of math co-processors, we did everything with integral math. Even though there were libraries for floating point math, they were too expensive (in CPU instructions) for general purpose, so we used a 16 bit value for the whole portion of a number and another 16 value for the fraction.
EDIT: my answer makes me think of the classic old man saying "when I was your age..."
Chapter and verse
6.5.5 Multiplicative operators
...
6 When integers are divided, the result of the / operator is the algebraic quotient with any
fractional part discarded.105) If the quotient a/b is representable, the expression
(a/b)*b + a%b shall equal a; otherwise, the behavior of both a/b and a%b is
undefined.
105) This is often called ‘‘truncation toward zero’’.
Dividing an integer by an integer gives an integer result. 1/2 yields 0; assigning this result to a floating-point variable gives 0.0. To get a floating-point result, at least one of the operands must be a floating-point type. b = a / 350.0f; should give you the result you want.
Probably the best reason is because 0xfffffffffffffff/15 would give you a horribly wrong answer...
Dividing two integers will result in an integer (whole number) result.
You need to cast one number as a float, or add a decimal to one of the numbers, like a/350.0.

How will you implement pow(a,b) in C ? condition follows --

without using multiplication or division operators.
You can use only add/substract operators.
A pointless problem, but solvable with the properties of logarithms:
pow(a,b) = exp( b * log(a) )
= exp( exp(log(b) + log(log(a)) )
Take care to insure that your exponential and logarithm functions are using the same base.
Yes, I know how to use a sliderule. Learning that trick will change your perspective of logarithms.
If they are integers, it's simple to turn pow (a, b) into b multiplications of a.
pow(a, b) = a * a * a * a ... ; // do this b times
And simple to turn a * a into additions
a * a = a + a + a + a + ... ; // do this a times
If you combine them, you can make pow.
First, make mult(int a, int b), then use it to make pow.
A recursive solution :
#include<stdio.h>
int multiplication(int a1, int b1)
{
if(b1)
return (a1 + multiplication(a1, b1-1));
else
return 0;
}
int pow(int a, int b)
{
if(b)
return multiplication(a, pow(a, b-1));
else
return 1;
}
int main()
{
printf("\n %d", pow(5, 4));
}
You've already gotten answers purely for FP and purely for integers. Here's one for a FP number raised to an integer power:
double power(double x, int y) {
double z = 1.0;
while (y > 0) {
while (!(y&1)) {
y >>= 2;
x *= x;
}
--y;
z = x * z;
}
return z;
}
At the moment this uses multiplication. You can implement multiplication using only bit shifts, a few bit comparisons, and addition. For integers it looks like this:
int mul(int x, int y) {
int result = 0;
while (y) {
if (y&1)
result += x;
x <<= 1;
y >>= 1;
}
return result;
}
Floating point is pretty much the same, except you have to normalize your results -- i.e., in essence, a floating point number is 1) a significand expressed as a (usually fairly large) integer, and 2) a scale factor. If you want to produce normal IEEE floating point numbers a few parts get a bit ugly though -- for example, the scale factor is stored as a "bias" number instead of any of the usual 1's complement, 2's complement, etc., so working with it is clumsy (basically, each operation you subtract off the bias, do the operation, check for overflow, and (assuming it hasn't overflowed) add the bias back on again).
Doing the job without any kind of logical tests sounds (to me) like it probably wasn't really intended. For quite a few computer architecture classes, it's interesting to reduce a problem to primitive operations you can express directly in hardware (e.g., bit shifts, bitwise-AND, -OR and -NOT, etc.) The implementation shown above fits that reasonably well (if you want to get technical, an adder takes a few gates, but VHDL, Verilog, etc., but it's included in things like VHDL and Verilog anyway).

Floating point linear interpolation

To do a linear interpolation between two variables a and b given a fraction f, I'm currently using this code:
float lerp(float a, float b, float f)
{
return (a * (1.0 - f)) + (b * f);
}
I think there's probably a more efficient way of doing it. I'm using a microcontroller without an FPU, so floating point operations are done in software. They are reasonably fast, but it's still something like 100 cycles to add or multiply.
Any suggestions?
n.b. for the sake of clarity in the equation in the code above, we can omit specifying 1.0 as an explicit floating-point literal.
As Jason C points out in the comments, the version you posted is most likely the best choice, due to its superior precision near the edge cases:
float lerp(float a, float b, float f)
{
return a * (1.0 - f) + (b * f);
}
If we disregard from precision for a while, we can simplify the expression as follows:
    a(1 − f) × (b − a)
 = a − af + bf
 = a + f(b − a)
Which means we could write it like this:
float lerp(float a, float b, float f)
{
return a + f * (b - a);
}
In this version we've gotten rid of one multiplication, but lost some precision.
Presuming floating-point math is available, the OP's algorithm is a good one and is always superior to the alternative a + f * (b - a) due to precision loss when a and b significantly differ in magnitude.
For example:
// OP's algorithm
float lint1 (float a, float b, float f) {
return (a * (1.0f - f)) + (b * f);
}
// Algebraically simplified algorithm
float lint2 (float a, float b, float f) {
return a + f * (b - a);
}
In that example, presuming 32-bit floats lint1(1.0e20, 1.0, 1.0) will correctly return 1.0, whereas lint2 will incorrectly return 0.0.
The majority of precision loss is in the addition and subtraction operators when the operands differ significantly in magnitude. In the above case, the culprits are the subtraction in b - a, and the addition in a + f * (b - a). The OP's algorithm does not suffer from this due to the components being completely multiplied before addition.
For the a=1e20, b=1 case, here is an example of differing results. Test program:
#include <stdio.h>
#include <math.h>
float lint1 (float a, float b, float f) {
return (a * (1.0f - f)) + (b * f);
}
float lint2 (float a, float b, float f) {
return a + f * (b - a);
}
int main () {
const float a = 1.0e20;
const float b = 1.0;
int n;
for (n = 0; n <= 1024; ++ n) {
float f = (float)n / 1024.0f;
float p1 = lint1(a, b, f);
float p2 = lint2(a, b, f);
if (p1 != p2) {
printf("%i %.6f %f %f %.6e\n", n, f, p1, p2, p2 - p1);
}
}
return 0;
}
Output, slightly adjusted for formatting:
f lint1 lint2 lint2-lint1
0.828125 17187500894208393216 17187499794696765440 -1.099512e+12
0.890625 10937500768952909824 10937499669441282048 -1.099512e+12
0.914062 8593750447104196608 8593749897348382720 -5.497558e+11
0.945312 5468750384476454912 5468749834720641024 -5.497558e+11
0.957031 4296875223552098304 4296874948674191360 -2.748779e+11
0.972656 2734375192238227456 2734374917360320512 -2.748779e+11
0.978516 2148437611776049152 2148437474337095680 -1.374390e+11
0.986328 1367187596119113728 1367187458680160256 -1.374390e+11
0.989258 1074218805888024576 1074218737168547840 -6.871948e+10
0.993164 683593798059556864 683593729340080128 -6.871948e+10
1.000000 1 0 -1.000000e+00
If you are on a micro-controller without an FPU then floating point is going to be very expensive. Could easily be twenty times slower for a floating point operation. The fastest solution is to just do all the math using integers.
The number of places after the fixed binary point (http://blog.credland.net/2013/09/binary-fixed-point-explanation.html?q=fixed+binary+point) is: XY_TABLE_FRAC_BITS.
Here's a function I use:
inline uint16_t unsignedInterpolate(uint16_t a, uint16_t b, uint16_t position) {
uint32_t r1;
uint16_t r2;
/*
* Only one multiply, and one divide/shift right. Shame about having to
* cast to long int and back again.
*/
r1 = (uint32_t) position * (b-a);
r2 = (r1 >> XY_TABLE_FRAC_BITS) + a;
return r2;
}
With the function inlined it should be approx. 10-20 cycles.
If you've got a 32-bit micro-controller you'll be able to use bigger integers and get larger numbers or more accuracy without compromising performance. This function was used on a 16-bit system.
If you're coding for a microcontroller without floating-point operations, then it's better not to use floating-point numbers at all, and to use fixed-point arithmetic instead.
Since C++20 you can use std::lerp(), which is likely to be the best possible implementation for your target.
It is worth to note, that the standard linear interpolation formulas f1(t)=a+t(b-a), f2(t)=b-(b-a)(1-t), and f3(t)=a(1-t)+bt do not guarantee to be well-behaved when using floating point arithmetic.
Namely, if a != b, it is not guaranteed that the f1(1.0) == b or that f2(0.0) == a, while for a == b, f3(t) is not guaranteed to be equal to a, when 0 < t < 1.
This function has worked for me on processors that support IEEE754 floating point when I need the results to behave well and to hit the endpoints exactly (I use it with double precision, but float should work as well):
double lerp(double a, double b, double t)
{
if (t <= 0.5)
return a+(b-a)*t;
else
return b-(b-a)*(1.0-t);
}
If you want to the final result to be an integer, it might be faster to use integers for the input as well.
int lerp_int(int a, int b, float f)
{
//float diff = (float)(b-a);
//float frac = f*diff;
//return a + (int)frac;
return a + (int)(f * (float)(b-a));
}
This does two casts and one float multiply. If a cast is faster than a float add/subtract on your platform, and if an integer answer is useful to you, this might be a reasonable alternative.

Resources