rounding shortcuts in C - c

I am working in C to implement pseudo-code that says:
delay = ROUND(64*(floatDelay - intDelay))
where intDelay = (int) floatDelay
The floatDelay will always be positive. Is there an advantage to using the round function from math.h:
#inlcude <math.h>
delay=(int) round(64*(floatDelay-intDelay));
or can I use:
delay=(int)(64*(floatDelay - intDelay) + 0.5))

There isn't any advantages that I know of, other than the fact that the cast to int may not be immediately obvious to other programmers that it works like a trunc...
Whereas with the round function, your intentions are clear.

You should always use the appropriate math libs when dealing with floating point numbers. A float may be only a very close approximation of the actual value, and that can cause weirdness.
For instance, 5f might be approximated to 4.9999999... and if you try to cast directly to int it will be truncated to 4.
To see why in depth, you should look up floating point numbers on wikipedia. But in short instead of storing the number as a straight series of bits like an int, it's stored in two parts. There's a "fraction" and an exponent, where the final value of the float is fraction * (base ^ exponent).

Either is fine, provided as you say floatDelay is positive.
It's possible that one is marginally faster than the other, though it would be hard to tell which without benchmarking, given that round() is quite possibly implemented as a compiler intrinsic. It's even more likely that any speed difference is overwhelmingly unimportant, so use whichever you feel is clearer.

For instance, 5f might be approximated to 4.9999999...
and if you try to cast directly to int it will be truncated to 4.
Is this really true?
If you make sure the you add the 0.5 before you truncate to int,
is really 4.9999 a problem.
I mean:
4.9999+0.5=5.4999 -> 5


floating point inaccuracies in c

I know floating point values are limited in the numbers the can express accurately and i have found many sites that describe why this happens. But i have not found any information of how to deal with this problem efficiently. But I'm sure NASA isn't OK with 0.2/0.1 = 0.199999. Example:
#include <stdio.h>
#include <float.h>
#include <math.h>
int main(void)
float number = 4.20;
float denominator = 0.25;
printf("number = %f\n", number);
printf("denominator = %f\n", denominator);
printf("quotient as a float = %f should be 16.8\n", number/denominator);
printf("the remainder of 4.20 / 0.25 = %f\n", number - ((int) number/denominator)*denominator);
printf("now if i divide 0.20000 by 0.1 i get %f not 2\n", ( number - ((int) number/denominator)*denominator)/0.1);
number = 4.200000
denominator = 0.250000
quotient as a float = 16.799999 should be 16.8
the remainder of 4.20 / 0.25 = 0.200000
now if i divide 0.20000 by 0.1 i get 1.999998 not 2
So how do i do arithmetic with floats (or decimals or doubles) and get accurate results. Hope i haven't just missed something super obvious. Any help would be awesome! Thanks.
The solution is to not use floats for applications where you can't accept roundoff errors. Use an extended precision library (a.k.a. arbitrary precision library) like GNU MP Bignum. See this Wikipedia page for a nice list of arbitrary-precision libraries. See also the Wikipedia article on rational data types and this thread for more info.
If you are going to use floating point representations (float, double, etc.) then write code using accepted methods for dealing with roundoff errors (e.g., avoiding ==). There's lots of on-line literature about how to do this and the methods vary widely depending on the application and algorithms involved.
Floating point is pretty fine, most of the time. Here are the key things I try to keep in mind:
There's really a big difference between float and double. double gives you enough precision for most things, most of the time; float surprisingly often gives you not enough. Unless you know what you're doing and have a really good reason, just always use double.
There are some things that floating point is not good for. Although C doesn't support it natively, fixed point is often a good alternative. You're essentially using fixed point if you do your financial calculations in cents rather than dollars -- that is, if you use an int or a long int representing pennies, and remember to put a decimal point two places from the right when it's time to print out as dollars.
The algorithm you use can really matter. Naïve or "obvious" algorithms can easily end up magnifying the effects of roundoff error, while more sophisticated algorithms minimize them. One simple example is that the order you add up floating-point numbers can matter.
Never worry about 16.8 versus 16.799999. That sort of thing always happens, but it's not a problem, unless you make it a problem. If you want one place past the decimal, just print it using %.1f, and printf will round it for you. (Also don't try to compare floating-point numbers for exact equality, but I assume you've heard that by now.)
Related to the above, remember that 0.1 is not representable exactly in binary (just as 1/3 is not representable exactly in decimal). This is just one of many reasons that you'll always get what look like tiny roundoff "errors", even though they're perfectly normal and needn't cause problems.
Occasionally you need a multiple precision (MP or "bignum") library, which can represent numbers to arbitrary precision, but these are (relatively) slow and (relatively) cumbersome to use, and fortunately you usually don't need them. But it's good to know they exist, and if you're a math nurd they can be a lot of fun to use.
Occasionally a library for representing rational numbers is useful. Such a library represents, for example, the number 1/3 as the pair of numbers (1, 3), so it doesn't have the inaccuracies inherent in trying to represent that number as 0.333333333.
Others have recommended the paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, which is very good, and the standard reference, although it's long and fairly technical. An easier and shorter read I can recommend is this handout from a class I used to teach: . This is a little dated by now, but it should get you started on the basics.
There's isn't a good answer and it's often a problem.
If data is integral, e.g. amounts of money in cents, then store it as integers, which can mean a double that is constrained to hold an integer number of cents rather than a rational number of dollars. But that only helps in a few circumstances.
As a general rule, you get inaccuracies when trying to divide by numbers that are close to zero. So you just have to write the algorithms to avoid or suppress such operations. There are lots of discussions of "numerically stable" versus "unstable" algorithms and it's too big a subject to do justice to it here. And then, usually, it's best to treat floating point numbers as though they have small random errors. If they ultimately represent measurements of analogue values in the real world, there must be a certain tolerance or inaccuracy in them anyway.
If you are doing maths rather than processing data, simply don't use C or C++. Use a symbolic algebra package such a Maple, which stores values such as sqrt(2) as an expression rather than a floating point number, so sqrt(2) * sqrt(2) will always give exactly 2, rather than a number very close to 2.

Why do float calculation results differ in C and on my calculator?

I am working on a problem and my results returned by C program are not as good as returned by a simple calculator, not equally precise to be precise.
On my calculator, when I divide 2000008 by 3, I get 666669.333333
But in my C program, I am getting 666669.312500
This is what I'm doing-
Why are results different? What should i do to get the result same as that of calculator? I tried double but then it returns result in a different format. Do I need to go through conversion and all? Please help.
See for an in-depth explanation.
In short, floating point numbers are approximations to the real numbers, and they have a limit on digits they can hold. With float, this limit is quite small, with doubles, it's more, but still not perfect.
and you'll see a better, but still not perfect result. What it boils down to is, you should never assume floating point numbers to be precise. They aren't.
Floating point numbers take a fixed amount of memory and therefore have a limited precision. Limited precision means you can't represent all possible real numbers, and that in turn means that some calculations result in rounding errors. Use double instead of float to gain extra precision, but mind you that even a double can't represent everything even if it's enough for most practical purposes.
Gunthram summarizes it very well in his answer:
What it boils down to is, you should never assume floating point numbers to be precise. They aren't.

trouble with double truncation and math in C

Im making a functions that fits balls into boxes. the code that computes the number of balls that can fit on each side of the box is below. Assume that the balls fit together as if they were cubes. I know this is not the optimal way but just go with it.
the problem for me is that although I get numbers like 4.0000000*4.0000000*2.000000 the product is 31 instead of 32. whats going on??
two additional things, this error only happens when the optimal side length is reached; for example, the side length is 12.2, the box thickness is .1 and the ball radius is 1.5. this leads to exactly 4 balls fit on that side. if I DONT cast as an int, it works out but if I do cast as an int, I get the aforementioned error (31 instead of 32). Also, the print line runs once if the side length is optimal but twice if it's not. I don't know what that means.
double ballsFit(double r, double l, double w, double h, double boxthick)
double ballsInL, ballsInW, ballsInH;
int ballsinbox;
ballsInL= (int)((l-(2*boxthick))/(r*2));
ballsInW= (int)((w-(2*boxthick))/(r*2));
ballsInH= (int)((h-(2*boxthick))/(r*2));
printf("LENGTH=%f\nWidth=%f\nHight=%f\nBALLS=%d\n", ballsInL, ballsInW, ballsInH, ballsinbox);
return ballsinbox;
The fundamental problem is that floating-point math is inexact.
For example, the number 0.1 -- that you mention as the value of thickness in the problematic example -- cannot be represented exactly as a double. When you assign 0.1 to a variable, what gets stored is an approximation of 0.1.
I recommend that you read What Every Computer Scientist Should Know About Floating-Point Arithmetic.
although I get numbers like 4.0000000*4.0000000*2.000000 the product is 31 instead of 32. whats going on??
It is almost certainly the case that the multiplicands (at least some of them) are not what they look like. If they were exactly 4.0, 4.0 and 2.0, their product would be exactly 32.0. If you printed out all the digits that the doubles are capable of representing, I am pretty sure you'd see lots of 9s, as in 3.99999999999... etc. As a consequence, the product is a tiny bit less than 32. The double-to-int conversion simply chops off the fractional part, so you end up with 31.
Of course, you don't always get numbers that are less than what they would be if the computation were exact; you can also get numbers that are greater than what you might expect.
Fixed precision floating point numbers, such as the IEEE-754 numbers commonly used in modern computers cannot represent all decimal numbers accurately - much like 1/3 cannot be represented accurately in decimal.
For example 0.1 can be something along the lines of 0.100000000000000004... when converted to binary and back. The difference is small, but significant.
I have occasionally managed to (partially) deal with such issues by using extended or arbitrary precision arithmetic to maintain a degree of precision while computing and then down-converting to double for the final results. There is usually a noticeable drop in performance, but IMHO correctness is infinitely more important.
I recently used algorithms from the high-precision arithmetic libraries listed here with good results on both the precision and performance fronts.

casting doubles to integers in order to gain speed

in Redis ( there are scores associated to elements, in order to take this elements sorted. This scores are doubles, even if many users actually sort by integers (for instance unix times).
When the database is saved we need to write this doubles ok disk. This is what is used currently:
Additionally infinity and not-a-number conditions are checked in order to also represent this in the final database file.
Unfortunately converting a double into the string representation is pretty slow. While we have a function in Redis that converts an integer into a string representation in a much faster way. So my idea was to check if a double could be casted into an integer without lost of data, and then using the function to turn the integer into a string if this is true.
For this to provide a good speedup of course the test for integer "equivalence" must be fast. So I used a trick that is probably undefined behavior but that worked very well in practice. Something like that:
double x = ... some value ...
if (x == (double)((long long)x))
use_the_fast_integer_function((long long)x);
In my reasoning the double casting above converts the double into a long, and then back into an integer. If the range fits, and there is no decimal part, the number will survive the conversion and will be exactly the same as the initial number.
As I wanted to make sure this will not break things in some system, I joined #c on freenode and I got a lot of insults ;) So I'm now trying here.
Is there a standard way to do what I'm trying to do without going outside ANSI C? Otherwise, is the above code supposed to work in all the Posix systems that currently Redis targets? That is, archs where Linux / Mac OS X / *BSD / Solaris are running nowaday?
What I can add in order to make the code saner is an explicit check for the range of the double before trying the cast at all.
Thank you for any help.
Perhaps some old fashion fixed point math could help you out. If you converted your double to a fixed point value, you still get decimal precision and converting to a string is as easy as with ints with the addition of a single shift function.
Another thought would be to roll your own snprintf() function. Doing the conversion from double to int is natively supported by many FPU units so that should be lightning fast. Converting that to a string is simple as well.
Just a few random ideas for you.
The problem with doing that is that the comparisons won't work out the way you'd expect. Just because one floating point value is less than another doesn't mean that its representation as an integer will be less than the other's. Also, I see you comparing one of the (former) double values for equality. Due to rounding and representation errors in the low-order bits, you almost never want to do that.
If you are just looking for some kind of key to do something like hashing on, it would probably work out fine. If you actually care about which values really have greater or lesser value, its a bad idea.
I don't see a problem with the casts, as long as x is within the range of long long. Maybe you should check out the modf() function which separates a double into its integral and fractional part. You can then add checks against (double)LLONG_MIN and (double)LLONG_MAX for the integral part to make sure. Though there may be difficulties with the precision of double.
But before doing anything of this, have you made sure it actually is a bottleneck by measuring its performance? And is the percentage of integer values high enough that it would really make a difference?
Your test is perfectly fine (assuming you have already separately handled infinities and NANs by this point) - and it's probably one of the very few occaisions when you really do want to compare floats for equality. It doesn't invoke undefined behaviour - even if x is outside of the range of long long, you'll just get an "implementation-defined result", which is OK here.
The only fly in the ointment is that negative zero will end up as positive zero (because negative zero compares equal to positive zero).

Floating point again

Yesterday I asked a floating point question, and I have another one. I am doing some computations where I use the results of the math.h (C language) sine, cosine and tangent functions.
One of the developers muttered that you have to be careful of the return values of these functions and I should not make assumptions on the return values of the gcc math functions. I am not trying to start a discussion but I really want to know what I need to watch out for when doing computations with the standard math functions.
You should not assume that the values returned will be consistent to high degrees of precision between different compiler/stdlib versions.
That's about it.
You should not expect sin(PI/6) to be equal to cos(PI/3), for example. Nor should you expect asin(sin(x)) to be equal to x, even if x is in the domain for sin. They will be close, but might not be equal.
Floating point is straightforward. Just always remember that there is an uncertainty component to all floating point operations and functions. It is usually modelled as being random, even though it usually isn't, but if you treat it as random, you'll succeed in understanding your own code. For instance:
This should be treated as if it was:
If you want an estimate of the size of the errors, you need to dig into each operation/function to find out. Different compilers, parameter choice etc. will yield different values. For instance, 0.09-0.089999 on a system with 5 digits precision will yield an error somewhere between -0.000001 and 0.000001. this error is comparable in size with the actual result.
If you want to learn how to do floating point as precise as posible, then it's a study by it's own.
The problem isn't with the standard math functions, so much as the nature of floating point arithmetic.
Very short version: don't compare two floating point numbers for equality, even with obvious, trivial identities like 10 == 10 / 3.0 * 3.0 or tan(x) == sin(x) / cos(x).
you should take care about precision:
Structure of a floating-point number
are you on 32bits, 64 bits Platform ?
you should read IEEE Standard for Binary Floating-Point Arithmetic
there are some intersting libraries such GMP, or MPFR.
you should learn how Comparing floating-point numbers
etc ...
Agreed with all of the responses that say you should not compare for equality. What you can do, however, is check if the numbers are close enough, like so:
if (abs(numberA - numberB) < CLOSE_ENOUGH)
// Equal for all intents and purposes
Where CLOSE_ENOUGH is some suitably small floating-point value.
