`nextafter` and `nexttoward`: why this particular interface? - c

What exactly is the reason behind this peculiar interface of nextafter (and nexttoward) functions? We specify the direction by specifying the value we want to move toward.
At the first sight it feels as if something non-obvious is hidden behind this idea. In my (naive) opinion the first choice for such functions would be something like a pair of single-parameter functions nextafter(a)/nextbefore(a). The next choice would be a two-parameter function nextafter(a, dir) in which the direction dir is specified explicitly (-1 and +1, some standard enum, etc.).
But instead we have to specify a value we want to move toward. Hence a number of questions
(A vague one). There might be some clever idea or idiomatic pattern that is so valuable that it influenced this choice of interface in these standard functions. Is there?
What if decide to just blindly use -DBL_MAX and +DBL_MAX as the second argument for nextafter to specify the negative and positive direction respectively. Are there any pitfalls in doing so?
(A refinement of 2). If I know for sure that b is [slightly] greater than a, is there any reason to prefer nextafter(a, b) over nextafter(a, DBL_MAX)? E.g. is there a chance of better performance for nextafter(a, b) version?
Is nextafter generally a heavy operation? I know that it is implementation-dependent. But, again, assuming an implementation that is based in IEEE 754 representations, is it fairly "difficult" to find the adjacent floating-point value?

With IEEE-754 binary floating point representations, if both arguments of nextafter are finite and the two arguments are not equal, then the result can be computed by either adding one to or subtracting one from the representation of the number reinterpreted as an unsigned integer [Note 1]. The (slight) complexity results from correctly dealing with the corner cases which do not meet those preconditions, but in general you'll find that it is extremely fast.
Aside from NaNs, the only thing that matters about the second argument is whether it is greater than, less than, or equal to the first argument.
The interface basically provides additional clarity for the corner case results, but it is also sometimes useful. In particular, the usage nextafter(x, 0), which truncates regardless of sign, is often convenient. You can also take advantage of the fact that nextafter(x, x); is x to clamp the result at an arbitrary value.
The difference between nextafter and nexttowards is that the latter allows you to use the larger dynamic range of long double; again, that helps with certain corner cases.
Strictly speaking, if the first argument is a zero of some sign and the other argument is a valid non-zero number of the opposite sign, then the argument needs to have its sign bit flipped before the increment. But it seemed too much legalese to add that to the list, and it is still hardly a complicated transform.

Related

When is it useful to write "0 - x" rather than "-x"?

I've occasionally noticed some C code insisting on using 0 - x to get the additive complement of x, rather than writing -x. Now, I suppose these are not equivalent for types smaller in size than int (edit: Nope, apparently equivalent even then), but otherwise - is there some benefit to the former rather than the latter form?
tl;dr: 0-x is useful for scrubbing the sign of floating-point zero.
(As #Deduplicator points out in a comment:)
Many of us tend to forget that, in floating-point types, we have both a "positive zero" and a "negative zero" value - flipping the sign bit on and off leaves the same mantissa and exponent. Read more on this here.
Well, it turns out that the two expressions behave differently on positive-signed zero, and the same on negative-signed zero, as per the following:
value of x
value of 0-x
value of -x
-.0
0
0
0
0
-.0
See this on Coliru.
So, when x is of a floating-point type,
If you want to "forget the sign of zero", use 0-x.
If you want to "keep the sign of zero", use x.
For integer types it shouldn't matter.
On the other hand, as #NateEldredge points out the expressions should be equivalent on small integer types, due to integer promotion - -x translates into a promotion of x into an int, then applying the minus sign.
There is no technical reason to do this today. At least not with integers. And at least not in a way that a sane (according to some arbitrary definition) coder would use. Sure, it could be the case that it causes a cast. I'm actually not 100% sure, but in that case I would use an explicit cast instead to clearly communicate the intention.
As M.M pointed out, there were reasons in the K&R time, when =- was equivalent to -=. This had the effect that x=-y was equivalent to x=x-y instead of x=0-y. This was undesirable effect, so the feature was removed.
Today, the reason would be readability. Especially if you're writing a mathematical formula and want to point out that a parameter is zero. One example would be the distance formula. The distance from (x,y) to origo is sqrt(pow(0-x, 2), pow(0-y, 2))

Standard guarantees for using floating point arithmetic to represent integer operations

I am working on some code to be run on a very heterogeneous cluster. The program performs interval arithmetic using 3, 4, or 5 32 bit words (unsigned ints) to represent high precision boundaries for the intervals. It seems to me that representing some words in floating point in some situations may produce a speedup. So, my question is two parts:
1) Are there any guarantees in the C11 standard as to what range of integers will be represented exactly, and what range of input pairs would have their products represented exactly? One multiplication error could entirely change the results.
2) Is this even a reasonable approach? It seems that the separation of floating point and integer processing within the processor would allow data to be running through both pipelines simultaneously, improving throughput. I don't know much about hardware though, so I'm not sure that the pipelines for integers and floating points actually are all that separate, or, if they are, if they can be used simultaneously.
I understand that the effectiveness of this sort of thing is platform dependent, but right now I am concerned about the reliability of the approach. If it is reliable, I can benchmark it and see, but I am having trouble proving reliability. Secondly, perhaps this sort of approach shows little promise, and if so I would like to know so I can focus elsewhere.
Thanks!
I don't know about the Standard, but it seems that you can assume all your processors are using the normal IEEE floating point format. In this case, it's pretty easy to determine whether your calculations are correct. The first integer not representable by the 32-bit float format is 16777217 (224+1), so if all your intermediate results are less than that (in absolute value), float will be fine.
The reverse is also true: if any intermediate result is greater than 224 (in absolute value) and odd, float representation will alter it, which is unacceptable for you.
If you are worried specifically about multiplications, look at how the multiplicands are limited. If one is limited by 211, and the other by 213, you will be fine (just barely). If, for example, both are limited by 216, there almost certainly is a problem. To prove it, find a test case that causes their product to exceed 224 and be odd.
All that you need to know to which limits you may go and still have integer precision should be available to you through the macros defined in <float.h>. There you have the exact description of the floating point types, FLT_RADIX for the radix, FLT_MANT_DIG for the number of the digits, etc.
As you say, whether or not such an approach is efficient will depend on the platform. You should be aware that this is much dependent of the particular processor you'd have, not only the processor family. From one Intel or AMD processor variant to another there could already be sensible differences. So you'd basically benchmark all possibilities and have code that decides on program startup which variant to use.

Why does frexp() not yield scientific notation?

Scientific notation is the common way to express a number with an explicit order of magnitude. First a nonzero digit, then a radix point, then a fractional part, and the exponent. In binary, there is only one possible nonzero digit.
Floating-point math involves an implicit first digit equal to one, then the mantissa bits "follow the radix point."
So why does frexp() put the radix point to the left of the implicit bit, and return a number in [0.5, 1) instead of scientific-notation-like [1, 2)? Is there some overflow to beware of?
Effectively it subtracts one more than the bias value specified by IEEE 754/ISO 60559. In hardware, this potentially trades an addition for an XOR. Alone, that seems like a pretty weak argument, considering that in many cases getting back to normal will require another floating-point operation.
The rationale says:
4.5.4.2 The frexp function
The functions frexp, ldexp, and modf are primitives used by the
remainder of the library. There was some sentiment for dropping them
for the same reasons that ecvt, fcvt, and gcvt were dropped, but their
adherents rescued them for general use. Their use is problematic: on
nonbinary architectures ldexp may lose precision, and frexp may be
inefficient.
One can speculate that the “remainder of the library” was more convenient to write with frexp's convention, or was already traditionally written against this interface although it did not provide any benefit.
I know that this does not fully answer the question, but it did not quite fit inside a comment.
I should also point out that some of the choices made in the design of the C language predate IEEE 754. Perhaps the format returned by frexp made sense with the PDP-11's floating-point format(s), or any other architecture on which a function frexp was first introduced. EDIT: See also page 155 of the manual for one PDP-11 model.

Which of these C "MAX" macros would be best?

I see two possible implementations for a MAX macro in C. Which of these would be best?
define MAX(X,Y) ((X) < (Y) ? : (Y) : (X))
define MAX(X,Y) 0.5*(X+Y+ABS(X-Y))
Second one is hard to read, and actually broken. Really bad idea.
Second one always uses doubles. You will get rounding errors. I certainly wouldn't expect getting a result that's not equal to either side.
When applied to integers it returns a double, quite surprising. Floats get promoted to double as well.
If you "fix" it by replacing *0.5 with /2 it'd "work" on those other types, but you'd get unexpected overflows on large integers.
I also recommend functions, and not macros so the arguments don't get evaluated twice.
There are occasionally situations where such tricky versions are appropriate. For example calculating the maximum of two integers in constant time. But they're rare, and certainly should not be used as the default implementation for MAX.
The first version is more general, efficient and easier to understand.
The second version uses a floating-point constant, which makes it specific to doubles.
It may potentially return the wrong answer since floating-point calculations may be rounded off. (due to the inability of a binary value to represent exactly every possible decimal value, 0.1 for example)
There are also more calculations involved.
The multiplication with 0.5 is not enclosed in parenthesis which may lead to unexpected results.
There's also the matter of compiler optimizations but I'm not going into that.

casting doubles to integers in order to gain speed

in Redis (http://code.google.com/p/redis) there are scores associated to elements, in order to take this elements sorted. This scores are doubles, even if many users actually sort by integers (for instance unix times).
When the database is saved we need to write this doubles ok disk. This is what is used currently:
snprintf((char*)buf+1,sizeof(buf)-1,"%.17g",val);
Additionally infinity and not-a-number conditions are checked in order to also represent this in the final database file.
Unfortunately converting a double into the string representation is pretty slow. While we have a function in Redis that converts an integer into a string representation in a much faster way. So my idea was to check if a double could be casted into an integer without lost of data, and then using the function to turn the integer into a string if this is true.
For this to provide a good speedup of course the test for integer "equivalence" must be fast. So I used a trick that is probably undefined behavior but that worked very well in practice. Something like that:
double x = ... some value ...
if (x == (double)((long long)x))
use_the_fast_integer_function((long long)x);
else
use_the_slow_snprintf(x);
In my reasoning the double casting above converts the double into a long, and then back into an integer. If the range fits, and there is no decimal part, the number will survive the conversion and will be exactly the same as the initial number.
As I wanted to make sure this will not break things in some system, I joined #c on freenode and I got a lot of insults ;) So I'm now trying here.
Is there a standard way to do what I'm trying to do without going outside ANSI C? Otherwise, is the above code supposed to work in all the Posix systems that currently Redis targets? That is, archs where Linux / Mac OS X / *BSD / Solaris are running nowaday?
What I can add in order to make the code saner is an explicit check for the range of the double before trying the cast at all.
Thank you for any help.
Perhaps some old fashion fixed point math could help you out. If you converted your double to a fixed point value, you still get decimal precision and converting to a string is as easy as with ints with the addition of a single shift function.
Another thought would be to roll your own snprintf() function. Doing the conversion from double to int is natively supported by many FPU units so that should be lightning fast. Converting that to a string is simple as well.
Just a few random ideas for you.
The problem with doing that is that the comparisons won't work out the way you'd expect. Just because one floating point value is less than another doesn't mean that its representation as an integer will be less than the other's. Also, I see you comparing one of the (former) double values for equality. Due to rounding and representation errors in the low-order bits, you almost never want to do that.
If you are just looking for some kind of key to do something like hashing on, it would probably work out fine. If you actually care about which values really have greater or lesser value, its a bad idea.
I don't see a problem with the casts, as long as x is within the range of long long. Maybe you should check out the modf() function which separates a double into its integral and fractional part. You can then add checks against (double)LLONG_MIN and (double)LLONG_MAX for the integral part to make sure. Though there may be difficulties with the precision of double.
But before doing anything of this, have you made sure it actually is a bottleneck by measuring its performance? And is the percentage of integer values high enough that it would really make a difference?
Your test is perfectly fine (assuming you have already separately handled infinities and NANs by this point) - and it's probably one of the very few occaisions when you really do want to compare floats for equality. It doesn't invoke undefined behaviour - even if x is outside of the range of long long, you'll just get an "implementation-defined result", which is OK here.
The only fly in the ointment is that negative zero will end up as positive zero (because negative zero compares equal to positive zero).

Resources