Let's say we have these inequalities:
if (a*a+b*b>0) {
...
}
if (a*b+c*d>0) {
...
}
Obviously, both of them require 2 multiplications to evaluate.
The thing is, do we really need to calculate 2 full-precision products just to check whether these expressions are positive or not?
Is there any mathematical trickery that allows me to write those if commands without the need to evaluate 2 products?
Will it be faster?
Or perhaps the compiler takes care of making it as fast as possible?
Am I overthinking?
EDIT:
Well, that escalated quickly.
I just want to point out that I am speaking in general terms. I don't need such a micro-optimization in any project of mine anyway.
Also, yes, I could have omitted the first one for being too trivial. Possibly the second one is more interesting.
Your "am I overthinking" question suggests me that you haven't found this to be an actual bottleneck by really profiling your code. So I'd say yes, you're just trying to do premature optimization.
However, if this really is a major performance-critical part of your application, then the only improvement I can think of right now is the following. Since squares of real numbers can never be negative, then "a squared is greater than zero" is equivalent with "a is not zero". So if comparisons are fast (well, that's relative -- faster than multiplication) on your architecture, then
if (a*a+b*b>0) {
...
}
can be written as
if (a || b) {
...
}
(provided that no corner cases arise. If the variables are signed integers or floating-point numbers representing real numbers, then this should be fine. If, however, there are some unsigned integer overflow or complex numbers involved, then you will have to perform additional checks, and at that point, it's hard to reason about the relative performance without true profiling.)
I don't have such a "clever" "optimization" for the second case in my mind, but perhaps someone else can come up with something similar -- if and only if it is absolutely necessary. Not otherwise -- code readability is preferred over performance when performance is not critical.
I'm assuming none of these expressions will overflow either because the types don't have a concept of overflow or because the values are in range. When overflow and potential wrap-around enters the picture (e.g. if a and b are unsigned int) different rules apply.
The first statement is obviously equivalent to
if (a != 0 || b != 0)
or
if (a || b)
which trades an extra branch for two multiplications and an addition.
The second statement is a bit more interesting: I'd think it could be reasonable to determine the signs of the operand and only do the actual math when a*b and c*d have opposite signs. In all other cases the condition can be determined without knowing the actual values. Whether the resulting logic is faster than the computations will depend on the types, I'd guess.
The first one will always be >= 0. It will be 0 if and only if a and b are 0, so it's equivalent to:
if (a || b) {
...
}
About the second one: if sign of a is equal to sign of b, and sign of c is equal to sign of d, then it's the same situation as above:
if (sign(a)==sign(b) && sign(c)==sign(d))
{
if ((a && b) || (c && d))
{
... > 0
}
else
{
... = 0
}
}
else
{
if (sign(a)*sign(b)==sign(c)*sign(d))
{
... <= 0
}
else
{
/* must do the actual product to find out */
}
}
For a IEEE-754 compliant floating point number, the sign is at the MSb of each number.
For environments in which FP is emulated, there's one thing you can do to optimize a bit the comparison: you can avoid the additions, if you just compare the two results of the products, like this:
if (a*b>c*d) {
...
}
This is a bit faster because to compare two floating point numbers, you just compare them as if they were signed integer numbers, and a FP-less CPU surely will have resources to compare two integer numbers faster than the time it spends doing a FP software addition.
Another rewrite (assuming you are using floats, they are 32 bits wide and IEEE 754 compliant, and the same size as int; yes, this is hacky and platform-dependent).
For the first case, you can use a single bitwise 'or' & 'and' (the and is used to ignore the sign-bit and the exponent, retaining only the mantissa; you can remove it if there cannot be any -0s):
if (*((int *)&a) | (*(int *)&b) & 0x7FFFFF) { // a*a + b*b>0
...
}
I really doubt that there is any similarly branch-less magic for the second case.
Related
I came across the following snippet which in my opinion is to convert an integer into binary equivalence.
Can anyone tell me why an &1 is used instead of %2 ? Many thanks.
for (i = 0; i <= nBits; ++i) {
bits[i] = ((unsigned long) x & 1) ? 1 : 0;
x = x / 2;
}
The representation of unsigned integers is specified by the Standard: An unsigned integer with n value bits represents numbers in the range [0, 2n), with the usual binary semantics. Therefore, the least significant bit is the remainder of the value of the integer after division by 2.
It is debatable whether it's useful to replace readable mathematics with low-level bit operations; this kind of style was popular in the 70s when compilers weren't very smart. Nowadays I think you can assume that a compiler will know that dividing by two can be realized as bit shift etc., so you can just Write What You Mean.
what the code snippet does, is not to convert a unsigned int into a binary number (it's internal representation is already binary). It created a bit array with the values of the unsigned int's bits. Spreads it out over an array if you will.
e.g. x=3 => bits[2]=0 bits[1]=1 bits[0]=1
To do this
it selects the last bit of the number and places it the bits array
(the &1 operation).
then shifts the number to the right by one position ( /2 is
equivalent to >>1).
Repeats the above operations for all the bits
You could have used %2 instead of &1, the generated code should be the same. But I guess it's just a matter of programming style and preference. For most programmers, the &1 is a lot clearer than %2.
In your example, %2 and &1 are the same. Which one to take is probably simply a matter of taste. While %2 is probably more easier to read for people with a strong mathematics background, &1 is easier to understand for people with a strong technical background.
They are equivalent in the very special case. It's an old Fortran influenced style.
I know UIKit uses CGFloat because of the resolution independent coordinate system.
But every time I want to check if for example frame.origin.x is 0 it makes me feel sick:
if (theView.frame.origin.x == 0) {
// do important operation
}
Isn't CGFloat vulnerable to false positives when comparing with ==, <=, >=, <, >?
It is a floating point and they have unprecision problems: 0.0000000000041 for example.
Is Objective-C handling this internally when comparing or can it happen that a origin.x which reads as zero does not compare to 0 as true?
First of all, floating point values are not "random" in their behavior. Exact comparison can and does make sense in plenty of real-world usages. But if you're going to use floating point you need to be aware of how it works. Erring on the side of assuming floating point works like real numbers will get you code that quickly breaks. Erring on the side of assuming floating point results have large random fuzz associated with them (like most of the answers here suggest) will get you code that appears to work at first but ends up having large-magnitude errors and broken corner cases.
First of all, if you want to program with floating point, you should read this:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
Yes, read all of it. If that's too much of a burden, you should use integers/fixed point for your calculations until you have time to read it. :-)
Now, with that said, the biggest issues with exact floating point comparisons come down to:
The fact that lots of values you may write in the source, or read in with scanf or strtod, do not exist as floating point values and get silently converted to the nearest approximation. This is what demon9733's answer was talking about.
The fact that many results get rounded due to not having enough precision to represent the actual result. An easy example where you can see this is adding x = 0x1fffffe and y = 1 as floats. Here, x has 24 bits of precision in the mantissa (ok) and y has just 1 bit, but when you add them, their bits are not in overlapping places, and the result would need 25 bits of precision. Instead, it gets rounded (to 0x2000000 in the default rounding mode).
The fact that many results get rounded due to needing infinitely many places for the correct value. This includes both rational results like 1/3 (which you're familiar with from decimal where it takes infinitely many places) but also 1/10 (which also takes infinitely many places in binary, since 5 is not a power of 2), as well as irrational results like the square root of anything that's not a perfect square.
Double rounding. On some systems (particularly x86), floating point expressions are evaluated in higher precision than their nominal types. This means that when one of the above types of rounding happens, you'll get two rounding steps, first a rounding of the result to the higher-precision type, then a rounding to the final type. As an example, consider what happens in decimal if you round 1.49 to an integer (1), versus what happens if you first round it to one decimal place (1.5) then round that result to an integer (2). This is actually one of the nastiest areas to deal with in floating point, since the behaviour of the compiler (especially for buggy, non-conforming compilers like GCC) is unpredictable.
Transcendental functions (trig, exp, log, etc.) are not specified to have correctly rounded results; the result is just specified to be correct within one unit in the last place of precision (usually referred to as 1ulp).
When you're writing floating point code, you need to keep in mind what you're doing with the numbers that could cause the results to be inexact, and make comparisons accordingly. Often times it will make sense to compare with an "epsilon", but that epsilon should be based on the magnitude of the numbers you are comparing, not an absolute constant. (In cases where an absolute constant epsilon would work, that's strongly indicative that fixed point, not floating point, is the right tool for the job!)
Edit: In particular, a magnitude-relative epsilon check should look something like:
if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y))
Where FLT_EPSILON is the constant from float.h (replace it with DBL_EPSILON fordoubles or LDBL_EPSILON for long doubles) and K is a constant you choose such that the accumulated error of your computations is definitely bounded by K units in the last place (and if you're not sure you got the error bound calculation right, make K a few times bigger than what your calculations say it should be).
Finally, note that if you use this, some special care may be needed near zero, since FLT_EPSILON does not make sense for denormals. A quick fix would be to make it:
if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y) || fabs(x-y) < FLT_MIN)
and likewise substitute DBL_MIN if using doubles.
Since 0 is exactly representable as an IEEE754 floating-point number (or using any other implementation of f-p numbers I've ever worked with) comparison with 0 is probably safe. You might get bitten, however, if your program computes a value (such as theView.frame.origin.x) which you have reason to believe ought to be 0 but which your computation cannot guarantee to be 0.
To clarify a little, a computation such as :
areal = 0.0
will (unless your language or system is broken) create a value such that (areal==0.0) returns true but another computation such as
areal = 1.386 - 2.1*(0.66)
may not.
If you can assure yourself that your computations produce values which are 0 (and not just that they produce values which ought to be 0) then you can go ahead and compare f-p values with 0. If you can't assure yourself to the required degree, best stick to the usual approach of 'toleranced equality'.
In the worst cases the careless comparison of f-p values can be extremely dangerous: think avionics, weapons-guidance, power-plant operations, vehicle navigation, almost any application in which computation meets the real world.
For Angry Birds, not so dangerous.
I want to give a bit of a different answer than the others. They are great for answering your question as stated but probably not for what you need to know or what your real problem is.
Floating point in graphics is fine! But there is almost no need to ever compare floats directly. Why would you need to do that? Graphics uses floats to define intervals. And comparing if a float is within an interval also defined by floats is always well defined and merely needs to be consistent, not accurate or precise! As long as a pixel (which is also an interval!) can be assigned that's all graphics needs.
So if you want to test if your point is outside a [0..width[ range this is just fine. Just make sure you define inclusion consistently. For example always define inside is (x>=0 && x < width). The same goes for intersection or hit tests.
However, if you are abusing a graphics coordinate as some kind of flag, like for example to see if a window is docked or not, you should not do this. Use a boolean flag that is separate from the graphics presentation layer instead.
Comparing to zero can be a safe operation, as long as the zero wasn't a calculated value (as noted in an above answer). The reason for this is that zero is a perfectly representable number in floating point.
Talking perfectly representable values, you get 24 bits of range in a power-of-two notion (single precision). So 1, 2, 4 are perfectly representable, as are .5, .25, and .125. As long as all your important bits are in 24-bits, you are golden. So 10.625 can be repsented precisely.
This is great, but will quickly fall apart under pressure. Two scenarios spring to mind:
1) When a calculation is involved. Don't trust that sqrt(3)*sqrt(3) == 3. It just won't be that way. And it probably won't be within an epsilon, as some of the other answers suggest.
2) When any non-power-of-2 (NPOT) is involved. So it may sound odd, but 0.1 is an infinite series in binary and therefore any calculation involving a number like this will be imprecise from the start.
(Oh and the original question mentioned comparisons to zero. Don't forget that -0.0 is also a perfectly valid floating-point value.)
[The 'right answer' glosses over selecting K. Selecting K ends up being just as ad-hoc as selecting VISIBLE_SHIFT but selecting K is less obvious because unlike VISIBLE_SHIFT it is not grounded on any display property. Thus pick your poison - select K or select VISIBLE_SHIFT. This answer advocates selecting VISIBLE_SHIFT and then demonstrates the difficulty in selecting K]
Precisely because of round errors, you should not use comparison of 'exact' values for logical operations. In your specific case of a position on a visual display, it can't possibly matter if the position is 0.0 or 0.0000000003 - the difference is invisible to the eye. So your logic should be something like:
#define VISIBLE_SHIFT 0.0001 // for example
if (fabs(theView.frame.origin.x) < VISIBLE_SHIFT) { /* ... */ }
However, in the end, 'invisible to the eye' will depend on your display properties. If you can upper bound the display (you should be able to); then choose VISIBLE_SHIFT to be a fraction of that upper bound.
Now, the 'right answer' rests upon K so let's explore picking K. The 'right answer' above says:
K is a constant you choose such that the accumulated error of your
computations is definitely bounded by K units in the last place (and
if you're not sure you got the error bound calculation right, make K a
few times bigger than what your calculations say it should be)
So we need K. If getting K is more difficult, less intuitive than selecting my VISIBLE_SHIFT then you'll decide what works for you. To find K we are going to write a test program that looks at a bunch of K values so we can see how it behaves. Ought to be obvious how to choose K, if the 'right answer' is usable. No?
We are going to use, as the 'right answer' details:
if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
Let's just try all values of K:
#include <math.h>
#include <float.h>
#include <stdio.h>
void main (void)
{
double x = 1e-13;
double y = 0.0;
double K = 1e22;
int i = 0;
for (; i < 32; i++, K = K/10.0)
{
printf ("K:%40.16lf -> ", K);
if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
printf ("YES\n");
else
printf ("NO\n");
}
}
ebg#ebg$ gcc -o test test.c
ebg#ebg$ ./test
K:10000000000000000000000.0000000000000000 -> YES
K: 1000000000000000000000.0000000000000000 -> YES
K: 100000000000000000000.0000000000000000 -> YES
K: 10000000000000000000.0000000000000000 -> YES
K: 1000000000000000000.0000000000000000 -> YES
K: 100000000000000000.0000000000000000 -> YES
K: 10000000000000000.0000000000000000 -> YES
K: 1000000000000000.0000000000000000 -> NO
K: 100000000000000.0000000000000000 -> NO
K: 10000000000000.0000000000000000 -> NO
K: 1000000000000.0000000000000000 -> NO
K: 100000000000.0000000000000000 -> NO
K: 10000000000.0000000000000000 -> NO
K: 1000000000.0000000000000000 -> NO
K: 100000000.0000000000000000 -> NO
K: 10000000.0000000000000000 -> NO
K: 1000000.0000000000000000 -> NO
K: 100000.0000000000000000 -> NO
K: 10000.0000000000000000 -> NO
K: 1000.0000000000000000 -> NO
K: 100.0000000000000000 -> NO
K: 10.0000000000000000 -> NO
K: 1.0000000000000000 -> NO
K: 0.1000000000000000 -> NO
K: 0.0100000000000000 -> NO
K: 0.0010000000000000 -> NO
K: 0.0001000000000000 -> NO
K: 0.0000100000000000 -> NO
K: 0.0000010000000000 -> NO
K: 0.0000001000000000 -> NO
K: 0.0000000100000000 -> NO
K: 0.0000000010000000 -> NO
Ah, so K should be 1e16 or larger if I want 1e-13 to be 'zero'.
So, I'd say you have two options:
Do a simple epsilon computation using your engineering judgement for the value of 'epsilon', as I've suggested. If you are doing graphics and 'zero' is meant to be a 'visible change' than examine your visual assets (images, etc) and judge what epsilon can be.
Don't attempt any floating point computations until you've read the non-cargo-cult answer's reference (and gotten your Ph.D in the process) and then use your non-intuitive judgement to select K.
The correct question: how does one compare points in Cocoa Touch?
The correct answer: CGPointEqualToPoint().
A different question: Are two calculated values are the same?
The answer posted here: They are not.
How to check if they are close? If you want to check if they are close, then don't use CGPointEqualToPoint(). But, don't check to see if they are close. Do something that makes sense in the real world, like checking to see if a point is beyond a line or if a point is inside a sphere.
The last time I checked the C standard, there was no requirement for floating point operations on doubles (64 bits total, 53 bit mantissa) to be accurate to more than that precision. However, some hardware might do the operations in registers of greater precision, and the requirement was interpreted to mean no requirement to clear lower order bits (beyond the precision of the numbers being loaded into the registers). So you could get unexpected results of comparisons like this depending on what was left over in the registers from whoever slept there last.
That said, and despite my efforts to expunge it whenever I see it, the outfit where I work has lots of C code that is compiled using gcc and run on linux, and we have not noticed any of these unexpected results in a very long time. I have no idea whether this is because gcc is clearing the low-order bits for us, the 80-bit registers are not used for these operations on modern computers, the standard has been changed, or what. I'd like to know if anyone can quote chapter and verse.
You can use such code for compare float with zero:
if ((int)(theView.frame.origin.x * 100) == 0) {
// do important operation
}
This will compare with 0.1 accuracy, that enough for CGFloat in this case.
Another issue that may need to be kept in mind is that different implementations do things differently. One example of this that I am very familiar with is the FP units on the Sony Playstation 2. They have significant discrepancies when compared to the IEEE FP hardware in any X86 device. The cited article mentions the complete lack of support for inf and NaN, and it gets worse.
Less well known is what I came to know as the "one bit multiply" error. For certain values of float x:
y = x * 1.0;
assert(y == x);
would fail the assert. In the general case, sometimes, but not always, the result of a FP multiply on the Playstation 2 had a mantissa that was a single bit less than the equivalent IEEE mantissa.
My point being that you should not assume that porting FP code from one platform to another will produce the same results. Any given platform is internally consistent, in that results don't change on that platform, it's just that they may not agree with a different platform. E.g. CPython on X86 uses 64 bit doubles to represent floats, while CircuitPython on a Cortex MO has to use software FP, and only uses 32 bit floats. Needless to say that will introduce discrepancies.
A quote I learned over 40 years ago is as true today as the day I learned it. "Doing floating point maths on a computer is like moving a pile of sand. Every time you do anything, you leave a little sand behind and pick up a little dirt."
Playstation is a registered trademark of Sony Corporation.
-(BOOL)isFloatEqual:(CGFloat)firstValue secondValue:(CGFloat)secondValue{
BOOL isEqual = NO;
NSNumber *firstValueNumber = [NSNumber numberWithDouble:firstValue];
NSNumber *secondValueNumber = [NSNumber numberWithDouble:secondValue];
isEqual = [firstValueNumber isEqualToNumber:secondValueNumber];
return isEqual;
}
I am using the following comparison function to compare a number of decimal places:
bool compare(const double value1, const double value2, const int precision)
{
int64_t magnitude = static_cast<int64_t>(std::pow(10, precision));
int64_t intValue1 = static_cast<int64_t>(value1 * magnitude);
int64_t intValue2 = static_cast<int64_t>(value2 * magnitude);
return intValue1 == intValue2;
}
// Compare 9 decimal places:
if (compare(theView.frame.origin.x, 0, 9)) {
// do important operation
}
I'd say the right thing is to declare each number as an object, and then define three things in that object: 1) an equality operator. 2) a setAcceptableDifference method. 3)the value itself. The equality operator returns true if the absolute difference of two values is less than the value set as acceptable.
You can subclass the object to suit the problem. For example, round bars of metal between 1 and 2 inches might be considered of equal diameter if their diameters differed by less than 0.0001 inches. So you'd call setAcceptableDifference with parameter 0.0001, and then use the equality operator with confidence.
I am curious to understand the logic behind the mod operation since I understand that bit-shifting operations can be performed to do different things such as bit shifting to multiply.
One way I can see it being done is by a recursive algorithm that keeps dividing until you cannot divide anymore, but this does not seem efficient.
Any ideas will be helpful. Thanks in advance!
The quick version is: Depends on hardware, the optimizer, if it's division by a constant or not (pdf), if there's exceptions to be checked for (e.g. modulo by 0), if and how negative numbers are handled (this is a scary question for C++), etc...
R gave a nice, concise answer for unsigned integers, but it's difficult to understand unless you're well versed with C.
The crux of the technique illuminated by R is to strip away multiples of q until there's no more multiples of q left. We could naively do this with a simple loop:
while (p >= q) p -= q; // One liner, woohoo!
The code may be short, but for large values of p and small values of q this might take a very long time.
Better than stripping away one q at a time would be to strip away many q's at a time. Note that we actually want to strip away as many q's as possible -- that is, floor(p/q) many q's... And indeed, that's a valid technique. For unsigned integers, one would expect that p % q == p - (p / q) * q. (Note that unsigned integer division rounds down.)
But this almost feels like cheating because division and remainder operations are so intimately related. (In fact, often if hardware natively supports division, it supports a divide-and-compute-remainder operation because they're so strongly related.)
Assuming we've no access to division, how shall we find a multiple of q greater than 1 to strip away? In hardware, fixed shift operations are cheap (if not practically free) and conceptually represent multiplication by a non-negative power of two. For example, shifting a bit string left by 3 is equivalent to multiplying by 8 (that is, 2^3), e.g. 5 decimal is equivalent to '101' binary. Shift '101' in binary by adding three zeroes on the right (giving '101000') and the result is 50 in decimal -- five times eight.
Likewise, shift operations are very cheap as software operations and you'll struggle to find a controller that doesn't support them and quickly. (Some architectures such as ARM can even combine shifts with other instructions to make them 'free' a good deal of the time.)
ARMed (couldn't resist) with these shift operations, we can proceed as follows:
Find out the largest power of two we can multiply q by and still be less than p.
Working from the largest power of two to the smallest, multiply q by each power of two and if it's less than what's left of p subtract it from what's left of p.
Whatever you've got left is the remainder.
Why does this work? Because in the end you'll find that all the subtracted powers of two actually sum to floor(p / q)! Don't take my word for it, similar knowledge has been known for a very long time.
Breaking apart R's answer:
#define HI (-1U-(-1U/2))
This effectively gives you an unsigned integer with only the highest value bit set.
unsigned i;
for (i=0; !(HI & (q<<i)); i++);
This line actually finds the highest power of two q can be multiplied before overflowing an unsigned integer. This isn't strictly necessary, but it doesn't change the results other than increasing the amount of execution time required.
In case you're not familiar with the C-isms in this line:
(q<<i) is a left bit shift by i. Recall this is equivalent to multiplying by 2^i.
HI & (q<<i) performs a bitwise-AND. Since HI only has its top bit populated this will only result in a non-zero value when (q<<i) is large enough to cause the top bit to be non-zero. One more shift over to the left and there'd be an integer overflow.
!(HI & (q<<i)) is 'true' when (HI & (q<<i)) is zero and 'false' otherwise.
do { if (p >= (q<<i)) p -= (q<<i); } while (i--);
This is a simple decreasing loop do { .... } while (i--);. Note that post-decrementing is used on i so the loop executes, then it checks to see if i is not zero, then it subtracts one from i, and then if its earlier check resulted in true it continues. This has the property that the loop executes its last time when i is 0. This is important because we may need to strip away an unmultiplied copy of q.
if (p >= (q<<i)) checks if the 2^i * q is less than or equal to p. If it is, p -= (q<<i) strips it away.
The remainder is left.
While most C implementations run on hardware that has a division instruction, the remainder operation can be performed roughly like this, for computing p%q, assuming unsigned values:
#define HI (-1U-(-1U/2))
unsigned i;
for (i=0; !(HI & (q<<i)); i++);
do { if (p >= (q<<i)) p -= (q<<i); } while (i--);
The resulting remainder is in p.
In addition to a hardware instruction and implementation using shifts, as R.. suggests, there's also reciprocal multiplication.
This technique can be used when the right-hand side of % is a constant, known at compile time.
Reciprocal multiplication is used to implement division, but using it for % is easy, based on the formula a%b == a-(a/b)*b.
Depending on the smarts of the optimizer, there is a shortcut for modulo base 2. For example, a % 32 can be implemented as a & 31. In general, a % (2^N) == a & (2^N -1). This is lightning fast compared to division. Most dividers (ever hardware) require at least 1 cycle for each bit of the result to calculate, while logic AND is just a few cycle operation (in the pipeline).
EDIT: this only works if a is unsigned !
For my C computation, I need a sign type (an associated operator is acceptable) that can do the following:
type sign_t = -1 | 0 | 1
integer mult_sign(integer i, sign_t s)
{
switch (s) {
case -1: return -i;
case 0: return 0;
case 1: return i;
}
}
Clarification: The value of the sign is not known at compile time!
For now, I'm using C signeds with integer value of -1,0, 1, respectively, and the operation is C multiplication myint * mysign. But I'm wondering if this has performance implications: for each mult-with-sign-operation, a hardware multiplication is employed which might be slower than negation | set to 0 | don't touch.
What would be the ideal way to do this in C?
What would be the ideal way if we take away the value 0 from the sign values (so only -1 and 1 are valid)?
Architecture specific hacks / standard nonconformness are very fine if you tell me where they are.
Multiplication seems like an awesome choice.
It's very clear and concise, and relies only on basic (well-understood) mathematical properties of integers.
You don't say much about your execution environment, but on typical desktop CPU:s integer multiply has been single-cycle for a long time. So it's hard to come up with something faster.
Also, doing a multiply removes the need to branch to "decide" what to do, which is often much (much) better than jumping around to do something "simpler".
This is almost certainly not a performance bottleneck. Just use a plain int.
Incidentally, you do realize that taking the negative of int min will result in undefined behavior? Usually you'll just get the same thing back, but you probably want to defined -fwrapv or whatever it is to ensure this.
If you are on a small microcontroller without fast hardware multiplication then you should do it case-by-case. In this case also the overhead for a branch is small.
Otherwise just use the multiplication.
I'm looking for an existing implementation for C or D, or advice in implementing, signed and/or unsigned integer types with floating point semantics.
That is to say, an integer type that behaves as floating point types do when doing arithmetic: Overflow produces infinity (-infinity for signed underflow) rather than wrapping around or having undefined behavior, undefined operations produce NaN, etc.
In essence, a version of floating point where the distribution of presentable numbers falls evenly on the number line, instead of conglomerating around 0.
In addition, all operations should be deterministic; any given two's complement 32-bit architecture should produce the exact same result for the same computation, regardless of its implementation (whereas floating point may, and often will, produce slightly differing results).
Finally, performance is a concern, which has me worried about potential "bignum" (arbitrary-precision) solutions.
See also: Fixed-point and saturation arithmetic.
I do not know of any existing implementations of this.
But I would imagine implementing it would be a matter of (in D):
enum CheckedIntState : ubyte
{
ok,
overflow,
underflow,
nan,
}
struct CheckedInt(T)
if (isIntegral!T)
{
private T _value;
private CheckedIntState _state;
// Constructors, getters, conversion helper methods, etc.
// And a bunch of operator overloads that check the
// result on every operation and yield a CheckedInt!T
// with an appropriate state.
// You'll also want to overload opEquals and opCmp and
// make them check the state of the operands so that
// NaNs compare equal and so on.
}
Saturating arithmetic does what you want except for the part where undefined operations produce NaN; this is going to turn out to be problematic, because most saturating implementations use the full number range, and so there are not values left over to reserve for NaN. Thus, you probably can't easily build this on the back of saturating hardware instructions unless you have an additional "is this value NaN" field, which is rather wasteful.
Assuming that you're wedded to the idea of NaN values, all of the edge case detection will probably need to happen in software. For most integer operations, this is pretty straightforward, especially if you have a wider type available (let's assume long long is strictly larger than whatever integer type underlies myType):
myType add(myType x, myType y) {
if (x == positiveInfinity && y == negativeInfinity ||
x == negativeInfinity && y == positiveInfinity)
return notANumber;
long long wideResult = x + y;
if (wideResult >= positiveInfinity) return positiveInfinity;
if (wideResult <= negativeInfinity) return negativeInfinity;
return (myType)wideResult;
}
One solution might be to implement multiple-precision arithmetic with abstract data types. The book C Interfaces and Implementations by David Hanson has a chapter (interface and implementation) of MP arithmetic.
Doing calculations using scaled integers is also a possibility. You might be able to use his arbitrary-precision arithmetic, although I believe this implementation can't overflow. You could run out of memory, but that's a different problem.
In either case, you might need to tweak the code to return exactly what you want on overflow and such.
Source code (MIT license)
That page also has a link to buy the book from amazon.com.
Half of the requirements are satisfied in saturating arithmetic, which are implemented in e.g. ARM instructions, MMX and SSE.
As also pointed out by Stephen Canon, one needs additional elements to check overflow / NaN. Some instruction sets (Atmel at least) btw have a sticking flag to test for overflows (could be used to differentiate inf from max_int). And perhaps "Q" + 0 could mark for NaN.