Find next number with specific hamming weight - permutation

Given a certain integer x, I wish to compute the next higher integer y which has a certain hamming weight w. Mind that the hamming weight of x does not have to be w as well.
So, for example x = 10 (1010) and w = 4 the result should be y = 15 (1111).
Obviously, I could achieve this by just incrementing x but that would be a very slow solution for high numbers. Can I achieve this by the means of bit shifts somehow?

There are three cases: the Hamming weight (a.k.a. bitwise population count) is reduced, unchanged, or increased.
Increased Hamming weight
Set enough successive low-order 0-bits to achieve the desired weight.
This works because each successive low-order bit is the smallest value that you can add, and their sum is the smallest difference that will sufficiently increase the Hamming weight.
Unchanged Hamming weight
Add the value of the lowest-order 1-bit.
Increase the Hamming weight per above, if necessary.
This works because the value of the lowest set bit is the lowest value that will cause a carry to occur. Adding any lower bit-value would simply set that bit, and increase the Hamming weight.
As long as addition "carries a one," bits will be cleared. The resulting weight must be the equal or reduced. If several bits are cleared due to a chain of carries, you'll need to increase the weight in compensation by setting low-order bits.
Reduced Hamming weight
Clear enough low-order 1-bits to achieve the new desired weight.
Follow the above process for unchanged weight.
This works because clearing low-order 1-bits finds the preceding number with the desired weight, by subtracting the smallest viable amount. From the preceding number of the correct weight, follow the "unchanged weight" algorithm to reach the next number.

Related

Reduce() is not working as expected in Kotlin for sum operation for some value

val arr = arrayListOf(256741038, 623958417 ,467905213, 714532089, 938071625)
arr.sort()
val max = (arr.slice(1 until arr.size)).reduce { x, ars -> x+ars }
so I want the max sum of 4 out of 5 elements in an array but I am having an answer which is not expected
max = -1550499952
I don't know what going wrong there because it's working for many cases but not for this.
The expected output would be:
max = 2744467344
If you ever see a negative number appearing out of nowhere, that's a sign that you've got an overflow. The largest number an Int can represent is 2147483647 - add 1 to that and you get -2147483648, a negative number.
That's because signed integers represent negative numbers with a 1 in the most significant bit of the binary representation. Your largest positive number is 0111 (except with 32 bits not 4!), then you add 1 and it ticks over to 1000, the largest negative number. Then as you add to that, it moves towards zero, until you have 1111 (which is -1). Add another 1 and it overflows (there's no space to represent 10000) and you're back at zero, 0000.
Anyway point is you're adding lots of big numbers together and an Int can't hold the result. It keeps overflowing, so you lose the bigger digits (it can't represent more than ~2 billion) and it can be negative depending on where the overflow ends up, which half of the binary range it lands in.
You can fix this by using Longs instead (64-bits, max values +/- 9 quintillion, lots of room):
// note the L's to make them Longs
arrayListOf(256741038L, 623958417L ,467905213L, 714532089L, 938071625L)

Problem of a simple float number serialization example

I am reading the Serialization section of a tutorial http://beej.us/guide/bgnet/html/#serialization .
And I am reviewing the code which Encode the number into a portable binary form.
#include <stdint.h>
uint32_t htonf(float f)
{
uint32_t p;
uint32_t sign;
if (f < 0) { sign = 1; f = -f; }
else { sign = 0; }
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
return p;
}
float ntohf(uint32_t p)
{
float f = ((p>>16)&0x7fff); // whole part
f += (p&0xffff) / 65536.0f; // fraction
if (((p>>31)&0x1) == 0x1) { f = -f; } // sign bit set
return f;
}
I ran into problems with this line p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign .
According to the original code comments, this line extracts the whole part and sign, and the next line deals with fraction part.
Then I found an image about how float is represented in memory and started the calculation by hand.
From Wikipedia Single-precision floating-point format:
So I then presumed that whole part == exponent part.
But this (uint32_t)f)&0x7fff)<<16) is getting the last 15bits of the fraction part, if based on the image above.
Now I get confused, where did I get wrong?
It's important to realize what this code is not. This code does not do anything with the individual bits of a float value. (If it did, it wouldn't be portable and machine-independent, as it claims to be.) And the "portable" string representation it creates is fixed point, not floating point.
For example, if we use this code to convert the number -123.125, we will get the binary result
10000000011110110010000000000000
or in hexadecimal
807b2000
Now, where did that number 10000000011110110010000000000000 come from? Let's break it up into its sign, whole number, and fractional parts:
1 000000001111011 0010000000000000
The sign bit is 1 because our original number was negative. 000000001111011 is the 15-bit binary representation of 123. And 0010000000000000 is 8192. Where did 8192 come from? Well, 8192 ÷ 65536 is 0.125, which was our fractional part. (More on this below.)
How did the code do this? Let's walk through it step by step.
(1) Extract sign. That's easy: it's the ordinary test if(f < 0).
(2) Extract whole-number part. That's also easy: We take our floating-point number f, and cast it to type unint32_t. When you convert a floating-point number to an integer in C, the behavior is pretty obvious: it throws away the fractional part and gives you the integer. So if f is 123.125, (uint32_t)f is 123.
(3) Extract fraction. Since we've already got the integer part, we can isolate the fraction by starting with the original floating-point number f, and subtracting the integer part. That is, 123.125 - 123 = 0.125. Then we multiply the fractional part by 65536, which is 216.
It may not be obvious why we multiplied by 65536 and not some other number. In one sense, it doesn't matter what number you use. The goal here is to take a fractional number f and turn it into two integers a and b such that we can recover the fractional number f again later (although perhaps approximately). The way we're going to recover the fractional number f again later is by computing
a + b / x
where x is, well, some other number. If we chose 1000 for x, we'd break 123.125 up into a and b values of 123 and 125. We're choosing 65536, or 216, for x because that lets us make maximal use of the 16 bits we've allocated for the fractional part in our representation. Since x is 65536, b has to be some number we can divide by 65536 in order to get 0.125. So since b / 65536 = 0.125, by simple algebra we have b = 0.125 * 65536. Make sense?
Anyway, let's now look at the actual code for performing steps 1, 2, and 3.
if (f < 0) { sign = 1; f = -f; }
Easy peasy. If f is negative, our sign bit will be 1, and we want the rest of the code to operate on the positive version of f.
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31);
As mentioned, the important part here is (uint32_t)f, which just grabs the integer (whole-number) part of f. The bitmask & 0x7fff extracts the low-order 15 bits of it, throwing anything else away. (This is since our "portable representation" only allocates 15 bits for the whole-number part, meaning that numbers greater than 215-1 or 32767 can't be represented). The shift << 16 moves it into the high half of the eventual unint32_t result, where it belongs. And then | (sign<<31) takes the sign bit and puts it in the high-order position where it belongs.
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
Here, (int)f recomputes the integer (whole-number) part of f, and then f - (int)f extracts the fraction. We multiply it by 65536, as explained above. There may still be a fractional part (even after the multiplication, that is), so we cast to (uint32_t) again to throw that away, retaining only the integer part. We can only handle 16 bits of fraction, so we extract those bits (discarding anything else) with & 0xffff, although this should be unnecessary since we started with a positive fractional number less than 1, and multiplied it by 65536, so we should end up with a positive number less than 65536, i.e. we shouldn't have a number that won't exactly fit in 16 bits. Finally, the p |= operation stuffs these 16 bits we've just computed into the low-order half of p, and we're done.
Addendum: It may still not be obvious where the number 65536 came from, and why that was used instead of 10000 or some other number. So let's review two key points: we're ultimately dealing with integers here. Also, in one sense, the number 65536 actually was pretty arbitrary.
At the end of the day, any bit pattern we're working with is "really" just an integer. It's not a character, or a floating-point number, or a pointer — it's just an integer. If it has N bits, it represents integers from 0 to 2N-1.
In the fixed-point representation we're using here, there are three subfields: a 1-bit sign, a 15-bit whole-number part, and a 16-bit fraction part.
The interpretation of the sign and whole-number parts is obvious. But the question is: how shall we represent a fraction using a 16-bit integer?
And the answer is, we pick a number, almost any number, to divide by. We can call this number a "scaling factor".
We really can pick almost any number. Suppose I chose the number 3467 as my scaling factor. Here is now I would then represent several different fractions as integers:
    ½ → 1734/3467 → 1734
    ⅓ → 1155/3467 → 1155
    0.125 → 433/3467 → 433
So my fractions ½, ⅓, and 0.125 are represented by the integers 1734, 1155, and 433. To recover my original fractions, I just divide by 3467:
    1734 → 1734 ÷ 3467 → 0.500144
    1155 → 1155 ÷ 3467 → 0.333141
    433 → 1734 ÷ 3467 → 0.124891
Obviously I wasn't able to recover my original fractions exactly, but I came pretty close.
The other thing to wonder about is, where does that number 3467 "live"? If you're just looking at the numbers 1734, 1155, and 433, how do you know you're supposed to divide them by 3467? And the answer is, you don't know, at least, not just by looking at them. 3567 would have to be part of the definition of my silly fractional number format; people would just have to know, because I said so, that they had to multiply by 3467 when constructing integers to represent fractions, and divide by 3467 when recovering the original fractions.
And the other thing to look at is what the implications are of choosing various different scaling factors. The first thing is that, since in the end we're going to be using a 16-bit integer for the fractional representation, we absolutely can't use a scaling factor any greater than 65536. If we did, sometimes we'd end up with an integer greater than 65535, and it wouldn't fit in 16 bits. For example, suppose we tried to use a scaling factor of 70000, and suppose we tried to represent the fraction 0.95. Now, 0.95 is equal to 66500/70000, so our integer would be 66500, but that doesn't fit in 16 bits.
On the other hand, it turns out that ideally we don't want to use a number less than 65536, either. The smaller a number we use, the more of our 16-bit fractional representation we'll waste. When I chose 3467 in my silly example a little earlier, that meant I would represent fractions from 0/3467 = 0.00000 and 1/3467 = 0.000288 up to 3466/3467 = 0.999711. But I'd never use any of the integers from 3467 through 65536. They'd be wasted, and by not using them, I'd unnecessarily limit the precision of the fractions I could represent.
The "best" (least wasteful) scaling factor to use is 65536, although there's one other consideration, namely, which fractions do you want to be able to represent exactly? When I used 3467 as my scaling factor, I couldn't represent any of my test numbers ½, ⅓, or 0.125 exactly. If we use 65536 as the scaling factor, it turns out that we can represent fractions involving small powers of two exactly — that is, halves, quarters, eights, sixteenths, etc. — but not any other fractions, and in particular not most of the decimal fractions like 0.1. If we wanted to be able to represent decimal fractions exactly, we would have to use a scaling factor that was a power of 10. The largest power of 10 that will fit in 16 bits is 10000, and that would indeed let us exactly represent decimal fractions as small as 0.00001, although we'd waste about 5/6 (or 85%) of our 16-bit fractional range.
So if we wanted to represent decimal fractions exactly, without wasting precision, the inescapable conclusion is that we should not have allocated 16 bits for our fraction field in the first place. Better choices would have been 10 bits (ideal scaling factor 1024, we'd use 1000, wasting only 2%) or 20 bits (ideal scaling factor 1048576, we'd use 1000000, wasting about 5%).
The relevant excerpts from the page are
The thing to do is to pack the data into a known format and send that over the wire for decoding. For example, to pack floats, here’s something quick and dirty with plenty of room for improvement
and
On the plus side, it’s small, simple, and fast. On the minus side, it’s not an efficient use of space and the range is severely restricted—try storing a number greater-than 32767 in there and it won’t be very happy! You can also see in the above example that the last couple decimal places are not correctly preserved.
The code is presented only as an example. It is really quick and dirty, because it packs and unpacks the float as a fixed point number with 16 bits for fractions, 15 bits for integer magnitude and one for sign. It is an example and does not attempt to map floats 1:1.
It is in fact rather incredibly stupid algorithm: It can map 1:1 all IEEE 754 float32s within magnitude range ~256...32767 without losing a bit of information, truncate the fractions in floats in range 0...255 to 16 bits, and fail spectacularly for any number >= 32768. And NaNs.
As for the endianness problem: for any protocol that does not work with integers >= 32 bits intrinsically, someone needs to decide how to again serialize these integers into the other format. For example in the Internet at lowest levels data consists of 8-bit octets.
There are 24 obvious ways mapping a 32-bit unsigned integer into 4 octets, of which 2 are now generally used, and some more historically. Of course there are a countably infinite (and exponentially sillier) ways of encoding them...

matrix multiplication in C and MATLAB , different result

i am using 4Rungekutta to solve the DGL(Ax + Bu = x_dot) in MATLAB and C,
A is 5x5, x is 5x1, B 5x1, u 1x1, u is the output of sine function(2500 points),
the output of 4Rungekutta in MATLAB and C are all the same until 45th iteration, but at 45th(in 2500 iterations) iteration of 4Rungekutta the output of A*x at 2th Step of 4Rungekutta are different , hier are the Matrix.
i have printed them with 30 decimals
A and x are the same in MATLAB and C
A = [0, 0.100000000000000005551115123126,0,0,0;
-1705.367199390822406712686643004417 -13.764624913971095665488064696547 245874.405372532171895727515220642090 0.000000000000000000000000000000 902078.458362009725533425807952880859;
0, 0, 0, 0.100000000000000005551115123126, 0;
2.811622989796986438193471258273, 0, -572.221510883482778808684088289738, -0.048911651728553134921284595293 ,0;
0, 0, -0.100000000000000005551115123126 0, 0]
x = [0.071662614269441649028635765717 ;
45.870073568955461951190955005586;
0.000002088948888569741376840423;
0.002299524406171214990085571728;
0.000098982102875767145086331744]
but the results of A*x are not the same,the second element in MATLAB is-663.792187417201375865261070430279,in C is
-663.792187417201489552098792046309
MATLAB
A*x = [ 4.587007356895546728026147320634
-663.792187417201375865261070430279
0.000229952440617121520692600622
0.200180438762844026268084007825
-0.000000208894888856974158859866];
C
A*x = [4.587007356895546728026147320634
-663.792187417201489552098792046309
0.000229952440617121520692600622
0.200180438762844026268084007825
-0.000000208894888856974158859866];
though the difference is small, but i need this result to do the finite difference, at that point the result would be more obvious
does anyone know why?
How many digits do you consider you need? You have the same first 16 digits of each number equal, which is the aproximate amount of data a double normally can represent internally and store. You cannot get more, even if you force your printing routines to print more digits, they will print rubbish. What happens is that you have said to get say, 120 digits to your printing routines... and they will print those, normally multiplying the remainder (whatever it can be) As numbers are represented in base 2, you normally don't get zeros once passed the internal precission of the number... and the printing implementations don't have to agree on the digits printed once you don't have more bits represented in your number.
Suppose for a moment you have a hand calculator that only has 10 digits of precision. And you are given numbers of 120 digits. You begin to calculate and only get results with 10 digits... but you have been requested to print a report with 120 digit results. Well.... as the overal calculation cannot be done with more than 10 digits what can you do? you are using a calculator unable to give you the requested number of digits... and more, the number of base 10 digits in a 52bit significand is not a whole number of digits (there are 15.65355977452702215111442252567364 decimal digits in a 52bit significand). What can you do, you can fill with zeros (incorrect, most probably) you can fill those places with rubish (that will never affect the final 10 digits result) or you can go to Radio Shack and buy a 120 digit calculator. Floating point printing routines use a counter to specify how many times to go into a loop and get another digit, they normally stop when the counter reaches it's limit, but don't do any extra effort to know if you have got crazy and specified a large amount of digits... if you ask for 600 digits, you just get 600 loop iterations, but digits will be fake.
You should expect a difference of one part in 2^52 in a double number, as those are the number of binary digits used for the significand (this is aprox 2,220446049250313080847263336181641e-16, so you have to multiply this number by the one you have output to see how large the rounding error is, aproximately) if you multiply your number 663.792187417201375865261070430279 by that, you get 1.473914740073748177152126604805902e-13, which is an estimate of where in the number is the last valid digit of it. Probably the error estimate will be far larger due to the large number o multiplications and sums required to make a cell calculation. Anyway, a resolution of 1.0e-13 is very good (subatomic difference, should the values be lengths and units in meters).
EDIT
as an example, just consider the following program:
#include <stdio.h>
int main()
{
printf("%.156f\n", 0.1);
}
if you run it you'll get:
0.100000000000000005551115123125782702118158340454101562500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
which indeed is the most (exact) aproximation to the internal representation of the number 0.1 that the machine can represent in base 2 floating point (0.1 happens to be a periodic number, when represented in base 2) Its representation is:
0.0001100110011(0011)*
so it cannot be represented exactly with 52 bits repeating the pattern 1100 indefinitely. At some point you have to cut, and the printf routine continues adding zeros to the right until it gets to the representation above (all the finite digit numbers in base 2 are representable as a finite number of digits in base 10, but the converse is not true, (because all the factors of 2 are in 10, but not all the factors of 10 are in 2).
If you divide the difference from 0.1 and 0.1000000000000000055511151231257827021181583404541015625 between 0.1 you'll get 5.55111512312578270211815834045414e-17 which is approximately 1/2^54 or one quarter approximately (roughly one fifth) of the limit 1/2^52 I showed to you above. It is the closest number representable with 52 bits to the number 0.1

Generating random numbers in ranges from 32 bytes of random data, without bignum library

I have 32 bytes of random data.
I want to generate random numbers within variable ranges between 0-9 and 0-100.
If I used an arbitrary precision arithmetic (bignum) library, and treated the 32 bytes as a big number, I could simply do:
random = random_source % range;
random_source = random_source / range;
as often as I liked (with different ranges) until the product of the ranges nears 2^256.
Is there a way of doing this using only (fixed-size) integer arithmetic?
Certainly you can do this by doing base 256 long division (or push up multiplication). It is just like the long division you learnt in primary school, but with bytes instead of digits. It involves doing a cascade of divides and remainders for each byte in turn. Note that you also need to be aware how you are consuming the big number, and that as you consume it and it becomes smaller, there is an increasing bias against the larger values in the range. Eg if you only have 110 left, and you asked for a rnd(100), the values 0-9 would be 10% more likely than 10-99 each.
But, you don't really need the bignum techniques for this, you can use ideas from arithmetic encoding compression, where you build up the single number without actually ever dealing with the whole thing.
If you start by reading 4 bytes to an unsigned uint_32 buffer, it has a range 0..4294967295 , a non-inclusive max of 4294967296. I will refer to this synthesised value as the "carry forward", and this exclusive max value is also important to record.
[For simplicity, you might start with reading 3 bytes to your buffer, generating a max of 16M. This avoids ever having to deal with the 4G value that can't be held in a 32 bit integer.]
There are 2 ways to use this, both with accuracy implications:
Stream down:
Do your modulo range. The modulo is your random answer. The division result is your new carry forward and has a smaller range.
Say you want 0..99, so you modulo by 100, your upper part has a range max 42949672 (4294967296/100) which you carry forward for the next random request
We can't feed another byte in yet...
Say you now want 0..9, so you modulo by 10, and now your upper part has a range 0..4294967 (42949672/100)
As max is less than 16M, we can now bring in the next byte. Multiply it by the current max 4294967 and add it to the carry forward. The max is also multiplied by 256 -> 1099511552
This method has a slight bias towards small values, as 1 in the "next max" times, the available range of values will not be the full range, because the last value is truncated, but by choosing to maintain 3-4 good bytes in max, that bias is minimised. It will only occur at max 1 in 16million times.
The computational cost of this algorithm is the div by the random range of both carry forward and max, and then the multiply each time you feed in a new byte. I assume the compiler will optimise the modulo
Stream up:
Say you want 0..99
Divide your max by range, to get the nextmax, and divide carryforward by nextmax. Now, your random number is in the division result, and the remainder forms the value you carry forward to get the next random.
When nextmax becomes less than 16M, simply multiply both nextmax and your carry forward by 256 and add in the next byte.
The downside if this method is that depending on the division used to generate nextmax, the top value result (i.e. 99 or 9) is heavily biased against, OR sometimes you will generate the over-value (100) - this depends whether you round up or down doing the first division.
The computational cost here is again 2 divides, presuming the compiler optimiser blends div and mod operations. The multiply by 256 is fast.
In both cases you could choose to say that if the input carry forward value is in this "high bias range" then you will perform a different technique. You could even oscillate between the techniques - use the second in preference, but if it generates the over-value, then use the first technique, though on its own the likelihood is that both techniques will bias for similar input random streams when the carry forward value is near max. This bias can be reduced by making the second method generate -1 as the out-of-range, but each of these fixes adds an extra multiply step.
Note that in arithmetic encoding this overflow zone is effectively discarded as each symbol is extracted. It is guaranteed during decoding that those edge values won't happen, and this contributes to the slight suboptimal compression.
/* The 32 bytes in data are treated as a base-256 numeral following a "." (a
radix point marking where fractional digits start). This routine
multiplies that numeral by range, updates data to contain the fractional
portion of the product, and returns the integer portion.
8-bit bytes are assumed, or "t /= 256" could be changed to
"t >>= CHAR_BIT". But then you have to check the sizes of int
and unsigned char to consider overflow.
*/
int r(int range, unsigned char *data)
{
// Start with 0 carried from a lower position.
int t = 0;
// Iterate through each byte.
for (int i = 32; 0 < i;)
{
--i;
// Multiply next byte by our multiplier and add the carried data.
t = data[i] * range + t;
// Store the low bits of the result.
data[i] = t;
// Carry the high bits of the result to the next position.
t /= 256;
}
// Return the bits that carried out of the multiplication.
return t;
}

Prevent overflow/underflow in float division

I have two numbers:
FL_64 variable_number;
FL_64 constant_number;
The constant number is always the same, for example:
constant_number=(FL_64)0.0000176019966602325;
The variable number is given to me and I need to perform the division:
FL_64 result = variable_number/constant_number;
What would be the checks I need to do to variable_number in order to make sure the operation will not overflow / underflow before performing it?
Edit: FL_64 is just a typedef for double so FL_64 = double.
A Test For Overflow
Assume:
The C implementation uses IEEE-754 arithmetic with round-to-nearest-ties-to-even.
The magnitude of the divisor is at most 1, and the divisor is non-zero.
The divisor is positive.
The test and the proof below are written with the above assumptions for simplicity, but the general cases are easily handled:
If the divisor might be negative, use fabs(divisor) in place of divisor when calculating the limit shown below.
If the divisor is zero, there is no need to test for overflow, as it is already known an error (divide-by-zero) occurs.
If the magnitude exceeds 1, the division never creates a new overflow. Overflow occurs only if the dividend is already infinity (so a test would be isinf(candidate)). (With a divisor exceeding 1 in magnitude, the division could underflow. This answer does not discuss testing for underflow in that case.)
Note about notation: Expressions using non-code-format operators, such as x•y, represent exact mathematical expressions, without floating-point rounding. Expressions in code format, such as x*y, mean the computed results with floating-point rounding.
To detect overflow when dividing by divisor, we can use:
FL_64 limit = DBL_MAX * divisor;
if (-limit <= candidate && candidate <= limit)
// Overflow will not occur.
else
// Overflow will occur or candidate or divisor is NaN.
Proof:
limit will equal DBL_MAX multiplied by divisor and rounded to the nearest representable value. This is exactly DBL_MAX•divisor•(1+e) for some error e such that −2−53 ≤ e ≤ 2−53, by the properties of rounding to nearest plus the fact that no representable value for divisor can, when multiplied by DBL_MAX, produce a value below the normal range. (In the subnormal range, the relative error due to rounding could be greater than 2−53. Since the product remains in the normal range, that does not occur.)
However, e = 2−53 can occur only if the exact mathematical value of DBL_MAX•divisor falls exactly midway between two representable values, thus requiring it to have 54 significant bits (the bit that is ½ of the lowest position of the 53-bit significand of representable values is the 54th bit, counting from the leading bit). We know the significand of DBL_MAX is 1fffffffffffff16 (53 bits). Multiplying it by odd numbers produces 1fffffffffffff16 (when multiplied by 1), 5ffffffffffffd16 (by 3), and 0x9ffffffffffffb16 (by 5), and numbers with more significant bits when multiplied by greater odd numbers. Note that 5ffffffffffffd16 has 55 significant bits. None of these has exactly 54 significant bits. When multiplied by even numbers, the product has trailing zeros, so the number of significant bits is the same as when multiplying by the odd number that results from dividing the even number by the greatest power of two that divides it. Therefore, no product of DBL_MAX is exactly midway between two representable values, so the error e is never exactly 2−53. So −253 < e < 2−53.
So, limit = DBL_MAX•divisor•(1+e), where e < 2−53. Therefore limit/divisor is DBL_MAX•(1+e). Since this result is less than ½ ULP from DBL_MAX, it never rounds up to infinity, so it never overflows. So dividing any candidate that is less than or equal to limit by divisor does not overflow.
Now we will consider candidates exceeding limit. As with the upper bound, e cannot equal −2−53, for the same reason. Then the least e can be is −2−53 + 2−105, because the product of DBL_MAX and divisor has at most 106 significant bits, so any increase from the midpoint between two representable values must be by at least one part in 2−105. Then, if limit < candidate, candidate is at least one part in 2−52 greater than limit, since there are 53 bits in a significand. So DBL_MAX•divisor•(1−2−53+2−105)•(1+2−52) < candidate. Then candidate/divisor is at least DBL_MAX•(1−2−53+2−105)•(1+2−52), which is DBL_MAX•(1+2−53+2−157). The exceeds the midpoint between DBL_MAX and what would be the next representable value if the exponent range were unbounded, which is the basis for the IEEE-754 rounding criterion. Therefore, it rounds up to infinity, so overflow occurs.
Underflow
Dividing by a number with magnitude less than one of course makes a number larger in magnitude, so it never underflows to zero. However, the IEEE-754 definition of underflow is that a non-zero result is tiny (in the subnormal range), either before or after rounding (whether to use before or after is implementation-defined). It is of course possible that dividing a subnormal number by a divisor less than one will produce a result still in the subnormal range. However, for this to happen, underflow must have occurred previously, to get the subnormal dividend in the first place. Therefore, underflow will never be introduced by a division by a number with magnitude less than one.
If one does wish to test for this underflow, one might similarly to the test for overflow—by comparing the candidate to the minimum normal (or the greatest subnormal) multiplied by divisor—but I have not yet worked through the numerical properties.
Assuming FL_64 is something like a double you can get the maximum value which is named DBL_MAX from float.h
So you want to make sure that
DBL_MAX >= variable_number/constant_number
or equally
DBL_MAX * constant_number >= variable_number
In code that could be something like
if (constant_number > 0.0 && constant_number < 1.0)
{
if (DBL_MAX * constant_number >= variable_number)
{
// wont overflow
}
else
{
// will overflow
}
}
else
{
// add code for other ranges of constant_number
}
However, notice that floating point calculations are imprecise so there maybe corner cases where the above code will fail.
I'm going to attempt to answer the question you asked (instead trying to answer a different "How to detect overflow or underflow that was not prevented" question that you didn't ask).
To prevent overflow and underflow for division during the design of software:
Determine the range of the numerator and find the values with the largest and smallest absolute magnitude
Determine the range of the divisor and find the values with the largest and smallest absolute magnitude
Make sure that the maximum representable value of the data type (e.g. FLT_MAX) divided by the largest absolute magnitude of the range of divisors is larger than the largest absolute magnitude of the range of numerators.
Make sure that the minimum representable value of the data type (e.g. FLT_MIN) multiplied by the smallest absolute magnitude of the range of divisors is smaller than the smallest absolute magnitude of the range of numerators.
Note that the last few steps may need to be repeated for each possible data type until you've found the "best" (smallest) data type that prevents underflow and underflow (e.g. you might check if float satisfies the last 2 steps and find that it doesn't, then check if double satisfies the last 2 steps and find that it does).
It's also possible that you find out that no data type is able to prevent overflow and underflow, and that you have to limit the range of values that could be used for numerator or divisor, or rearrange formulas (e.g. change a (c*a)/b into a (c/b)*a) or switch to a different representation ("double double", rational numbers, ...).
Also; be aware that this provides a guarantee that (for all combinations of values within your ranges) overflow and underflow will be prevented; but doesn't guarantee that the smallest data type will be chosen if there's some kind of relationship between the magnitudes of the numerators and divisors. For a simple example, if you're doing something like b = a*a+1; result = b/a; where the magnitude of the numerator depends on the magnitude of the divisor, then you'll never get the "largest numerator with smallest divisor" or "smallest numerator with largest divisor" cases and a smaller data type (that can't handle cases that won't exist) may be suitable.
Note that you can also do checks before each individual division. This tends to make performance worse (due to the branches/checks) while causing code duplication (e.g. providing alternative code that uses double for cases when float would've caused overflow or underflow); and can't work when the largest type supported isn't large enough (you end up with an } else { // Now what??? problem that can't be solved in a way that ensures values that should work do work because typically the only thing you can do is treat it as an error condition).
I don't know what standard your FL_64 adheres to, but if it's anything like IEEE 754, you'll want to watch out for
Not a Number
There might be a special NaN value. In some implementation, the result of comparing it to anything is 0, so if (variable_number == variable_number) == 0, then that's what's going on. There might be macros and functions to check for this depending on the implementation, such as in the GNU C Library.
Infinity
IEEE 754 also supports infinity (and negative infinity). This can be the result of an overflow, for instance. If variable_number is infinite and you divide it by constant_number, the result will probably be infinite again. As with NaN, the implementation usually supplies macros or functions to test for this, otherwise you could try dividing the number by something and see if it got any smaller.
Overflow
Since dividing the number by constant_number will make it bigger, the variable_number could overflow if it is already enormous. Check if it's not so big that this can happen. But depending on what your task is, the possibility of it being this large might already be excluded. The 64 bit floats in IEEE 754 go up to about 10^308. If your number overflows, it might turn into infinity.
I personally don't know the FL_64 variable type, from the name I suppose it has a 64 bit representation, but is it signed or unsigned?
Anyway I would see a potential problem only if the type is signed, otherwise both the quotient and reminder would be re-presentable on the same quantity of bits.
In case of signed, you need to check the result sign:
FL_64 result = variable_number/constant_number;
if ((variable_number > 0 && constant_number > 0) || (variable_number < 0 && constant_number < 0)) {
if (result < 0) {
//OVER/UNDER FLOW
printf("over/under flow");
} else {
//NO OVER/UNDER FLOW
printf("no over/under flow");
}
} else {
if (result < 0) {
//NO OVER/UNDER FLOW
printf("no over/under flow");
} else {
//OVER/UNDER FLOW
printf("over/under flow");
}
}
Also other cases should be checked, like division by 0. But as you mentioned constant_number is always fixed and different from 0.
EDIT:
Ok so there could be another way to check overflow by using the DBL_MAX value. By having the maximum re-presentable number on a double you can multiply it by the constant_number and compute the maximum value for the variable_number. From the code snippet below, you can see that the first case does not cause overflow, while the second does (since the variable_number is a larger number compared to the test). From the console output in fact you can see that the first value result is higher than the second one, even if this should actually be the double of the previous one. So this case is an overflow case.
#include <stdio.h>
#include <float.h>
typedef double FL_64;
int main() {
FL_64 constant_number = (FL_64)0.0000176019966602325;
FL_64 test = DBL_MAX * constant_number;
FL_64 variable_number = test;
FL_64 result;
printf("MAX double value:\n%f\n\n", DBL_MAX);
printf("Variable Number value:\n%f\n\n", variable_number);
printf(variable_number > test ? "Overflow case\n\n" : "No overflow\n\n");
result = variable_number / constant_number;
printf("Result: %f\n\n", variable_number);
variable_number *= 2;
printf("Variable Number value:\n%f\n\n", variable_number);
printf(variable_number > test ? "Overflow case\n\n" : "No overflow\n\n");
result = variable_number / constant_number;
printf("Result:\n%f\n\n", variable_number);
return 0;
}
This a specific case solution, since you have a constant value number. But this solution will not work in a general case.

Resources