What operation turns floating point numbers into a "group"? - c

Might anyone be famiiar with tricks and techniques to coerce the set of valid floating point numbers to be a group under a multiplication based operation?
That is, given any two floating point numbers ("double a,b"), what sequence of operations, including multiply, will turn this into another valid floating point number? (A valid floating point number is anything 1-normalized, excluding NaN, denormals and -0.0).
To put this rough code:
double a = drand();
while ( forever )
{
double b = drand();
a = GROUP_OPERATION(a,b);
//invariant - a is a valid floating point number
}
Just multiply by itself doesn't work, because of NaNs. Ideally this would be a straight-line approach (avoiding "if above X, divide by Y" formulations).
If this can't work for all valid floating point numbers, is there a subset for which such an operation is available?
(The model I'm looking for is akin to integer multiplication in C - no matter what two integers get multiplied together, you always get an integer back).

(The model I'm looking for is akin to integer multiplication in C - no matter what two integers get multiplied together, you always get an integer back).
Integers modulo 2^N do not form a group - what integer multiplied by 2 gives 1? For integers to be a group under multiplication, you have to be modulo a prime number. (eg Z mod 7, 2*4 = 1, so 2 and 4 are each other's inverses)
For floating point values, simple multiplication or addition saturates to +/- Infinity, and there are no values which are the inverses of infinity, so either the set is not closed, or it lacks invertibility.
If on the other hand you want something similar to integer multiplication modulo a power of 2, then multiplication will do - there are elements without an inverse, so it's not a group, but it is closed - you always get a floating point value back. For subsets of floats which are a true group, see lakshmanaraj's answer.

Floating point numbers are backed by bits. That means that you can use the integer arithmetic on the integer representation of your floating point values and you will get a group.
Not sure this is very usefull though.
/* You have to find the integer type whose size correspond to your double */
typedef double float_t;
typedef long long int_t;
float_t group_operation(float_t a, float_t b)
{
int_t *ia, *ib, c;
assert(sizeof(float_t) == sizeof(int_t));
ia = &a;
ib = &b;
c = *ia * *ib;
return (float_t)c;
}

Floating point numbers never form a group in the sense you are talking about, because of rounding errors. Consider any of those horrible examples from numerical analysis class, like the fact that 0.1 can't be represented exactly in binary.
But then even computational ints don't form a group in that sense, since they're not closed under multiplication either. (Proof: compute the result of while true do x = x*x. At some point you'll exceed the word size, run out of resources for a BIGNUM, or something.)
update for #UnderAchievementAward:
-- added here so I can get formatting, unlike comments
Since I start with floating point (instead of "real" numbers), can't I avoid any of the 0.1 representational issues? The "x = x*x" problem is why additional operations are needed to keep the result in the valid range.
Okay, but then you're going to run into a situation where there will exist some x,y st 0 ≤ x,y < max where xy < 0. Or something equally non-intuitive.
The point is that you can certainly define a collection of operations that will look like a group on a finite representation set, but it's going to do weird things if you try to use it as the normal arithmetic operations.

If group operation is multiplication then
if n is the highest bit, then r1=1/power(2,n-1) is the least decimal that you can operate and the set
[r1,2 * r1,4 * r1,8 * r1...1] union [-r1, -2 * r1, -4 * r1,....-1] union [0] will be the group that you are expecting.
For integer [1,0,-1] is the group.
if Group operation can be any thing else,
then to form n set of valid groups,
A(r)=cos(2*Pi*r/n) from r=0 to n-1
and group operation is
COS(COSINV(A1)+COSINV(A2))
I don't know whether you want this.....
or if you want INFINITY set as a valid group then
simple answer :
GROUP OPERATION = AVG(A1,A2) = (A1+A2)/2
or some functions exists F which has FINV as it's inverse and then FINV(F(A1)+F(A2)/2)
Example of F is Log, inverse, square etc ..
double a = drand();
while ( forever )
{
double b = drand();
a = (a+b)/2
//invariant - a is a valid floating point number
}
or if you want INFINITY set of DIGITAL format as a valid group then
Let L be the lowest float number and H be highest float number
then GROUP OPERATION = AVG(A1,A2, L, H) = (A1+A2+L+H)/4
this operation will always be within L & H for all Positive numbers.
You can take L as four times the lowest decimal number and H as the (highest decimal number /4) for practical purpose.
double l = (0.0000000000000000000000000//1) * 4
double h = (0xFFFFFFFFFFFFFFFFFFFFFF///F) / 4
double a = abs(drand()) / 4;
while ( forever )
{
double b = abs(drand()) / 4;
a = (a+b+l+h)/4
//invariant - a is a valid floating point number
}
this has a subset of all possitive float number / 4.

The integers don't form a group under multiplication -- 0 doesn't have
an inverse.

Related

Matlab: binary valued random variable

Problem 1: I have the decimal representation of a rational. This is the code for generating binary number.
x(1) = rand();
[num, den] = rat(x);
q = 2^32;
x1 = num / den * q;
b = dec2bin(x1, bits);
s = str2num(b')';
UPDATE: The information about Dyadic map expressed in code as
y = mod(x*2, 1)
says that if the input, x is a binary iterate s, then the output should be binary with the bits shifted to the left by one position. But, if I give the input x = 0.1101 or x = 1101 or x= 1 (bit) still the output y is not binary.
The machine understands the input as a decimal and hence returns a decimal base number. How can I use this map to model / represent binary valued random variables?
Problem 2: (SOLVED BASED ON THE ANSWER)
Secondly, I need to do another operation involving the command
(X(:,i)>=threshold)*(X(:,i)>=threshold)';
where X is a matrix of real valued numbers and the variable
threshold = 0.5
and i is the index for the element. I keep getting this error
Error using *
Both logical inputs must be scalar.
To compute elementwise TIMES, use TIMES (.*) instead.
I tried using the .* but still I keep getting this error. How do I solve these 2 problems?
It shall be helpful if a code is provided.
Problem 1: I have the decimal representation of a rational.
Great. So far so good...
This is the code for generating binary number.
No, this is the code for generating the binary representation of a number. It's the same number that you represented in decimal. I know you think I'm being pedantic, but as far as I can determine, this is the source of your confusion. A number is a number regardless of the representation. Five sheep is five sheep whether you write it in binary, decimal, octal or using the fingers on Hammish's left hand (he's only got 4 left).
Let's change your code slightly.
bits = 32;
r = rand();
[num, den] = rat(r);
q = 2^bits;
x(1) = num / den;
The value stored in x(1) is a rational number. If we type disp(x(1)) in Matlab, it will show us the value of that number in decimal representation. We can change that representation to binary using the dec2bin command:
b(1,:) = dec2bin(round(x(1)*q), bits);
But it's still the same number. (Actually, it's not the same number because we've now limited the precision to bits bits instead of the native 53 bits Matlab generated it with. More on this later.)
But dec2bin returns the value represented in a character string rather than a number. If we want to implement your function and keep down this path of using the binary representation, we could do something like this:
b(1,:) = dec2bin(round(x(1)*q), bits);
for d = 2:bits
b(d,:) = [b(d-1,2:end) '0'];
end
Each left-shift of the binary representation multiplies the value by 2. By ignoring the bit that's now to the left of the binary point, I'm implicitly performing the mod operation. Since we have no additional significant digits to add to the least-significant bit of the value, I just add a zero.
This will work; you get the proper values and can perform whatever operations on them you want. You can represent them as binary or decimal, you can turn them back into fractions, whatever.
But you can achieve the same thing without conversion to a binary representation.
x(1) = num / den;
for d = 2:bits
x(d) = mod(x(d-1)*2, 1);
end
(Note that I left the value in x(1) as a fraction.)
This does exactly the same operation on the exact same numbers. The one difference is that I didn't reduce the precision of the number at the beginning so it uses the full double precision. Now if I want to take these values and represent them as binary, I can still do that (remember to force the value to the integer range first, though).
c = dec2bin(round(x*q), bits);
Here's the result of a test run of both versions:
b =
11110000011101110111110010010001
11100000111011101111100100100010
11000001110111011111001001000100
10000011101110111110010010001000
00000111011101111100100100010000
00001110111011111001001000100000
00011101110111110010010001000000
00111011101111100100100010000000
01110111011111001001000100000000
11101110111110010010001000000000
11011101111100100100010000000000
10111011111001001000100000000000
01110111110010010001000000000000
11101111100100100010000000000000
11011111001001000100000000000000
10111110010010001000000000000000
01111100100100010000000000000000
11111001001000100000000000000000
11110010010001000000000000000000
11100100100010000000000000000000
11001001000100000000000000000000
10010010001000000000000000000000
00100100010000000000000000000000
01001000100000000000000000000000
10010001000000000000000000000000
00100010000000000000000000000000
01000100000000000000000000000000
10001000000000000000000000000000
00010000000000000000000000000000
00100000000000000000000000000000
01000000000000000000000000000000
10000000000000000000000000000000
c =
11110000011101110111110010010001
11100000111011101111100100100001
11000001110111011111001001000010
10000011101110111110010010000101
00000111011101111100100100001001
00001110111011111001001000010011
00011101110111110010010000100101
00111011101111100100100001001010
01110111011111001001000010010100
11101110111110010010000100101000
11011101111100100100001001010001
10111011111001001000010010100001
01110111110010010000100101000011
11101111100100100001001010000101
11011111001001000010010100001010
10111110010010000100101000010101
01111100100100001001010000101010
11111001001000010010100001010100
11110010010000100101000010100111
11100100100001001010000101001111
11001001000010010100001010011101
10010010000100101000010100111010
00100100001001010000101001110100
01001000010010100001010011101000
10010000100101000010100111010000
00100001001010000101001110100000
01000010010100001010011101000000
10000100101000010100111010000000
00001001010000101001110100000000
00010010100001010011101000000000
00100101000010100111010000000000
01001010000101001110100000000000
The two are identical except for the fact that b runs out of precision after 32 bits and c has 53 bits of precision. You can confirm this by running the code above but casting x(1) to a single:
x(1) = single(num / den);
Problem 1: (UPDATED)
This reflects your updates that your goal is a Dyadic Mapping.
Think of Matlab as an environment to abstract the notion of binary numbers. It doesn't have built-in support numerical operations with binary numbers. In fact, it doesn't have a numerical representation of bits. It has strings for bits only. You can put a decimal number through a custom function to make it look binary, but to Matlab its still a float. If you put x = 0.1101 through y= mod(2*x,1) it will treat x as a floating point
Problem 2:
I'm not sure what you're trying to do here. The error is caused by trying to matrix multiply a vector of type logical. Matrix multiplication is only defined for numeric types. A temporary hack would be to add 0.0 to the vectors before multiplying thus casting the values to a double
((X(:,i)>=threshold)+0.0)*((X(:,i)>=threshold)+0.0)';

Computing floating point accuracy (K&R 2-1)

I found Stevens Computing Services – K & R Exercise 2-1 a very thorough answer to K&R 2-1. This slice of the full code computes the maximum value of a float type in the C programming language.
Unluckily my theoretical comprehension of float values is quite limited. I know they are composed of significand (mantissa.. ) and a magnitude which is a power of 2.
#include <stdio.h>
#include <limits.h>
#include <float.h>
main()
{
float flt_a, flt_b, flt_c, flt_r;
/* FLOAT */
printf("\nFLOAT MAX\n");
printf("<limits.h> %E ", FLT_MAX);
flt_a = 2.0;
flt_b = 1.0;
while (flt_a != flt_b) {
flt_m = flt_b; /* MAX POWER OF 2 IN MANTISSA */
flt_a = flt_b = flt_b * 2.0;
flt_a = flt_a + 1.0;
}
flt_m = flt_m + (flt_m - 1); /* MAX VALUE OF MANTISSA */
flt_a = flt_b = flt_c = flt_m;
while (flt_b == flt_c) {
flt_c = flt_a;
flt_a = flt_a * 2.0;
flt_b = flt_a / 2.0;
}
printf("COMPUTED %E\n", flt_c);
}
I understand that the latter part basically checks to which power of 2 it's possible to raise the significand with a three variable algorithm. What about the first part?
I can see that a progression of multiples of 2 should eventually determine the value of the significand, but I tried to trace a few small numbers to check how it should work and it failed to find the right values...
======================================================================
What are the concepts on which this program is based upon and does this program gets more precise as longer and non-integer numbers have to be found?
The first loop determines the number of bits contributing to the significand by finding the least power 2 such that adding 1 to it (using floating-point arithmetic) fails to change its value. If that's the nth power of two, then the significand uses n bits, because with n bits you can express all the integers from 0 through 2^n - 1, but not 2^n. The floating-point representation of 2^n must therefore have an exponent large enough that the (binary) units digit is not significant.
By that same token, having found the first power of 2 whose float representation has worse than unit precision, the maximim float value that does have unit precision is one less. That value is recorded in variable flt_m.
The second loop then tests for the maximum exponent by starting with the maximum unit-precision value, and repeatedly doubling it (thereby increasing the exponent by 1) until it finds that the result cannot be converted back by halving it. The maximum float is the value before that final doubling.
Do note, by the way, that all the above supposes a base-2 floating-point representation. You are unlikely to run into anything different, but C does not actually require any specific representation.
With respect to the second part of your question,
does this program gets more precise as longer and non-integer numbers have to be found?
the program takes care to avoid losing precision. It does assume a binary floating-point representation such as you described, but it will work correctly regardless of the number of bits in the significand or exponent of such a representation. No non-integers are involved, but the program already deals with numbers that have worse than unit precision, and with numbers larger than can be represented with type int.

accuracy of sqrt of integers

I have a loop like this:
for(uint64_t i=0; i*i<n; i++) {
This requires doing a multiplication every iteration. If I could calculate the sqrt before the loop then I could avoid this.
unsigned cut = sqrt(n)
for(uint64_t i=0; i<cut; i++) {
In my case it's okay if the sqrt function rounds up to the next integer but it's not okay if it rounds down.
My question is: is the sqrt function accurate enough to do this for all cases?
Edit: Let me list some cases. If n is a perfect square so that n = y^2 my question would be - is cut=sqrt(n)>=y for all n? If cut=y-1 then there is a problem. E.g. if n = 120 and cut = 10 it's okay but if n=121 (11^2) and cut is still 10 then it won't work.
My first concern was the fractional part of float only has 23 bits and double 52 so they can't store all the digits of some 32-bit or 64-bit integers. However, I don't think this is a problem. Let's assume we want the sqrt of some number y but we can't store all the digits of y. If we let the fraction of y we can store be x we can write y = x + dx then we want to make sure that whatever dx we choose does not move us to the next integer.
sqrt(x+dx) < sqrt(x) + 1 //solve
dx < 2*sqrt(x) + 1
// e.g for x = 100 dx < 21
// sqrt(100+20) < sqrt(100) + 1
Float can store 23 bits so we let y = 2^23 + 2^9. This is more than sufficient since 2^9 < 2*sqrt(2^23) + 1. It's easy to show this for double as well with 64-bit integers. So although they can't store all the digits as long as the sqrt of what they can store is accurate then the sqrt(fraction) should be sufficient. Now let's look at what happens for integers close to INT_MAX and the sqrt:
unsigned xi = -1-1;
printf("%u %u\n", xi, (unsigned)(float)xi); //4294967294 4294967295
printf("%u %u\n", (unsigned)sqrt(xi), (unsigned)sqrtf(xi)); //65535 65536
Since float can't store all the digits of 2^31-2 and double can they get different results for the sqrt. But the float version of the sqrt is one integer larger. This is what I want. For 64-bit integers as long as the sqrt of the double always rounds up it's okay.
First, integer multiplication is really quite cheap. So long as you have more than a few cycles of work per loop iteration and one spare execute slot, it should be entirely hidden by reorder on most non-tiny processors.
If you did have a processor with dramatically slow integer multiply, a truly clever compiler might transform your loop to:
for (uint64_t i = 0, j = 0; j < cut; j += 2*i+1, i++)
replacing the multiply with an lea or a shift and two adds.
Those notes aside, let’s look at your question as stated. No, you can’t just use i < sqrt(n). Counter-example: n = 0x20000000000000. Assuming adherence to IEEE-754, you will have cut = 0x5a82799, and cut*cut is 0x1ffffff8eff971.
However, a basic floating-point error analysis shows that the error in computing sqrt(n) (before conversion to integer) is bounded by 3/4 of an ULP. So you can safely use:
uint32_t cut = sqrt(n) + 1;
and you’ll perform at most one extra loop iteration, which is probably acceptable. If you want to be totally precise, instead use:
uint32_t cut = sqrt(n);
cut += (uint64_t)cut*cut < n;
Edit: z boson clarifies that for his purposes, this only matters when n is an exact square (otherwise, getting a value of cut that is “too small by one” is acceptable). In that case, there is no need for the adjustment and on can safely just use:
uint32_t cut = sqrt(n);
Why is this true? It’s pretty simple to see, actually. Converting n to double introduces a perturbation:
double_n = n*(1 + e)
which satisfies |e| < 2^-53. The mathematical square root of this value can be expanded as follows:
square_root(double_n) = square_root(n)*square_root(1+e)
Now, since n is assumed to be a perfect square with at most 64 bits, square_root(n) is an exact integer with at most 32 bits, and is the mathematically precise value that we hope to compute. To analyze the square_root(1+e) term, use a taylor series about 1:
square_root(1+e) = 1 + e/2 + O(e^2)
= 1 + d with |d| <~ 2^-54
Thus, the mathematically exact value square_root(double_n) is less than half an ULP away from[1] the desired exact answer, and necessarily rounds to that value.
[1] I’m being fast and loose here in my abuse of relative error estimates, where the relative size of an ULP actually varies across a binade — I’m trying to give a bit of the flavor of the proof without getting too bogged down in details. This can all be made perfectly rigorous, it just gets to be a bit wordy for Stack Overflow.
All my answer is useless if you have access to IEEE 754 double precision floating point, since Stephen Canon demonstrated both
a simple way to avoid imul in loop
a simple way to compute the ceiling sqrt
Otherwise, if for some reason you have a non IEEE 754 compliant platform, or only single precision, you could get the integer part of square root with a simple Newton-Raphson loop. For example in Squeak Smalltalk we have this method in Integer:
sqrtFloor
"Return the integer part of the square root of self"
| guess delta |
guess := 1 bitShift: (self highBit + 1) // 2.
[
delta := (guess squared - self) // (guess + guess).
delta = 0 ] whileFalse: [
guess := guess - delta ].
^guess - 1
Where // is operator for quotient of integer division.
Final guard guess*guess <= self ifTrue: [^guess]. can be avoided if initial guess is fed in excess of exact solution as is the case here.
Initializing with approximate float sqrt was not an option because integers are arbitrarily large and might overflow
But here, you could seed the initial guess with floating point sqrt approximation, and my bet is that the exact solution will be found in very few loops. In C that would be:
uint32_t sqrtFloor(uint64_t n)
{
int64_t diff;
int64_t delta;
uint64_t guess=sqrt(n); /* implicit conversions here... */
while( (delta = (diff=guess*guess-n) / (guess+guess)) != 0 )
guess -= delta;
return guess-(diff>0);
}
That's a few integer multiplications and divisions, but outside the main loop.
What you are looking for is a way to calculate a rational upper bound of the square root of a natural number. Continued fraction is what you need see wikipedia.
For x>0, there is
.
To make the notation more compact, rewriting the above formula as
Truncate the continued fraction by removing the tail term (x-1)/2's at each recursion depth, one gets a sequence of approximations of sqrt(x) as below:
Upper bounds appear at lines with odd line numbers, and gets tighter. When distance between an upper bound and its neighboring lower bound is less than 1, that approximation is what you need. Using that value as the value of cut, here cut must be a float number, solves the problem.
For very large number, rational number should be used, so no precision is lost during conversion between integer and floating point number.

Need Floating Point Precision Using Unsigned Int

I'm working with a microchip that doesn't have room for floating point precision, however. I need to account for fractional values during some equations. So far I've had good luck using the old *100 -> /100 method like so:
increment = (short int)(((value1 - value2)*100 / totalSteps));
// later in the code I loop through the number of totolSteps
// adding back the increment to arrive at the total I want at the precise time
// time I need it.
newValue = oldValue + (increment / 100);
This works great for values from 0-255 divided by a totalSteps of up to 300. After 300, the fractional values to the right of the decimal place, become important, because they add up over time of course.
I'm curious if anyone has a better way to save decimal accuracy within an integer paradigm? I tried using *1000 /1000, but that didn't work at all.
Thank you in advance.
Fractions with integers is called fixed point math.
Try Googling "fixed point".
Fixed point tips and tricks are out of the scope of SO answer...
Example: 5 tap FIR filter
// C is the filter coefficients using 2.8 fixed precision.
// 2 MSB (of 10) is for integer part and 8 LSB (of 10) is the fraction part.
// Actual fraction precision here is 1/256.
int FIR_5(int* in, // input samples
int inPrec, // sample fraction precision
int* c, // filter coefficients
int cPrec) // coefficients fraction precision
{
const int coefHalf = (cPrec > 0) ? 1 << (cPrec - 1) : 0; // value of 0.5 using cPrec
int sum = 0;
for ( int i = 0; i < 5; ++i )
{
sum += in[i] * c[i];
}
// sum's precision is X.N. where N = inPrec + cPrec;
// return to original precision (inPrec)
sum = (sum + coefHalf) >> cPrec; // adding coefHalf for rounding
return sum;
}
int main()
{
const int filterPrec = 8;
int C[5] = { 8, 16, 208, 16, 8 }; // 1.0 == 256 in 2.8 fixed point. Filter value are 8/256, 16/256, 208/256, etc.
int W[5] = { 10, 203, 40, 50, 72}; // A sampling window (example)
int res = FIR_5(W, 0, C, filterPrec);
return 0;
}
Notes:
In the above example:
the samples are integers (no fraction)
the coefs have fractions of 8 bit.
8 bit fractions mean that each change of 1 is treated as 1/256. 1 << 8 == 256.
Useful notation is Y.Xu or Y.Xs. where Y is how many bits are allocated for the integer part and X for he fraction. u/s denote signed/unsigned.
when multiplying 2 fixed point numbers, their precision (size of fraction bits) are added to each other.
Example A is 0.8u, B is 0.2U. C=A*B. C is 0.10u
when dividing, use a shift operation to lower the result precision. Amount of shifting is up to you. Before lowering precision it's better to add a half to lower the error.
Example: A=129 in 0.8u which is a little over 0.5 (129/256). We want the integer part so we right shift it by 8. Before that we want to add a half which is 128 (1<<7). So A = (A + 128) >> 8 --> 1.
Without adding a half you'll get a larger error in the final result.
Don't use this approach.
New paradigm: Do not accumulate using FP math or fixed point math. Do your accumulation and other equations with integer math. Anytime you need to get some scaled value, divide by your scale factor (100), but do the "add up" part with the raw, unscaled values.
Here's a quick attempt at a precise rational (Bresenham-esque) version of the interpolation if you truly cannot afford to directly interpolate at each step.
div_t frac_step = div(target - source, num_steps);
if(frac_step.rem < 0) {
// Annoying special case to deal with rounding towards zero.
// Alternatively check for the error term slipping to < -num_steps as well
frac_step.rem = -frac_step.rem;
--frac_step.quot;
}
unsigned int error = 0;
do {
// Add the integer term plus an accumulated fraction
error += frac_step.rem;
if(error >= num_steps) {
// Time to carry
error -= num_steps;
++source;
}
source += frac_step.quot;
} while(--num_steps);
A major drawback compared to the fixed-point solution is that the fractional term gets rounded off between iterations if you are using the function to continually walk towards a moving target at differing step lengths.
Oh, and for the record your original code does not seem to be properly accumulating the fractions when stepping, e.g. a 1/100 increment will always be truncated to 0 in the addition no matter how many times the step is taken. Instead you really want to add the increment to a higher-precision fixed-point accumulator and then divide it by 100 (or preferably right shift to divide by a power-of-two) each iteration in order to compute the integer "position".
Do take care with the different integer types and ranges required in your calculations. A multiplication by 1000 will overflow a 16-bit integer unless one term is a long. Go through you calculations and keep track of input ranges and the headroom at each step, then select your integer types to match.
Maybe you can simulate floating point behaviour by saving
it using the IEEE 754 specification
So you save mantisse, exponent, and sign as unsigned int values.
For calculation you use then bitwise addition of mantisse and exponent and so on.
Multiplication and Division you can replace by bitwise addition operations.
I think it is a lot of programming staff to emulate that but it should work.
Your choice of type is the problem: short int is likely to be 16 bits wide. That's why large multipliers don't work - you're limited to +/-32767. Use a 32 bit long int, assuming that your compiler supports it. What chip is it, by the way, and what compiler?

Is Multiplying a decimal number where all results are full integers, considered Floating Point Math?

Sorry for the wordy title. My code is targeting a microcontroller (msp430) with no floating point unit, but this should apply to any similar MCU.
If I am multiplying a large runtime variable with what would normally be considered a floating point decimal number (1.8), is this still treated like floating point math by the MCU or compiler?
My simplified code is:
int multip = 0xf; // Can be from 0-15, not available at compile time
int holder = multip * 625; // 0 - 9375
holder = holder * 1.8; // 0 - 16875`
Since the result will always be a positive full, real integer number, is it still floating point math as far as the MCU or compiler are concerned, or is it fixed point?
(I realize I could just multiply by 18, but that would require declaring a 32bit long instead of a 16 bit int then dividing and downcasting for the array it will be put in, trying to skimp on memory here)
The result is not an integer; it rounds to an integer.
9375 * 1.8000000000000000444089209850062616169452667236328125
yields
16875.0000000000004163336342344337026588618755340576171875
which rounds (in double precision floating point) to 16875.
If you write a floating-point multiply, I know of no compiler that will determine that there's a way to do that in fixed-point instead. (That does not mean they do not exist, but it ... seems unlikely.)
I assume you simplified away something important, because it seems like you could just do:
result = multip * 1125;
and get the final result directly.
I'd go for chux's formula if there's some reason you can't just multiply by 1125.
Confident FP code will be created for
holder = holder * 1.8
To avoid FP and 32-bit math, given the OP values of
int multip = 0xf; // Max 15
unsigned holder = multip * 625; // Max 9375
// holder = holder * 1.8;
// alpha depends on rounding desired, e.g. 2 for round to nearest.
holder += (holder*4u + alpha)/5;
If int x is non-negative, you can compute x *= 1.8 rounded to nearest using only int arithmetic, without overflow unless the final result overflows, with:
x - (x+2)/5 + x
For truncation instead of round-to-nearest, use:
x - (x+4)/5 + x
If x may be negative, some additional work is needed.

Resources