C: Representing a fraction without floating points

C: Representing a fraction without floating points - c

I'm writing some code for an embedded system (MSP430) without hardware floating point support. Unfortunately, I will need to work with fractions in my code as I'm doing ranging, and a short-range sensor with a precision of 1m isn't a very good sensor.
I can do as much of the math as I need in ints, but by the end there are two values that I will definitely need to have fractions on; range and speed. Range will be a value between 2-500 (cm), while speed should be no higher than -10 to 10 (ms^-1). I am unsure how to represent them without floating point values, if it is possible. A simple way of rounding the fractions up or down would be best.
Some sample code I have:
voltage_difference_new = ((memval3_new - memval4_new)*3.3/4096);
where memval3_new and memval4_new are ints, but voltage_difference_new is a float.
Please let me know if more information is needed. Or if there is a blindingly easy fix.

You have rather answered your own question with the statement:
Range will be a value between 2-500 (cm),
Work in centimetre (or even millimetre) rather than metre units.
That said you don't need floating-point hardware to do floating point math; the compiler will support "soft" floating point and generate the code to perform floating point operations - it will be slower than hardware floating point or integer operations, but that may not be an issue in your application.
Nonetheless there are many reasons to avoid floating-point even with hardware support and it does not sound like your case for FP is particularly compelling, but it is hard to tell without seeing your code and a specific example. In 32 years of embedded systems development I have seldom resorted to FP even for trig, log, sqrt and digital signal processing.
A general method is to use a fixed point presentation. My earlier suggestion of using centimetres is an example of decimal fixed point, but for greater efficiency you should use binary fixed point. For example you might represent distance in 1/1024 metre units (giving > 1 mm precision). Because the fixed point is binary, all the necessary rescaling can be done with shifts rather than more expensive multiply/divide operations.
For example, say you have an 8 bit sensor generating linear output 0 to 255 corresponding to a real distance 0 to 0.5 metre.
#define Q10_SHIFT = 10 ; // 10 bits fractional (1/1024)
typedef int q10_t ;
#define ONE_METRE = (1 << Q10_SHIFT)
#define SENSOR_MAX = 255
#define RANGE_MAX = (ONE_METRE/2)
q10_t distance = read_sensor() * RANGE_MAX / SENSOR_MAX ;
distance is in Q10 fixed point representation. Performing addition and subtraction on such is normal integer arithmentic, multiply and divide require scaling:
int q10_add( q10_t a, q10_t b )
{
return a + b ;
}
int q10_sub( q10_t a, q10_t b )
{
return a - b ;
}
int q10_mul( q10_t a, q10_t b )
{
return (a * b) >> Q10_SHIFT ;
}
int q10_div( q10_t a, q10_t b )
{
return (a << Q10_SHIFT) / b ;
}
Of course you may want to be able to mix types and say multiply a q10_t by an int - providing a comprehensive library for fixed-point can get complex. Personally for that I use C++ where you have classes, function overloading and operator overloading to support more natural code. But unless your code has a great deal of general fixed point math, it may be simpler to code specific fixed point operations ad-hoc.
To take the one example you have provided:
double voltage_difference_new = ((memval3_new - memval4_new)*3.3/4096);
The floating-point there is trivially removed using millivolts:
int voltage_difference_new_mv = ((memval3_new - memval4_new) * 3300) /4096 ;
The issue then perhaps becomes one of presentation. For example if you have to present or report the value in volts to a user. In that case:
int volt_fract = abs(voltage_difference_new_mv % 1000) ;
int volt_whole = voltage_difference_new_mv / 1000 ;
printf( "%d.%04d", volt_whole, volt_fract ) ;

Related

Better precision with Arduino (floats)

I'm trying to do the Steinhart-Hart temperature calculation on an Arduino. The equation is
I solved a system of 3 equations to obtain the values of A, B and C, which are:
A = 0.0164872
B = -0.00158538
C = 3.3813e-6
When I plug these into WolframAlpha to solve for T I get a value in Kelvins that makes sense:
T=1/(0.0164872-0.00158538*log2(10000)+3.3813E-6*(log2(10000))^3) solve for T
T = 298.145 Kelvins = 77 Fahrenheit
However when I try to use this equation on my Arduino, I get a very wrong answer, I suspect because doubles do not have enough precision. Here's what I'm using:
double temp = (1 / (A + B*log(R_therm) + C*pow(log(R_therm),3)));
This returns 222 Kelvin instead, which is way off.
So, how can I do a calculation like this in Arduino?? Any advice is greatly appreciated, thanks.

Precision is not the main issue. Could even use float and powf(). A thermistor temperature calculation is not that accurate. After all the temperature is certainly not better than ±0.1°C accurate. Self heating of the thermistor is a larger factor.
OP's C code assumes log base 2, use log base e log() as the constants were derived using log base 2. #Martin R
// double temp = (1 / (A + B*log(R_therm) + C*pow(log(R_therm),3)));
double temp = (1 / (A + B*log(R_therm)/log(2) + C*pow(log(R_therm)/log(2),3)));`
Sample implementation, that avoids an unnecessary slow pow() call.
static const inv_ln2 = 1.4426950408889634073599246810019;
double ln2_R = log(R_therm)*inv_ln2;
double temp = 1.0 / (A + ln2_R*(B + C*ln2_R*ln2_R));

Yes, floating point arithmetic has limited precision on most arduinos.
Have you considered using fixed precision? If used correctly, this might give you better results. The requirement for this is to have rather narrow parameters, however, and be careful about unit conversions.
An unsigned long on arduino is 4 bytes too, so it can contain numbers up to 2^32-1. If using fixed point, you might want to replace this 1/T by something like 100000/T, where the numerator constant and T have been scaled according to the desired precision.
You will also need to keep a (mental or paper) model of the number of decimals each variable contains, in order to optimize the operation order not to lose precision.
For the log2 function, I doubt it is available out of the box for integers. You could either cast the result or reimplement it. There is plenty of ressources for this problem, even here on SO.

Notation for fixed point representation

I'm looking for a commonly understandable notation to define a fixed point number representation.
The notation should be able to define both a power-of-two factor (using fractional bits) and a generic factor (sometimes I'm forced to use this, though less efficient). And also an optional offset should be defined.
I already know some possible notations, but all of them seem to be constrained to specific applications.
For example the Simulink notation would perfectly fit my needs, but it's known only in the Simulink world. Furthermore the overloaded usage of the fixdt() function is not so readable.
TI defines a really compact Q Formats, but the sign is implicit, and it doesn't manage a generic factor (i.e. not a power-of-two).
ASAM uses a generic 6-coefficient rational function with 2nd-degree numerator and denominator polynomials (COMPU_METHOD). Very generic, but not so friendly.
See also the Wikipedia discussion.
The question is only about the notation (not efficiency of the representation nor fixed-point manipulation). So it's a matter of code readability, maintenability and testability.

Ah, yes. Having good naming annotations is absolutely critical to not introducing bugs with fixed point arithmetic. I use an explicit version of the Q notation which handles
any division between M and N by appending _Q<M>_<N> to the name of the variable. This also makes it possible to include the signedness as well. There are no run-time performance penalties for this. Example:
uint8_t length_Q2_6; // unsigned, 2 bit integer, 6 bit fraction
int32_t sensor_calibration_Q10_21; // signed (1 bit), 10 bit integer, 21 bit fraction.
/*
* Calculations with the bc program (with '-l' argument):
*
* sqrt(3)
* 1.73205080756887729352
*
* obase=16
* sqrt(3)
* 1.BB67AE8584CAA73B0
*/
const uint32_t SQRT_3_Q7_25 = 1 << 25 | 0xBB67AE85U >> 7; /* Unsigned shift super important here! */
In case someone have not fully understood why such annotation is extremely important,
Can you spot the if there is an bug in the following two examples?
Example 1:
speed_fraction = fix32_udiv(25, speed_percent << 25, 100 << 25);
squared_speed = fix32_umul(25, speed_fraction, speed_fraction);
tmp1 = fix32_umul(25, squared_speed, SQRT_3);
tmp2 = fix32_umul(12, tmp1 >> (25-12), motor_volt << 12);
Example 2:
speed_fraction_Q7_25 = fix32_udiv(25, speed_percent << 25, 100 << 25);
squared_speed_Q7_25 = fix32_umul(25, speed_fraction_Q7_25, speed_fraction_Q7_25);
tmp1_Q7_25 = fix32_umul(25, squared_speed_Q7_25, SQRT_3_Q1_31);
tmp2_Q20_12 = fix32_umul(12, tmp1_Q7_25 >> (25-12), motor_volt << 12);
Imagine if one file contained #define SQRT_3 (1 << 25 | 0xBB67AE85U >> 7) and another file contained #define SQRT_3 (1 << 31 | 0xBB67AE85U >> 1) and code was moved between those files. For example 1 this has a high chance of going unnoticed and introduce the bug that is present in example 2 which here is done deliberately and has a zero chance of being done accidentally.

Actually Q format is the most used representation in commercial applications: you use is when you need to deal with fractional numbers FAST and your processor does not have a FPU (floating point unit) is it cannot use float and double data types natively - it has to emulate instructions for them which are very expensive.
usually you use Q format to represent only the fractional part, though this not a must, you get more precision for your representation. Here's what you need to consider:
number of bits you use (Q15 uses 16 bitdata types, usually short int)
the first bit is the sign bit (out of 16 bits you are left with 15 for data value)
the rest of the bits are used to store the fractional part of your number.
since you are representing fractional numbers your value is somewhere in [0,1)
you can choose to use some bits for the integer part as well, but you would loose precision - e.g if you wanted to represent 3.3 in Q format, you would need 1 bit for sign, 2 bits for the integer part, and are left with 13 bits for the fractional part (assuming you are using 16 bits representation)-> this format is called 2Q13
Example: Say you want to represent 0.3 in Q15 format; you apply the Rule of Three:
1 = 2^15 = 32768 = 0x8000
0.3 = X
-------------
X = 0.3*32768 = 9830 = 0x666
You lost precision by doing this but at least the computation is fast now.

In C, you can't use a user defined type like a builtin one. If you want to do that, you need to use C++. In that language you can define a class for your fixed point type, overload all the arithmetic operators (+, -, *, /, %, +=, -=, *=, /=, %=, --, ++, cast to other types), so that usage of the instances of this class really behave like the builtin types.
In C, you need to do what you want explicitly. There are two basic approaches.
Approach 1: Do the fixed point adjustments in the user code.
This is overhead-free, but you need to remember to do the correct adjustments. I believe, it is easiest to just add the number of past point bits to the end of the variable name, because the type system won't do you much good, even if you typedef'd all the point positions you use. Here is an example:
int64_t a_7 = (int64_t)(7.3*(1<<7)); //a variable with 7 past point bits
int64_t b_5 = (int64_t)(3.78*(1<<5)); //a variable with 5 past point bits
int64_t sum_7 = a_7 + (b_5 << 2); //to add those two variables, we need to adjust the point position in b
int64_t product_12 = a_7 * b_5; //the product produces a number with 12 past point bits
Of course, this is a lot of hassle, but at least you can easily check at every point whether the point adjustment is correct.
Approach 2: Define a struct for your fixed point numbers and encapsulate the arithmetic on it in a bunch of functions. Like this:
typedef struct FixedPoint {
int64_t data;
uint8_t pointPosition;
} FixedPoint;
FixedPoint fixed_add(FixedPoint a, FixedPoint b) {
if(a.pointPosition >= b.PointPosition) {
return (FixedPoint){
.data = a.data + (b.data << a.pointPosition - b.pointPosition),
.pointPosition = a.pointPosition
};
} else {
return (FixedPoint){
.data = (a.data << b.pointPosition - a.pointPosition) + b.data,
.pointPosition = b.pointPosition
};
}
}
This approach is a bit cleaner in the usage, however, it introduces significant overhead. That overhead consists of:
The function calls.
The copying of the structs for parameter and result passing, or the pointer dereferences if you use pointers.
The need to calculate the point adjustments at runtime.
This is pretty much similar to the overhead of a C++ class without templates. Using templates would move some decisions back to compile time, at the cost of loosing flexibility.
This object based approach is probably the most flexible one, and it allows you to add support for non-binary point positions in a transparent way.

Need Floating Point Precision Using Unsigned Int

I'm working with a microchip that doesn't have room for floating point precision, however. I need to account for fractional values during some equations. So far I've had good luck using the old *100 -> /100 method like so:
increment = (short int)(((value1 - value2)*100 / totalSteps));
// later in the code I loop through the number of totolSteps
// adding back the increment to arrive at the total I want at the precise time
// time I need it.
newValue = oldValue + (increment / 100);
This works great for values from 0-255 divided by a totalSteps of up to 300. After 300, the fractional values to the right of the decimal place, become important, because they add up over time of course.
I'm curious if anyone has a better way to save decimal accuracy within an integer paradigm? I tried using *1000 /1000, but that didn't work at all.
Thank you in advance.

Fractions with integers is called fixed point math.
Try Googling "fixed point".
Fixed point tips and tricks are out of the scope of SO answer...
Example: 5 tap FIR filter
// C is the filter coefficients using 2.8 fixed precision.
// 2 MSB (of 10) is for integer part and 8 LSB (of 10) is the fraction part.
// Actual fraction precision here is 1/256.
int FIR_5(int* in, // input samples
int inPrec, // sample fraction precision
int* c, // filter coefficients
int cPrec) // coefficients fraction precision
{
const int coefHalf = (cPrec > 0) ? 1 << (cPrec - 1) : 0; // value of 0.5 using cPrec
int sum = 0;
for ( int i = 0; i < 5; ++i )
{
sum += in[i] * c[i];
}
// sum's precision is X.N. where N = inPrec + cPrec;
// return to original precision (inPrec)
sum = (sum + coefHalf) >> cPrec; // adding coefHalf for rounding
return sum;
}
int main()
{
const int filterPrec = 8;
int C[5] = { 8, 16, 208, 16, 8 }; // 1.0 == 256 in 2.8 fixed point. Filter value are 8/256, 16/256, 208/256, etc.
int W[5] = { 10, 203, 40, 50, 72}; // A sampling window (example)
int res = FIR_5(W, 0, C, filterPrec);
return 0;
}
Notes:
In the above example:
the samples are integers (no fraction)
the coefs have fractions of 8 bit.
8 bit fractions mean that each change of 1 is treated as 1/256. 1 << 8 == 256.
Useful notation is Y.Xu or Y.Xs. where Y is how many bits are allocated for the integer part and X for he fraction. u/s denote signed/unsigned.
when multiplying 2 fixed point numbers, their precision (size of fraction bits) are added to each other.
Example A is 0.8u, B is 0.2U. C=A*B. C is 0.10u
when dividing, use a shift operation to lower the result precision. Amount of shifting is up to you. Before lowering precision it's better to add a half to lower the error.
Example: A=129 in 0.8u which is a little over 0.5 (129/256). We want the integer part so we right shift it by 8. Before that we want to add a half which is 128 (1<<7). So A = (A + 128) >> 8 --> 1.
Without adding a half you'll get a larger error in the final result.

Don't use this approach.
New paradigm: Do not accumulate using FP math or fixed point math. Do your accumulation and other equations with integer math. Anytime you need to get some scaled value, divide by your scale factor (100), but do the "add up" part with the raw, unscaled values.

Here's a quick attempt at a precise rational (Bresenham-esque) version of the interpolation if you truly cannot afford to directly interpolate at each step.
div_t frac_step = div(target - source, num_steps);
if(frac_step.rem < 0) {
// Annoying special case to deal with rounding towards zero.
// Alternatively check for the error term slipping to < -num_steps as well
frac_step.rem = -frac_step.rem;
--frac_step.quot;
}
unsigned int error = 0;
do {
// Add the integer term plus an accumulated fraction
error += frac_step.rem;
if(error >= num_steps) {
// Time to carry
error -= num_steps;
++source;
}
source += frac_step.quot;
} while(--num_steps);
A major drawback compared to the fixed-point solution is that the fractional term gets rounded off between iterations if you are using the function to continually walk towards a moving target at differing step lengths.
Oh, and for the record your original code does not seem to be properly accumulating the fractions when stepping, e.g. a 1/100 increment will always be truncated to 0 in the addition no matter how many times the step is taken. Instead you really want to add the increment to a higher-precision fixed-point accumulator and then divide it by 100 (or preferably right shift to divide by a power-of-two) each iteration in order to compute the integer "position".
Do take care with the different integer types and ranges required in your calculations. A multiplication by 1000 will overflow a 16-bit integer unless one term is a long. Go through you calculations and keep track of input ranges and the headroom at each step, then select your integer types to match.

Maybe you can simulate floating point behaviour by saving
it using the IEEE 754 specification
So you save mantisse, exponent, and sign as unsigned int values.
For calculation you use then bitwise addition of mantisse and exponent and so on.
Multiplication and Division you can replace by bitwise addition operations.
I think it is a lot of programming staff to emulate that but it should work.

Your choice of type is the problem: short int is likely to be 16 bits wide. That's why large multipliers don't work - you're limited to +/-32767. Use a 32 bit long int, assuming that your compiler supports it. What chip is it, by the way, and what compiler?

What are the approaches to SW floating-point implementation?

I need to be able to use floating-point arithmetic under my dev environment in C (CPU: ~12 MHz Motorola 68000). The standard library is not present, meaning it is a bare-bones C and no - it isn't gcc due to several other issues
I tried getting the SoftFloat library to compile and one other 68k-specific FP library (name of which escapes me at this moment), but their dependencies cannot be resolved for this particular platform - mostly due to libc deficiencies. I spent about 8 hrs trying to overcome the linking issues, until I got to a point where I know I can't get further.
However, it took mere half an hour to come up with and implement the following set of functions that emulate floating-point functionality sufficiently for my needs.
The basic idea is that both fractional and non-fractional part are 16-bit integers, thus there is no bit manipulation.
The nonfractional part has a range of [-32767, 32767] and the fractional part [-0.9999, +0.9999] - which gives us 4 digits of precision (good enough for my floating-point needs - albeit wasteful).
It looks to me, like this could be used to make a faster, smaller - just 2 Bytes-big - alternative version of a float with ranges [-99, +99] and [-0.9, +0.9]
The question here is, what other techniques - other than IEEE - are there to make an implementation of basic floating-point functionality (+ - * /) using fixed-point functionality?
Later on, I will need some basic trigonometry, but there are lots of resources on net for that.
Since the HW has 2 MBs of RAM, I don't really care if I can save 2 bytes per soft-float (say - by reserving 9 vs 7 bits in an int). Thus - 4 bytes are good enough.
Also, from brief looking at the 68k instruction manual (and the cycle costs of each instruction), I made few early observations:
bit-shifting is slow, and unless performance is of critical importance (not the case here), I'd prefer easy debugging of my soft-float library versus 5-cycles-faster code. Besides, since this is C and not 68k ASM, it is obvious that speed is not a critical factor.
8-bit operands are as slow as 16-bit (give or take a cycle in most cases), thus it looks like it does not make much sense to compress floats for the sake of performance.
What improvements / approaches would you propose to implement floating-point in C using fixed-point without any dependency on other library/code?
Perhaps it would be possible to use a different approach and do the operations on frac & non-frac parts at the same time?
Here is the code (tested only using the calculator), please ignore the C++ - like declaration and initialization in the middle of functions (I will reformat that to C-style later):
inline int Pad (int f) // Pad the fractional part to 4 digits
{
if (f < 10) return f*1000;
else if (f < 100) return f*100;
else if (f < 1000) return f*10;
else return f;
}
// We assume fractional parts are padded to full 4 digits
inline void Add (int & b1, int & f1, int b2, int f2)
{
b1 += b2;
f1 +=f2;
if (f1 > 9999) { b1++; f1 -=10000; }
else if (f1 < -9999) { b1--; f1 +=10000; }
f1 = Pad (f1);
}
inline void Sub (int & b1, int & f1, int b2, int f2)
{
// 123.1652 - 18.9752 = 104.1900
b1 -= b2; // 105
f1 -= f2; // -8100
if (f1 < 0) { b1--; f1 +=10000; }
f1 = Pad (f1);
}
// ToDo: Implement a multiplication by float
inline void Mul (int & b1, int & f1, int num)
{
// 123.9876 * 251 = 31120.8876
b1 *=num; // 30873
long q = f1*num; //2478876
int add = q/10000; // 247
b1+=add; // 31120
f1 = q-(add*10000);//8876
f1 = Pad (f1);
}
// ToDo: Implement a division by float
inline void Div (int & b1, int & f1, int num)
{
// 123.9876 / 25 = 4.959504
int b2 = b1/num; // 4
long q = b1 - (b2*num); // 23
f1 = ((q*10000) + f1) / num; // (23000+9876) / 25 = 9595
b1 = b2;
f1 = Pad (f1);
}

You are thinking in the wrong base for a simple fixed point implementation. It is much easier if you use the bits for the decimal place. e.g. using 16 bits for the integer part and 16 bits for the decimal part (range -32767/32767, precision of 1/2^16 which is a lot higher precision than you have).
The best part is that addition and subtraction are simple (just add the two parts together). Multiplication is a little bit trickier: you have to be aware of overflow and so it helps to do the multiplication in 64 bit. You also have to shift the result after the multiplication (by however many bits are in your decimal).
typedef int fixed16;
fixed16 mult_f(fixed16 op1, fixed16 op2)
{
/* you may need to do something tricky with upper and lower if you don't
* have native 64 bit but the compiler might do it for us if we are lucky
*/
uint64_t tmp;
tmp = (op1 * op2) >> 16;
/* add in error handling for overflow if you wish - this just wraps */
return tmp & 0xFFFFFFFF;
}
Division is similar.
Someone might have implemented almost exactly what you need (or that can be hacked to make it work) that's called libfixmath

If you decide to use fixed-point, the whole number (i.e both int and fractional parts) should be in the same base. Using binary for the int part and decimal for the fractional part as above is not very optimal and slows down the calculation. Using binary fixed-point you'll only need to shift an appropriate amount after each operation instead of long adjustments like your idea. If you want to use Q16.16 then libfixmath as dave mentioned above is a good choice. If you want a different precision or floating point position such as Q14.18, Q19.13 then write your own library or modify some library for your own use. Some examples
BoostGSoC15/fixed_point
https://github.com/johnmcfarlane/cnl
See also What's the best way to do fixed-point math?
If you want a larger range then floating point maybe the better choice. Write a library as your own requirements, choose a format that is easy to implement and easy to achieve good performance in software, no need to follow IEEE 754 specifications (which is only fast with hardware implementations due to the odd number of bits and strange exponent bits' position) unless you intend to exchange data with other devices. For example a format of exp.sign.significand with 7 exponent bits followed by a sign bit and then 24 bits of significand. The exponent doesn't need to be biased, so to get the base only an arithmetic shift by 25 is needed, the sign bit will also be extended. But in case the shift is slower than a subtraction then excess-n is better.

Faster way of finding multiple of double

If have the following C function, used to determine if one number is a multiple of another to an arbirary tolerance
#include <math.h>
#define TOLERANCE 0.0001
int IsMultipleOf(double x,double mod)
{
return(fabs(fmod(x, mod)) < TOLERANCE);
}
It works fine, but profiling shows it to be very slow, to the extent that it has become a candidate for optimization. About 75% of the time is spent in modulo and the remaining in fabs. I'm trying to figure a way of speeding things up, using something like a look-up table. The parameter x changes regularly, whereas mod changes infrequently. The number of possible values of x is small enough that the space for a look-up would not be an issue, typically it will be one of a few hundred possible values. I can get rid of the fabs easily enough, but can't figure out a reasonable alternative to the modulo. Any ideas on how to optimize the above?
Edit The code will be running on a wide range of Windows desktop and mobile devices, hence processors could include Intel, AMD on desktop, and ARM or SH4 on mobile devices. VisualStudio 2008 is the compiler.

Do you really have to use modulo for this?
Wouldn't it be possible to just result = x / mod and then check if the decimal part of result is close to 0. For instance:
11 / 5.4999 = 2.000003 ==> 0.000003 < TOLERANCE
Or something like that.

Division (floating point or not, fmod in your case) is often an operation where the execution time varies a lot depending on the cpu and compiler:
gcc has a builtin replacement for
that if you give it the right compile
flags or if you use __builtin_fmod
explicitly. This then might map the
operation on a small number of
assembler instructions.
there may be special units like SSE
on intel processors where this
operation is implemented more
efficiently
By such tricks, depending on your environment (you didn't tell which) the time may vary from some clock cycles to some hundred. I think best is to look into the documentation of your compiler and cpu for that particular operation.

The following is probably overkill, and sub-optimal. But for what it is worth here is one way on how to do it.
We know the format of the double ...
1 bit for the sign
11 bits for the biased exponent
52 fraction bits
Let ...
value = x / mod;
exp = exponent bits of value - BIAS;
lsb = least sig bit of value's fraction bits;
Once you have that ...
/*
* If applying the exponent would eliminate the fraction bits
* then for double precision resolution it is a multiple.
* Note: lsb may require some massaging.
*/
if (exp > lsb)
return (true);
if (exp < 0)
return (false);
The only case remaining is the tolerance case. Build your double so that you are getting rid of all the digits to the left of the decimal.
sign bit is zero (positive)
exponent is the BIAS (1023 I think ... look it up to be sure)
shift the fraction bits as appropriate
Now compare it against your tolerance.

I think you need to inspect the bowels of your C RTL fmod() function: X86 FPU's have 'FPREM/FPREM1' instructions which computes remainders by repeated subtraction.
While floating point division is a single instruction, it seems you may need to call FPREM repeatedly to get the right answer for modulus, so your RTL may not use it.

I have not tested this at all, but from the way I understand fmod this should be equivalent inlined, which might let the compiler optimize it better, though I would have thought that the compiler's math library (or builtins) would work just as well. (also, I don't even know for sure if this is correct).
#include <math.h>
int IsMultipleOf(double x, double mod) {
long n = x / mod; // You should probably test for /0 or NAN result here
double new_x = mod * n;
double delta = x - new_x;
return fabs(delta) < TOLERANCE; // and for NAN result from fabs
}

Maybe you can get away with long long instead of double if you have comparable scale of data. For example long long would be enough for over 60 astronomical units in micrometer resolution.

Does it need to be double precision ? Depending on how good your math library is, this ought to be faster:
#include <math.h>
#define TOLERANCE 0.0001f
bool IsMultipleOf(float x, float mod)
{
return(fabsf(fmodf(x, mod)) < TOLERANCE);
}

I presume modulo looks a little like this on the inside:
mod(x,m) {
while (x > m) {
x = x - m
}
return x
}
I think that through some sort of search i could be optimised: eg:
fastmod(x,m) {
q = 1
while (m * q < x) {
q = q * 2
}
return mod((x - (q / 2) * m), m)
}
You might even choose to replace the finall call to mod with annother call to fastmod, adding the condition that if x < m then to return x.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

C: Representing a fraction without floating points - c

Related

Better precision with Arduino (floats)

Notation for fixed point representation

Need Floating Point Precision Using Unsigned Int

What are the approaches to SW floating-point implementation?

Faster way of finding multiple of double

Categories

Resources