Effective way to calc without float, C

Effective way to calc without float, C - c

I need to do a math to convert a 16-bit value received from sensor to real relative humidity value. It's calculated with following formula:
Given this in floating point math that would be:
uint16_t buf = 0x7C80; // Example
float rh = ((float)buf*125 / 65536)-6;
But I want to avoid floating point math as my platform are "FPUless".
What are the most effective way to calculate & store RH in integer math here? Considering it's humidity the actual value should be between 0 and 100% but sometimes approximation could lead that rh could be slightly less than 0 or higher than 100 (if I would leave that float, I could just do something like if (rh<0) rh=0; else if (rh>100) rh=100;) and I care only about last 2 digits after decimal point (%.2f).
Currently I've solved this like this:
int16_t rhint = ((uint32_t)buf*12500 / 65536)-600;
And working with rhint / 100; rhint % 100. But probably there are more effective way?

You could avoid the large intermediate term by writing the right hand side as
-6 + (128 - 4 + 1) * S / 65536
Which becomes
-6 + S / 512 - S / 16384 + S / 65536
You might be able to drop the last term, and possibly the penultimate one too depending on how precise you want the basis point truncation to be.

Related

Better precision with Arduino (floats)

I'm trying to do the Steinhart-Hart temperature calculation on an Arduino. The equation is
I solved a system of 3 equations to obtain the values of A, B and C, which are:
A = 0.0164872
B = -0.00158538
C = 3.3813e-6
When I plug these into WolframAlpha to solve for T I get a value in Kelvins that makes sense:
T=1/(0.0164872-0.00158538*log2(10000)+3.3813E-6*(log2(10000))^3) solve for T
T = 298.145 Kelvins = 77 Fahrenheit
However when I try to use this equation on my Arduino, I get a very wrong answer, I suspect because doubles do not have enough precision. Here's what I'm using:
double temp = (1 / (A + B*log(R_therm) + C*pow(log(R_therm),3)));
This returns 222 Kelvin instead, which is way off.
So, how can I do a calculation like this in Arduino?? Any advice is greatly appreciated, thanks.

Precision is not the main issue. Could even use float and powf(). A thermistor temperature calculation is not that accurate. After all the temperature is certainly not better than ±0.1°C accurate. Self heating of the thermistor is a larger factor.
OP's C code assumes log base 2, use log base e log() as the constants were derived using log base 2. #Martin R
// double temp = (1 / (A + B*log(R_therm) + C*pow(log(R_therm),3)));
double temp = (1 / (A + B*log(R_therm)/log(2) + C*pow(log(R_therm)/log(2),3)));`
Sample implementation, that avoids an unnecessary slow pow() call.
static const inv_ln2 = 1.4426950408889634073599246810019;
double ln2_R = log(R_therm)*inv_ln2;
double temp = 1.0 / (A + ln2_R*(B + C*ln2_R*ln2_R));

Yes, floating point arithmetic has limited precision on most arduinos.
Have you considered using fixed precision? If used correctly, this might give you better results. The requirement for this is to have rather narrow parameters, however, and be careful about unit conversions.
An unsigned long on arduino is 4 bytes too, so it can contain numbers up to 2^32-1. If using fixed point, you might want to replace this 1/T by something like 100000/T, where the numerator constant and T have been scaled according to the desired precision.
You will also need to keep a (mental or paper) model of the number of decimals each variable contains, in order to optimize the operation order not to lose precision.
For the log2 function, I doubt it is available out of the box for integers. You could either cast the result or reimplement it. There is plenty of ressources for this problem, even here on SO.

accuracy of sqrt of integers

I have a loop like this:
for(uint64_t i=0; i*i<n; i++) {
This requires doing a multiplication every iteration. If I could calculate the sqrt before the loop then I could avoid this.
unsigned cut = sqrt(n)
for(uint64_t i=0; i<cut; i++) {
In my case it's okay if the sqrt function rounds up to the next integer but it's not okay if it rounds down.
My question is: is the sqrt function accurate enough to do this for all cases?
Edit: Let me list some cases. If n is a perfect square so that n = y^2 my question would be - is cut=sqrt(n)>=y for all n? If cut=y-1 then there is a problem. E.g. if n = 120 and cut = 10 it's okay but if n=121 (11^2) and cut is still 10 then it won't work.
My first concern was the fractional part of float only has 23 bits and double 52 so they can't store all the digits of some 32-bit or 64-bit integers. However, I don't think this is a problem. Let's assume we want the sqrt of some number y but we can't store all the digits of y. If we let the fraction of y we can store be x we can write y = x + dx then we want to make sure that whatever dx we choose does not move us to the next integer.
sqrt(x+dx) < sqrt(x) + 1 //solve
dx < 2*sqrt(x) + 1
// e.g for x = 100 dx < 21
// sqrt(100+20) < sqrt(100) + 1
Float can store 23 bits so we let y = 2^23 + 2^9. This is more than sufficient since 2^9 < 2*sqrt(2^23) + 1. It's easy to show this for double as well with 64-bit integers. So although they can't store all the digits as long as the sqrt of what they can store is accurate then the sqrt(fraction) should be sufficient. Now let's look at what happens for integers close to INT_MAX and the sqrt:
unsigned xi = -1-1;
printf("%u %u\n", xi, (unsigned)(float)xi); //4294967294 4294967295
printf("%u %u\n", (unsigned)sqrt(xi), (unsigned)sqrtf(xi)); //65535 65536
Since float can't store all the digits of 2^31-2 and double can they get different results for the sqrt. But the float version of the sqrt is one integer larger. This is what I want. For 64-bit integers as long as the sqrt of the double always rounds up it's okay.

First, integer multiplication is really quite cheap. So long as you have more than a few cycles of work per loop iteration and one spare execute slot, it should be entirely hidden by reorder on most non-tiny processors.
If you did have a processor with dramatically slow integer multiply, a truly clever compiler might transform your loop to:
for (uint64_t i = 0, j = 0; j < cut; j += 2*i+1, i++)
replacing the multiply with an lea or a shift and two adds.
Those notes aside, let’s look at your question as stated. No, you can’t just use i < sqrt(n). Counter-example: n = 0x20000000000000. Assuming adherence to IEEE-754, you will have cut = 0x5a82799, and cut*cut is 0x1ffffff8eff971.
However, a basic floating-point error analysis shows that the error in computing sqrt(n) (before conversion to integer) is bounded by 3/4 of an ULP. So you can safely use:
uint32_t cut = sqrt(n) + 1;
and you’ll perform at most one extra loop iteration, which is probably acceptable. If you want to be totally precise, instead use:
uint32_t cut = sqrt(n);
cut += (uint64_t)cut*cut < n;
Edit: z boson clarifies that for his purposes, this only matters when n is an exact square (otherwise, getting a value of cut that is “too small by one” is acceptable). In that case, there is no need for the adjustment and on can safely just use:
uint32_t cut = sqrt(n);
Why is this true? It’s pretty simple to see, actually. Converting n to double introduces a perturbation:
double_n = n*(1 + e)
which satisfies |e| < 2^-53. The mathematical square root of this value can be expanded as follows:
square_root(double_n) = square_root(n)*square_root(1+e)
Now, since n is assumed to be a perfect square with at most 64 bits, square_root(n) is an exact integer with at most 32 bits, and is the mathematically precise value that we hope to compute. To analyze the square_root(1+e) term, use a taylor series about 1:
square_root(1+e) = 1 + e/2 + O(e^2)
= 1 + d with |d| <~ 2^-54
Thus, the mathematically exact value square_root(double_n) is less than half an ULP away from[1] the desired exact answer, and necessarily rounds to that value.
[1] I’m being fast and loose here in my abuse of relative error estimates, where the relative size of an ULP actually varies across a binade — I’m trying to give a bit of the flavor of the proof without getting too bogged down in details. This can all be made perfectly rigorous, it just gets to be a bit wordy for Stack Overflow.

All my answer is useless if you have access to IEEE 754 double precision floating point, since Stephen Canon demonstrated both
a simple way to avoid imul in loop
a simple way to compute the ceiling sqrt
Otherwise, if for some reason you have a non IEEE 754 compliant platform, or only single precision, you could get the integer part of square root with a simple Newton-Raphson loop. For example in Squeak Smalltalk we have this method in Integer:
sqrtFloor
"Return the integer part of the square root of self"
| guess delta |
guess := 1 bitShift: (self highBit + 1) // 2.
[
delta := (guess squared - self) // (guess + guess).
delta = 0 ] whileFalse: [
guess := guess - delta ].
^guess - 1
Where // is operator for quotient of integer division.
Final guard guess*guess <= self ifTrue: [^guess]. can be avoided if initial guess is fed in excess of exact solution as is the case here.
Initializing with approximate float sqrt was not an option because integers are arbitrarily large and might overflow
But here, you could seed the initial guess with floating point sqrt approximation, and my bet is that the exact solution will be found in very few loops. In C that would be:
uint32_t sqrtFloor(uint64_t n)
{
int64_t diff;
int64_t delta;
uint64_t guess=sqrt(n); /* implicit conversions here... */
while( (delta = (diff=guess*guess-n) / (guess+guess)) != 0 )
guess -= delta;
return guess-(diff>0);
}
That's a few integer multiplications and divisions, but outside the main loop.

What you are looking for is a way to calculate a rational upper bound of the square root of a natural number. Continued fraction is what you need see wikipedia.
For x>0, there is
.
To make the notation more compact, rewriting the above formula as
Truncate the continued fraction by removing the tail term (x-1)/2's at each recursion depth, one gets a sequence of approximations of sqrt(x) as below:
Upper bounds appear at lines with odd line numbers, and gets tighter. When distance between an upper bound and its neighboring lower bound is less than 1, that approximation is what you need. Using that value as the value of cut, here cut must be a float number, solves the problem.
For very large number, rational number should be used, so no precision is lost during conversion between integer and floating point number.

Need Floating Point Precision Using Unsigned Int

I'm working with a microchip that doesn't have room for floating point precision, however. I need to account for fractional values during some equations. So far I've had good luck using the old *100 -> /100 method like so:
increment = (short int)(((value1 - value2)*100 / totalSteps));
// later in the code I loop through the number of totolSteps
// adding back the increment to arrive at the total I want at the precise time
// time I need it.
newValue = oldValue + (increment / 100);
This works great for values from 0-255 divided by a totalSteps of up to 300. After 300, the fractional values to the right of the decimal place, become important, because they add up over time of course.
I'm curious if anyone has a better way to save decimal accuracy within an integer paradigm? I tried using *1000 /1000, but that didn't work at all.
Thank you in advance.

Fractions with integers is called fixed point math.
Try Googling "fixed point".
Fixed point tips and tricks are out of the scope of SO answer...
Example: 5 tap FIR filter
// C is the filter coefficients using 2.8 fixed precision.
// 2 MSB (of 10) is for integer part and 8 LSB (of 10) is the fraction part.
// Actual fraction precision here is 1/256.
int FIR_5(int* in, // input samples
int inPrec, // sample fraction precision
int* c, // filter coefficients
int cPrec) // coefficients fraction precision
{
const int coefHalf = (cPrec > 0) ? 1 << (cPrec - 1) : 0; // value of 0.5 using cPrec
int sum = 0;
for ( int i = 0; i < 5; ++i )
{
sum += in[i] * c[i];
}
// sum's precision is X.N. where N = inPrec + cPrec;
// return to original precision (inPrec)
sum = (sum + coefHalf) >> cPrec; // adding coefHalf for rounding
return sum;
}
int main()
{
const int filterPrec = 8;
int C[5] = { 8, 16, 208, 16, 8 }; // 1.0 == 256 in 2.8 fixed point. Filter value are 8/256, 16/256, 208/256, etc.
int W[5] = { 10, 203, 40, 50, 72}; // A sampling window (example)
int res = FIR_5(W, 0, C, filterPrec);
return 0;
}
Notes:
In the above example:
the samples are integers (no fraction)
the coefs have fractions of 8 bit.
8 bit fractions mean that each change of 1 is treated as 1/256. 1 << 8 == 256.
Useful notation is Y.Xu or Y.Xs. where Y is how many bits are allocated for the integer part and X for he fraction. u/s denote signed/unsigned.
when multiplying 2 fixed point numbers, their precision (size of fraction bits) are added to each other.
Example A is 0.8u, B is 0.2U. C=A*B. C is 0.10u
when dividing, use a shift operation to lower the result precision. Amount of shifting is up to you. Before lowering precision it's better to add a half to lower the error.
Example: A=129 in 0.8u which is a little over 0.5 (129/256). We want the integer part so we right shift it by 8. Before that we want to add a half which is 128 (1<<7). So A = (A + 128) >> 8 --> 1.
Without adding a half you'll get a larger error in the final result.

Don't use this approach.
New paradigm: Do not accumulate using FP math or fixed point math. Do your accumulation and other equations with integer math. Anytime you need to get some scaled value, divide by your scale factor (100), but do the "add up" part with the raw, unscaled values.

Here's a quick attempt at a precise rational (Bresenham-esque) version of the interpolation if you truly cannot afford to directly interpolate at each step.
div_t frac_step = div(target - source, num_steps);
if(frac_step.rem < 0) {
// Annoying special case to deal with rounding towards zero.
// Alternatively check for the error term slipping to < -num_steps as well
frac_step.rem = -frac_step.rem;
--frac_step.quot;
}
unsigned int error = 0;
do {
// Add the integer term plus an accumulated fraction
error += frac_step.rem;
if(error >= num_steps) {
// Time to carry
error -= num_steps;
++source;
}
source += frac_step.quot;
} while(--num_steps);
A major drawback compared to the fixed-point solution is that the fractional term gets rounded off between iterations if you are using the function to continually walk towards a moving target at differing step lengths.
Oh, and for the record your original code does not seem to be properly accumulating the fractions when stepping, e.g. a 1/100 increment will always be truncated to 0 in the addition no matter how many times the step is taken. Instead you really want to add the increment to a higher-precision fixed-point accumulator and then divide it by 100 (or preferably right shift to divide by a power-of-two) each iteration in order to compute the integer "position".
Do take care with the different integer types and ranges required in your calculations. A multiplication by 1000 will overflow a 16-bit integer unless one term is a long. Go through you calculations and keep track of input ranges and the headroom at each step, then select your integer types to match.

Maybe you can simulate floating point behaviour by saving
it using the IEEE 754 specification
So you save mantisse, exponent, and sign as unsigned int values.
For calculation you use then bitwise addition of mantisse and exponent and so on.
Multiplication and Division you can replace by bitwise addition operations.
I think it is a lot of programming staff to emulate that but it should work.

Your choice of type is the problem: short int is likely to be 16 bits wide. That's why large multipliers don't work - you're limited to +/-32767. Use a 32 bit long int, assuming that your compiler supports it. What chip is it, by the way, and what compiler?

Fixed-point scaling and accuracy in multiplication

I need to perform a multiplication operation on a fixed-point variable x (unsigned 16-bit integer [U16] type with binary point 6 [BP6]) with a coefficient A, which I know will always be between 0 and 1. Code is being written in C for a 32-bit embedded platform.
I know that if I were to also make this coefficient a U16 BP6, then I would end up with a U32 BP12 from the multiplication. I want to rescale this result back down to U16 BP6, so I just lop off the first 10 bits and the last 6.
However, since the coefficient is limited in precision by the number of fractional bits, and I do not necessarily need the full 10 bits of integer, I was thinking that I could just make the coefficient variable A a U16 BP15 to yield a more precise result.
I have worked out the following example (bear with me):
Let's say that x = 172.0 (decimal) and I want to use a coefficient A = 0.82 (decimal). The ideal decimal result would be 172.0 * 0.82 = 141.04.
In binary, x = 0010101100.000000.
If I am using BP6 for A, the binary representation will be either
A_1 = 0000000000.110100 = 0.8125 or
A_2 = 0000000000.110101 = 0.828125
(depending on whether value is based on floor or ceiling).
Performing the binary multiplication between x and either value of A yields (leaving out leading zeroes):
A_1 * x = 10001011.110000000000 = 139.75
A_2 * x = 10001110.011100000000 = 142.4375
In both cases, triming down the last 6 bits would not affect the result.
Now, if I expanded A to have BP15, then
A_3 = 0.110100011110110 = 0.82000732421875
and the resulting multiplication yields
A_3 * x = 10001101.000010101001000000000 = 141.041259765625
When trimming the extra 15 fractional bits, the result is
A_3 * x = 10001101.000010 = 141.03125
So it's pretty clear here that by expanding the coefficient to have more fractional bits yields a more precise result (at least in my example). Is this something which will hold true in general? Is this good/bad to use in practice? Am I missing or misunderstanding something?
EDIT: I should have said "accuracy" in place of "precision" here. I am looking for a result which is closer to my expected value rather than a result which contains more fractional bits.

Having done similar code, I'd say you what you are doing will hold true in general with the following concerns.
It is very easy to get unexpected overflow when shifting around your binary point. Rigorous testing/analysis and/or code detect is recommended. Notable failure: Ariane_5
You want precision, thus I disagree with "lop off ... last 6". Instead I recommend rounding your results as processing time allows. Use the MSBit to be lopped off to possibly adjust the result.

Is Multiplying a decimal number where all results are full integers, considered Floating Point Math?

Sorry for the wordy title. My code is targeting a microcontroller (msp430) with no floating point unit, but this should apply to any similar MCU.
If I am multiplying a large runtime variable with what would normally be considered a floating point decimal number (1.8), is this still treated like floating point math by the MCU or compiler?
My simplified code is:
int multip = 0xf; // Can be from 0-15, not available at compile time
int holder = multip * 625; // 0 - 9375
holder = holder * 1.8; // 0 - 16875`
Since the result will always be a positive full, real integer number, is it still floating point math as far as the MCU or compiler are concerned, or is it fixed point?
(I realize I could just multiply by 18, but that would require declaring a 32bit long instead of a 16 bit int then dividing and downcasting for the array it will be put in, trying to skimp on memory here)

The result is not an integer; it rounds to an integer.
9375 * 1.8000000000000000444089209850062616169452667236328125
yields
16875.0000000000004163336342344337026588618755340576171875
which rounds (in double precision floating point) to 16875.

If you write a floating-point multiply, I know of no compiler that will determine that there's a way to do that in fixed-point instead. (That does not mean they do not exist, but it ... seems unlikely.)
I assume you simplified away something important, because it seems like you could just do:
result = multip * 1125;
and get the final result directly.
I'd go for chux's formula if there's some reason you can't just multiply by 1125.

Confident FP code will be created for
holder = holder * 1.8
To avoid FP and 32-bit math, given the OP values of
int multip = 0xf; // Max 15
unsigned holder = multip * 625; // Max 9375
// holder = holder * 1.8;
// alpha depends on rounding desired, e.g. 2 for round to nearest.
holder += (holder*4u + alpha)/5;

If int x is non-negative, you can compute x *= 1.8 rounded to nearest using only int arithmetic, without overflow unless the final result overflows, with:
x - (x+2)/5 + x
For truncation instead of round-to-nearest, use:
x - (x+4)/5 + x
If x may be negative, some additional work is needed.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Effective way to calc without float, C - c

Related

Better precision with Arduino (floats)

accuracy of sqrt of integers

Need Floating Point Precision Using Unsigned Int

Fixed-point scaling and accuracy in multiplication

Is Multiplying a decimal number where all results are full integers, considered Floating Point Math?

Categories

Resources