I'm doing some programming on a 32bit machine.
As part of an algorithm to calculate collisions between objects in 3d I have to get the results of a dot product:
//Vector3 components are signed int
signed long GaMhVecDotL(const Vector3 *p_a, const Vector3 *p_b)
{
return ((p_a->vx * p_b->vx + p_a->vy * p_b->vy + p_a->vz * p_b->vz));
}
In some cases this result overflows the 32 bit returning value (signed long).
I have tried a couple of things:
Bitshift the Vector3 components before sending them to this function
to reduce the size. This works in most cases but I lose precision and
that makes the algorithm fail in some edge cases.
Storing the result of the operation in a long long variable,
and although it compiles it doesn't seem to store the variables
correctly (this is for some PSX homebrew, compiler and tools haven't been updated since the late 90s).
I actually don't need to know the full result of the Dot Product, I would just need to know if the result is positive, negative or 0 and at the same time trying to preserve as much precision as possible.
Is there any way I can store the result of that operation (p_a->vx * p_b->vx + p_a->vy * p_b->vy + p_a->vz * p_b->vz) in a temp 64 bit var (or 2x32 bit) that would allow me later to check if this var is positive, negative or 0?
Is there any way I can store the result of that operation (p_a->vx * p_b->vx + p_a->vy * p_b->vy + p_a->vz * p_b->vz) in a temp 64 bit var (or 2x32 bit) that would allow me later to check if this var is positive, negative or 0?
This uses 32-bit math, (given int is 32-bit). Storing the return in a 64-bit result does not make the equation 64-bit.
// 32-bit math
p_a->vx * p_b->vx + p_a->vy * p_b->vy + p_a->vz * p_b->vz
Instead, use 64 bit math in the equation.
// v-----------------v multiplication now done as 64-bit
long long dot = (1LL*p_a->vx*p_b->vx) + (1LL*p_a->vy*p_b->vy) + (1LL*p_a->vz*p_b->vz);
Then check sign-ness
if (dot < 0) return -1;
return dot > 0;
Related
I'm trying to implement the following function to be calculated by my STM32
y=0.0006*x^3 - 0.054*x^2 + 2.9094*x - 2.3578
x is in range 0 to 215
To avoid the use of pow or any other functions, I have written the following code
static double tmp = 0;
static unsigned short c_m;
static unsigned short c_m_3;
static unsigned short c_m_2;
c_m_3 = c_m*c_m*c_m;
c_m_2 = c_m*c_m;
tmp = (0.0006*c_m_3) - (0.054*c_m_2) + (2.9094*c_m) - 2.3578;
dati->sch_g = tmp;
For some reason the calculation is totally wrong as, for istane, if c_m = 1 I should have tmp = 0.4982 instead I got 13
Am I missing something?
As denoted by Lundin in the comments your micro controller type (ARM Cortex M0) doesn't provide a floating point unit. This in consequence means you cannot rely on natural floating point math, but need to rely on a floating point software library like e.g. this one (note: I didn't evaluate, has just been the very first I stumbled upon on a quick search!).
Alternatively – and likely preferrably – you might want to do the calculations in plain integers; if you additionally convert your calculation from pattern a*x*x*x + b*x*x + c*x + d to ((a*x + b)*x + c)*x + d you even spare some mulitiplications:
int32_t c_m = ...;
c_m = ((6 * c_m - 540) * c_m + 29094) * c_m - 23578;
Note: unsigned short would be too small to hold the result on STM32, so you need to switch to 32 bit at least! Additionally you need a signed value to be able to hold the negative result arising from c_m == 0.
Your results would now be too large by a factor of 10 000, of course. As the use case is unclear question remains open how you would want to deal with, possibly rounding it (c_m = (c_m + 5000) / 10000) or evaluating the fractional part by other means.
short is 16 bits on all STM32. Thus the value 215 * 215 * 215 will not fit inside one. c_m_3 = c_m*c_m*c_m; truncates the result as per modulus USHRT_MAX+1 (65536):
Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.
Use uint32_t instead.
short is only 16 bits, the max value it can hold is 65535. Therefore it will overflow if the number you want to calculate the third power for is over 40. This means that you must use a larger variable type like uint32_t.
You can also use ifs to detect overflow for better programming practices.
As another note, it's better to use "uint8_t" and "uint16_t" instead of "unsigned char" and "unsigned short" in embedded programming, because they're more descriptive of the data sizes.
I am writing a software for a small 8-bit microcontroller in C. Part of the code is to read the ADC value of a current transformer (ZCT), and then calculate the RMS value. The current flowing through the ZCT is sinusoidal but it can be distorted. My code as follow:
float adc_value, inst_current;
float acc_load_current; // accumulator = (I1*I1 + I2*I2 + ... + In*In)
double rms_current;
// Calculate the real instantanous value from the ADC reading
inst_current = (adc_value/1024)*2.5; // 10bit ADC, Voltage ref. 2.5V, so formula is: x=(adc/1024)*2.5V
// Update the RMS value with the new instananous value:
// Substract 1 sample from the accumulator (sample size is 512, so divide accumulator by 512 and substract it from the accumulator)
acc_load_current -= (acc_load_current / 512);
inst_current *= inst_current; // square the instantanous current
acc_load_current += inst_current; // Add it to the accumulator
rms_current = (acc_load_current / 512); // Get the mean square value. (sample size is 512)
rms_current = sqrt(rms_current); // Get RMS value
// Now the < rms_current > is the real RMS current
However, it has many floating point calculations. This adds a large burden to my small MCU. And I found that the sqrt() function does not work in my compiler.
Is there any code that could run faster?
When you need to get faster on an processor that lacks an FPU, your best
bet is to replace floating point calculations with fixed point. Combine
this with joop's suggestion (a one Newton-Raphson sqrt) and you get
something like this:
#define INITIAL 512 /* Initial value of the filter memory. */
#define SAMPLES 512
uint16_t rms_filter(uint16_t sample)
{
static uint16_t rms = INITIAL;
static uint32_t sum_squares = 1UL * SAMPLES * INITIAL * INITIAL;
sum_squares -= sum_squares / SAMPLES;
sum_squares += (uint32_t) sample * sample;
if (rms == 0) rms = 1; /* do not divide by zero */
rms = (rms + sum_squares / SAMPLES / rms) / 2;
return rms;
}
Just run your raw ADC samples through this filter. You may add a few
bit-shifts here and there to get more resolution, but you have to be
careful not to overflow your variables. And I doubt you really need the
extra resolution.
The output of the filter is in the same unit as its input. In this case,
it is the unit of your ADC:
2.5 V / 1024 ≈ 2.44 mV. If you can keep
this unit in subsequent calculations, you will save cycles by avoiding
unnecessary conversions. If you really need the value to be in volts (it
may be an I/O requirement), then you will have to convert to floating
point. If you want millivolts, you can stay in the integer realm:
uint16_t rms_in_mV = rms_filter(raw_sample) * 160000UL >> 16;
Since your sum-of-squares value acc_load_current does not vary very much between iterations, its square root will be almost constant. A Newton-Raphson sqrt() function normally converges in only a few iterations. By using one iteration per step, the computation is smeared out.
static double one_step_newton_raphson_sqrt(double val, double hint)
{
double probe;
if (hint <= 0) return val /2;
probe = val / hint;
return (probe+hint) /2;
}
static double acc_load_current = 0.0; // accumulator = (I1*I1 + I2*I2 + ... + In*In)
static double rms_current = 1.0;
float adc_value, inst_current;
double tmp_rms_current;
// Calculate the real instantanous value from the ADC reading
inst_current = (adc_value/1024)*2.5; // 10bit ADC, Voltage ref. 2.5V, so formula is: x=(adc/1024)*2.5V
// Update the RMS value with the new instananous value:
// Substract 1 sample from the accumulator (sample size is 512, so divide accumulator by 512 and substract it from the accumulator)
acc_load_current -= (acc_load_current / 512);
inst_current *= inst_current; // square the instantanous current
acc_load_current += inst_current; // Add it to the accumulator
tmp_rms_current = (acc_load_current / 512);
rms_current = one_step_newton_raphson_sqrt(tmp_rms_current, rms_current); // Polish RMS value
// Now the <rms_current> is the APPROXIMATE RMS current
Notes:
I changed some of the data types from float to double (which is normal on a general purpose machine/desktop) If double is very expensive on your microcomputer you could change them back.
I also added static, because I did not know if the code was from a function or from a loop.
I made the function static to force it to be inlined. If the compiler does not inline static functions, you should inline it manually.
Hopefully your project is for measuring "Big" AC voltages ( and not something like 9v motor control. ) If this happens to be the case then you can cheat because your error can then be within accepted limits..
Do all of the maths in integer, and use a simple lookup map for the sqrt operation. ( which you can pre-calculate at startup, you would only REALLY need about ~600 odd values if you are doing 3 phase..
Also this begs the question do you ACTUALLY need VAC RMS or some other measure of power ? (for example can you get away with a simple box mean algorythm? )
To find the square root use the below app note from microchip for 8 bit controllers
Fast Integer Square Root
which is much fast and can find square root in just 9 loops
divisions/multiplications by power of 2
can be done by changing the exponent only via bit mask operations and +,- so mask/extract the exponent to integer value then apply bias. After that add/sub the value log2(operand) and encode back to your double value
sqrt
how fast and accurate it should be? You can use binary search on fixed point or use sqrt(x)=pow(x,0.5)=exp2(0.5*log2(x)). Again on fixed point it is quite easy to implement. You can temporarily make double a fixed point by taking the mantissa and bit shift it to some known exponent around your used values + handle the offset or to 2^0 if you have enough bits ...
compute sqrt and then store back to double. If you do not need too big precision then you can stay on operand exponent and do the binary search directly on mantissa only.
[edit1] binary search in C++
//---------------------------------------------------------------------------
double f64_sqrt(double x)
{
const int h=1; // may be platform dependent MSB/LSB order
const int l=0;
DWORD b; // bit mask
int e; // exponent
union // semi result
{
double f; // 64bit floating point
DWORD u[2]; // 2x32 bit uint
} y;
// fabs
y.f=x;
y.u[h]&=0x7FFFFFFF; x=y.f;
// set safe exponent (~ abs half)
e=((y.u[h]>>20)&0x07FF)-1023;
e/=2; // here can use bit shift with copying MSB ...
y.u[h]=((e+1023)&0x07FF)<<20;
// mantisa=0
y.u[l] =0x00000000;
y.u[h]&=0xFFF00000;
// correct exponent
if (y.f*y.f>x) { e--; y.u[h]=((e+1023)&0x07FF)<<20; }
// binary search
for (b =0x00080000;b;b>>=1) { y.u[h]|=b; if (y.f*y.f>x) y.u[h]^=b; }
for (b =0x80000000;b;b>>=1) { y.u[l]|=b; if (y.f*y.f>x) y.u[l]^=b; }
return y.f;
}
//---------------------------------------------------------------------------
it returns sqrt(abs(x)) the results match "math.h" implementation from mine C++ IDE (BDS2006 Turbo C++). Exponent starts at half of the x value and is corrected by 1 for values x>1.0 if needed. The rest is pretty obvious
Code is in C++ but it is still not optimized it can be surely improved ... If your platform does not know DWORD use unsigned int instead. If your platform does not support 32 bit integer types then chunk it to 4 x 16 bit values or 8 x 8 bit values. If you have 64 bit then use single value to speed up the process
Do not forget to handle exponent also as 11 bit .... so for 8 bit registers only use 2 ... The FPU operations can be avoided if you multiply and compare just mantissas as integers. Also the multiplication itself is cumulative so you can use previous sub-result.
[notes]
For the bit positions see wiki double precision floating point format
Sorry for the wordy title. My code is targeting a microcontroller (msp430) with no floating point unit, but this should apply to any similar MCU.
If I am multiplying a large runtime variable with what would normally be considered a floating point decimal number (1.8), is this still treated like floating point math by the MCU or compiler?
My simplified code is:
int multip = 0xf; // Can be from 0-15, not available at compile time
int holder = multip * 625; // 0 - 9375
holder = holder * 1.8; // 0 - 16875`
Since the result will always be a positive full, real integer number, is it still floating point math as far as the MCU or compiler are concerned, or is it fixed point?
(I realize I could just multiply by 18, but that would require declaring a 32bit long instead of a 16 bit int then dividing and downcasting for the array it will be put in, trying to skimp on memory here)
The result is not an integer; it rounds to an integer.
9375 * 1.8000000000000000444089209850062616169452667236328125
yields
16875.0000000000004163336342344337026588618755340576171875
which rounds (in double precision floating point) to 16875.
If you write a floating-point multiply, I know of no compiler that will determine that there's a way to do that in fixed-point instead. (That does not mean they do not exist, but it ... seems unlikely.)
I assume you simplified away something important, because it seems like you could just do:
result = multip * 1125;
and get the final result directly.
I'd go for chux's formula if there's some reason you can't just multiply by 1125.
Confident FP code will be created for
holder = holder * 1.8
To avoid FP and 32-bit math, given the OP values of
int multip = 0xf; // Max 15
unsigned holder = multip * 625; // Max 9375
// holder = holder * 1.8;
// alpha depends on rounding desired, e.g. 2 for round to nearest.
holder += (holder*4u + alpha)/5;
If int x is non-negative, you can compute x *= 1.8 rounded to nearest using only int arithmetic, without overflow unless the final result overflows, with:
x - (x+2)/5 + x
For truncation instead of round-to-nearest, use:
x - (x+4)/5 + x
If x may be negative, some additional work is needed.
I need to be able to use floating-point arithmetic under my dev environment in C (CPU: ~12 MHz Motorola 68000). The standard library is not present, meaning it is a bare-bones C and no - it isn't gcc due to several other issues
I tried getting the SoftFloat library to compile and one other 68k-specific FP library (name of which escapes me at this moment), but their dependencies cannot be resolved for this particular platform - mostly due to libc deficiencies. I spent about 8 hrs trying to overcome the linking issues, until I got to a point where I know I can't get further.
However, it took mere half an hour to come up with and implement the following set of functions that emulate floating-point functionality sufficiently for my needs.
The basic idea is that both fractional and non-fractional part are 16-bit integers, thus there is no bit manipulation.
The nonfractional part has a range of [-32767, 32767] and the fractional part [-0.9999, +0.9999] - which gives us 4 digits of precision (good enough for my floating-point needs - albeit wasteful).
It looks to me, like this could be used to make a faster, smaller - just 2 Bytes-big - alternative version of a float with ranges [-99, +99] and [-0.9, +0.9]
The question here is, what other techniques - other than IEEE - are there to make an implementation of basic floating-point functionality (+ - * /) using fixed-point functionality?
Later on, I will need some basic trigonometry, but there are lots of resources on net for that.
Since the HW has 2 MBs of RAM, I don't really care if I can save 2 bytes per soft-float (say - by reserving 9 vs 7 bits in an int). Thus - 4 bytes are good enough.
Also, from brief looking at the 68k instruction manual (and the cycle costs of each instruction), I made few early observations:
bit-shifting is slow, and unless performance is of critical importance (not the case here), I'd prefer easy debugging of my soft-float library versus 5-cycles-faster code. Besides, since this is C and not 68k ASM, it is obvious that speed is not a critical factor.
8-bit operands are as slow as 16-bit (give or take a cycle in most cases), thus it looks like it does not make much sense to compress floats for the sake of performance.
What improvements / approaches would you propose to implement floating-point in C using fixed-point without any dependency on other library/code?
Perhaps it would be possible to use a different approach and do the operations on frac & non-frac parts at the same time?
Here is the code (tested only using the calculator), please ignore the C++ - like declaration and initialization in the middle of functions (I will reformat that to C-style later):
inline int Pad (int f) // Pad the fractional part to 4 digits
{
if (f < 10) return f*1000;
else if (f < 100) return f*100;
else if (f < 1000) return f*10;
else return f;
}
// We assume fractional parts are padded to full 4 digits
inline void Add (int & b1, int & f1, int b2, int f2)
{
b1 += b2;
f1 +=f2;
if (f1 > 9999) { b1++; f1 -=10000; }
else if (f1 < -9999) { b1--; f1 +=10000; }
f1 = Pad (f1);
}
inline void Sub (int & b1, int & f1, int b2, int f2)
{
// 123.1652 - 18.9752 = 104.1900
b1 -= b2; // 105
f1 -= f2; // -8100
if (f1 < 0) { b1--; f1 +=10000; }
f1 = Pad (f1);
}
// ToDo: Implement a multiplication by float
inline void Mul (int & b1, int & f1, int num)
{
// 123.9876 * 251 = 31120.8876
b1 *=num; // 30873
long q = f1*num; //2478876
int add = q/10000; // 247
b1+=add; // 31120
f1 = q-(add*10000);//8876
f1 = Pad (f1);
}
// ToDo: Implement a division by float
inline void Div (int & b1, int & f1, int num)
{
// 123.9876 / 25 = 4.959504
int b2 = b1/num; // 4
long q = b1 - (b2*num); // 23
f1 = ((q*10000) + f1) / num; // (23000+9876) / 25 = 9595
b1 = b2;
f1 = Pad (f1);
}
You are thinking in the wrong base for a simple fixed point implementation. It is much easier if you use the bits for the decimal place. e.g. using 16 bits for the integer part and 16 bits for the decimal part (range -32767/32767, precision of 1/2^16 which is a lot higher precision than you have).
The best part is that addition and subtraction are simple (just add the two parts together). Multiplication is a little bit trickier: you have to be aware of overflow and so it helps to do the multiplication in 64 bit. You also have to shift the result after the multiplication (by however many bits are in your decimal).
typedef int fixed16;
fixed16 mult_f(fixed16 op1, fixed16 op2)
{
/* you may need to do something tricky with upper and lower if you don't
* have native 64 bit but the compiler might do it for us if we are lucky
*/
uint64_t tmp;
tmp = (op1 * op2) >> 16;
/* add in error handling for overflow if you wish - this just wraps */
return tmp & 0xFFFFFFFF;
}
Division is similar.
Someone might have implemented almost exactly what you need (or that can be hacked to make it work) that's called libfixmath
If you decide to use fixed-point, the whole number (i.e both int and fractional parts) should be in the same base. Using binary for the int part and decimal for the fractional part as above is not very optimal and slows down the calculation. Using binary fixed-point you'll only need to shift an appropriate amount after each operation instead of long adjustments like your idea. If you want to use Q16.16 then libfixmath as dave mentioned above is a good choice. If you want a different precision or floating point position such as Q14.18, Q19.13 then write your own library or modify some library for your own use. Some examples
BoostGSoC15/fixed_point
https://github.com/johnmcfarlane/cnl
See also What's the best way to do fixed-point math?
If you want a larger range then floating point maybe the better choice. Write a library as your own requirements, choose a format that is easy to implement and easy to achieve good performance in software, no need to follow IEEE 754 specifications (which is only fast with hardware implementations due to the odd number of bits and strange exponent bits' position) unless you intend to exchange data with other devices. For example a format of exp.sign.significand with 7 exponent bits followed by a sign bit and then 24 bits of significand. The exponent doesn't need to be biased, so to get the base only an arithmetic shift by 25 is needed, the sign bit will also be extended. But in case the shift is slower than a subtraction then excess-n is better.
I am currently writing a fast 32.32 fixed-point math library. I succeeded at making adding, subtraction and multiplication work correctly, but I am quite stuck at division.
A little reminder for those who can't remember: a 32.32 fixed-point number is a number having 32 bits of integer part and 32 bits of fractional part.
The best algorithm I came up with needs 96-bit integer division, which is something compilers usually don't have built-ins for.
Anyway, here it goes:
G = 2^32
notation: x is the 64-bit fixed-point number, x1 is its low nibble and x2 is its high
G*(a/b) = ((a1 + a2*G) / (b1 + b2*G))*G // Decompose this
G*(a/b) = (a1*G) / (b1*G + b2) + (a2*G*G) / (b1*G + b2)
As you can see, the (a2*G*G) is guaranteed to be larger than the regular 64-bit integer. If uint128_t's were actually supported by my compiler, I would simply do the following:
((uint128_t)x << 32) / y)
Well they aren't and I need a solution. Thank you for your help.
You can decompose a larger division into multiple chunks that do division with less bits. As another poster already mentioned the algorithm can be found in TAOCP from Knuth.
However, no need to buy the book!
There is a code on the hackers delight website that implements the algorithm in C. It's written to do 64-bit unsigned divisions using 32-bit arithmetic only, so you can't directly cut'n'paste the code. To get from 64 to 128-bit you have to widen all types, masks and constans by two e.g. a short becomes a int, a 0xffff becomes 0xffffffffll ect.
After this easy easy change you should be able to do 128bit divisions.
The code is mirrored on GitHub, but was originally posted on Hackersdelight.org (original link no longer accessible).
Since your largest values only need 96-bit, One of the 64-bit divisions will always return zero, so you can even simplify the code a bit.
Oh - and before I forget this: The code only works with unsigned values. To convert from signed to unsigned divide you can do something like this (pseudo-code style):
fixpoint Divide (fixpoint a, fixpoint b)
{
// check if the integers are of different sign:
fixpoint sign_difference = a ^ b;
// do unsigned division:
fixpoint x = unsigned_divide (abs(a), abs(b));
// if the signs have been different: negate the result.
if (sign_difference < 0)
{
x = -x;
}
return x;
}
The website itself is worth checking out as well: http://www.hackersdelight.org/
By the way - nice task that you're working on.. Do you mind telling us for what you need the fixed-point library?
By the way - the ordinary shift and subtract algorithm for division would work as well.
If you target x86 you can implement it using MMX or SSE intrinsics. The algorithm relies only on primitive operations, so it could perform quite fast as well.
Better self-adjusting answer:
Forgive the C#-ism of the answer, but the following should work in all cases. There is likely a solution possible that finds the right shifts to use quicker, but I'd have to think much deeper than I can right now. This should be reasonably efficient though:
int upshift = 32;
ulong mask = 0xFFFFFFFF00000000;
ulong mod = x % y;
while ((mod & mask) != 0)
{
// Current upshift of the remainder would overflow... so adjust
y >>= 1;
mask <<= 1;
upshift--;
mod = x % y;
}
ulong div = ((x / y) << upshift) + (mod << upshift) / y;
Simple but unsafe answer:
This calculation can cause an overflow in the upshift of the x % y remainder if this remainder has any bits set in the high 32 bits, causing an incorrect answer.
((x / y) << 32) + ((x % y) << 32) / y
The first part uses integer division and gives you the high bits of the answer (shift them back up).
The second part calculates the low bits from the remainder of the high-bit division (the bit that could not be divided any further), shifted up and then divided.
I like Nils' answer, which is probably the best. It's just long division, like we all learned in grade school, except the digits are base 2^32 instead of base 10.
However, you might also consider using Newton's approximation method for division:
x := x (N + N - N * D * x)
where N is the numerator and D is the demoninator.
This just uses multiplies and adds, which you already have, and it converges very quickly to about 1 ULP of precision. On the other hand, you won't be able to acheive the exact 0.5-ULP answer in all cases.
In any case, the tricky bit is detecting and handling the overflows.
Quick -n- dirty.
Do the A/B divide with double precision floating point.
This gives you C~=A/B. It's only approximate because of floating point precision and 53 bits of mantissa.
Round off C to a representable number in your fixed point system.
Now compute (again with your fixed point) D=A-C*B. This should have significantly lower magnitude than A.
Repeat , now computing D/B with floating point. Again, round the answer to an integer. Add each division result together as you go. You can stop when your remainder is so small that your floating point divide returns 0 after rounding.
You're still not done. Now you're very close to the answer, but the divisions weren't exact.
To finalize, you'll have to do a binary search. Using the (very good) starting estimate, see if increasing it improves the error.. you basically want to bracket the proper answer and keep dividing the range in half with new tests.
Yes, you could do Newton iteration here, but binary search will likely be easier since you need only simple multiplies and adds using your existing 32.32 precision toolkit.
This is not the most efficient method, but it's by far the easiest to code.