In Eigen3 the norm() method does not provide the precise answer, why

In Eigen3 the norm() method does not provide the precise answer, why - eigen3

I have a situation that using Eigen3 library the norm() does not provide the correct answer. The norm() should be just the square root of the coeficients of a vector to the square:
NORM= sqrt( v[1]*v[1] + v[2]*v[2] + .... + v[N]*v[N])
However the following function calculates the norm() in two ways: with the norm() method of Eigen3 and by hand. The results are slighly different:
void mytest()
{
double mvec[3];
mvec[0] = -3226.9276456286984;
mvec[1] = 6153.3425006471571;
mvec[2] = 2548.5894934614853;
Vector3d v;
v(0) = mvec[0];
v(1) = mvec[1];
v(2) = mvec[2];
double normEigen = v.norm();
double normByHand = sqrt( v(0)*v(0) + v(1)*v(1) + v(2)*v(2));
double mdiff = abs((normEigen - normByHand));
std::cout.precision(17);
std::cout << "normEigen= " << normEigen << std::endl;
std::cout << "normByHand= " << normByHand << std::endl;
std::cout << "mdiff= " << mdiff << std::endl;
}
The output of this function is:
normEigen= 7400.8103858007089
normByHand= 7400.8103858007107
mdiff= 1.8189894035e-12
From digit 15 they are different, why? where is rounding some number?
Thanks in advance
PedroC.

The calculation is one that uses floating point computations. As such, the order of operations, as well as things like vectorization, can result in (usually) slightly different results (due to different roundings, different orders of magnitude, etc.).
In this case, the difference is just in the 15th digit. The maximal accuracy on a 64 bit floating point number is around the 16th digit.
If we look at the distance in ULPs using boost:
#include <boost/math/special_functions/next.hpp>
#include <iostream>
int main()
{
double normEigen = 7400.8103858007089;
double normByHand = 7400.8103858007107;
std::cout << boost::math::float_distance(normEigen, normByHand);
return 0;
}
we see that the distance (at least on my system) is 2. So the binary number is e.g. 0101...011 instead of 0101...001. Such a small difference is almost always due to the reasons I listed above.

Going deeper I see that the sum of the square values introduces the discrepancy, when I calculate the squaredNorm in a unique vector and in
3 vectors using only one tem I see that the total is not identical.
void mytest2()
{
double mvec[3];
mvec[0] = -3226.9276456286984;
mvec[1] = 6153.3425006471571;
mvec[2] = 2548.5894934614853;
Vector3d v, v1, v2, v3;
v(0) = mvec[0];
v(1) = mvec[1];
v(2) = mvec[2];
v1(0) = mvec[0]; v1(1) = v1(2) = 0.0;
v2(0) = 0.0; v2(1) = mvec[1]; v2(2) = 0.0;
v3(0) = v3(1) = 0.0; v3(2) = mvec[2];
double squnorm = v.squaredNorm();
double squnorm1 = v1.squaredNorm();
double squnorm2 = v2.squaredNorm();
double squnorm3 = v3.squaredNorm();
double squnormbyhand = squnorm1 + squnorm2 + squnorm3;
double sqdiff = abs(squnorm - squnormbyhand);
std::cout.precision(17);
std::cout << "normEigen= " << squnorm << std::endl;
std::cout << "normByHand= " << squnormbyhand << std::endl;
std::cout << "mdiff= " << sqdiff << std::endl;
}
The ouput of this function is:
normEigen= 54771994.366575643
normByHand= 54771994.366575658
mdiff= 1.49011161193847656e-8
For some reason when adding the square values Eigen introduces a rounding difference.
Thanks for your answer anyway.
Pedro

Related

Arithmetic operations on 64-Bit double values using ARM Neon Intrinsic's in ARM64

I'm trying to implement a simple 64 Bit double addition operation using ARM Neon. I've come across this Question but there was no sample implementation using ARM intrinsic available in the answer. So any Help in providing a complete example is greatly appreciated. Here is what i have tried so far by using integer type registers.
Side Note:
Please note that i'm using intel/ARM_NEON_2_x86_SSE library for simulating this ARM Neon code using SSE instructions. Should i switch to native ARM neon to test this code?
int main()
{
double Val1[2] = { 2.46574621,0.46546221};
double Val2[2] = { 2.63565654,0.46574621};
double Sum[2] = { 0.0,0.0 };
double Sum_C[2] = { 0.0,0.0};
vst1q_s64(Sum, //Store int64x2_t
vaddq_s64( //Add int64x2_t
vld1q_s64(&(Val1[0])), //Load int64x2_t
vld1q_s64(&(Val2[0])) )); //Load int64x2_t
for (size_t i = 0; i < 2; i++)
{
Sum_C[i] = Val1[i] + Val2[i];
if (Sum_C[i] != Sum[i])
{
cout << "[Error] Sum : " << Sum[i] << " != " << Sum_C[i] << "\n";
}
else
cout << "[Passed] Sum : " << Sum[i] << " == " << Sum_C[i] << "\n";
}
cout << "\n";
}
[Error] Sum : -1.22535e-308 != 5.1014
[Error] Sum : 1.93795e+307 != 0.931208

Double precision isn't supported on aarch32 NEON.
Therefore, if you target armv7-a while using the data type float64x2_t, it won't build.
If your test platform is an aarch64 one with a 64-bit OS installed, just exclude the aarch32 target from your makefile.

Flipping Pebble Screen Issue

I'm writing a Pebble Time Watch app using Pebble SDK 3.0 on the basalt platform that requires text to be displayed upsidedown.
The logic is:-
Write to screen
Capture screen buffer
Flip screen buffer (using flipHV routine, see below)
Release buffer.
After a fair amount of experimentation I've got it working after a fashion but the (black) text has what seems to be random vertical white lines through it (see image below) which I suspect is something to do with shifting bits.
The subroutine I'm using is:-
void flipHV(GBitmap *bitMap) {
GRect fbb = gbitmap_get_bounds(bitMap);
int Width = 72; // fbb.size.w;
int Height = 84; // fbb.size.h;
uint32_t *pBase = (uint32_t *)gbitmap_get_data(bitMap);
uint32_t *pTopRemainingPixel = pBase;
uint32_t *pBottomRemainingPixel = pBase + (Height * Width);
while (pTopRemainingPixel < pBottomRemainingPixel) {
uint32_t TopPixel = *pTopRemainingPixel;
uint32_t BottomPixel = *pBottomRemainingPixel;
TopPixel = (TopPixel << 16) | (TopPixel >> 16);
*pBottomRemainingPixel = TopPixel;
BottomPixel = (BottomPixel << 16) | (BottomPixel >> 16);
*pTopRemainingPixel = BottomPixel;
pTopRemainingPixel++;
pBottomRemainingPixel--;
}
}
and its purpose is to work though the screen buffer taking the first pixel and swapping with the last one, the second one and swapping it with the second last one etc etc.
Because each 32 bit 'byte' holds 2 pixels I also need to rotate it through 16 bits.
I suspect that that is where the problem lies.
Can someone have a look at my code and see if they can see what is going wrong and put me right. I should say that I'm both a C and Pebble SDK newbie so please explain everything as if to a child!

Your assignments like
TopPixel = (TopPixel << 16) | (TopPixel >> 16)
swap pixels pair-wise
+--+--+ +--+--+
|ab|cd| => |cd|ab|
+--+--+ +--+--+
What you want instead is a full swap:
+--+--+ +--+--+
|ab|cd| => |dc|ba|
+--+--+ +--+--+
That can be done with even more bit-fiddling, e.g
TopPixel = ((TopPixel << 24) | // move d from 0..7 to 24..31
((TopPixel << 8) & 0x00ff0000) | // move c from 8..15 to 16..23
((TopPixel >> 8) & 0x0000ff00) | // move b from 16..23 to 8..15
((TopPixel >> 24) | // move a from 24..31 to 0..7
or - way more readable(!) - by using GColor8 instead of uint32_t and a loop on a per-pixel-basis:
// only loop to half of the distance to avoid swapping twice
for (int16_t y = 0; y <= max_y / 2; y++) {
for (int16_t x = 0; x <= max_x / 2; x++) {
GColor8 *value_1 = gbitmap_get_bytes_per_row(bmp) * y + x;
GColor8 *value_2 = gbitmap_get_bytes_per_row(bmp) * (max_y - y) + (max_x - x);
// swapping the two pixel values, could be simplified with a SWAP(a,b) macro
GColor8 tmp = *value_1;
*value_1 = *value_2;
*value_2 = tmp;
}
}
Disclaimer: I haven't compiled this code. It might also be necessary to cast the gbitmap_get_byes_per_row()... expressions to GColor8*. And the whole pointer arithmetic can be tuned if you see that this is a performance bottle-neck.

It turns out that I needed to replace all of the uint32_t with uint8_t and do away with the shifting.

Storing input values in structs for fastest comparison later

I'm sampling eight input ports and comparing the values up to ten times a second.
These inputs will be XOR'd against a similar field, indicating which signals are set to "Active Low", then an AND operation to mask out input signals that are not going to be compared (though all signals are sampled, whether compared or not).
So this is an example for the sampling. I've created a struct where the signals will be stored and then saved in memory. This struct contains a lot of other values, so replacing the whole struct is not an option. Anyway, these input values need to be saved in a efficient way so I later on can perform fast XOR and AND operations with my masks.
void SampleData(){
// These are not all values o be sampled, only inputs
currentSample.i0 = RD13_bit;
currentSample.i1 = RD12;
currentSample.i2 = RD11;
currentSample.i3 = RD10;
currentSample.i4 = RE12;
currentSample.i5 = RE13;
currentSample.i6 = RF8;
currentSample.i7 = RF9;
}
This is an example of the comparison I need
checkInputSignals(){
activated = ((inputValues ^ activeLowInputs) & activeInputsMask);
if(activated ){
importantMethod();
}
}
I've tried a bitfield, but I couldn't get the operators to work, and I've no knowledge about the effiency using bitfield. Efficiency in this project is not focused on memory, but speed and comfort. How should I store my three fields? If it helps, I am using a dsPic33EP microprocessor.
If using a 'char' or 'uint_8', my sample method would look like this, right? And this does not seem to be the most elegant solution.
unsigned char inputValues;
void SampleData(){
currentSample.i0 = RD13_bit;
currentSample.i1 = RD12;
currentSample.i2 = RD11;
currentSample.i3 = RD10;
currentSample.i4 = RE12;
currentSample.i5 = RE13;
currentSample.i6 = RF8;
currentSample.i7 = RF9;
// For the masking
inputValues += currentSample.i7;
inputValues = (inputValues << 1) + currentSample.i6;
inputValues = (inputValues << 1) + currentSample.i5;
inputValues = (inputValues << 1) + currentSample.i4;
inputValues = (inputValues << 1) + currentSample.i3;
inputValues = (inputValues << 1) + currentSample.i2;
inputValues = (inputValues << 1) + currentSample.i1;
inputValues = (inputValues << 1) + currentSample.i0;
}
And I would have to do the same for my masks, for example.
void ConfigureActiveLowInputs(){
activeLowInputs += currentCalibration->I0_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I1_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I2_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I3_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I4_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I5_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I6_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I7_activeLow;
}
There must be a better solution than bit shifting?

Some things I think you need to know.
Don't use bit fields. Apart from being non-portable, they make this kind of bit-twiddling harder, not easier.
Don't use run-time shifts. Get the compiler to do your work.
Do read code, study and practice. Learning bit-twiddling can be hard, and from your code I don't think you're quite there yet.
If we're going to help there are some things we need to know.
You mention 8 ports. Are they single bit ports, or single ports with multiple bits?
You mention 3 fields. What are they?
Your sample code uses + operators, which are rarely used in bit operations. Why?
In C the code usually ends up with a set of macros and defines, plus a few small functions. It's all quite simple, generates good code, and runs fast without too much effort. If we only knew what you were trying to do.

You seem to be storing individual bits is separate structure members, and then packing them to word on the fly to be able to apply masks; but it is probably more efficient to pack them into a word, and use a mask to access the individual bits when necessary.
The members i0, i1 etc. are probably unnecessary. It would be simpler to pack the bits directly into a uint8_t member, then write functions or macros to return individual bits where necessary.
uint8_t void SampleData()
{
return (RD13_bit << 7 ) |
(RD12 << 6) |
(RD11 << 5) |
(RD10 << 4) |
(RE12 << 3) |
(RE13 << 2) |
(RF8 << 1) |
RF9 ;
}
Then:
currentSample.i = SampleData() ;
Then you can apply masks to that directly. If you need to access individual bits (and if you don't, why make then separate members in the first case?) then for example:
#include <stdbool.h>
#define GETBIT( word, bit ) (((word) & (1<<bit) != 0)
bool i6 = GETBIT( currentSample.i, 6 ) ;

Calculate (x exponent 0.19029) with low memory using lookup table?

I'm writing a C program for a PIC micro-controller which needs to do a very specific exponential function. I need to calculate the following:
A = k . (1 - (p/p0)^0.19029)
k and p0 are constant, so it's all pretty simple apart from finding x^0.19029
(p/p0) ratio would always be in the range 0-1.
It works well if I add in math.h and use the power function, except that uses up all of the available 16 kB of program memory. Talk about bloatware! (Rest of program without power function = ~20% flash memory usage; add math.h and power function, =100%).
I'd like the program to do some other things as well. I was wondering if I can write a special case implementation for x^0.19029, maybe involving iteration and some kind of lookup table.
My idea is to generate a look-up table for the function x^0.19029, with perhaps 10-100 values of x in the range 0-1. The code would find a close match, then (somehow) iteratively refine it by re-scaling the lookup table values. However, this is where I get lost because my tiny brain can't visualise the maths involved.
Could this approach work?
Alternatively, I've looked at using Exp(x) and Ln(x), which can be implemented with a Taylor expansion. b^x can the be found with:
b^x = (e^(ln b))^x = e^(x.ln(b))
(See: Wikipedia - Powers via Logarithms)
This looks a bit tricky and complicated to me, though. Am I likely to get the implementation smaller then the compiler's math library, and can I simplify it for my special case (i.e. base = 0-1, exponent always 0.19029)?
Note that RAM usage is OK at the moment, but I've run low on Flash (used for code storage). Speed is not critical. Somebody has already suggested that I use a bigger micro with more flash memory, but that sounds like profligate wastefulness!
[EDIT] I was being lazy when I said "(p/p0) ratio would always be in the range 0-1". Actually it will never reach 0, and I did some calculations last night and decided that in fact a range of 0.3 - 1 would be quite adequate! This mean that some of the simpler solutions below should be suitable. Also, the "k" in the above is 44330, and I'd like the error in the final result to be less than 0.1. I guess that means an error in the (p/p0)^0.19029 needs to be less than 1/443300 or 2.256e-6

Use splines. The relevant part of the function is shown in the figure below. It varies approximately like the 5th root, so the problematic zone is close to p / p0 = 0. There is mathematical theory how to optimally place the knots of splines to minimize the error (see Carl de Boor: A Practical Guide to Splines). Usually one constructs the spline in B form ahead of time (using toolboxes such as Matlab's spline toolbox - also written by C. de Boor), then converts to Piecewise Polynomial representation for fast evaluation.
In C. de Boor, PGS, the function g(x) = sqrt(x + 1) is actually taken as an example (Chapter 12, Example II). This is exactly what you need here. The book comes back to this case a few times, since it is admittedly a hard problem for any interpolation scheme due to the infinite derivatives at x = -1. All software from PGS is available for free as PPPACK in netlib, and most of it is also part of SLATEC (also from netlib).
Edit (Removed)
(Multiplying by x once does not significantly help, since it only regularizes the first derivative, while all other derivatives at x = 0 are still infinite.)
Edit 2
My feeling is that optimally constructed splines (following de Boor) will be best (and fastest) for relatively low accuracy requirements. If the accuracy requirements are high (say 1e-8), one may be forced to get back to the algorithms that mathematicians have been researching for centuries. At this point, it may be best to simply download the sources of glibc and copy (provided GPL is acceptable) whatever is in
glibc-2.19/sysdeps/ieee754/dbl-64/e_pow.c
Since we don't have to include the whole math.h, there shouldn't be a problem with memory, but we will only marginally profit from having a fixed exponent.
Edit 3
Here is an adapted version of e_pow.c from netlib, as found by #Joni. This seems to be the grandfather of glibc's more modern implementation mentioned above. The old version has two advantages: (1) It is public domain, and (2) it uses a limited number of constants, which is beneficial if memory is a tight resource (glibc's version defines over 10000 lines of constants!). The following is completely standalone code, which calculates x^0.19029 for 0 <= x <= 1 to double precision (I tested it against Python's power function and found that at most 2 bits differed):
#define __LITTLE_ENDIAN
#ifdef __LITTLE_ENDIAN
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
#else
#define __HI(x) *(int*)&x
#define __LO(x) *(1+(int*)&x)
#endif
static const double
bp[] = {1.0, 1.5,},
dp_h[] = { 0.0, 5.84962487220764160156e-01,}, /* 0x3FE2B803, 0x40000000 */
dp_l[] = { 0.0, 1.35003920212974897128e-08,}, /* 0x3E4CFDEB, 0x43CFD006 */
zero = 0.0,
one = 1.0,
two = 2.0,
two53 = 9007199254740992.0, /* 0x43400000, 0x00000000 */
/* poly coefs for (3/2)*(log(x)-2s-2/3*s**3 */
L1 = 5.99999999999994648725e-01, /* 0x3FE33333, 0x33333303 */
L2 = 4.28571428578550184252e-01, /* 0x3FDB6DB6, 0xDB6FABFF */
L3 = 3.33333329818377432918e-01, /* 0x3FD55555, 0x518F264D */
L4 = 2.72728123808534006489e-01, /* 0x3FD17460, 0xA91D4101 */
L5 = 2.30660745775561754067e-01, /* 0x3FCD864A, 0x93C9DB65 */
L6 = 2.06975017800338417784e-01, /* 0x3FCA7E28, 0x4A454EEF */
P1 = 1.66666666666666019037e-01, /* 0x3FC55555, 0x5555553E */
P2 = -2.77777777770155933842e-03, /* 0xBF66C16C, 0x16BEBD93 */
P3 = 6.61375632143793436117e-05, /* 0x3F11566A, 0xAF25DE2C */
P4 = -1.65339022054652515390e-06, /* 0xBEBBBD41, 0xC5D26BF1 */
P5 = 4.13813679705723846039e-08, /* 0x3E663769, 0x72BEA4D0 */
lg2 = 6.93147180559945286227e-01, /* 0x3FE62E42, 0xFEFA39EF */
lg2_h = 6.93147182464599609375e-01, /* 0x3FE62E43, 0x00000000 */
lg2_l = -1.90465429995776804525e-09, /* 0xBE205C61, 0x0CA86C39 */
ovt = 8.0085662595372944372e-0017, /* -(1024-log2(ovfl+.5ulp)) */
cp = 9.61796693925975554329e-01, /* 0x3FEEC709, 0xDC3A03FD =2/(3ln2) */
cp_h = 9.61796700954437255859e-01, /* 0x3FEEC709, 0xE0000000 =(float)cp */
cp_l = -7.02846165095275826516e-09, /* 0xBE3E2FE0, 0x145B01F5 =tail of cp_h*/
ivln2 = 1.44269504088896338700e+00, /* 0x3FF71547, 0x652B82FE =1/ln2 */
ivln2_h = 1.44269502162933349609e+00, /* 0x3FF71547, 0x60000000 =24b 1/ln2*/
ivln2_l = 1.92596299112661746887e-08; /* 0x3E54AE0B, 0xF85DDF44 =1/ln2 tail*/
double pow0p19029(double x)
{
double y = 0.19029e+00;
double z,ax,z_h,z_l,p_h,p_l;
double y1,t1,t2,r,s,t,u,v,w;
int i,j,k,n;
int hx,hy,ix,iy;
unsigned lx,ly;
hx = __HI(x); lx = __LO(x);
hy = __HI(y); ly = __LO(y);
ix = hx&0x7fffffff; iy = hy&0x7fffffff;
ax = x;
/* special value of x */
if(lx==0) {
if(ix==0x7ff00000||ix==0||ix==0x3ff00000){
z = ax; /*x is +-0,+-inf,+-1*/
return z;
}
}
s = one; /* s (sign of result -ve**odd) = -1 else = 1 */
double ss,s2,s_h,s_l,t_h,t_l;
n = ((ix)>>20)-0x3ff;
j = ix&0x000fffff;
/* determine interval */
ix = j|0x3ff00000; /* normalize ix */
if(j<=0x3988E) k=0; /* |x|<sqrt(3/2) */
else if(j<0xBB67A) k=1; /* |x|<sqrt(3) */
else {k=0;n+=1;ix -= 0x00100000;}
__HI(ax) = ix;
/* compute ss = s_h+s_l = (x-1)/(x+1) or (x-1.5)/(x+1.5) */
u = ax-bp[k]; /* bp[0]=1.0, bp[1]=1.5 */
v = one/(ax+bp[k]);
ss = u*v;
s_h = ss;
__LO(s_h) = 0;
/* t_h=ax+bp[k] High */
t_h = zero;
__HI(t_h)=((ix>>1)|0x20000000)+0x00080000+(k<<18);
t_l = ax - (t_h-bp[k]);
s_l = v*((u-s_h*t_h)-s_h*t_l);
/* compute log(ax) */
s2 = ss*ss;
r = s2*s2*(L1+s2*(L2+s2*(L3+s2*(L4+s2*(L5+s2*L6)))));
r += s_l*(s_h+ss);
s2 = s_h*s_h;
t_h = 3.0+s2+r;
__LO(t_h) = 0;
t_l = r-((t_h-3.0)-s2);
/* u+v = ss*(1+...) */
u = s_h*t_h;
v = s_l*t_h+t_l*ss;
/* 2/(3log2)*(ss+...) */
p_h = u+v;
__LO(p_h) = 0;
p_l = v-(p_h-u);
z_h = cp_h*p_h; /* cp_h+cp_l = 2/(3*log2) */
z_l = cp_l*p_h+p_l*cp+dp_l[k];
/* log2(ax) = (ss+..)*2/(3*log2) = n + dp_h + z_h + z_l */
t = (double)n;
t1 = (((z_h+z_l)+dp_h[k])+t);
__LO(t1) = 0;
t2 = z_l-(((t1-t)-dp_h[k])-z_h);
/* split up y into y1+y2 and compute (y1+y2)*(t1+t2) */
y1 = y;
__LO(y1) = 0;
p_l = (y-y1)*t1+y*t2;
p_h = y1*t1;
z = p_l+p_h;
j = __HI(z);
i = __LO(z);
/*
* compute 2**(p_h+p_l)
*/
i = j&0x7fffffff;
k = (i>>20)-0x3ff;
n = 0;
if(i>0x3fe00000) { /* if |z| > 0.5, set n = [z+0.5] */
n = j+(0x00100000>>(k+1));
k = ((n&0x7fffffff)>>20)-0x3ff; /* new k for n */
t = zero;
__HI(t) = (n&~(0x000fffff>>k));
n = ((n&0x000fffff)|0x00100000)>>(20-k);
if(j<0) n = -n;
p_h -= t;
}
t = p_l+p_h;
__LO(t) = 0;
u = t*lg2_h;
v = (p_l-(t-p_h))*lg2+t*lg2_l;
z = u+v;
w = v-(z-u);
t = z*z;
t1 = z - t*(P1+t*(P2+t*(P3+t*(P4+t*P5))));
r = (z*t1)/(t1-two)-(w+z*w);
z = one-(r-z);
__HI(z) += (n<<20);
return s*z;
}
Clearly, 50+ years of research have gone into this, so it's probably very hard to do any better. (One has to appreciate that there are 0 loops, only 2 divisions, and only 6 if statements in the whole algorithm!) The reason for this is, again, the behavior at x = 0, where all derivatives diverge, which makes it extremely hard to keep the error under control: I once had a spline representation with 18 knots that was good up to x = 1e-4, with absolute and relative errors < 5e-4 everywhere, but going to x = 1e-5 ruined everything again.
So, unless the requirement to go arbitrarily close to zero is relaxed, I recommend using the adapted version of e_pow.c given above.
Edit 4
Now that we know that the domain 0.3 <= x <= 1 is sufficient, and that we have very low accuracy requirements, Edit 3 is clearly overkill. As #MvG has demonstrated, the function is so well behaved that a polynomial of degree 7 is sufficient to satisfy the accuracy requirements, which can be considered a single spline segment. #MvG's solution minimizes the integral error, which already looks very good.
The question arises as to how much better we can still do? It would be interesting to find the polynomial of a given degree that minimizes the maximum error in the interval of interest. The answer is the minimax
polynomial, which can be found using Remez' algorithm, which is implemented in the Boost library. I like #MvG's idea to clamp the value at x = 1 to 1, which I will do as well. Here is minimax.cpp:
#include <ostream>
#define TARG_PREC 64
#define WORK_PREC (TARG_PREC*2)
#include <boost/multiprecision/cpp_dec_float.hpp>
typedef boost::multiprecision::number<boost::multiprecision::cpp_dec_float<WORK_PREC> > dtype;
using boost::math::pow;
#include <boost/math/tools/remez.hpp>
boost::shared_ptr<boost::math::tools::remez_minimax<dtype> > p_remez;
dtype f(const dtype& x) {
static const dtype one(1), y(0.19029);
return one - pow(one - x, y);
}
void out(const char *descr, const dtype& x, const char *sep="") {
std::cout << descr << boost::math::tools::real_cast<double>(x) << sep << std::endl;
}
int main() {
dtype a(0), b(0.7); // range to optimise over
bool rel_error(false), pin(true);
int orderN(7), orderD(0), skew(0), brake(50);
int prec = 2 + (TARG_PREC * 3010LL)/10000;
std::cout << std::scientific << std::setprecision(prec);
p_remez.reset(new boost::math::tools::remez_minimax<dtype>(
&f, orderN, orderD, a, b, pin, rel_error, skew, WORK_PREC));
out("Max error in interpolated form: ", p_remez->max_error());
p_remez->set_brake(brake);
unsigned i, count(50);
for (i = 0; i < count; ++i) {
std::cout << "Stepping..." << std::endl;
dtype r = p_remez->iterate();
out("Maximum Deviation Found: ", p_remez->max_error());
out("Expected Error Term: ", p_remez->error_term());
out("Maximum Relative Change in Control Points: ", r);
}
boost::math::tools::polynomial<dtype> n = p_remez->numerator();
for(i = n.size(); i--; ) {
out("", n[i], ",");
}
}
Since all parts of boost that we use are header-only, simply build with:
c++ -O3 -I<path/to/boost/headers> minimax.cpp -o minimax
We finally get the coefficients, which are after multiplication by 44330:
24538.3409, -42811.1497, 34300.7501, -11284.1276, 4564.5847, 3186.7541, 8442.5236, 0.
The following error plot demonstrates that this is really the best possible degree-7 polynomial approximation, since all extrema are of equal magnitude (0.06659):
Should the requirements ever change (while still keeping well away from 0!), the C++ program above can be simply adapted to spit out the new optimal polynomial approximation.

Instead of a lookup table, I'd use a polynomial approximation:
1 - x0.19029 ≈ - 1073365.91783x15 + 8354695.40833x14 - 29422576.6529x13 + 61993794.537x12 - 87079891.4988x11 + 86005723.842x10 - 61389954.7459x9 + 32053170.1149x8 - 12253383.4372x7 + 3399819.97536x6 - 672003.142815x5 + 91817.6782072x4 - 8299.75873768x3 + 469.530204564x2 - 16.6572179869x + 0.722044145701
Or in code:
double f(double x) {
double fx;
fx = - 1073365.91783;
fx = fx*x + 8354695.40833;
fx = fx*x - 29422576.6529;
fx = fx*x + 61993794.537;
fx = fx*x - 87079891.4988;
fx = fx*x + 86005723.842;
fx = fx*x - 61389954.7459;
fx = fx*x + 32053170.1149;
fx = fx*x - 12253383.4372;
fx = fx*x + 3399819.97536;
fx = fx*x - 672003.142815;
fx = fx*x + 91817.6782072;
fx = fx*x - 8299.75873768;
fx = fx*x + 469.530204564;
fx = fx*x - 16.6572179869;
fx = fx*x + 0.722044145701;
return fx;
}
I computed this in sage using the least squares approach:
f(x) = 1-x^(19029/100000) # your function
d = 16 # number of terms, i.e. degree + 1
A = matrix(d, d, lambda r, c: integrate(x^r*x^c, (x, 0, 1)))
b = vector([integrate(x^r*f(x), (x, 0, 1)) for r in range(d)])
A.solve_right(b).change_ring(RDF)
Here is a plot of the error this will entail:
Blue is the error from my 16 term polynomial, while red is the error you'd get from piecewise linear interpolation with 16 equidistant values. As you can see, both errors are quite small for most parts of the range, but will become really huge close to x=0. I actually clipped the plot there. If you can somehow narrow the range of possible values, you could use that as the domain for the integration, and obtain an even better fit for the relevant range. At the cost of worse fit outside, of course. You could also increase the number of terms to obtain a closer fit, although that might also lead to higher oscillations.
I guess you can also combine this approach with the one Stefan posted: use his to split the domain into several parts, then use mine to find a close low degree polynomial for each part.
Update
Since you updated the specification of your question, with regard to both the domain and the error, here is a minimal solution to fit those requirements:
44330(1 - x0.19029) ≈ + 23024.9160933(1-x)7 - 39408.6473636(1-x)6 + 31379.9086193(1-x)5 - 10098.7031260(1-x)4 + 4339.44098317(1-x)3 + 3202.85705860(1-x)2 + 8442.42528906(1-x)
double f(double x) {
double fx, x1 = 1. - x;
fx = + 23024.9160933;
fx = fx*x1 - 39408.6473636;
fx = fx*x1 + 31379.9086193;
fx = fx*x1 - 10098.7031260;
fx = fx*x1 + 4339.44098317;
fx = fx*x1 + 3202.85705860;
fx = fx*x1 + 8442.42528906;
fx = fx*x1;
return fx;
}
I integrated x from 0.293 to 1 or equivalently 1 - x from 0 to 0.707 to keep the worst oscillations outside the relevant domain. I also omitted the constant term, to ensure an exact result at x=1. The maximal error for the range [0.3, 1] now occurs at x=0.3260 and amounts to 0.0972 < 0.1. Here is an error plot, which of course has bigger absolute errors than the one above due to the scale factor k=44330 which has been included here.
I can also state that the first three derivatives of the function will have constant sign over the range in question, so the function is monotonic, convex, and in general pretty well-behaved.

Not meant to answer the question, but it illustrates the Road Not To Go, and thus may be helpful:
This quick-and-dirty C code calculates pow(i, 0.19029) for 0.000 to 1.000 in steps of 0.01. The first half displays the error, in percents, when stored as 1/65536ths (as that theoretically provides slightly over 4 decimals of precision). The second half shows both interpolated and calculated values in steps of 0.001, and the difference between these two.
It kind of looks okay if you read from the bottom up, all 100s and 99.99s there, but about the first 20 values from 0.001 to 0.020 are worthless.
#include <stdio.h>
#include <math.h>
float powers[102];
int main (void)
{
int i, as_int;
double as_real, low, high, delta, approx, calcd, diff;
printf ("calculating and storing:\n");
for (i=0; i<=101; i++)
{
as_real = pow(i/100.0, 0.19029);
as_int = (int)round(65536*as_real);
powers[i] = as_real;
diff = 100*as_real/(as_int/65536.0);
printf ("%.5f %.5f %.5f ~ %.3f\n", i/100.0, as_real, as_int/65536.0, diff);
}
printf ("\n");
printf ("-- interpolating in 1/10ths:\n");
for (i=0; i<1000; i++)
{
as_real = i/1000.0;
low = powers[i/10];
high = powers[1+i/10];
delta = (high-low)/10.0;
approx = low + (i%10)*delta;
calcd = pow(as_real, 0.19029);
diff = 100.0*approx/calcd;
printf ("%.5f ~ %.5f = %.5f +/- %.5f%%\n", as_real, approx, calcd, diff);
}
return 0;
}

You can find a complete, correct standalone implementation of pow in fdlibm. It's about 200 lines of code, about half of which deal with special cases. If you remove the code that deals with special cases you're not interested in I doubt you'll have problems including it in your program.

LutzL's answer is a really good one: Calculate your power as (x^1.52232)^(1/8), computing the inner power by spline interpolation or another method. The eighth root deals with the pathological non-differentiable behavior near zero. I took the liberty of mocking up an implementation this way. The below, however, only does a linear interpolation to do x^1.52232, and you'd need to get the full coefficients using your favorite numerical mathematics tools. You'll adding scarcely 40 lines of code to get your needed power, plus however many knots you choose to use for your spline, as dicated by your required accuracy.
Don't be scared by the #include <math.h>; it's just for benchmarking the code.
#include <stdio.h>
#include <math.h>
double my_sqrt(double x) {
/* Newton's method for a square root. */
int i = 0;
double res = 1.0;
if (x > 0) {
for (i = 0; i < 10; i++) {
res = 0.5 * (res + x / res);
}
} else {
res = 0.0;
}
return res;
}
double my_152232(double x) {
/* Cubic spline interpolation for x ** 1.52232. */
int i = 0;
double res = 0.0;
/* coefs[i] will give the cubic polynomial coefficients between x =
i and x = i+1. Out of laziness, the below numbers give only a
linear interpolation. You'll need to do some work and research
to get the spline coefficients. */
double coefs[3][4] = {{0.0, 1.0, 0.0, 0.0},
{-0.872526, 1.872526, 0.0, 0.0},
{-2.032706, 2.452616, 0.0, 0.0}};
if ((x >= 0) && (x < 3.0)) {
i = (int) x;
/* Horner's method cubic. */
res = (((coefs[i][3] * x + coefs[i][2]) * x) + coefs[i][1] * x)
+ coefs[i][0];
} else if (x >= 3.0) {
/* Scaled x ** 1.5 once you go off the spline. */
res = 1.024824 * my_sqrt(x * x * x);
}
return res;
}
double my_019029(double x) {
return my_sqrt(my_sqrt(my_sqrt(my_152232(x))));
}
int main() {
int i;
double x = 0.0;
for (i = 0; i < 1000; i++) {
x = 1e-2 * i;
printf("%f %f %f \n", x, my_019029(x), pow(x, 0.19029));
}
return 0;
}
EDIT: If you're just interested in a small region like [0,1], even simpler is to peel off one sqrt(x) and compute x^1.02232, which is quite well behaved, using a Taylor series:
double my_152232(double x) {
double part_050000 = my_sqrt(x);
double part_102232 = 1.02232 * x + 0.0114091 * x * x - 3.718147e-3 * x * x * x;
return part_102232 * part_050000;
}
This gets you within 1% of the exact power for approximately [0.1,6], though getting the singularity exactly right is always a challenge. Even so, this three-term Taylor series gets you within 2.3% for x = 0.001.

I'm working on my gEDA fork and want to get rid of the existing simple tile-based system1 in favour of a real spatial index2.
An algorithm that efficiently finds points is not enough: I need to find objects with non-zero extent. Think in terms of objects having bounding rectangles, that pretty much captures the level of detail I need in the index. Given a search rectangle, I need to be able to efficiently find all objects whose bounding rectangles are inside, or that intersect, the search rectangle.
The index can't be read-only: gschem is a schematic capture program, and the whole point of it is to move things around the schematic diagram. So things are going to be a'changing. So while I can afford insertion to be a bit more expensive than searching, it can't be too much more expensive, and deleting must also be both possible and reasonably cheap. But the most important requirement is the asymptotic behaviour: searching should be O(log n) if it can't be O(1). Insertion / deletion should preferably be O(log n), but O(n) would be okay. I definitely don't want anything > O(n) (per action; obviously O(n log n) is expected for an all-objects operation).
What are my options? I don't feel clever enough to evaluate the various options. Ideally there'd be some C library that will do all the clever stuff for me, but I'll mechanically implement an algorithm I may or may not fully understand if I have to. gEDA uses glib by the way, if that helps to make a recommendation.
Footnotes:
1 Standard gEDA divides a schematic diagram into a fixed number (currently 100) of "tiles" which serve to speed up searches for objects in a bounding rectangle. This is obviously good enough to make most schematics fast enough to search, but the way it's done causes other problems: far too many functions require a pointer to a de-facto global object. The tiles geometry is also fixed: it would be possible to defeat this tiling system completely simply by panning (and possibly zooming) to an area covered by only one tile.
2 A legitimate answer would be to keep elements of the tiling system, but to fix its weaknesses: teaching it to span the entire space, and to sub-divide when necessary. But I'd like others to add their two cents before I autocratically decide that this is the best way.

A nice data structure for a mix of points and lines would be an R-tree or one of its derivatives (e.g. R*-Tree or a Hilbert R-Tree). Given you want this index to be dynamic and serializable, I think using SQLite's R*-Tree module would be a reasonable approach.
If you can tolerate C++, libspatialindex has a mature and flexible R-tree implementation which supports dynamic inserts/deletes and serialization.

Your needs sound very similar to what is used in collision detection algorithms for games and physics simulations. There are several open source C++ libraries that handle this in 2-D (Box2D) or 3-D (Bullet physics). Although your question is for C, you may find their documentation and implementations useful.
Usually this is split into a two phases:
A fast broad phase that approximates objects by their axis-aligned bounding box (AABB), and determines pairs of AABBs that touch or overlap.
A slower narrow phase that calculates the points of geometric overlap for pairs of objects whose AABBs touch or overlap.
Physics engines also use spatial coherence to further reduce the pairs of objects that are compared, but this optimization probably won't help your application.
The broadphase is usually implemented with an O(N log N) algorithm like Sweep and prune. You may be able to accelerate this by using it in conjunction with the current tile approach (one of Nvidia's GPUGems describes this hybrid approach). The narrow phase is quite costly for each pair, and may be overkill for your needs. The GJK algorithm is often used for convex objects in this step, although faster algorithms exist for more specialized cases (e.g.: box/circle and box/sphere collisions).

This sounds to like an application well-suited to a quadtree (assuming you are interested only in 2D.) The quadtree is hierarchical (good for searching) and it's spatial resolution is dynamic (allowing higher resolution in areas that need it).
I've always rolled my own quadtrees, but here is a library that appears reasonable: http://www.codeproject.com/Articles/30535/A-Simple-QuadTree-Implementation-in-C

It is easy to do. It's hard to do fast. Sounds like a problem I worked on where there was a vast list of min,max values and given a value it had to return how many min,max pairs overlapped that value. You just have it in two dimensions. So you do it with two trees for each direction. Then do a intersection on the results. This is really fast.
#include <iostream>
#include <fstream>
#include <map>
using namespace std;
typedef unsigned int UInt;
class payLoad {
public:
UInt starts;
UInt finishes;
bool isStart;
bool isFinish;
payLoad ()
{
starts = 0;
finishes = 0;
isStart = false;
isFinish = false;
}
};
typedef map<UInt,payLoad> ExtentMap;
//==============================================================================
class Extents
{
ExtentMap myExtentMap;
public:
void ReadAndInsertExtents ( const char* fileName )
{
UInt start, finish;
ExtentMap::iterator EMStart;
ExtentMap::iterator EMFinish;
ifstream efile ( fileName);
cout << fileName << " filename" << endl;
while (!efile.eof()) {
efile >> start >> finish;
//cout << start << " start " << finish << " finish" << endl;
EMStart = myExtentMap.find(start);
if (EMStart==myExtentMap.end()) {
payLoad pay;
pay.isStart = true;
myExtentMap[start] = pay;
EMStart = myExtentMap.find(start);
}
EMFinish = myExtentMap.find(finish);
if (EMFinish==myExtentMap.end()) {
payLoad pay;
pay.isFinish = true;
myExtentMap[finish] = pay;
EMFinish = myExtentMap.find(finish);
}
EMStart->second.starts++;
EMFinish->second.finishes++;
EMStart->second.isStart = true;
EMFinish->second.isFinish = true;
// for (EMStart=myExtentMap.begin(); EMStart!=myExtentMap.end(); EMStart++)
// cout << "| key " << EMStart->first << " count " << EMStart->second.value << " S " << EMStart->second.isStart << " F " << EMStart->second.isFinish << endl;
}
efile.close();
UInt count = 0;
for (EMStart=myExtentMap.begin(); EMStart!=myExtentMap.end(); EMStart++)
{
count += EMStart->second.starts - EMStart->second.finishes;
EMStart->second.starts = count + EMStart->second.finishes;
}
// for (EMStart=myExtentMap.begin(); EMStart!=myExtentMap.end(); EMStart++)
// cout << "||| key " << EMStart->first << " count " << EMStart->second.starts << " S " << EMStart->second.isStart << " F " << EMStart->second.isFinish << endl;
}
void ReadAndCountNumbers ( const char* fileName )
{
UInt number, count;
ExtentMap::iterator EMStart;
ExtentMap::iterator EMTemp;
if (myExtentMap.empty()) return;
ifstream nfile ( fileName);
cout << fileName << " filename" << endl;
while (!nfile.eof())
{
count = 0;
nfile >> number;
//cout << number << " number ";
EMStart = myExtentMap.find(number);
EMTemp = myExtentMap.end();
if (EMStart==myExtentMap.end()) { // if we don't find the number then create one so we can find the nearest number.
payLoad pay;
myExtentMap[ number ] = pay;
EMStart = EMTemp = myExtentMap.find(number);
if ((EMStart!=myExtentMap.begin()) && (!EMStart->second.isStart))
{
EMStart--;
}
}
if (EMStart->first < number) {
while (!EMStart->second.isFinish) {
//cout << "stepped through looking for end - key" << EMStart->first << endl;
EMStart++;
}
if (EMStart->first >= number) {
count = EMStart->second.starts;
//cout << "found " << count << endl;
}
}
else if (EMStart->first==number) {
count = EMStart->second.starts;
}
cout << count << endl;
//cout << "| count " << count << " key " << EMStart->first << " S " << EMStart->second.isStart << " F " << EMStart->second.isFinish<< " V " << EMStart->second.value << endl;
if (EMTemp != myExtentMap.end())
{
myExtentMap.erase(EMTemp->first);
}
}
nfile.close();
}
};
//==============================================================================
int main (int argc, char* argv[]) {
Extents exts;
exts.ReadAndInsertExtents ( "..//..//extents.txt" );
exts.ReadAndCountNumbers ( "..//../numbers.txt" );
return 0;
}
the extents test file was 1.5mb of:
0 200000
1 199999
2 199998
3 199997
4 199996
5 199995
....
99995 100005
99996 100004
99997 100003
99998 100002
99999 100001
The numbers file was like:
102731
104279
109316
104859
102165
105762
101464
100755
101068
108442
107777
101193
104299
107080
100958
.....
Even reading the two files from disk, extents were 1.5mb and numbers were 780k and the really large number of values and lookups, this runs in a fraction of a second. If in memory it would lightning quick.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight