How does this implementation of 1D IDCT work?

How does this implementation of 1D IDCT work? - c

I have an implementation of a Inverse Discrete Cosine Transform and I'm trying to figure out how they got to this code. So far, I've figured out that this is probably an optimized implementation of the Cooley-Tukey radix-2 Decimation-in-time for a DCT instead of a DFT (Discrete Fourier Transform).
However, I'm still at a loss about exactly what happens at each stage. I figured that the Wx constants are probably the twiddle factors.
Can anybody provide a reference to an explanation, or provide some explanation to this code?
//Twiddle factors
#define W1 2841 /* 2048*sqrt(2)*cos(1*pi/16) */
#define W2 2676 /* 2048*sqrt(2)*cos(2*pi/16) */
#define W3 2408 /* 2048*sqrt(2)*cos(3*pi/16) */
#define W5 1609 /* 2048*sqrt(2)*cos(5*pi/16) */
#define W6 1108 /* 2048*sqrt(2)*cos(6*pi/16) */
#define W7 565 /* 2048*sqrt(2)*cos(7*pi/16) */
//Discrete Cosine Transform on a row of 8 DCT coefficients.
NJ_INLINE void njRowIDCT(int* blk) {
int x0, x1, x2, x3, x4, x5, x6, x7, x8;
int t;
if (!((x1 = blk[4] << 11)
| (x2 = blk[6])
| (x3 = blk[2])
| (x4 = blk[1])
| (x5 = blk[7])
| (x6 = blk[5])
| (x7 = blk[3])))
{
blk[0] = blk[1] = blk[2] = blk[3] = blk[4] = blk[5] = blk[6] = blk[7] = blk[0] << 3;
return;
}
x0 = (blk[0] << 11) + 128; //For rounding at fourth stage
//First stage
/*What exactly are we doing here? Do the x values have a meaning?*/
x8 = W7 * (x4 + x5);
x4 = x8 + (W1 - W7) * x4;
x5 = x8 - (W1 + W7) * x5;
x8 = W3 * (x6 + x7);
x6 = x8 - (W3 - W5) * x6;
x7 = x8 - (W3 + W5) * x7;
//Second stage
x8 = x0 + x1;
x0 -= x1;
x1 = W6 * (x3 + x2);
x2 = x1 - (W2 + W6) * x2;
x3 = x1 + (W2 - W6) * x3;
x1 = x4 + x6;
x4 -= x6;
x6 = x5 + x7;
x5 -= x7;
//Third stage
x7 = x8 + x3;
x8 -= x3;
x3 = x0 + x2;
x0 -= x2;
x2 = (181 * (x4 + x5) + 128) >> 8;
x4 = (181 * (x4 - x5) + 128) >> 8;
//Fourth stage
blk[0] = (x7 + x1) >> 8; //bit shift is to emulate 8 bit fixed point precision
blk[1] = (x3 + x2) >> 8;
blk[2] = (x0 + x4) >> 8;
blk[3] = (x8 + x6) >> 8;
blk[4] = (x8 - x6) >> 8;
blk[5] = (x0 - x4) >> 8;
blk[6] = (x3 - x2) >> 8;
blk[7] = (x7 - x1) >> 8;
}

I'm not an expert at DCTs but I have written a few FFT implementations in my time so I'm going to take a stab at answering this. Please take the following with a pinch of salt.
void njRowIDCT(int* blk)
You correctly say that the algorithm appears to be an 8-length Radix-2 DCT that uses fixed point arithmetic with 24:8 precision. I'm guessing the precision because the last stage right shifts by 8 to get the desired (that and the tell tale comment ;)
Because its 8-length, its power is 3 (2^3 = 8) meaning there are 3 stages in the DCT. So far this is all very similar to FFTs. The "fourth stage" seems to be just a scaling to recover the original precision after fixed point arithmetic.
As far as I can see the input stage is the bit-reversal from input array blk to local variables x0-x7. x8 seems to be a temporary variable. Sorry I can't be more descriptive than that.
Bit reversal stage
Update
Take a look at DSP For Scientists and Engineers. It gives a clear and precise explanation of signal processing topics. This chapter is on the DCT (please skip to p497).
The Wn (twiddle factors) correspond to the Basis Functions in this chapter, although note this is describing an 8x8 (2D) DCT.
With regard to 3 stages that I mentioned, compare to the description of an 8 point FFT:
The FFT is performing butterflies on the bit-reversed input array (which are essentially complex multiply-adds), multiplying one path by the Wn or twiddle factor along the way. The FFT is performed in stages. I still haven't worked out what your DCT code is doing but decomposing it into a diagram like this may help.
That or someone who knows what they're talking about step up ;-)
I'll revisit this page and edit as I decipher more code

Both the DFT and the DCT are just linear transforms, which can be represented as a single complex matrix multiply (sometime pruned for strictly real input). So you could just combine the above equations to get the formula for each final term, which should end up equivalent to one row of the linear transform matrix (ignoring rounding issues). Then see how the above code sequence is manually doing a common subexpression optimization or refactoring between and/or within row calculations.

Related

Is there a way to improve this pygame colour filter algorithm [duplicate]

I've made a function to find a color within a image, and return x, y. Now I need to add a new function, where I can find a color with a given tolerence. Should be easy?
Code to find color in image, and return x, y:
def FindColorIn(r,g,b, xmin, xmax, ymin, ymax):
image = ImageGrab.grab()
for x in range(xmin, xmax):
for y in range(ymin,ymax):
px = image.getpixel((x, y))
if px[0] == r and px[1] == g and px[2] == b:
return x, y
def FindColor(r,g,b):
image = ImageGrab.grab()
size = image.size
pos = FindColorIn(r,g,b, 1, size[0], 1, size[1])
return pos
Outcome:
Taken from the answers the normal methods of comparing two colors are in Euclidean distance, or Chebyshev distance.
I decided to mostly use (squared) euclidean distance, and multiple different color-spaces. LAB, deltaE (LCH), XYZ, HSL, and RGB. In my code, most color-spaces use squared euclidean distance to compute the difference.
For example with LAB, RGB and XYZ a simple squared euc. distance does the trick:
if ((X-X1)^2 + (Y-Y1)^2 + (Z-Z1)^2) <= (Tol^2) then
...
LCH, and HSL is a little more complicated as both have a cylindrical hue, but some piece of math solves that, then it's on to using squared eucl. here as well.
In most these cases I've added "separate parameters" for tolerance for each channel (using 1 global tolerance, and alternative "modifiers" HueTol := Tolerance * hueMod or LightTol := Tolerance * LightMod).
It seems like colorspaces built on top of XYZ (LAB, LCH) does perform best in many of my scenarios. Tho HSL yields very good results in some cases, and it's much cheaper to convert to from RGB, RGB is also great tho, and fills most of my needs.

Computing distances between RGB colours, in a way that's meaningful to the eye, isn't as easy a just taking the Euclidian distance between the two RGB vectors.
There is an interesting article about this here: http://www.compuphase.com/cmetric.htm
The example implementation in C is this:
typedef struct {
unsigned char r, g, b;
} RGB;
double ColourDistance(RGB e1, RGB e2)
{
long rmean = ( (long)e1.r + (long)e2.r ) / 2;
long r = (long)e1.r - (long)e2.r;
long g = (long)e1.g - (long)e2.g;
long b = (long)e1.b - (long)e2.b;
return sqrt((((512+rmean)*r*r)>>8) + 4*g*g + (((767-rmean)*b*b)>>8));
}
It shouldn't be too difficult to port to Python.
EDIT:
Alternatively, as suggested in this answer, you could use HLS and HSV. The colorsys module seems to have functions to make the conversion from RGB. Its documentation also links to these pages, which are worth reading to understand why RGB Euclidian distance doesn't really work:
http://www.poynton.com/ColorFAQ.html
http://www.cambridgeincolour.com/tutorials/color-space-conversion.htm
EDIT 2:
According to this answer, this library should be useful: http://code.google.com/p/python-colormath/

Here is an optimized Python version adapted from Bruno's asnwer:
def ColorDistance(rgb1,rgb2):
'''d = {} distance between two colors(3)'''
rm = 0.5*(rgb1[0]+rgb2[0])
d = sum((2+rm,4,3-rm)*(rgb1-rgb2)**2)**0.5
return d
usage:
>>> import numpy
>>> rgb1 = numpy.array([1,1,0])
>>> rgb2 = numpy.array([0,0,0])
>>> ColorDistance(rgb1,rgb2)
2.5495097567963922

Instead of this:
if px[0] == r and px[1] == g and px[2] == b:
Try this:
if max(map(lambda a,b: abs(a-b), px, (r,g,b))) < tolerance:
Where tolerance is the maximum difference you're willing to accept in any of the color channels.
What it does is to subtract each channel from your target values, take the absolute values, then the max of those.

Assuming that rtol, gtol, and btol are the tolerances for r,g, and b respectively, why not do:
if abs(px[0]- r) <= rtol and \
abs(px[1]- g) <= gtol and \
abs(px[2]- b) <= btol:
return x, y

Here's a vectorised Python (numpy) version of Bruno and Developer's answers (i.e. an implementation of the approximation derived here) that accepts a pair of numpy arrays of shape (x, 3) where individual rows are in [R, G, B] order and individual colour values ∈[0, 1].
You can reduce it two a two-liner at the expense of readability. I'm not entirely sure whether it's the most optimised version possible, but it should be good enough.
def colour_dist(fst, snd):
rm = 0.5 * (fst[:, 0] + snd[:, 0])
drgb = (fst - snd) ** 2
t = np.array([2 + rm, 4 + 0 * rm, 3 - rm]).T
return np.sqrt(np.sum(t * drgb, 1))
It was evaluated against Developer's per-element version above, and produces the same results (save for floating precision errors in two cases out of one thousand).

A cleaner python implementation of the function stated here, the function takes 2 image paths, reads them using cv.imread and the outputs a matrix with each matrix cell having difference of colors. you can change it to just match 2 colors easily
import numpy as np
import cv2 as cv
def col_diff(img1, img2):
img_bgr1 = cv.imread(img1) # since opencv reads as B, G, R
img_bgr2 = cv.imread(img2)
r_m = 0.5 * (img_bgr1[:, :, 2] + img_bgr2[:, :, 2])
delta_rgb = np.square(img_bgr1- img_bgr2)
cols_diffs = delta_rgb[:, :, 2] * (2 + r_m / 256) + delta_rgb[:, :, 1] * (4) +
delta_rgb[:, :, 0] * (2 + (255 - r_m) / 256)
cols_diffs = np.sqrt(cols_diffs)
# lets normalized the values to range [0 , 1]
cols_diffs_min = np.min(cols_diffs)
cols_diffs_max = np.max(cols_diffs)
cols_diffs_normalized = (cols_diffs - cols_diffs_min) / (cols_diffs_max - cols_diffs_min)
return np.sqrt(cols_diffs_normalized)

Simple:
def eq_with_tolerance(a, b, t):
return a-t <= b <= a+t
def FindColorIn(r,g,b, xmin, xmax, ymin, ymax, tolerance=0):
image = ImageGrab.grab()
for x in range(xmin, xmax):
for y in range(ymin,ymax):
px = image.getpixel((x, y))
if eq_with_tolerance(r, px[0], tolerance) and eq_with_tolerance(g, px[1], tolerance) and eq_with_tolerance(b, px[2], tolerance):
return x, y

from pyautogui source code
def pixelMatchesColor(x, y, expectedRGBColor, tolerance=0):
r, g, b = screenshot().getpixel((x, y))
exR, exG, exB = expectedRGBColor
return (abs(r - exR) <= tolerance) and (abs(g - exG) <= tolerance) and (abs(b - exB) <= tolerance)
you just need a little fix and you're ready to go.

Here is a simple function that does not require any libraries:
def color_distance(rgb1, rgb2):
rm = 0.5 * (rgb1[0] + rgb2[0])
rd = ((2 + rm) * (rgb1[0] - rgb2[0])) ** 2
gd = (4 * (rgb1[1] - rgb2[1])) ** 2
bd = ((3 - rm) * (rgb1[2] - rgb2[2])) ** 2
return (rd + gd + bd) ** 0.5
assuming that rgb1 and rgb2 are RBG tuples

Matlab: Help understanding sinusoidal curve fit

I have an unknown sine wave with some noise that I am trying to reconstruct. The ultimate goal is to come up with a C algorithm to find the amplitude, dc offset, phase, and frequency of a sine wave but I am prototyping in Matlab (Octave actually) first. The sine wave is of the form
y = a + b*sin(c + 2*pi*d*t)
a = dc offset
b = amplitude
c = phase shift (rad)
d = frequency
I have found this example and in the comments John D'Errico presents a method for using Least Squares to fit a sine wave to data. It is a neat little algorithm and works remarkably well but I am having difficulties understanding one aspect. The algorithm is as follows:
Algorithm
Suppose you have a sine wave of the form:
(1) y = a + b*sin(c+d*x)
Using the identity
(2) sin(u+v) = sin(u)*cos(v) + cos(u)*sin(v)
We can rewrite (1) as
(3) y = a + b*sin(c)*cos(d*x) + b*cos(c)*sin(d*x)
Since b*sin(c) and b*cos(c) are constants, these can be wrapped into constants b1 and b2.
(4) y = a + b1*cos(d*x) + b2*sin(d*x)
This is the equation that is used to fit the sine wave. A function is created to generate regression coefficients and a sum-of-squares residual error.
(5) cfun = #(d) [ones(size(x)), sin(d*x), cos(d*x)] \ y;
(6) sumerr2 = #(d) sum((y - [ones(size(x)), sin(d*x), cos(d*x)] * cfun(d)) .^ 2);
Next, sumerr2 is minimized for the frequency d using fminbnd with lower limit l1 and upper limit l2.
(7) dopt = fminbnd(sumerr2, l1, l2);
Now a, b, and c can be computed. The coefficients to compute a, b, and c are given from (4) at dopt
(8) abb = cfun(dopt);
The dc offset is simply the first value
(9) a = abb(1);
A trig identity is used to find b
(10) sin(u)^2 + cos(u)^2 = 1
(11) b = sqrt(b1^2 + b2^2)
(12) b = norm(abb([2 3]));
Finally the phase offset is found
(13) b1 = b*cos(c)
(14) c = acos(b1 / b);
(15) c = acos(abb(2) / b);
Question
What is going on in (5) and (6)? Can someone break down what is happening in pseudo-code or perhaps perform the same function in a more explicit way?
(5) cfun = #(d) [ones(size(x)), sin(d*x), cos(d*x)] \ y;
(6) sumerr2 = #(d) sum((y - [ones(size(x)), sin(d*x), cos(d*x)] * cfun(d)) .^ 2);
Also, given (4) shouldn't it be:
[ones(size(x)), cos(d*x), sin(d*x)]
Code
Here is the Matlab code in full. Blue line is the actual signal. Green line is the reconstructed signal.
close all
clear all
y = [111,140,172,207,243,283,319,350,383,414,443,463,483,497,505,508,503,495,479,463,439,412,381,347,311,275,241,206,168,136,108,83,63,54,45,43,41,45,51,63,87,109,137,168,204,239,279,317,348,382,412,439,463,479,496,505,508,505,495,483,463,441,414,383,350,314,278,245,209,175,140,140,110,85,63,51,45,41,41,44,49,63,82,105,135,166,200,236,277,313,345,379,409,438,463,479,495,503,508,503,498,485,467,444,415,383,351,318,281,247,211,174,141,111,87,67,52,45,42,41,45,50,62,79,104,131,163,199,233,273,310,345,377,407,435,460,479,494,503,508,505,499,486,467,445,419,387,355,319,284,249,215,177,143,113,87,67,55,46,43,41,44,48,63,79,102,127,159,191,232,271,307,343,373,404,437,457,478,492,503,508,505,499,488,470,447,420,391,360,323,287,254,215,182,147,116,92,70,55,46,43,42,43,49,60,76,99,127,159,191,227,268,303,339,371,401,431,456,476,492,502,507,507,500,488,471,447,424,392,361,326,287,287,255,220,185,149,119,92,72,55,47,42,41,43,47,57,76,95,124,156,189,223,258,302,337,367,399,428,456,476,492,502,508,508,501,489,471,451,425,396,364,328,294,259,223,188,151,119,95,72,57,46,43,44,43,47,57,73,95,124,153,187,222,255,297,335,366,398,426,451,471,494,502,507,508,502,489,474,453,428,398,367,332,296,262,227,191,154,124,95,75,60,47,43,41,41,46,55,72,94,119,150,183,215,255,295,331,361,396,424,447,471,489,500,508,508,502,492,475,454,430,401,369,335,299,265,228,191,157,126,99,76,59,49,44,41,41,46,55,72,92,118,147,179,215,252,291,328,360,392,422,447,471,488,499,507,508,503,493,477,456,431,403]';
fs = 100e3;
N = length(y);
t = (0:1/fs:N/fs-1/fs)';
cfun = #(d) [ones(size(t)), sin(2*pi*d*t), cos(2*pi*d*t)]\y;
sumerr2 = #(d) sum((y - [ones(size(t)), sin(2*pi*d*t), cos(2*pi*d*t)] * cfun(d)) .^ 2);
dopt = fminbnd(sumerr2, 2300, 2500);
abb = cfun(dopt);
a = abb(1);
b = norm(abb([2 3]));
c = acos(abb(2) / b);
d = dopt;
y_reconstructed = a + b*sin(2*pi*d*t - c);
figure(1)
hold on
title('Signal Reconstruction')
grid on
plot(t*1000, y, 'b')
plot(t*1000, y_reconstructed, 'g')
ylim = get(gca, 'ylim');
xlim = get(gca, 'xlim');
text(xlim(1), ylim(2) - 15, [num2str(b) ' cos(2\pi * ' num2str(d) 't - ' ...
num2str(c * 180/pi) ') + ' num2str(a)]);
hold off

(5) and (6) are defining anonymous functions that can be used within the optimisation code. cfun returns an array that is a function of t, y and the parameter d (that is the optimisation parameter that will be varied). Similarly, sumerr2 is another anonymous function, with the same arguments, this time returning a scalar. That scalar will be the error that is to be minimised by fminbnd.

Calculate (x exponent 0.19029) with low memory using lookup table?

I'm writing a C program for a PIC micro-controller which needs to do a very specific exponential function. I need to calculate the following:
A = k . (1 - (p/p0)^0.19029)
k and p0 are constant, so it's all pretty simple apart from finding x^0.19029
(p/p0) ratio would always be in the range 0-1.
It works well if I add in math.h and use the power function, except that uses up all of the available 16 kB of program memory. Talk about bloatware! (Rest of program without power function = ~20% flash memory usage; add math.h and power function, =100%).
I'd like the program to do some other things as well. I was wondering if I can write a special case implementation for x^0.19029, maybe involving iteration and some kind of lookup table.
My idea is to generate a look-up table for the function x^0.19029, with perhaps 10-100 values of x in the range 0-1. The code would find a close match, then (somehow) iteratively refine it by re-scaling the lookup table values. However, this is where I get lost because my tiny brain can't visualise the maths involved.
Could this approach work?
Alternatively, I've looked at using Exp(x) and Ln(x), which can be implemented with a Taylor expansion. b^x can the be found with:
b^x = (e^(ln b))^x = e^(x.ln(b))
(See: Wikipedia - Powers via Logarithms)
This looks a bit tricky and complicated to me, though. Am I likely to get the implementation smaller then the compiler's math library, and can I simplify it for my special case (i.e. base = 0-1, exponent always 0.19029)?
Note that RAM usage is OK at the moment, but I've run low on Flash (used for code storage). Speed is not critical. Somebody has already suggested that I use a bigger micro with more flash memory, but that sounds like profligate wastefulness!
[EDIT] I was being lazy when I said "(p/p0) ratio would always be in the range 0-1". Actually it will never reach 0, and I did some calculations last night and decided that in fact a range of 0.3 - 1 would be quite adequate! This mean that some of the simpler solutions below should be suitable. Also, the "k" in the above is 44330, and I'd like the error in the final result to be less than 0.1. I guess that means an error in the (p/p0)^0.19029 needs to be less than 1/443300 or 2.256e-6

Use splines. The relevant part of the function is shown in the figure below. It varies approximately like the 5th root, so the problematic zone is close to p / p0 = 0. There is mathematical theory how to optimally place the knots of splines to minimize the error (see Carl de Boor: A Practical Guide to Splines). Usually one constructs the spline in B form ahead of time (using toolboxes such as Matlab's spline toolbox - also written by C. de Boor), then converts to Piecewise Polynomial representation for fast evaluation.
In C. de Boor, PGS, the function g(x) = sqrt(x + 1) is actually taken as an example (Chapter 12, Example II). This is exactly what you need here. The book comes back to this case a few times, since it is admittedly a hard problem for any interpolation scheme due to the infinite derivatives at x = -1. All software from PGS is available for free as PPPACK in netlib, and most of it is also part of SLATEC (also from netlib).
Edit (Removed)
(Multiplying by x once does not significantly help, since it only regularizes the first derivative, while all other derivatives at x = 0 are still infinite.)
Edit 2
My feeling is that optimally constructed splines (following de Boor) will be best (and fastest) for relatively low accuracy requirements. If the accuracy requirements are high (say 1e-8), one may be forced to get back to the algorithms that mathematicians have been researching for centuries. At this point, it may be best to simply download the sources of glibc and copy (provided GPL is acceptable) whatever is in
glibc-2.19/sysdeps/ieee754/dbl-64/e_pow.c
Since we don't have to include the whole math.h, there shouldn't be a problem with memory, but we will only marginally profit from having a fixed exponent.
Edit 3
Here is an adapted version of e_pow.c from netlib, as found by #Joni. This seems to be the grandfather of glibc's more modern implementation mentioned above. The old version has two advantages: (1) It is public domain, and (2) it uses a limited number of constants, which is beneficial if memory is a tight resource (glibc's version defines over 10000 lines of constants!). The following is completely standalone code, which calculates x^0.19029 for 0 <= x <= 1 to double precision (I tested it against Python's power function and found that at most 2 bits differed):
#define __LITTLE_ENDIAN
#ifdef __LITTLE_ENDIAN
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
#else
#define __HI(x) *(int*)&x
#define __LO(x) *(1+(int*)&x)
#endif
static const double
bp[] = {1.0, 1.5,},
dp_h[] = { 0.0, 5.84962487220764160156e-01,}, /* 0x3FE2B803, 0x40000000 */
dp_l[] = { 0.0, 1.35003920212974897128e-08,}, /* 0x3E4CFDEB, 0x43CFD006 */
zero = 0.0,
one = 1.0,
two = 2.0,
two53 = 9007199254740992.0, /* 0x43400000, 0x00000000 */
/* poly coefs for (3/2)*(log(x)-2s-2/3*s**3 */
L1 = 5.99999999999994648725e-01, /* 0x3FE33333, 0x33333303 */
L2 = 4.28571428578550184252e-01, /* 0x3FDB6DB6, 0xDB6FABFF */
L3 = 3.33333329818377432918e-01, /* 0x3FD55555, 0x518F264D */
L4 = 2.72728123808534006489e-01, /* 0x3FD17460, 0xA91D4101 */
L5 = 2.30660745775561754067e-01, /* 0x3FCD864A, 0x93C9DB65 */
L6 = 2.06975017800338417784e-01, /* 0x3FCA7E28, 0x4A454EEF */
P1 = 1.66666666666666019037e-01, /* 0x3FC55555, 0x5555553E */
P2 = -2.77777777770155933842e-03, /* 0xBF66C16C, 0x16BEBD93 */
P3 = 6.61375632143793436117e-05, /* 0x3F11566A, 0xAF25DE2C */
P4 = -1.65339022054652515390e-06, /* 0xBEBBBD41, 0xC5D26BF1 */
P5 = 4.13813679705723846039e-08, /* 0x3E663769, 0x72BEA4D0 */
lg2 = 6.93147180559945286227e-01, /* 0x3FE62E42, 0xFEFA39EF */
lg2_h = 6.93147182464599609375e-01, /* 0x3FE62E43, 0x00000000 */
lg2_l = -1.90465429995776804525e-09, /* 0xBE205C61, 0x0CA86C39 */
ovt = 8.0085662595372944372e-0017, /* -(1024-log2(ovfl+.5ulp)) */
cp = 9.61796693925975554329e-01, /* 0x3FEEC709, 0xDC3A03FD =2/(3ln2) */
cp_h = 9.61796700954437255859e-01, /* 0x3FEEC709, 0xE0000000 =(float)cp */
cp_l = -7.02846165095275826516e-09, /* 0xBE3E2FE0, 0x145B01F5 =tail of cp_h*/
ivln2 = 1.44269504088896338700e+00, /* 0x3FF71547, 0x652B82FE =1/ln2 */
ivln2_h = 1.44269502162933349609e+00, /* 0x3FF71547, 0x60000000 =24b 1/ln2*/
ivln2_l = 1.92596299112661746887e-08; /* 0x3E54AE0B, 0xF85DDF44 =1/ln2 tail*/
double pow0p19029(double x)
{
double y = 0.19029e+00;
double z,ax,z_h,z_l,p_h,p_l;
double y1,t1,t2,r,s,t,u,v,w;
int i,j,k,n;
int hx,hy,ix,iy;
unsigned lx,ly;
hx = __HI(x); lx = __LO(x);
hy = __HI(y); ly = __LO(y);
ix = hx&0x7fffffff; iy = hy&0x7fffffff;
ax = x;
/* special value of x */
if(lx==0) {
if(ix==0x7ff00000||ix==0||ix==0x3ff00000){
z = ax; /*x is +-0,+-inf,+-1*/
return z;
}
}
s = one; /* s (sign of result -ve**odd) = -1 else = 1 */
double ss,s2,s_h,s_l,t_h,t_l;
n = ((ix)>>20)-0x3ff;
j = ix&0x000fffff;
/* determine interval */
ix = j|0x3ff00000; /* normalize ix */
if(j<=0x3988E) k=0; /* |x|<sqrt(3/2) */
else if(j<0xBB67A) k=1; /* |x|<sqrt(3) */
else {k=0;n+=1;ix -= 0x00100000;}
__HI(ax) = ix;
/* compute ss = s_h+s_l = (x-1)/(x+1) or (x-1.5)/(x+1.5) */
u = ax-bp[k]; /* bp[0]=1.0, bp[1]=1.5 */
v = one/(ax+bp[k]);
ss = u*v;
s_h = ss;
__LO(s_h) = 0;
/* t_h=ax+bp[k] High */
t_h = zero;
__HI(t_h)=((ix>>1)|0x20000000)+0x00080000+(k<<18);
t_l = ax - (t_h-bp[k]);
s_l = v*((u-s_h*t_h)-s_h*t_l);
/* compute log(ax) */
s2 = ss*ss;
r = s2*s2*(L1+s2*(L2+s2*(L3+s2*(L4+s2*(L5+s2*L6)))));
r += s_l*(s_h+ss);
s2 = s_h*s_h;
t_h = 3.0+s2+r;
__LO(t_h) = 0;
t_l = r-((t_h-3.0)-s2);
/* u+v = ss*(1+...) */
u = s_h*t_h;
v = s_l*t_h+t_l*ss;
/* 2/(3log2)*(ss+...) */
p_h = u+v;
__LO(p_h) = 0;
p_l = v-(p_h-u);
z_h = cp_h*p_h; /* cp_h+cp_l = 2/(3*log2) */
z_l = cp_l*p_h+p_l*cp+dp_l[k];
/* log2(ax) = (ss+..)*2/(3*log2) = n + dp_h + z_h + z_l */
t = (double)n;
t1 = (((z_h+z_l)+dp_h[k])+t);
__LO(t1) = 0;
t2 = z_l-(((t1-t)-dp_h[k])-z_h);
/* split up y into y1+y2 and compute (y1+y2)*(t1+t2) */
y1 = y;
__LO(y1) = 0;
p_l = (y-y1)*t1+y*t2;
p_h = y1*t1;
z = p_l+p_h;
j = __HI(z);
i = __LO(z);
/*
* compute 2**(p_h+p_l)
*/
i = j&0x7fffffff;
k = (i>>20)-0x3ff;
n = 0;
if(i>0x3fe00000) { /* if |z| > 0.5, set n = [z+0.5] */
n = j+(0x00100000>>(k+1));
k = ((n&0x7fffffff)>>20)-0x3ff; /* new k for n */
t = zero;
__HI(t) = (n&~(0x000fffff>>k));
n = ((n&0x000fffff)|0x00100000)>>(20-k);
if(j<0) n = -n;
p_h -= t;
}
t = p_l+p_h;
__LO(t) = 0;
u = t*lg2_h;
v = (p_l-(t-p_h))*lg2+t*lg2_l;
z = u+v;
w = v-(z-u);
t = z*z;
t1 = z - t*(P1+t*(P2+t*(P3+t*(P4+t*P5))));
r = (z*t1)/(t1-two)-(w+z*w);
z = one-(r-z);
__HI(z) += (n<<20);
return s*z;
}
Clearly, 50+ years of research have gone into this, so it's probably very hard to do any better. (One has to appreciate that there are 0 loops, only 2 divisions, and only 6 if statements in the whole algorithm!) The reason for this is, again, the behavior at x = 0, where all derivatives diverge, which makes it extremely hard to keep the error under control: I once had a spline representation with 18 knots that was good up to x = 1e-4, with absolute and relative errors < 5e-4 everywhere, but going to x = 1e-5 ruined everything again.
So, unless the requirement to go arbitrarily close to zero is relaxed, I recommend using the adapted version of e_pow.c given above.
Edit 4
Now that we know that the domain 0.3 <= x <= 1 is sufficient, and that we have very low accuracy requirements, Edit 3 is clearly overkill. As #MvG has demonstrated, the function is so well behaved that a polynomial of degree 7 is sufficient to satisfy the accuracy requirements, which can be considered a single spline segment. #MvG's solution minimizes the integral error, which already looks very good.
The question arises as to how much better we can still do? It would be interesting to find the polynomial of a given degree that minimizes the maximum error in the interval of interest. The answer is the minimax
polynomial, which can be found using Remez' algorithm, which is implemented in the Boost library. I like #MvG's idea to clamp the value at x = 1 to 1, which I will do as well. Here is minimax.cpp:
#include <ostream>
#define TARG_PREC 64
#define WORK_PREC (TARG_PREC*2)
#include <boost/multiprecision/cpp_dec_float.hpp>
typedef boost::multiprecision::number<boost::multiprecision::cpp_dec_float<WORK_PREC> > dtype;
using boost::math::pow;
#include <boost/math/tools/remez.hpp>
boost::shared_ptr<boost::math::tools::remez_minimax<dtype> > p_remez;
dtype f(const dtype& x) {
static const dtype one(1), y(0.19029);
return one - pow(one - x, y);
}
void out(const char *descr, const dtype& x, const char *sep="") {
std::cout << descr << boost::math::tools::real_cast<double>(x) << sep << std::endl;
}
int main() {
dtype a(0), b(0.7); // range to optimise over
bool rel_error(false), pin(true);
int orderN(7), orderD(0), skew(0), brake(50);
int prec = 2 + (TARG_PREC * 3010LL)/10000;
std::cout << std::scientific << std::setprecision(prec);
p_remez.reset(new boost::math::tools::remez_minimax<dtype>(
&f, orderN, orderD, a, b, pin, rel_error, skew, WORK_PREC));
out("Max error in interpolated form: ", p_remez->max_error());
p_remez->set_brake(brake);
unsigned i, count(50);
for (i = 0; i < count; ++i) {
std::cout << "Stepping..." << std::endl;
dtype r = p_remez->iterate();
out("Maximum Deviation Found: ", p_remez->max_error());
out("Expected Error Term: ", p_remez->error_term());
out("Maximum Relative Change in Control Points: ", r);
}
boost::math::tools::polynomial<dtype> n = p_remez->numerator();
for(i = n.size(); i--; ) {
out("", n[i], ",");
}
}
Since all parts of boost that we use are header-only, simply build with:
c++ -O3 -I<path/to/boost/headers> minimax.cpp -o minimax
We finally get the coefficients, which are after multiplication by 44330:
24538.3409, -42811.1497, 34300.7501, -11284.1276, 4564.5847, 3186.7541, 8442.5236, 0.
The following error plot demonstrates that this is really the best possible degree-7 polynomial approximation, since all extrema are of equal magnitude (0.06659):
Should the requirements ever change (while still keeping well away from 0!), the C++ program above can be simply adapted to spit out the new optimal polynomial approximation.

Instead of a lookup table, I'd use a polynomial approximation:
1 - x0.19029 ≈ - 1073365.91783x15 + 8354695.40833x14 - 29422576.6529x13 + 61993794.537x12 - 87079891.4988x11 + 86005723.842x10 - 61389954.7459x9 + 32053170.1149x8 - 12253383.4372x7 + 3399819.97536x6 - 672003.142815x5 + 91817.6782072x4 - 8299.75873768x3 + 469.530204564x2 - 16.6572179869x + 0.722044145701
Or in code:
double f(double x) {
double fx;
fx = - 1073365.91783;
fx = fx*x + 8354695.40833;
fx = fx*x - 29422576.6529;
fx = fx*x + 61993794.537;
fx = fx*x - 87079891.4988;
fx = fx*x + 86005723.842;
fx = fx*x - 61389954.7459;
fx = fx*x + 32053170.1149;
fx = fx*x - 12253383.4372;
fx = fx*x + 3399819.97536;
fx = fx*x - 672003.142815;
fx = fx*x + 91817.6782072;
fx = fx*x - 8299.75873768;
fx = fx*x + 469.530204564;
fx = fx*x - 16.6572179869;
fx = fx*x + 0.722044145701;
return fx;
}
I computed this in sage using the least squares approach:
f(x) = 1-x^(19029/100000) # your function
d = 16 # number of terms, i.e. degree + 1
A = matrix(d, d, lambda r, c: integrate(x^r*x^c, (x, 0, 1)))
b = vector([integrate(x^r*f(x), (x, 0, 1)) for r in range(d)])
A.solve_right(b).change_ring(RDF)
Here is a plot of the error this will entail:
Blue is the error from my 16 term polynomial, while red is the error you'd get from piecewise linear interpolation with 16 equidistant values. As you can see, both errors are quite small for most parts of the range, but will become really huge close to x=0. I actually clipped the plot there. If you can somehow narrow the range of possible values, you could use that as the domain for the integration, and obtain an even better fit for the relevant range. At the cost of worse fit outside, of course. You could also increase the number of terms to obtain a closer fit, although that might also lead to higher oscillations.
I guess you can also combine this approach with the one Stefan posted: use his to split the domain into several parts, then use mine to find a close low degree polynomial for each part.
Update
Since you updated the specification of your question, with regard to both the domain and the error, here is a minimal solution to fit those requirements:
44330(1 - x0.19029) ≈ + 23024.9160933(1-x)7 - 39408.6473636(1-x)6 + 31379.9086193(1-x)5 - 10098.7031260(1-x)4 + 4339.44098317(1-x)3 + 3202.85705860(1-x)2 + 8442.42528906(1-x)
double f(double x) {
double fx, x1 = 1. - x;
fx = + 23024.9160933;
fx = fx*x1 - 39408.6473636;
fx = fx*x1 + 31379.9086193;
fx = fx*x1 - 10098.7031260;
fx = fx*x1 + 4339.44098317;
fx = fx*x1 + 3202.85705860;
fx = fx*x1 + 8442.42528906;
fx = fx*x1;
return fx;
}
I integrated x from 0.293 to 1 or equivalently 1 - x from 0 to 0.707 to keep the worst oscillations outside the relevant domain. I also omitted the constant term, to ensure an exact result at x=1. The maximal error for the range [0.3, 1] now occurs at x=0.3260 and amounts to 0.0972 < 0.1. Here is an error plot, which of course has bigger absolute errors than the one above due to the scale factor k=44330 which has been included here.
I can also state that the first three derivatives of the function will have constant sign over the range in question, so the function is monotonic, convex, and in general pretty well-behaved.

Not meant to answer the question, but it illustrates the Road Not To Go, and thus may be helpful:
This quick-and-dirty C code calculates pow(i, 0.19029) for 0.000 to 1.000 in steps of 0.01. The first half displays the error, in percents, when stored as 1/65536ths (as that theoretically provides slightly over 4 decimals of precision). The second half shows both interpolated and calculated values in steps of 0.001, and the difference between these two.
It kind of looks okay if you read from the bottom up, all 100s and 99.99s there, but about the first 20 values from 0.001 to 0.020 are worthless.
#include <stdio.h>
#include <math.h>
float powers[102];
int main (void)
{
int i, as_int;
double as_real, low, high, delta, approx, calcd, diff;
printf ("calculating and storing:\n");
for (i=0; i<=101; i++)
{
as_real = pow(i/100.0, 0.19029);
as_int = (int)round(65536*as_real);
powers[i] = as_real;
diff = 100*as_real/(as_int/65536.0);
printf ("%.5f %.5f %.5f ~ %.3f\n", i/100.0, as_real, as_int/65536.0, diff);
}
printf ("\n");
printf ("-- interpolating in 1/10ths:\n");
for (i=0; i<1000; i++)
{
as_real = i/1000.0;
low = powers[i/10];
high = powers[1+i/10];
delta = (high-low)/10.0;
approx = low + (i%10)*delta;
calcd = pow(as_real, 0.19029);
diff = 100.0*approx/calcd;
printf ("%.5f ~ %.5f = %.5f +/- %.5f%%\n", as_real, approx, calcd, diff);
}
return 0;
}

You can find a complete, correct standalone implementation of pow in fdlibm. It's about 200 lines of code, about half of which deal with special cases. If you remove the code that deals with special cases you're not interested in I doubt you'll have problems including it in your program.

LutzL's answer is a really good one: Calculate your power as (x^1.52232)^(1/8), computing the inner power by spline interpolation or another method. The eighth root deals with the pathological non-differentiable behavior near zero. I took the liberty of mocking up an implementation this way. The below, however, only does a linear interpolation to do x^1.52232, and you'd need to get the full coefficients using your favorite numerical mathematics tools. You'll adding scarcely 40 lines of code to get your needed power, plus however many knots you choose to use for your spline, as dicated by your required accuracy.
Don't be scared by the #include <math.h>; it's just for benchmarking the code.
#include <stdio.h>
#include <math.h>
double my_sqrt(double x) {
/* Newton's method for a square root. */
int i = 0;
double res = 1.0;
if (x > 0) {
for (i = 0; i < 10; i++) {
res = 0.5 * (res + x / res);
}
} else {
res = 0.0;
}
return res;
}
double my_152232(double x) {
/* Cubic spline interpolation for x ** 1.52232. */
int i = 0;
double res = 0.0;
/* coefs[i] will give the cubic polynomial coefficients between x =
i and x = i+1. Out of laziness, the below numbers give only a
linear interpolation. You'll need to do some work and research
to get the spline coefficients. */
double coefs[3][4] = {{0.0, 1.0, 0.0, 0.0},
{-0.872526, 1.872526, 0.0, 0.0},
{-2.032706, 2.452616, 0.0, 0.0}};
if ((x >= 0) && (x < 3.0)) {
i = (int) x;
/* Horner's method cubic. */
res = (((coefs[i][3] * x + coefs[i][2]) * x) + coefs[i][1] * x)
+ coefs[i][0];
} else if (x >= 3.0) {
/* Scaled x ** 1.5 once you go off the spline. */
res = 1.024824 * my_sqrt(x * x * x);
}
return res;
}
double my_019029(double x) {
return my_sqrt(my_sqrt(my_sqrt(my_152232(x))));
}
int main() {
int i;
double x = 0.0;
for (i = 0; i < 1000; i++) {
x = 1e-2 * i;
printf("%f %f %f \n", x, my_019029(x), pow(x, 0.19029));
}
return 0;
}
EDIT: If you're just interested in a small region like [0,1], even simpler is to peel off one sqrt(x) and compute x^1.02232, which is quite well behaved, using a Taylor series:
double my_152232(double x) {
double part_050000 = my_sqrt(x);
double part_102232 = 1.02232 * x + 0.0114091 * x * x - 3.718147e-3 * x * x * x;
return part_102232 * part_050000;
}
This gets you within 1% of the exact power for approximately [0.1,6], though getting the singularity exactly right is always a challenge. Even so, this three-term Taylor series gets you within 2.3% for x = 0.001.

Decrease in instructions retired after loop Unrolling

I have a O(N^4) image processing loop and after profiling it (Using Intel Vtune 2013), I see that the number of Instructions retired is reduced drastically. I need help understanding this behavior on a multicore architecture. (I'm using Intel Xeon x5365- has 8 cores with shared L2 cache for every 2 cores). And also the no of branch mis-predictions have increased drastically!!
///////////////EDITS/////////// A sample of my non-Unrolled code is shown below:
for(imageNo =0; imageNo<496;imageNo++){
for (unsigned int k=0; k<256; k++)
{
double z = O_L + (double)k * R_L;
for (unsigned int j=0; j<256; j++)
{
double y = O_L + (double)j * R_L;
for (unsigned int i=0; i<256; i++)
{
double x[1] = {O_L + (double)i * R_L} ;
double w_n = (A_n[2] * x[0] + A_n[5] * y + A_n[8] * z + A_n[11]) ;
double u_n = ((A_n[0] * x[0] + A_n[3] * y + A_n[6] * z + A_n[9] ) / w_n);
double v_n = ((A_n[1] * x[0] + A_n[4] * y + A_n[7] * z + A_n[10]) / w_n);
for(int loop=0; loop<1;loop++)
{
px_x[loop] = (int) floor(u_n);
px_y[loop] = (int) floor(v_n);
alpha[loop] = u_n - px_x[loop] ;
beta[loop] = v_n - px_y[loop] ;
}
///////////////////(i,j) pixels ///////////////////////////////
if (px_x[0]>=0 && px_x[0]<(int)threadCopy[0].S_x && px_y[0]>=0 && px_y[0]<(int)threadCopy[0].S_y)
pixel_1[0] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + px_x[0]];
else
pixel_1[0] = 0.0;
if (px_x[0]+1>=0 && px_x[0]+1<(int)threadCopy[0].S_x && px_y[0]>=0 && px_y[0]<(int)threadCopy[0].S_y)
pixel_1[2] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + (px_x[0]+1)];
else
pixel_1[2] = 0.0;
/////////////////// (i+1, j) pixels/////////////////////////
if (px_x[0]>=0 && px_x[0]<(int)threadCopy[0].S_x && px_y[0]+1>=0 && px_y[0]+1<(int)threadCopy[0].S_y)
pixel_1[1] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + px_x[0]];
else
pixel_1[1] = 0.0;
if (px_x[0]+1>=0 && px_x[0]+1<(int)threadCopy[0].S_x && px_y[0]+1>=0 && px_y[0]+1<(int)threadCopy[0].S_y)
pixel_1[3] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + (px_x[0]+1)];
else
pixel_1[3] = 0.0;
pix_1 = (1.0 - alpha[0]) * (1.0 - beta[0]) * pixel_1[0] + (1.0 - alpha[0]) * beta[0] * pixel_1[1]
+ alpha[0] * (1.0 - beta[0]) * pixel_1[2] + alpha[0] * beta[0] * pixel_1[3];
f_L[k * L * L + j * L + i] += (float)(1.0 / (w_n * w_n) * pix_1);
}
}
}
}
I'm unrolling the inner most loop by 4 iterations.(You will have a general ideal how I stripped the loop. Basically i created an array of Array[4] and filled respective vales in it.) Doing the math, I'm reducing the total no of iterations by 75%. Say there are 4 loop handling instructions for every loop (load i, inc i, cmp i, jle loop), the total no of instructions after unrolling should reduce by (256-64)*4*256*256*496=24.96G.
The profiled results are:
Before UnRolling: Instr retired: 3.1603T no of branch mis-predictions: 96 million
After UnRolling: Instr retired: 2.642240T no of branch mis-predictions: 144 million
The no instr retired decreased by 518.06G . I have no clue how this is happening. I would appreciate any help regarding this (Even if it is remote possibility for its occurrence) . Also, I would like to know why are branch mis-predictions increasing. Thanks in advance!

It is not clear where gcc would be reducing the number of instructions. It is possible that increased register pressure might encourage gcc to use load+operate instructions (so the same number of primitive operations but fewer instructions). The index for f_L would only be incremented once per innermost loop, but this would only save 6.2G (3*64*256*256*496) instructions. (By the way, the loop overhead should only be three instructions since i should remain in a register.)
The following pseudo-assembly (for a RISC-like ISA) using a two-way unrolling shows how an increment can be saved:
// the address of f_L[k * L * L + j * L + i] is in r1
// (float)(1.0 / (w_n * w_n) * pix_1) results are in f1 and f2
load-single f9 [r1]; // load float at address in r1 to register f9
add-single f9 f9 f1; // f9 = f9 + f1
store-single [r1] f9; // store float in f9 to address in r1
load-single f10 4[r1]; // load float at address of r1+4 to f10
add-single f10 f10 f2; // f10 = f10 + f2
store-single 4[r1] f10; // store float in f10 to address of r1+4
add r1 r1 #8; // increase the address by 8 bytes
The trace of two iterations of the non-unrolled version would look more like:
load-single f9 [r1]; // load float at address of r1 to f9
add-single f9 f9 f2; // f9 = f9 + f2
store-single [r1] f9; // store float in f9 to address of r1
add r1 r1 #4; // increase the address by 4 bytes
...
load-single f9 [r1]; // load float at address of r1 to f9
add-single f9 f9 f2; // f9 = f9 + f2
store-single [r1] f9; // store float in f9 to address of r1
add r1 r1 #4; // increase the address by 4 bytes
Because memory addressing instructions commonly include adding an immediate offset (Itanium is an unusual exception) and the pipelines are not generally implemented to optimize the case when the immediate is zero, using a non-zero immediate offset is generally "free". (It certainly reduces the number of instructions—7 vs. 8 in this case—, but generally it also improves performance.)
With respect to branch prediction, the according to Agner Fog's The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers(PDF) the Core2 microarchitecture's branch predictor uses an 8 bit global history. This means that it tracks the results for the last 8 branches and uses these 8 bits (along with bits from the instruction address) to index a table. This allows correlations between nearby branches to be recognized.
For your code, the branch corresponding to, e.g., the 8th previous branch is not the same branch in each iteration (since short-circuiting is used), so it is not easy to conceptualize how well correlations would be recognized.
Some correlations in branches are obvious. If px_x[0]>=0 is true, px_x[0]+1>=0 will also be true. If px_x[0] <(int)threadCopy[0].S_x is true, then px_x[0]+1 <(int)threadCopy[0].S_x is likely to be true.
If the unrolling is done such that px_x[n] is tested for all four values of n then these correlations would be pushed farther away so that the results are not used by the branch predictor.
Some optimization possibilities
Although you did not ask about any optimization possibilities, I am going to offer some avenues for exploration.
First, for the branches, if not being strictly portable is OK, the test x>=0 && x<y can be simplified to (unsigned)x<(unsigned)y. This is not strictly portable because, e.g., a machine could theoretically represent negative numbers in a sign-magnitude format with the most significant bit as the sign and negative indicated by a zero bit. For the common representations of signed integers, such a reinterpreting cast will work as long as y is a positive signed integer since a negative x value will have the most significant bit set and so be larger than y interpreted as an unsigned integer.
Second, the number of branches can be significantly reduced by using the 100% correlations for either px_x or px_y:
if ((unsigned) px_y[0]<(unsigned int)threadCopy[0].S_y)
{
if ((unsigned)px_x[0]<(unsigned int)threadCopy[0].S_x)
pixel_1[0] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + px_x[0]];
else
pixel_1[0] = 0.0;
if ((unsigned)px_x[0]+1<(unsigned int)threadCopy[0].S_x)
pixel_1[2] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + (px_x[0]+1)];
else
pixel_1[2] = 0.0;
}
if ((unsigned)px_y[0]+1<(unsigned int)threadCopy[0].S_y)
{
if ((unsigned)px_x[0]<(unsigned int)threadCopy[0].S_x)
pixel_1[1] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + px_x[0]];
else
pixel_1[1] = 0.0;
if ((unsigned)px_x[0]+1<(unsigned int)threadCopy[0].S_x)
pixel_1[3] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + (px_x[0]+1)];
else
pixel_1[3] = 0.0;
}
(If the above section of code is replicated for unrolling, it should probably be replicated as a block rather than interleaving tests for different px_x and px_y values to allow the px_y branch to be near the px_y+1 branch and the first px_x branch to be near the other px_x branch and the px_x+1 branches.)
Another possible optimization is changing the calculation of w_n into a calculation of its reciprocal. This would change a multiply and three divisions into four multiplies and one division. Division is much more expensive than multiplication. In addition, calculating an approximate reciprocal is much more SIMD-friendly since there are usually reciprocal estimate instructions that provide a starting point which can be refined by the Newton-Raphson method.
If even worse obfuscation of the code is acceptable, you might consider changing code like double y = O_L + (double)j * R_L; into double y = O_L; ... y += R_L;. (I ran a test, and gcc does not seem to recognize this optimization, probably because of the use of floating point and the cast to double.) Thus:
for(int imageNo =0; imageNo<496;imageNo++){
double z = O_L;
for (unsigned int k=0; k<256; k++)
{
double y = O_L;
for (unsigned int j=0; j<256; j++)
{
double x[1]; x[0] = O_L;
for (unsigned int i=0; i<256; i++)
{
...
x[0] += R_L ;
} // end of i loop
y += R_L;
} // end of j loop
z += R_L;
} // end of k loop
} // end of imageNo loop
I am guessing that this would only modest improve performance, so the obfuscation cost would be higher relative to the benefit.
Another change that might be worth trying is incorporating some of the pix_1 calculation into the sections conditionally setting pixel_1[] values. This would significantly obfuscate the code and might not have much benefit. In addition, it might make autovectorization by the compiler more difficult. (With conditionally setting the values to the appropriate I_n or to zero, an SIMD comparison could set each element to -1 or 0 and a simple and with the I_n value would provide the correct value. In this case, the overhead of forming the I_n vector would probably not be worthwhile given that Core2 only supports 2-wide double precision SIMD, but with gather support or even just a longer vector the tradeoffs might change.)
However, this change would increase the size of basic blocks and reduce the amount of computation when any of px_x and px_y are out of range (I am guessing this is uncommon, so the benefit would be very small at best).
double pix_1 = 0.0;
double alpha_diff = 1.0 - alpha;
if ((unsigned) px_y[0]<(unsigned int)threadCopy[0].S_y)
{
double beta_diff = 1.0 - beta;
if ((unsigned)px_x[0]<(unsigned int)threadCopy[0].S_x)
pix1 += alpha_diff * beta_diff
* threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + px_x[0]];
// no need for else statement since pix1 is already zeroed and not
// adding the pixel_1[0] factor is the same as zeroing pixel_1[0]
if ((unsigned)px_x[0]+1<(unsigned int)threadCopy[0].S_x)
pix1 += alpha * beta_diff
* threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + (px_x[0]+1)];
}
if ((unsigned)px_y[0]+1<(unsigned int)threadCopy[0].S_y)
{
if ((unsigned)px_x[0]<(unsigned int)threadCopy[0].S_x)
pix1 += alpha_diff * beta
* threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + px_x[0]];
if ((unsigned)px_x[0]+1<(unsigned int)threadCopy[0].S_x)
pix1 += alpha * beta
* threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + (px_x[0]+1)];
}
Ideally, code like yours would be vectorized, but I do not know how to get gcc to recognize the opportunities, how to express the opportunities using intrinsics, nor whether significant effort at manually vectorizing this code would be worthwhile with an SIMD width of only two.
I am not a programmer (just someone who likes learning and thinking about computer architecture) and I have a significant inclination toward micro-optimization (as clear from the above), so the above proposals should be considered in that light.

Algorithm for triangulation of point travelling around a circle

Given the following system:
Where:
A: Point that exists anywhere on the edge of a circle with radius r on the xz plane.
θ: The angle between the positive-x-axis and a vector from the origin to point A. This should range from -PI/2 to PI/2.
B1: A point at the intersection of the circle and the positive x-axis at a height of h1.
B2: A point at the intersection of the circle and the positive z-axis at a height of h2.
d1: Distance between B1 and A.
d2: Distance between B2 and A.
Assuming:
h1, h2, and r are known constants.
d1 and d2 are known variables.
How do I find θ?
This will eventually be implemented in C in an embedded system where I have reasonably fast functions for arctan2, sine, and cosine. As such, performance is definitely a priority, and estimations can be used if they are correct to about 3 decimal places (which is how accurate my trig functions are).
However, even given a mathematical algorithm, I'm sure I could work out the specific implementation.
For what it's worth, I got about as far as:
(d1^2 - h1^2) / r = (sin(θ))^2 + (cos(θ))^2
(d2^2 - h2^2) / r = (sin(PI/4 - θ))^2 + (cos(PI/4 - θ))^2
Before I realized that, mathematically, this is way out of my league.

This isn't a full answer but a start of one.
There are two easy simplifications you can make.
Let H1 and H2 be the points in your plane below B1 and B2.
Since you know h1 and d1, h2 and d2, you can calculate the 2 distances A-H1 and A-H2 (with Pythagoras).
Now you have reduced the puzzle to a plane.
Furthermore, you don't really need to look at both H1 and H2. Given the distance A-H1, there are only 2 possible locations for A, which are mirrored in the x-axis. Then you can find which of the two it is by seeing if the A-H2 distance is above or below the threshold distance H2-H1.
That seems to be a good beginning :-)

Employing #Rhialto, additional simplifications and tests for corners cases:
// c1 is the signed chord distance A to (B1 projected to the xz plane)
// c1*c1 + h1*h1 = d1*d1
// c1 = +/- sqrt(d1*d1 - h1*h1) (choose sign later)
// c1 = Cord(r, theta) = fabs(r*2*sin(theta/2))
// theta = asin(c1/(r*2))*2
//
// theta is < 0 when d2 > sqrt(h2*h2 + sqrt(2)*r*sqrt(2)*r)
// theta is < 0 when d2 > sqrt(h2*h2 + 2*r*r)
// theta is < 0 when d2*d2 > h2*h2 + 2*r*r
#define h1 (0.1)
#define h2 (0.25)
#define r (1.333)
#define h1Squared (h1*h1)
#define rSquared (r*r)
#define rSquaredTimes2 (r*r*2)
#define rTimes2 (r*2)
#define d2Big (h2*h2 + 2*r*r)
// Various steps to avoid issues with d1 < 0, d2 < 0, d1 ~= h1 and theta near pi
double zashu(double d1, double d2) {
double c1Squared = d1*d1 - h1Squared;
if (c1Squared < 0.0)
c1Squared = 0.0; // _May_ be needed when in select times |d1| ~= |h1|
double a = sqrt(c1Squared) / rTimes2;
double theta = (a <= 1.0) ? asin(a)*2.0 : asin(1.0)*2.0; // Possible a is _just_ greater than 1.0
if (d2*d2 > d2Big) // this could be done with fabs(d2) > pre_computed_sqrt(d2Big)
theta = -theta;
return theta;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight