I have been trying to implement the optimized rasterizer outlined in this blog:
The naive approach outlined in his prior blog post, calculates the determinants (for barycentric weights) at each pixel. But his optimized version takes advantage of the fact that for three points a, b, c, the determinant function
(b.x - a.x) * (c.y - a.y) - (b.y - a.y) * (c.x - a.x)
can be rewritten as
A * c.x + B * c.y + C
A = (a.y - b.y), B = (b.x - a.x), C = a.x * b.y - a.y * b.x.
Since during traversal, the points a, b are two points on the triangle, c is the only point whose values change.
For any edge function E which outputs the weight for the corresponding triangle partition made by it and the point c, we can express an x difference and y difference in discrete steps.
E(c.x, c.y) = A * c.x + B * c.y + C
E(c.x + 1, c.y) - E(c.x, c.y) = A, and E(c.x, c.y + 1) - E(c.x, c.y) = B
So at each iteration instead of recalculating the determinant, we can just find the three determinants for the first c, and then increment them the A or B which corresponds to their edge.
I'm currently trying to implement this in my own rasterizer, but I quickly noticed an issue with the formula. My triangles are fed to the draw function in screen space, so a low y value means high up whereas in vector space it means low down. I thought I would account for this by multiplying every y value in the formula by -1.
This gave me the following formula for the determinant of a, b, c:
(b.x - a.x) * (a.y - c.y) - (a.y - b.y) * (c.x - a.x)
From which I derived the following A, B, C:
A = b.y - a.y, B = a.x - b.x, C = b.x * a.y - a.x * b.y
In my tests, using this new determinant formula (calculated at every pixel) works fine. And for the first point c in traversal, it is equivalent to
A * c.x + B * c.y + C
But as it continues traversing along the triangle's bounding box, the step incremented determinant values go out of sync with the raw calculated determinant values. Somehow this means that the step sizes A and B are faulty-- which makes no sense to me.
The only two causes of this problem I can think of are either I calculated A B and C incorrectly, or I am not mapping from vector space to screen space in a way that preserves area or orientation.
But just in case, here is all of my code for the rasterizer:
typedef float* point_t;
typedef float* triangle_t;
typedef struct edge {
point_t tail, tip;
float step_x, step_y;
int is_top_left;
} edge_t;
/* ... */
/* tail is the begining of the edge (i.e a), tip is the end of the edge (b) and c is variable */
static float init_edge(edge_t* edge, point_t tail, point_t tip, point_t origin) {
edge->tail = tail;
edge->tip = tip;
edge->is_top_left = is_top_left(tail, tip);
float A = tip[1] - tail[1];
float B = tail[0] - tip[0];
float C = tip[0] * tail[1] - tail[0] * tip[1];
/* step sizes */
edge->step_x = A;
edge->step_y = B;
/* edge function output at origin */
return A * origin[0] + B * origin[1] + C;
static float det(point_t a, point_t b, point_t c) {
return (b[0] - a[0]) * (a[1] - c[1]) - (a[1] - b[1]) * (c[0] - a[0]);
void draw_triangle(sr_pipeline_t* pipeline, triangle_t triangle) {
/* orient triangle ccw */
point_t v0 = (point_t)malloc(sizeof(float) * pipeline->num_attr);
point_t v1 = (point_t)malloc(sizeof(float) * pipeline->num_attr);
point_t v2 = (point_t)malloc(sizeof(float) * pipeline->num_attr);
memcpy(v0, triangle, sizeof(float) * pipeline->num_attr);
memcpy(v1, triangle + pipeline->num_attr, sizeof(float) * pipeline->num_attr);
memcpy(v2, triangle + (2 * pipeline->num_attr), sizeof(float) * pipeline->num_attr);
orient_ccw(&v0, &v1, &v2);
/* find bounding box */
float min_x = /* ... */;
float min_y = /* ... */;
float max_x = /* ... */;
float max_y = /* ... */;
/* store current point */
point_t p = (point_t)calloc(pipeline->num_attr, sizeof(float));
p[0] = min_x;
p[1] = min_y;
/* grab edge information */
edge_t e01, e12, e20;
float w0 = init_edge(&e12, v1, v2, p);
float w1 = init_edge(&e20, v2, v0, p);
float w2 = init_edge(&e01, v0, v1, p);
/* rasterize */
for (p[1] = min_y; p[1] <= max_y; p[1]++) {
for (p[0] = min_x; p[0] <= max_x; p[0]++) {
/* determinant calculated at every step (I suspect these are correct) */
float s0 = det(v1, v2, p);
float s1 = det(v2, v0, p);
float s2 = det(v0, v1, p);
if ( (s0 >= 0) && (s1 >= 0) && (s2 >= 0) ) {
draw_point(pipeline, p);
w0 += e12.step_x;
w1 += e20.step_x;
w2 += e01.step_x;
w0 += e12.step_y;
w1 += e20.step_y;
w2 += e01.step_y;
Code and functions that I have omitted I have verified work correctly.
To reiterate, my question is why are the values w0, w1, w2 not the same as s0, s1, s2 as they should be?
Any help is appreciated, thank you!
Rather than multiply every y value in the formula by -1, replace them with (top - y).
I'm working on a minimal ray tracer in C, and I've written a ray tracer a little while ago so I understand the theory behind them, just wanted to do a rewrite for cleanup purposes.
I have the necessary elements for a ray tracer, and nothing more. I've written triangle intersection, transforming pixel space coordinates to NDC (with aspect ratio and FOV accounted for), and writing out the frame buffer.
However, it does not work as expected. The image is entirely black when it should be rendering a single triangle. I've tested writing a single test pixel, and it works fine so I know it isn't an issue with the image writing code.
I've double and triple-checked the code behind the math, and it looks fine to me. Intersection code is basically a duplicate of the source code in the original Moller-Trumbore paper:
/* ray triangle intersection */
bool ray_triangle_intersect(double orig[3], double dir[3], double vert0[3],
double vert1[3], double vert2[3], double* t, double* u, double* v) {
double edge1[3], edge2[3];
double tvec[3], pvec[3], qvec[3];
double det, inv_det;
/* edges */
SUB(edge1, vert1, vert0);
SUB(edge2, vert2, vert0);
/* determinant */
CROSS(pvec, dir, edge2);
/* ray in plane of triangle if near zero */
det = DOT(edge1, pvec);
if(det < EPSILON)
return 0;
SUB(tvec, orig, vert0);
inv_det = 1.0 / det;
/* calculate, check bounds */
*u = DOT(tvec, pvec) * inv_det;
if(*u < 0.0 || *u > 1.0)
return 0;
CROSS(qvec, tvec, edge1);
/* calculate, check bounds */
*v = DOT(dir, qvec) * inv_det;
if(*v < 0.0 || *u + *v > 1.0)
return 0;
*t = DOT(edge2, qvec) * inv_det;
return 1;
CROSS, DOT, and SUB are just macros:
#define CROSS(v,v0,v1) \
v[0] = v0[1] * v1[2] - v0[2] * v1[1]; \
v[1] = v0[2] * v1[0] - v0[0] * v1[2]; \
v[2] = v0[0] * v1[1] - v0[1] * v1[0];
#define DOT(v0,v1) (v0[0] * v1[0] + v0[1] * v1[1] + v0[2] * v1[2])
/* v = v0 - v1 */
#define SUB(v,v0,v1) \
v[0] = v0[0] - v1[0]; \
v[1] = v0[1] - v1[1]; \
v[2] = v0[2] - v1[2];
Transformation code is as follows:
double ndc[2];
screen_to_ndc(x, y, &ndc[0], &ndc[1]);
double dir[3];
dir[0] = ndc[0] * ar * tfov;
dir[1] = ndc[1] * tfov;
dir[2] = -1;
And screen_to_ndc:
void screen_to_ndc(unsigned int x, unsigned int y, double* ndcx, double* ndcy) {
*ndcx = 2 * (((double) x + (1.0 / 2.0)) / (double) WIDTH) - 1;
*ndcy = 1 - 2 * (((double) y + (1.0 / 2.0)) / (double) HEIGHT);
Any help would be appreciated.
Try reversing the orientation of your triangle. Your ray-triangle intersection code culls backfaces because it returns early when det is negative.
So I derived a rotation function like this:
I want to rotate (a, b, c) around the x axis
the value of a will not change
this is equivalent to rotating (b, c) around the origin in a 2d map
for a 2d map in polar coordinates, rotating d degrees is as simple as:
θ = θ + d
for a point P(x, y), x = Rcos(θ) and y = Rsin(θ)
so let Q be the point after rotation, then Q = (Rcos(θ + d), Rsin(θ + d))
since R2 = x2 + y2 and θ = arctan(y/x):
Q = (sqrt(x2 + y2) * cos(arctan(y/x) + d, sqrt(x2 + y2) * sin(arctan(y/x) + d)
I then made a C function that given a coordinate: a and rot_amount (usually 1) it would rotate my coordinate for me.
static void xrotate_coor(t_coor *a, int rot_amount)
double d;
double e;
d = a->y;
e = a->z;
if (e == 0 && d == 0)
return ;
if (d == 0)
a->y = sqrt(d * d + e * e) * cos(atan(INFIN) + rot_amount * M_PI / 50);
a->z = sqrt(d * d + e * e) * sin(atan(INFIN) + rot_amount * M_PI / 50);
return ;
a->y = sqrt(d * d + e * e) * cos(atan(e / d) + rot_amount * M_PI / 50);
a->z = sqrt(d * d + e * e) * sin(atan(e / d) + rot_amount * M_PI / 50);
INFIN is a macro I set to 999999.
I am not sure if it is correct though since using this formula the shape I am rotating is getting deformed so I feel like there is a flaw in my logic somewhere...
You are experiencing the accumulation of errors in the calculations. This is caused by the nature of how numbers are represented in computers.
The typical way to handle this problem in computer graphics is to keep the object's coordinates fixed and translate them to the position required for the frame being rendered. In your case, this would mean that rather than progressively rotating the object, leave the object in its original position and simply calculate the translation to the current angle around the X-axis based on where it should currently be displayed.
In other words, if you are translating 360 degrees total 20 degrees at a time, display the translated coordinates at 20 degrees in the first iteration and the translated coordinates at 40 degrees in the second iteration rather than actually translating 20 degrees each time.
... the shape I am rotating is getting deformed ...
atan(e / d) loses the 4 quadrant nature of a->y, a->z;. Consider that with OP's code, if the y,z are negated, the same result ensues. #Nominal Animal
d = -(a->y);
e = -(a->z);
atan(e / d)
Instead use a 4 quadrant arctangent.
double atan2(double y, double x);
The atan2 functions compute the value of the arc tangent of y/x, using the signs of both arguments to determine the quadrant of the return value. A domain error may occur if both arguments are zero.
Other suggested improvements below too.
#include <math.h>
static void xrotate_coor(t_coor *a, int rot_amount) {
double d = a->y;
double e = a->z;
double r = hypot(d, e); // vs. sqrt(d * d + e * e)
if (r) {
double angle = atan2(e, d);
angle += rot_amount * (M_PI / 50);
a->y = r * cos(angle);
a->z = r * sin(angle);
An analytical solution for cubic bezier length
seems not to exist, but it does not mean that
coding a cheap solution does not exist. By cheap I mean something like in the range of 50-100 ns (or less).
Does someone know anything like that? Maybe in two categories:
1) less error like 1% but more slow code.
2) more error like 20% but faster?
I scanned through google a bit but it doesn't
find anything which looks like a nice solution. Only something like divide on N line segments
and sum the N sqrt - too slow for more precision,
and probably too inaccurate for 2 or 3 segments.
Is there anything better?
Another option is to estimate the arc length as the average between the chord and the control net. In practice:
Bezier bezier = Bezier (p0, p1, p2, p3);
chord = (p3-p0).Length;
cont_net = (p0 - p1).Length + (p2 - p1).Length + (p3 - p2).Length;
app_arc_length = (cont_net + chord) / 2;
You can then recursively split your spline segment into two segments and calculate the arc length up to convergence. I tested myself and it actually converges pretty fast. I got the idea from this forum.
Simplest algorithm: flatten the curve and tally euclidean distance. As long as you want an approximate arc length, this solution is fast and cheap. Given your curve's coordinate LUT—you're talking about speed, so I'm assuming you use those, and don't constantly recompute the coordinates—it's a simple for loop with a tally. In generic code, with a dist function that computes the euclidean distance between two points:
var arclength = 0,
for (i=0; i<last; i++) {
arclength += dist(LUT[i], LUT[i+1]);
Done. arclength is now the approximate arc length based on the maximum number of segments you can form in the curve based on your LUT. Need things faster with a larger potential error? Control the segment count.
var arclength = 0,
segCount = ...,
step = last/segCount,
s, i;
for (s=0; s<=segCount; s++) {
i = (s*step/last)|0;
arclength += dist(LUT[i], LUT[i+1]);
This is pretty much the simplest possible algorithm that still generates values that come even close to the true arc length. For anything better, you're going to have to use more expensive numerical approaches (like the Legendre-Gauss quadrature technique).
If you want to know why, hit up the arc length section of "A Primer on Bézier Curves".
in my case a fast and valid approach is this. (Rewritten in c# for Unity3d)
public static float BezierSingleLength(Vector3[] points){
var p0 = points[0] - points[1];
var p1 = points[2] - points[1];
var p2 = new Vector3();
var p3 = points[3]-points[2];
var l0 = p0.magnitude;
var l1 = p1.magnitude;
var l3 = p3.magnitude;
if(l0 > 0) p0 /= l0;
if(l1 > 0) p1 /= l1;
if(l3 > 0) p3 /= l3;
p2 = -p1;
var a = Mathf.Abs(Vector3.Dot(p0,p1)) + Mathf.Abs(Vector3.Dot(p2,p3));
if(a > 1.98f || l0 + l1 + l3 < (4 - a)*8) return l0+l1+l3;
var bl = new Vector3[4];
var br = new Vector3[4];
bl[0] = points[0];
bl[1] = (points[0]+points[1]) * 0.5f;
var mid = (points[1]+points[2]) * 0.5f;
bl[2] = (bl[1]+mid) * 0.5f;
br[3] = points[3];
br[2] = (points[2]+points[3]) * 0.5f;
br[1] = (br[2]+mid) * 0.5f;
br[0] = (br[1]+bl[2]) * 0.5f;
bl[3] = br[0];
return BezierSingleLength(bl) + BezierSingleLength(br);
I worked out the closed form expression of length for a 3 point Bezier (below). I've not attempted to work out a closed form for 4+ points. This would most likely be difficult or complicated to represent and handle. However, a numerical approximation technique such as a Runge-Kutta integration algorithm (see my Q&A here for details) would work quite well by integrating using the arc length formula.
Here is some Java code for the arc length of a 3 point Bezier, with points a,b, and c.
v.x = 2*(b.x - a.x);
v.y = 2*(b.y - a.y);
w.x = c.x - 2*b.x + a.x;
w.y = c.y - 2*b.y + a.y;
uu = 4*(w.x*w.x + w.y*w.y);
if(uu < 0.00001)
return (float) Math.sqrt((c.x - a.x)*(c.x - a.x) + (c.y - a.y)*(c.y - a.y));
vv = 4*(v.x*w.x + v.y*w.y);
ww = v.x*v.x + v.y*v.y;
t1 = (float) (2*Math.sqrt(uu*(uu + vv + ww)));
t2 = 2*uu+vv;
t3 = vv*vv - 4*uu*ww;
t4 = (float) (2*Math.sqrt(uu*ww));
return (float) ((t1*t2 - t3*Math.log(t2+t1) -(vv*t4 - t3*Math.log(vv+t4))) / (8*Math.pow(uu, 1.5)));
public float FastArcLength()
float arcLength = 0.0f;
ArcLengthUtil(cp0.position, cp1.position, cp2.position, cp3.position, 5, ref arcLength);
return arcLength;
private void ArcLengthUtil(Vector3 A, Vector3 B, Vector3 C, Vector3 D, uint subdiv, ref float L)
if (subdiv > 0)
Vector3 a = A + (B - A) * 0.5f;
Vector3 b = B + (C - B) * 0.5f;
Vector3 c = C + (D - C) * 0.5f;
Vector3 d = a + (b - a) * 0.5f;
Vector3 e = b + (c - b) * 0.5f;
Vector3 f = d + (e - d) * 0.5f;
// left branch
ArcLengthUtil(A, a, d, f, subdiv - 1, ref L);
// right branch
ArcLengthUtil(f, e, c, D, subdiv - 1, ref L);
float controlNetLength = (B-A).magnitude + (C - B).magnitude + (D - C).magnitude;
float chordLength = (D - A).magnitude;
L += (chordLength + controlNetLength) / 2.0f;
first of first you should Understand the algorithm use in Bezier,
When i was coding a program by c# Which was full of graphic material I used beziers and many time I had to find a point cordinate in bezier , whic it seem imposisble in the first look. so the thing i do was to write Cubic bezier function in my costume math class which was in my project. so I will share the code with you first.
//--------------- My Costum Power Method ------------------\\
public static float FloatPowerX(float number, int power)
float temp = number;
for (int i = 0; i < power - 1; i++)
temp *= number;
return temp;
//--------------- Bezier Drawer Code Bellow ------------------\\
public static void CubicBezierDrawer(Graphics graphics, Pen pen, float[] startPointPixel, float[] firstControlPointPixel
, float[] secondControlPointPixel, float[] endPointPixel)
float[] px = new float[1111], py = new float[1111];
float[] x = new float[4] { startPointPixel[0], firstControlPointPixel[0], secondControlPointPixel[0], endPointPixel[0] };
float[] y = new float[4] { startPointPixel[1], firstControlPointPixel[1], secondControlPointPixel[1], endPointPixel[1] };
int i = 0;
for (float t = 0; t <= 1F; t += 0.001F)
px[i] = FloatPowerX((1F - t), 3) * x[0] + 3 * t * FloatPowerX((1F - t), 2) * x[1] + 3 * FloatPowerX(t, 2) * (1F - t) * x[2] + FloatPowerX(t, 3) * x[3];
py[i] = FloatPowerX((1F - t), 3) * y[0] + 3 * t * FloatPowerX((1F - t), 2) * y[1] + 3 * FloatPowerX(t, 2) * (1F - t) * y[2] + FloatPowerX(t, 3) * y[3];
graphics.DrawLine(pen, px[i - 1], py[i - 1], px[i], py[i]);
as you see above, this is the way a bezier Function work and it draw the same Bezier as Microsoft Bezier Function do( I've test it). you can make it even more accurate by incrementing array size and counter size or draw elipse instead of line& ... . All of them depend on you need and level of accuracy you need and ... .
Returning to main goal ,the Question is how to calc the lenght???
well The answer is we Have tons of point and each of them has an x coorinat and y coordinate which remember us a triangle shape & especially A RightTriabgle Shape. so if we have point p1 & p2 , we can calculate the distance of them as a RightTriangle Chord. as we remeber from our math class in school, in ABC Triangle of type RightTriangle, chord Lenght is -> Sqrt(Angle's FrontCostalLenght ^ 2 + Angle's SideCostalLeghth ^ 2);
and there is this relation betwen all points we calc the lenght betwen current point and the last point before current point(exmp p[i - 1] & p[i]) and store sum of them all in a variable. lets show it in code bellow
//--------------- My Costum Power Method ------------------\\
public static float FloatPower2(float number)
return number * number;
//--------------- My Bezier Lenght Calculator Method ------------------\\
public static float CubicBezierLenghtCalculator(float[] startPointPixel
, float[] firstControlPointPixel, float[] secondControlPointPixel, float[] endPointPixel)
float[] tmp = new float[2];
float lenght = 0;
float[] px = new float[1111], py = new float[1111];
float[] x = new float[4] { startPointPixel[0], firstControlPointPixel[0]
, secondControlPointPixel[0], endPointPixel[0] };
float[] y = new float[4] { startPointPixel[1], firstControlPointPixel[1]
, secondControlPointPixel[1], endPointPixel[1] };
int i = 0;
for (float t = 0; t <= 1.0; t += 0.001F)
px[i] = FloatPowerX((1.0F - t), 3) * x[0] + 3 * t * FloatPowerX((1.0F - t), 2) * x[1] + 3F * FloatPowerX(t, 2) * (1.0F - t) * x[2] + FloatPowerX(t, 3) * x[3];
py[i] = FloatPowerX((1.0F - t), 3) * y[0] + 3 * t * FloatPowerX((1.0F - t), 2) * y[1] + 3F * FloatPowerX(t, 2) * (1.0F - t) * y[2] + FloatPowerX(t, 3) * y[3];
if (i > 0)
tmp[0] = Math.Abs(px[i - 1] - px[i]);// calculating costal lenght
tmp[1] = Math.Abs(py[i - 1] - py[i]);// calculating costal lenght
lenght += (float)Math.Sqrt(FloatPower2(tmp[0]) + FloatPower2(tmp[1]));// calculating the lenght of current RightTriangle Chord & add it each time to variable
return lenght;
if you wish to have faster calculation just need to reduce px & py array lenght and loob count.
We also can decrease memory need by reducing px and py to array lenght to 1 or make a simple double variable but becuase of Conditional situation Happend which Increase Our Big O I didn't do that.
Hope it helped you so much. if have another question just ask.
With Best regards, Heydar - Islamic Republic of Iran.
I have a C function that computes the values of 4 sines based on time elapsed. Using gprof, I figured that this function uses 100% (100.7% to be exact lol) of the CPU time.
clock_gettime(CLOCK_MONOTONIC, &spec);
s = spec.tv_sec;
ms = spec.tv_nsec * 0.0000001;
etime = concatenate((long)s, ms);
int k;
for (k = 0; k < 799; ++k)
double A1 = 145 * sin((RAND1 * k + etime) * 0.00333) + RAND5; // Amplitude
double A2 = 100 * sin((RAND2 * k + etime) * 0.00333) + RAND4; // Amplitude
double A3 = 168 * sin((RAND3 * k + etime) * 0.00333) + RAND3; // Amplitude
double A4 = 136 * sin((RAND4 * k + etime) * 0.00333) + RAND2; // Amplitude
double B1 = 3 + RAND1 + (sin((RAND5 * k) * etime) * 0.00216); // Period
double B2 = 3 + RAND2 + (sin((RAND4 * k) * etime) * 0.002); // Period
double B3 = 3 + RAND3 + (sin((RAND3 * k) * etime) * 0.00245); // Period
double B4 = 3 + RAND4 + (sin((RAND2 * k) * etime) * 0.002); // Period
double x = k; // Current x
double C1 = 0.6 * etime; // X axis move
double C2 = 0.9 * etime; // X axis move
double C3 = 1.2 * etime; // X axis move
double C4 = 0.8 * etime + 200; // X axis move
double D1 = RAND1 + sin(RAND1 * x * 0.00166) * 4; // Y axis move
double D2 = RAND2 + sin(RAND2 * x * 0.002) * 4; // Y axis move
double D3 = RAND3 + cos(RAND3 * x * 0.0025) * 4; // Y axis move
double D4 = RAND4 + sin(RAND4 * x * 0.002) * 4; // Y axis move
sine1[k] = A1 * sin((B1 * x + C1) * 0.0025) + D1;
sine2[k] = A2 * sin((B2 * x + C2) * 0.00333) + D2 + 100;
sine3[k] = A3 * cos((B3 * x + C3) * 0.002) + D3 + 50;
sine4[k] = A4 * sin((B4 * x + C4) * 0.00333) + D4 + 100;
And this is the output from gprof:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
100.07 0.04 0.04
I'm currently getting a frame rate of roughly 30-31 fps using this. Now I figure there as to be a more efficient way to do this.
As you noticed I already changed all the divisions to multiplications but that had very little effect on performance.
How could I increase the performance of this math heavy function?
Besides all the other advice given in other answers, here is a pure algorithmic optimization.
In most cases, you're computing something of the form sin(k * a + b), where a and b are constants, and k is a loop variable. If you were also to compute cos(k * a + b), then you could use a 2D rotation matrix to form a recurrence relationship (in matrix form):
|cos(k*a + b)| = |cos(a) -sin(a)| * |cos((k-1)*a + b)|
|sin(k*a + b)| |sin(a) cos(a)| |sin((k-1)*a + b)|
In other words, you can calculate the value for the current iteration in terms of the value from the previous iteration. Thus, you only need to to do the full trig calculation for k == 0, but the rest can be calculated via this recurrence (once you have calculated cos(a) and sin(a), which are constants). So you eliminate 75% of the trig function calls (it's not clear the same trick can be pulled for the final set of trig calls).
If you don't need all that precision, create a lookup for the sin() values you need, so if 1 degree is enough, use double sin_lookup[360], etc.. And possibly float sin_lookup[360] if float precision is sufficient.
Also, as noted in comments, at a certain point as per Keith, "You might also consider using linear interpolation between lookup values, which should give you substantially better accuracy (a reasonably continuous function rather than a step function) at a fairly small cost in performance"
EDIT: also consider changing the hardcoded A1,A2,A3,A4 pattern to arrays of size[4], and looping from 0 to 3 - should allow vectorization on many platforms and allow parrellism without needing to manage threads
EDIT2: some code and results
(Coded in C++ just to make comparisons easy between precisions, calcs are the same in C)
class simple_trig
simple_trig(size_t prec) : precision(prec)
static const double PI=3.141592653589793;
const double dprec=(double)prec;
const double quotient=(2.0*PI)/dprec;
for (int i=0; i < precision; ++i)
double sin(double x) const
double cvt=x*rev_quotient;
int index=(int)cvt;
double delta=cvt-(double)index;
int lookup1=index%precision;
int lookup2=(index+1)%precision;
return values[lookup1]*(1.0-delta)+values[lookup2]*delta;
double cos(double x) const
double cvt=x*rev_quotient;
int index=(int)cvt;
double delta=cvt-(double)index;
int lookup1=(index+precision/4)%precision;
int lookup2=(index+precision/4+1)%precision;
return values[lookup1]*(1.0-delta)+values[lookup2]*delta;
const size_t precision;
double rev_quotient;
std::vector<double> values;
Examples Low is 100, Med is 1000 and High is 10,000
X=0 Sin=0 Sin Low=0 Sin Med=0 Sin High=0
X=0 Cos=1 Cos Low=1 Cos Med=1 Cos High=1
X=0.5 Sin=0.479426 Sin Low=0.479389 Sin Med=0.479423 Sin High=0.479426
X=0.5 Cos=0.877583 Cos Low=0.877512 Cos Med=0.877578 Cos High=0.877583
X=1.33333 Sin=0.971938 Sin Low=0.971607 Sin Med=0.971935 Sin High=0.971938
X=1.33333 Cos=0.235238 Cos Low=0.235162 Cos Med=0.235237 Cos High=0.235238
X=2.25 Sin=0.778073 Sin Low=0.777834 Sin Med=0.778072 Sin High=0.778073
X=2.25 Cos=-0.628174 Cos Low=-0.627986 Cos Med=-0.628173 Cos High=-0.628174
X=3.2 Sin=-0.0583741 Sin Low=-0.0583689 Sin Med=-0.0583739 Sin High=-0.0583741
X=3.2 Cos=-0.998295 Cos Low=-0.998166 Cos Med=-0.998291 Cos High=-0.998295
X=4.16667 Sin=-0.854753 Sin Low=-0.854387 Sin Med=-0.854751 Sin High=-0.854753
X=4.16667 Cos=-0.519036 Cos Low=-0.518818 Cos Med=-0.519034 Cos High=-0.519036
X=5.14286 Sin=-0.90877 Sin Low=-0.908542 Sin Med=-0.908766 Sin High=-0.90877
X=5.14286 Cos=0.417296 Cos Low=0.417195 Cos Med=0.417294 Cos High=0.417296
X=6.125 Sin=-0.157526 Sin Low=-0.157449 Sin Med=-0.157526 Sin High=-0.157526
X=6.125 Cos=0.987515 Cos Low=0.987028 Cos Med=0.987512 Cos High=0.987515
X=7.11111 Sin=0.73653 Sin Low=0.736316 Sin Med=0.736527 Sin High=0.73653
X=7.11111 Cos=0.676405 Cos Low=0.676213 Cos Med=0.676403 Cos High=0.676405
X=8.1 Sin=0.96989 Sin Low=0.969741 Sin Med=0.969887 Sin High=0.96989
X=8.1 Cos=-0.243544 Cos Low=-0.24351 Cos Med=-0.243544 Cos High=-0.243544
X=9.09091 Sin=0.327701 Sin Low=0.327558 Sin Med=0.3277 Sin High=0.327701
X=9.09091 Cos=-0.944782 Cos Low=-0.944381 Cos Med=-0.944779 Cos High=-0.944782
X=10.0833 Sin=-0.611975 Sin Low=-0.611673 Sin Med=-0.611973 Sin High=-0.611975
X=10.0833 Cos=-0.790877 Cos Low=-0.790488 Cos Med=-0.790875 Cos High=-0.790877
It seems to me that sine1, sine2, sine3 and sine4 arrays are completely independent from eachother. So you are basically running a single for loop for 4 different arrays which have no dependency.
Spawn 4 threads, 1 for each, so you have 4 for loops running at the same time. On multicore machine this should speed up your function dramatically. As a matter of fact, it should be a perfect 4x speedup (+- ...).
Actually combining the use of threads (consider this with OpenMP) and the use of a table for the sin is a good idea. If possible use float instead of double and, depending on the platform, you could also use simd instructions, but the later would make the use of threads unnecessary.
Here is a C++ snippet to use the rotation matrix suggested in the accepted answer.
float a = 0.343;
float b = 2.3232;
float sina{};
float cosa{};
sincosf(a, &sina, &cosa);
float resSin{};
float resCos{};
for (int k = 0; k < 5; k++) {
if (k == 0) {
sincosf(b, &resSin, &resCos);
} else {
float newResCos, newResSin;
newResCos = cosa * resCos - sina * resSin;
newResSin = sina * resCos + cosa * resSin;
resCos = newResCos;
resSin = newResSin;