Suggestions on optimizing a Z-buffer implementation? - c

I'm writing a 3D graphics library as part of a project of mine, and I'm at the point where everything works, but not well enough.
In particular, my main headache is that my pixel fill-rate is horribly slow -- I can't even manage 30 FPS when drawing a triangle that spans half of an 800x600 window on my target machine (which is admittedly an older computer, but it should be able to manage this . . .)
I ran gprof on my executable, and I end up with the following interesting lines:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
43.51 9.50 9.50 vSwap
34.86 17.11 7.61 179944 0.04 0.04 grInterpolateHLine
13.99 20.17 3.06 grClearDepthBuffer
<snip>
0.76 21.78 0.17 624 0.27 12.46 grScanlineFill
The function vSwap is my double-buffer swapping function, and it also performs vsyching, so it makes sense to me that the test program will spend much of its time waiting in there. grScanlineFill is my triangle-drawing function, which creates an edge list and then calls grInterpolateHLine to actually fill in the triangle.
My engine is currently using a Z-buffer to perform hidden surface removal. If we discount the (presumed) vsynch overhead, then it turns out that the test program is spending something like 85% of its execution time either clearing the depth buffer, or writing pixels according to the values in the depth buffer. My depth buffer clearing function is simplicity itself: copy the maximum value of a float into each element. The function grInterpolateHLine is:
void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
for(; x1 <= x2; x1 ++, z += zstep) {
if(z < grDepthBuffer[x1 + y*VIDEO_WIDTH]) {
vSetPixel(x1, y, colour);
grDepthBuffer[x1 + y*VIDEO_WIDTH] = z;
}
}
}
I really don't see how I can improve that, especially considering that vSetPixel is a macro.
My entire stock of ideas for optimization has been whittled down to precisely one:
Use an integer/fixed-point depth buffer.
The problem that I have with integer/fixed-point depth buffers is that interpolation can be very annoying, and I don't actually have a fixed-point number library yet. Any further thoughts out there? Any advice would be most appreciated.

You should have a look at the source code to something like Quake - considering what it could achieve on a Pentium, 15 years ago. Its z-buffer implementation used spans rather than per-pixel (or fragment) depth. Otherwise, you could look at the rasterization code in Mesa.

Hard to really tell what higher order optimizations can be done without seeing the rest of the code. I have a couple of minor observation, though.
There's no need to calculate x1 + y * VIDEO_WIDTH more than once in grInterpolateHLine. i.e.:
void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
int offset = x1 + (y * VIDEO_WIDTH);
for(; x1 <= x2; x1 ++, z += zstep, offset++) {
if(z < grDepthBuffer[offset]) {
vSetPixel(x1, y, colour);
grDepthBuffer[offset] = z;
}
}
}
Likewise, I'm guessing that your vSetPixel does a similar calculation, so you should be able to use the same offset there as well, and then you only need to increment offset and not x1 in each loop iteration. Chances are this can be extended back to the function that calls grInterpolateHLine, and you would then only need to do the multiplication once per triangle.
There are some other things you could do with the depth buffer. Most of the time if the first pixel of the line either fails or passes the depth test, then the rest of the line will have the same result. So after the first test you can write a more efficient assembly block to test the entire line in one shot, then if it passes you can use a more efficient block memory setter to block-set the pixel and depth values instead of doing them one at a time. You would only need to test/set per pixel if the line is only partially occluded.
Also, not sure what you mean by older computer, but if your target computer is multi-core then you can break it up among multiple cores. You can do this for the buffer clearing function as well. It can help quite a bit.

I ended up solving this by replacing the Z-buffer with the Painter's Algorithm. I used SSE to write a Z-buffer implementation that created a bitmask w/the pixels to paint (plus the range optimization suggested by Gerald), and it still ran far too slowly.
Thank you, everyone, for your input.

Related

How to use KissFFT with audio?

I have an array of 2048 samples of an audio file at 44.1 khz and want to transform it into a spectrum for an LED effect. I don't know too much about the inner workings of fft but I tryed it using kiss fft:
kiss_fft_cpx *cpx_in = malloc(FRAMES * sizeof(kiss_fft_cpx));
kiss_fft_cpx *cpx_out = malloc(FRAMES * sizeof(kiss_fft_cpx));
kiss_fft_cfg cfg = kiss_fft_alloc( FRAMES , 0 ,0,0 );
for(int j = 0;j<FRAMES;j++) {
float x = (alsa_buffer[(fft_last_index+j+BUFFER_OVERSIZE*FRAMES)%(BUFFER_OVERSIZE*FRAMES)] - offset);
cpx_in[j] = (kiss_fft_cpx){.r = x, .i = x};
}
kiss_fft(cfg, cpx_in, cpx_out);
My output seems really off. When I play a simple sine, there multiple outputs with values way above zero. Also it generally seems like the first entries are way higher. Do I have to weigh the outputs?
I also don't understand how I have to treat the complex numbers, I'm currently using my input values on the real and imaginary part and for the output I use the abs, is that right?
Also usually spectrum analyzers for audio have logarithmic scaling, so I tried that but the problem is that the fft output as far as I know isn't logarithmic, so the first band for example is say 0-100hz but optimally my first LED on the effect should be only up to like 60hz (so a fraction of the first outputs band), while the last LED would be say 8khz to 10khz which would in that case be 20 fft outputs.
Is there any way to make the output logarithmic? How do I limit the spectrum to 20khz (or know what the bands of the output are in general) and is there any other thing to look out for when working with audio signals?

What causes the stackoverflow? And how can I resolve it?

I was doing the homework for computer graphics.
We need to use floodfill to paint an area, but no matter how I changed the reserve stack of Visual Studio, it would always jump out stackoverflow.
void Polygon_FloodFill(HDC hdc, int x0, int y0, int fillColor, int borderColor) {
int interiorColor;
interiorColor = GetPixel(hdc, x0, y0);
if ((interiorColor != borderColor) && (interiorColor != fillColor)) {
SetPixel(hdc, x0, y0, fillColor);
Polygon_FloodFill(hdc, x0 + 1, y0, fillColor, borderColor);
Polygon_FloodFill(hdc, x0, y0 + 1, fillColor, borderColor);
Polygon_FloodFill(hdc, x0 - 1 ,y0, fillColor, borderColor);
Polygon_FloodFill(hdc, x0, y0 - 1, fillColor, borderColor);
}
You may have too large an area to fill, which causes recursive calls to consume all of the execution stack in your program.
Your options:
grow the execution stack even further, if you can
reduce the area (how about just 100x100 or 20x20?)
stop using the execution stack and use a data structure that works similarly but can contain more elements (by being more efficient and/or being able to grow/be larger)
use a different algorithm (e.g. consider going from individual pixels to horizontal spans of pixels, there will be many fewer of the latter than the former)
What causes the stackoverflow?
What is the range of x0? +/- 2,000,000,000? That is your stack depth potential.
Code does not obviously prevent going out of range unless GetPixel(out-of-range) returns a no-match value.
And how can I resolve it?
Code needs to be more selective on recursive calls.
When a row of pixels can be set, do so without recursion.
Then examine that row's neighbors and only recurse when the neighbors were not continuously in need of setting.
A promising approach would handle the middle and then look at the 4 cardinal directions.
// Pseudo code
Polygon_FloodFill(x,y,c)
if (pixel(x,y) needs filling) {
set pixel(x,y,c);
for each of the 4 directions
// example: east
i = 1;
// fill the east line first
while (pixel(x+i,y) needs filling) {
i++;
set pixel(x,y,c);
}
// now examine the line above the "east" line
recursed = false;
for (j=1; j<i; j++) {
if (pixel(x+j, y+j) needs filling) {
if (!recursed) {
recursed = true;
Polygon_FloodFill(x+j,y+j,c)
} else {
// no need to call Polygon_FloodFill as will be caught with previous call
}
} else {
recursed = false;
}
}
// Same for line below the "east" line
// do same for south, west, north.
}
how many pixels to fill? each pixel is one level deep of recursion and you got a lot of variables all local ones and operands of the recursive function + return value and address so for reach pixel you store this:
void Polygon_FloodFill(HDC hdc, int x0, int y0, int fillColor, int borderColor) {
int interiorColor;
in 32 bit environment I estimate this in [Bytes]:
4 Polygon_FloodFill return address
4 HDC hdc ?
4 int x0
4 int y0
4 int fillColor
4 int borderColor
4 int interiorColor
-------------------
~ 7*4 = 28 Bytes
There might be even more depending on the C engine and calling sequence.
Now if your filled area has for example 256x256 pixel then you need:
7*4*256*256 = 1.75 MByte
of memory on the stack/heap. How much memory you got depends on the settings you compile/link with so go to project option and look for memory stack/heap limits...
How to deal with this?
lower the stack/heap trashing
simply do not use operands for your flood_fill instead move them to global variables:
HDC floodfill_hdc;
int floodfill_x0,floodfill_y0,floodfill_fillColor,floodfill_borderColor;
void _Polygon_FloodFill()
{
// here your original filling code
int interiorColor;
...
}
void PolygonFloodFill(HDC hdc, int x0, int y0, int fillColor, int borderColor) // this is what you call when want to fill something
{
floodfill_hdc=hdc;
floodfill_x0=x0;
floodfill_y0=y0;
floodfill_fillColor=fillColor;
floodfill_borderColor=borderColor;
_Polygon_FloodFill();
}
this will allow to fill ~14 times bigger area.
limit recursion depth
This is also sometimes called priority que ... You just add one gobal counter that is counting actual depth of recursion and if hit limit value then do not allow recursion. Instead add pixel position to some list that will be processed after actual recursion stops.
change filling from pixels to lines
this simply eliminates a lot of recursive calls in wildly rough estimate to sqrt(n) recursions from n... You simply fill whole line from a start point to predetermined direction until you hit the border ... So you would have just recursion call per each line instead of per pixel. Here example (see [edit2]):
Paint algorithm leaving white pixels at the edges when I color
However the function name Polygon_FloodFill implies you got the border polygon in vector form. If the case than filling it will be much faster using polygon rasterization techniques like:
how to rasterize rotated rectangle (in 2d by setpixel)
but for that the polygon must be convex one so if not the case you need to triangulate or break down to convex polygons first (for example with Ear clipping).

Hough Transform: improving algorithm efficiency over OpenCL

I am trying to detect a circle in binary image using hough transform.
When I use Opencv's built-in function for the circular hough transform, it is OK and I can find the circle.
Now I try to write my own 'kernel' code for doing hough transform but is very very slow:
kernel void hough_circle(read_only image2d_t imageIn, global int* in,const int w_hough,__global int * circle)
{
sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int gid0 = get_global_id(0);
int gid1 = get_global_id(1);
uint4 pixel;
int x0=0,y0=0,r;
int maxval=0;
pixel=read_imageui(imageIn,sampler,(int2)(gid0,gid1));
if(pixel.x==255)
{
for(int r=20;r<150;r+=2)
{
// int r=100;
for(int theta=0; theta<360;theta+=2)
{
x0=(int) round(gid0-r*cos( (float) radians( (float) theta) ));
y0=(int) round(gid1-r*sin( (float) radians( (float) theta) ));
if((x0>0) && (x0<get_global_size(0)) && (y0>0)&&(y0<get_global_size(1)))
atom_inc(&in[w_hough*y0+x0]);
}
if(maxval<in[w_hough*y0+x0])
{
maxval=in[w_hough*y0+x0];
circle[0]=gid0;
circle[1]=gid1;
circle[2]=r;
}
}
}
}
There are source codes for the hough opencl library with opencv, but its hard to me for extract a specific function that helps me.
Can anyone offer a better source code example, or help me understand why this is so inefficient?
the code main.cpp and kernel.cl compress in rar file http://www.files.com/set/527152684017e
use opencv lib for read and display image >
Making repeated calls to sin() and cos() is computationally expensive. Since you only ever call these functions with the same 180 values of theta, you could speed things up by precalculating these values and storing them in an array.
A more robust approach would be to use the midpoint circle algorithm to find the perimeters of these circles by simple integer arithmetic.
What you are doing is running a huge CPU block of code in only 1 workitem, the results as expected, is a slowww kernel.
Detailed answer:
The only place were you use the work-item ID is just for the pixel value, if that condition is met then you run a big chunck of code. Some of the work-items will trigger this some of them don't. The ones that trigger it will make indirectly all the work group to run that code, and this will slow you down.
In addition, the workitems that don't enter that condition will be idle. Depending on the image maybe 99% of them are idle.
I would rewrite your algorithm to use 1 workgroup per pixel.
If the condition is met the workgroup will run the algorithm, if it is not, the whole workgroup will skip. And in the case the workgroup enters the condition, you will have many workitems to play with. This will allow a redesign of the code such that the inner for loops run in parallel.

How to realize the DRAWING processing in Processing?

We all know how to draw a line in Processing.
But when we draw a line, the line is shown immediately.
What if i want to witness the drawing process, namely, to see the line moving forward, gradually completes a whole line.
Here's what i want to realize: to DRAW several lines and curves which finally turn into some pattern.
So how to make that happen? Using array?
Many thanks.
In processing all of the drawing happens in a loop. An easy way to create animated sequences like you describe is to use frameCount to drive it and using the modulus function % is a good way to create a loop. For example, to animate along the x axis:
void draw() {
float x = 50;
float y = 50;
float lineLength = 50;
int framesToAnimate = 60;
line(x,y,x+float(frameCount % framesToAnimate)/framesToAnimate*lineLength, y);
}
Note: strange things will happen if you don't cast / convert to a float
I use this pretty often to animate other features such as the color.
fill(color(127 + sin(float(frameCount)/90)*127, 0, 0, 127));
If you want to get more advanced, setting vectors and coordinates with PVector. There is a pretty good tutorial on Daniel Shiffman's site.
If you want to set your animation independent of frame rate, you can use mills() instead. That will return current time since the sketch started so you can set something to happen in a given time in seconds.
like for example:
long initialTime;
void setup(){
size(400,200);
initialTime = millis();
}
void draw() {
float x = 50;
float y = 50; //set the multiplier to adjust speed
line(x,y,x+(millis()-initialTime)*0.01, y); //10 px/sec
line(x,y+50,x+(millis()-initialTime)*0.05, y+50); //50 px/sec
line(x,y+100,x+(millis()-initialTime)*0.001, y+100); // 1 px/sec
}
There is also some animation libraries, i've seen some impressive results with some, but i never used them. Here a list.

Looking for a fast outlined line rendering algorithm

I'm looking for a fast algorithm to draw an outlined line. For this application, the outline only needs to be 1 pixel wide. It should be possible, whether by default or through an option, to make two lines connect together seamlessly, if they share a common point.
Excuse the ASCII art but this is probably the best way to demonstrate it.
Normal line:
##
##
##
##
##
##
"Outlined" line:
**
*##**
**##**
**##**
**##**
**##**
**##*
**
I'm working on a dsPIC33FJ128GP802. It's a small microcontroller/digital signal processor, capable of 40 MIPS (million instructions per second.) It is only capable of integer math (add, subtract and multiply: it can do division, but it takes ~19 cycles.) It's being used to process an OSD layer at the same time and only 3-4 MIPS of the processing time is available for calculations, so speed is critical. The pixels occupy three states: black, white and transparent; and the video field is 192x128 pixels. This is for Super OSD, an open source project: http://code.google.com/p/super-osd/
The first solution I thought of was to draw 3x3 rectangles with outlined pixels on the first pass and normal pixels on the second pass, but this could be slow, as for every pixel at least 3 pixels are overwritten and the time spent drawing them is wasted. So I'm looking for a faster way. Each pixel costs around 30 cycles. The target is <50, 000 cycles to draw a line of 100 pixels length.
I suggest this (C/pseudocode mix) :
void draw_outline(int x1, int y1, int x2, int y2)
{
int x, y;
double slope;
if (abs(x2-x1) >= abs(y2-y1)) {
// line closer to horizontal than vertical
if (x2 < x1) swap_points(1, 2);
// now x1 <= x2
slope = 1.0*(y2-y1)/(x2-x1);
draw_pixel(x1-1, y1, '*');
for (x = x1; x <= x2; x++) {
y = y1 + round(slope*(x-x1));
draw_pixel(x, y-1, '*');
draw_pixel(x, y+1, '*');
// here draw_line() does draw_pixel(x, y, '#');
}
draw_pixel(x2+1, y2, '*');
}
else {
// same as above, but swap x and y
}
}
Edit: If you want to have successive lines connect seamlessly, I
think you really have to draw all the outlines in the first pass, and
then the lines. I edited the code above to draw only the outlines. The
draw_line() function would be exactly the same but with one single
draw_pixel(x, y, '#'); instead of four draw_pixel(..., ..., '*');.
And then you just:
void draw_polyline(point p[], int n)
{
int i;
for (i = 0; i < n-1; i++)
draw_outline(p[i].x, p[i].y, p[i+1].x, p[i+1].y);
for (i = 0; i < n-1; i++)
draw_line(p[i].x, p[i].y, p[i+1].x, p[i+1].y);
}
My approach would be to use the Bresenham to draw multiple lines. Looking at your ASCII art, you'll note that the outline lines are just the same as the Bresenham line, just shifted 1 pixel up and down -- plus a single pixel to the left of the first point and to the right of the last.
For a generic version, you'll need to determine whether your line is flat or steep -- i.e., whether abs(y1 - y0) <= abs(x1 - x0). For steep lines, the outlines are shifted by 1 pixel to the left and right, and the closing pixels are above the starting and below the ending point.
It could be worth optimizing this by drawing the line and two outline pixels in one go for each line pixel. However, if you need seamless outlines, the simplest solution would be to first draw all outlines, then the lines themselves -- which wouldn't work with the "three-pixel-Bresenham" optimization.

Resources