Hough Transform: improving algorithm efficiency over OpenCL - c

I am trying to detect a circle in binary image using hough transform.
When I use Opencv's built-in function for the circular hough transform, it is OK and I can find the circle.
Now I try to write my own 'kernel' code for doing hough transform but is very very slow:
kernel void hough_circle(read_only image2d_t imageIn, global int* in,const int w_hough,__global int * circle)
int gid0 = get_global_id(0);
int gid1 = get_global_id(1);
uint4 pixel;
int x0=0,y0=0,r;
int maxval=0;
for(int r=20;r<150;r+=2)
// int r=100;
for(int theta=0; theta<360;theta+=2)
x0=(int) round(gid0-r*cos( (float) radians( (float) theta) ));
y0=(int) round(gid1-r*sin( (float) radians( (float) theta) ));
if((x0>0) && (x0<get_global_size(0)) && (y0>0)&&(y0<get_global_size(1)))
There are source codes for the hough opencl library with opencv, but its hard to me for extract a specific function that helps me.
Can anyone offer a better source code example, or help me understand why this is so inefficient?
the code main.cpp and kernel.cl compress in rar file http://www.files.com/set/527152684017e
use opencv lib for read and display image >

Making repeated calls to sin() and cos() is computationally expensive. Since you only ever call these functions with the same 180 values of theta, you could speed things up by precalculating these values and storing them in an array.
A more robust approach would be to use the midpoint circle algorithm to find the perimeters of these circles by simple integer arithmetic.

What you are doing is running a huge CPU block of code in only 1 workitem, the results as expected, is a slowww kernel.
Detailed answer:
The only place were you use the work-item ID is just for the pixel value, if that condition is met then you run a big chunck of code. Some of the work-items will trigger this some of them don't. The ones that trigger it will make indirectly all the work group to run that code, and this will slow you down.
In addition, the workitems that don't enter that condition will be idle. Depending on the image maybe 99% of them are idle.
I would rewrite your algorithm to use 1 workgroup per pixel.
If the condition is met the workgroup will run the algorithm, if it is not, the whole workgroup will skip. And in the case the workgroup enters the condition, you will have many workitems to play with. This will allow a redesign of the code such that the inner for loops run in parallel.


Why am I getting huge slowdown when parallelising with OpenMP and using static scheduling?

I'm working to parallelise a disease spread model in c using OpenMP but am only seeing massive (order of magnitude) slowdown. I'll point out at the outset that I am a complete novice with both OpenMP and c.
The code loops over every point in the simulation and checks its status (susceptible, infected, recovered) and for each status, follows an algorithm to determine its status at the next time step.
I'll give the loop for infected points for illustrative purposes. Lpoints is a list of indices for points in the simulation, Nneigh gives the number of neighbours each point has and Lneigh gives the indices of these neighbours.
for (ipoint=0;ipoint<Nland;ipoint++) { //loop over all points
if (Lpoints_old[ipoint]==I) { //act on infected points
/* Probability Prec of infected population recovering */
xi = genrand();
if (xi<Pvac) { /* This point recovers (I->R) */
Lpoints[ipoint] = R;
/* printf("Point %d gained immunity\n",ipoint); */
else {
/* Probability of being blockaded by neighbours */
nsn = 0;
for (in=0;in<Nneigh[ipoint];in++) { /*count susceptible neighbours (nsn)*/
//if (npoint<0) printf("Bad npoint 1: %d in=%d\n",ipoint,in);
npoint = Lneigh[ipoint][in];
if (Lpoints_old[npoint]==S) nsn++;
Prob = (double)nsn*Pblo;
xi = genrand();
if (xi<Prob) { /* The population point is blockaded (I->R)*/
Lpoints[ipoint] = R;
else { /* Still infected */
Lpoints[ipoint] = I;
} /*else*/
} /*infected*/
} /*for*/
I tried to parallelise by adding #pragma omp parallel for default(shared) private(ipoint,xi,in,npoint,nsn,Prob) before the for loop. (I tried using default(none) as is generally recommended but it wouldn't compile.) On the small grid I am using to test the original series code runs in about 5 seconds and the OpenMP version runs in around 50.
I have searched for ages online and every similar problem seems to be the result of false cache sharing and has been solved by using static scheduling with a chunk size divisible by 8. I tried varying the chunk size to no effect whatsoever, only getting the timings to the original order when the chunk size surpassed the size of the problem (i.e. back to linearly carrying out on one thread.)
Slowdown doesn't seem any better when the problem is more appropriately scaled as far as I can tell either. I have no idea why this isn't working and what's going wrong. Any help greatly appreciated.

How to use KissFFT with audio?

I have an array of 2048 samples of an audio file at 44.1 khz and want to transform it into a spectrum for an LED effect. I don't know too much about the inner workings of fft but I tryed it using kiss fft:
kiss_fft_cpx *cpx_in = malloc(FRAMES * sizeof(kiss_fft_cpx));
kiss_fft_cpx *cpx_out = malloc(FRAMES * sizeof(kiss_fft_cpx));
kiss_fft_cfg cfg = kiss_fft_alloc( FRAMES , 0 ,0,0 );
for(int j = 0;j<FRAMES;j++) {
float x = (alsa_buffer[(fft_last_index+j+BUFFER_OVERSIZE*FRAMES)%(BUFFER_OVERSIZE*FRAMES)] - offset);
cpx_in[j] = (kiss_fft_cpx){.r = x, .i = x};
kiss_fft(cfg, cpx_in, cpx_out);
My output seems really off. When I play a simple sine, there multiple outputs with values way above zero. Also it generally seems like the first entries are way higher. Do I have to weigh the outputs?
I also don't understand how I have to treat the complex numbers, I'm currently using my input values on the real and imaginary part and for the output I use the abs, is that right?
Also usually spectrum analyzers for audio have logarithmic scaling, so I tried that but the problem is that the fft output as far as I know isn't logarithmic, so the first band for example is say 0-100hz but optimally my first LED on the effect should be only up to like 60hz (so a fraction of the first outputs band), while the last LED would be say 8khz to 10khz which would in that case be 20 fft outputs.
Is there any way to make the output logarithmic? How do I limit the spectrum to 20khz (or know what the bands of the output are in general) and is there any other thing to look out for when working with audio signals?

FFTW Output differs from Matlab with the same input dataset

I am developing an application that should analyze data coming from an A/D stage and find the frequency peaks in a defined frequency range (0-10kHz).
We are using the FFTW3 library, version 3.3.6, running on 64bit Slackware Linux (GCC version 5.3.0). As you can see in the piece of code included, we run the FFTW plan getting result in complex vector result[]. We have verified the operations using MATLAB. We run the FFT on MATLAB (that claims to use the same library) with exactly the same input datasets (complex signal[] as in the source code). We observe some difference between FFTW (Linux ANSI C) and MATLAB run. Each plot is done using MATLAB. In particular, we would like to understand (mag[] array):
Why is the noise floor so different?
After the main peak (at more or less 3kHz) we observe a negative peak in the Linux result, while MATLAB shows correctly a secondary peak as from the input signal.
In these examples, we do not perform any output normalization, neither in Linux nor in MATLAB. The two plots show the magnitude of the FFT results (not converted to dB).
The correct result is the MATLAB one. Does someone have any suggestion about this differences? And how can we produce with the FFTW library results closer to MATLAB?
Below the piece of C source code and the two plots.
// Part of source code:
// rup[] is filled with unsigned char data coming from an A/D conversion stage (8 bit depth)
// Sampling Frequency is 45.454 KHz
// Frequency Range: 0 - 10.0 KHz
#define CONVCOST 0.00787401574803149606
double mag[4096];
unsigned char rup[4096];
int i;
fftw_complex signal[1024];
fftw_complex result[1024];
fftw_plan plan = fftw_plan_dft_1d(1024,signal,result,FFTW_FORWARD,FFTW_ESTIMATE);
signal[i][REAL] = (double)rup[i] * CONVCOST;
signal[i][IMAG] = 0.0;
for (i = 0; i < 512; ++i)
mag[i] = sqrt(result[i][REAL] * result[i][REAL] + result[i][IMAG] * result[i][IMAG]);

Weird angle results when drone is flying

We use the L3GD20 gyro sensor and the LSM303DLHC accelerometer sensor in combination with the complementary filter to measure the angles of the drone.
If we simulate the angles of the drone with our hands, for example, if we tilt the drone forward, our x angle is positive. And if we tilt it backwards, our x angle is negative.
But if we activate our motors, the drone will always go to a negative x-angle. On top of that, the x-angle that used to be positive is now negative. Because the drone tries to compensate this angle, but the angles are inverted, the drone will never go back to its original state.
#define RAD_TO_DEG 57.29578
#define G_GAIN 0.070
#define AA 0.98
#define LOOP_TIME 0.02
void calculate_complementaryfilter(float* complementary_angles)
complementary_angles[X] = AA * (complementary_angles[X] + gyro_rates[X] * LOOP_TIME) + (1 - AA) * acc_angles[X];
complementary_angles[Y] = AA * (complementary_angles[Y] + gyro_rates[Y] * LOOP_TIME) + (1 - AA) * acc_angles[Y];
void convert_accelerometer_data_to_deg()
acc_angles[X] = (float) atan2(acc_raw[X], (sqrt(acc_raw[Y] * acc_raw[Y] + acc_raw[Z] * acc_raw[Z]))) * RAD_TO_DEG;
acc_angles[Y] = (float) atan2(acc_raw[Y], (sqrt(acc_raw[X] * acc_raw[X] + acc_raw[Z] * acc_raw[Z]))) * RAD_TO_DEG;
void convert_gyro_data_to_dps()
gyro_rates[X] = (float)gyr_raw[X] * G_GAIN;
gyro_rates[Y] = (float)gyr_raw[Y] * G_GAIN;
gyro_rates[Z] = (float)gyr_raw[Z] * G_GAIN;
The problem isn't the shaking of the drone. If we put the motors on max speed and simulate the angles by hand, we get the right angles. Thus also the right compensation by the motors.
If we need to add more code, just ask.
Thankyou in advance.
Standard exhaustive methodology for this kind of problems
You can go top-bottom or bottom-top. In this case, I'm more inclined to think in a hardware related problem, but it is up to you:
Power related problem
When you take the drone with your hand and run motors at full throttle, do they have propellers installed?
Motors at full speed without propellers drawn only a fraction of their full load. When lifting drone weight, a voltage drop can cause electronic malfunction.
Alternative cause: shortcircuits/derivations?
Mechanical problem (a.k.a vibrations spoil sensor readings)
In the past, I've seen MEMS sensors suffer a lot under heavy vibrations (amplitude +-4g). And with "suffer a lot" I mean accelerometer not even registering gravity and gyros returning meaningless data.
If vibrations are significative, you need either a better frame for the drone or a better vibration isolation for the sensors.
SW issues (data/algorithm/implementation)
If it is definitely unrelated with power and mechanical aspects, you can log raw sensor data and process it offline to see if it makes sense.
For this, you need an implementation of the same algorithm embedded in the drone.
Here you will be able to discern between:
Buggy/wrong embedded implementation.
Algorithm fails under working conditions.
Other: data looks wrong (problem reading), SW not reaching time cycle constraints.

Suggestions on optimizing a Z-buffer implementation?

I'm writing a 3D graphics library as part of a project of mine, and I'm at the point where everything works, but not well enough.
In particular, my main headache is that my pixel fill-rate is horribly slow -- I can't even manage 30 FPS when drawing a triangle that spans half of an 800x600 window on my target machine (which is admittedly an older computer, but it should be able to manage this . . .)
I ran gprof on my executable, and I end up with the following interesting lines:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
43.51 9.50 9.50 vSwap
34.86 17.11 7.61 179944 0.04 0.04 grInterpolateHLine
13.99 20.17 3.06 grClearDepthBuffer
0.76 21.78 0.17 624 0.27 12.46 grScanlineFill
The function vSwap is my double-buffer swapping function, and it also performs vsyching, so it makes sense to me that the test program will spend much of its time waiting in there. grScanlineFill is my triangle-drawing function, which creates an edge list and then calls grInterpolateHLine to actually fill in the triangle.
My engine is currently using a Z-buffer to perform hidden surface removal. If we discount the (presumed) vsynch overhead, then it turns out that the test program is spending something like 85% of its execution time either clearing the depth buffer, or writing pixels according to the values in the depth buffer. My depth buffer clearing function is simplicity itself: copy the maximum value of a float into each element. The function grInterpolateHLine is:
void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
for(; x1 <= x2; x1 ++, z += zstep) {
if(z < grDepthBuffer[x1 + y*VIDEO_WIDTH]) {
vSetPixel(x1, y, colour);
grDepthBuffer[x1 + y*VIDEO_WIDTH] = z;
I really don't see how I can improve that, especially considering that vSetPixel is a macro.
My entire stock of ideas for optimization has been whittled down to precisely one:
Use an integer/fixed-point depth buffer.
The problem that I have with integer/fixed-point depth buffers is that interpolation can be very annoying, and I don't actually have a fixed-point number library yet. Any further thoughts out there? Any advice would be most appreciated.
You should have a look at the source code to something like Quake - considering what it could achieve on a Pentium, 15 years ago. Its z-buffer implementation used spans rather than per-pixel (or fragment) depth. Otherwise, you could look at the rasterization code in Mesa.
Hard to really tell what higher order optimizations can be done without seeing the rest of the code. I have a couple of minor observation, though.
There's no need to calculate x1 + y * VIDEO_WIDTH more than once in grInterpolateHLine. i.e.:
void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
int offset = x1 + (y * VIDEO_WIDTH);
for(; x1 <= x2; x1 ++, z += zstep, offset++) {
if(z < grDepthBuffer[offset]) {
vSetPixel(x1, y, colour);
grDepthBuffer[offset] = z;
Likewise, I'm guessing that your vSetPixel does a similar calculation, so you should be able to use the same offset there as well, and then you only need to increment offset and not x1 in each loop iteration. Chances are this can be extended back to the function that calls grInterpolateHLine, and you would then only need to do the multiplication once per triangle.
There are some other things you could do with the depth buffer. Most of the time if the first pixel of the line either fails or passes the depth test, then the rest of the line will have the same result. So after the first test you can write a more efficient assembly block to test the entire line in one shot, then if it passes you can use a more efficient block memory setter to block-set the pixel and depth values instead of doing them one at a time. You would only need to test/set per pixel if the line is only partially occluded.
Also, not sure what you mean by older computer, but if your target computer is multi-core then you can break it up among multiple cores. You can do this for the buffer clearing function as well. It can help quite a bit.
I ended up solving this by replacing the Z-buffer with the Painter's Algorithm. I used SSE to write a Z-buffer implementation that created a bitmask w/the pixels to paint (plus the range optimization suggested by Gerald), and it still ran far too slowly.
Thank you, everyone, for your input.
