I am trying to speed up the execution of the following code with OpenMP. The code is for calculating a mandelbrot and output it to canvas.
The code works fine single threaded, but I want to use OpenMP to make it faster. I tried all sorts of combination of private and shared variables but nothing seems to work so far. The code always runs a little slower with OpenMP than without it (50 000 iterations - 2s slower).
I am using Ubuntu 16.04 and compiling with GCC.
void calculate_mandelbrot(GLubyte *canvas, GLubyte *color_buffer, uint32_t w, uint32_t h, mandelbrot_f x0, mandelbrot_f x1, mandelbrot_f y0, mandelbrot_f y1, uint32_t max_iter) {
mandelbrot_f dx = (x1 - x0) / w;
mandelbrot_f dy = (y1 - y0) / h;
uint16_t esc_time;
int i, j;
mandelbrot_f x, y;
//timer start
clock_t begin = clock();
#pragma omp parallel for private(i,j,x,y, esc_time) shared(canvas, color_buffer)
for(i = 0; i < w; ++i) {
x = x0 + i * dx;
for(j = 0; j < h; ++j) {
y = y1 - j * dy;
esc_time = escape_time(x, y, max_iter);
canvas[ GET_R(i, j, w) ] = color_buffer[esc_time * 3];
canvas[ GET_G(i, j, w) ] = color_buffer[esc_time * 3 + 1];
canvas[ GET_B(i, j, w) ] = color_buffer[esc_time * 3 + 2];
}
}
//time calculation
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%f\n",time_spent );
}
escape_time function which the code uses:
inline uint16_t escape_time(mandelbrot_f x0, mandelbrot_f y0, uint32_t max_iter) {
mandelbrot_f x = 0.0;
mandelbrot_f y = 0.0;
mandelbrot_f xtemp;
uint16_t iteration = 0;
while((x*x + y*y < 4) && (iteration < max_iter)) {
xtemp = x*x - y*y + x0;
y = 2*x*y + y0;
x = xtemp;
iteration++;
}
return iteration;
}
The code is from this repository https://github.com/hortont424/mandelbrot
First, like hinted in the comment, use omp_get_wtime() instead of clock() (it will give you the number of clock ticks accumulated across all threads) measure the time. Second, If I recall correctly, this algorithm have load balancing problems, so try to use a dynamic scheduling:
//timer start
double begin = omp_get_wtime();
#pragma omg parallel for private(j,x,y, esc_time) schedule(dynamic, 1)
for(i = 0; i < w; ++i) {
x = x0 + i * dx;
for(j = 0; j < h; ++j) {
y = y1 - j * dy;
esc_time = escape_time(x, y, max_iter);
canvas[ GET_R(i, j, w) ] = color_buffer[esc_time * 3];
canvas[ GET_G(i, j, w) ] = color_buffer[esc_time * 3 + 1];
canvas[ GET_B(i, j, w) ] = color_buffer[esc_time * 3 + 2];
}
}
//time calculation
double end = omp_get_wtime();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%f\n",time_spent );
As it was suggested my problem was caused by using the clock() function, which measures CPU time.
Using omp_get_wtime() instead solved my problem.
Related
This question already has answers here:
Creating large arrays in C
(2 answers)
Closed last month.
I am trying to optimize the best I can a DAXPY procedure. However,
there is a mistake in the following code that I cannot spot:
#include <stdio.h>
#include <time.h>
void daxpy(int n, double a, double *x, double *y) {
int i;
double y0, y1, y2, y3, x0, x1, x2, x3;
// loop unrolling
for (i = 0; i < n; i += 4) {
// multiple accumulating registers
x0 = x[i];
x1 = x[i + 1];
x2 = x[i + 2];
x3 = x[i + 3];
y0 = y[i];
y1 = y[i + 1];
y2 = y[i + 2];
y3 = y[i + 3];
y0 += a * x0;
y1 += a * x1;
y2 += a * x2;
y3 += a * x3;
y[i] = y0;
y[i + 1] = y1;
y[i + 2] = y2;
y[i + 3] = y3;
}
}
int main() {
int n = 10000000;
double a = 2.0;
double x[n];
double y[n];
int i;
for (i = 0; i < n; ++i) {
x[i] = (double)i;
y[i] = (double)i;
}
clock_t start = clock();
daxpy(n, a, x, y);
clock_t end = clock();
double time_elapsed = (double)(end - start) / CLOCKS_PER_SEC;
printf("Time elapsed: %f seconds\n", time_elapsed);
return 0;
}
When executing the binary I am getting the following error message:
Segmentation fault: 11
Using LLDB I am getting:
Process 89219 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x7ff7bb2b2cc0)
frame #0: 0x0000000100003e4f daxpy.x`main at daxpy.c:38:14
35 double y[n];
36 int i;
37 for (i = 0; i < n; ++i) {
-> 38 x[i] = (double)i;
39 y[i] = (double)i;
40 }
41 clock_t start = clock();
Target 0: (daxpy.x) stopped.
But I can't spot what is the mistake on lines 38 and 39. Any help?
Assuming sizeof(double) == 8 you try to allocate 160MB on the stack. The stack is not usually as big.
You need to allocate it from the heap
size_t n = 10000000;
double a = 2.0;
double *x = malloc(sizeof(*x) * n);
double *y = malloc(sizeof(*y) * n);
for (i = 0; i < n; where n is the array size.
What happens if you
y1 = y[i + 1];
y2 = y[i + 2];
y3 = y[i + 3];
You may access the array outside the bounds.
Hi I'm trying to do dynamic memory allocation of a large matrix in C but I'm running into the following error:
Exception thrown at 0x00007FF63A248571 in cdempd.exe: 0xC0000005: Access violation writing location 0x0000000000000000. occurred
sometimes it's Access violation writing location 0xFFFFFFFFB412E2A0.
double ndivx, ndivy, ndivz, nt, r, box, dx, totnode;
int main()
{
ndivx = 19.0;
ndivy = 19.0;
ndivz = 19.0;
int totnode = ndivx * ndivy * ndivz;
r = 0.005; //diameter of sphere
dx = 0.0025 / ndivx;
double dx = r / ndivx; // distance between points
int cols = 3;
int** coords;
coords = malloc(totnode * sizeof(int*));
for (int i = 0; i < totnode; i++) {
coords[i] = malloc(cols * sizeof(int));
}
//int* coord = (int*)malloc(totnode * cols * sizeof(int));
// int offset = i * cols + j;
// now mat[offset] corresponds to m(i, j)
//create a cube of equidistant points
int numm = 0;
for (int i = 1; i <= ndivx; i++)
{
for (int j = 1; j <= ndivy; j++)
{
for (int k = 1; k <= ndivz; k++)
{
coords[numm][0] = -1.0 / 2.0 * (r)+(dx / 2.0) + (i - 1.0) * dx;
coords[numm][1] = -1.0 / 2.0 * (r)+(dx / 2.0) + (j - 1.0) * dx;
coords[numm][2] = -1.0 / 2.0 * (r)+(dx / 2.0) + (k - 1.0) * dx;
numm = numm + 1;
}
}
}
}
pd.r is a double 0.005, dx is a double about 0.00026315, totnode is 6859.
I've tried two methods, the one that is there and the one commented out with //. Both give me the same error. I'm using visual studio 2019. I'm not so familiar with c and visual studio so forgive me if the question is silly. Any help would be appreciated thank you.
Aside from some of the other errors [after correction], all values of coords are set to zero. This is because coords is a pointer to int and not (e.g.) double and your equation uses -1.0 / ... which will always produce a fraction.
Also, as David pointed out, you're indexing from 1 [vs. 0] in the for loops. This could cause access violations/segfaults.
I've changed the for loops to start from 0. And, I've adjusted the equation accordingly (using a macro).
You were defining some things like index variables or size variables as double instead of int (e.g.) ndivx
Also, I introduced a typedef for the coordinate values.
Here's some cleaned up code that may help get you further:
#include <stdio.h>
#include <stdlib.h>
#if 0
double ndivx, ndivy, ndivz, nt, r, box, dx, totnode;
#endif
#if 0
typedef int coord_t;
#else
typedef double coord_t;
#endif
#define SETCOORD(_xidx,_var) \
do { \
coords[numm][_xidx] = -1.0 / 2.0 * r + (dx / 2.0) + (_var * dx); \
printf("coords[%d][%d]=%g\n",numm,_xidx,(double) coords[numm][_xidx]); \
} while (0)
int
main(void)
{
#if 1
int ndivx;
int ndivy;
int ndivz;
double r;
double dx;
#endif
ndivx = 19;
ndivy = 19;
ndivz = 19;
int totnode = ndivx * ndivy * ndivz;
r = 0.005; // diameter of sphere
dx = 0.0025 / ndivx;
#if 0
double dx = r / ndivx; // distance between points
#else
dx = r / ndivx; // distance between points
#endif
int cols = 3;
#if 0
int **coords;
#else
coord_t **coords;
#endif
coords = malloc(totnode * sizeof(coord_t *));
for (int i = 0; i < totnode; i++) {
coords[i] = malloc(cols * sizeof(coord_t));
}
// int* coord = (int*)malloc(totnode * cols * sizeof(int));
// int offset = i * cols + j;
// now mat[offset] corresponds to m(i, j)
// create a cube of equidistant points
int numm = 0;
for (int i = 0; i < ndivx; i++) {
for (int j = 0; j < ndivy; j++) {
for (int k = 0; k < ndivz; k++) {
SETCOORD(0,i);
SETCOORD(1,j);
SETCOORD(2,k);
numm = numm + 1;
}
}
}
return 0;
}
I'm trying to optimize nbody algorithm and when I add #pragma acc kernels in the loop, I don't understand what are the following comment
#pragma acc kernels
for (i = 0; i < n; i++)
{
real fx, fy, fz;
fx = fy = fz = 0;
real iPosx = in[i].x;
real iPosy = in[i].y;
real iPosz = in[i].z;
for (j = 0; j < n; j++)
{
real rx, ry, rz;
rx = in[j].x - iPosx;
ry = in[j].y - iPosy;
rz = in[j].z - iPosz;
real distSqr = rx*rx+ry*ry+rz*rz;
distSqr += SOFTENING_SQUARED;
real s = in[j].w / POW(distSqr,1.5);
real3 ff;
ff.x = rx * s;
ff.y = ry * s;
ff.z = rz * s;
fx += ff.x;
fy += ff.y;
fz += ff.z;
}
force[i].x = fx;
force[i].y = fy;
force[i].z = fz;
}
What means "generating implicit reduction(+:fx)
"generating implicit reduction(+:fy)
"generating implicit reduction(+:fz)"?
Thank you
In order to parallelize the inner "j" loop, the three variables fx, fy, and fz must be in a sum reduction. The compiler has automatically detected this an therefor implicitly added the reduction for you. It's the same as if you had explicitly declared them, for example:
#pragma acc loop reduction(+:fx,fy,fz)
for (j = 0; j < n; j++)
{
real rx, ry, rz;
I have coded a 1 dimension cfd problem but my numerical solution is coming same as to the analytical solution (up to 6 decimal places).
I am using TDMA method for numerical solution and for the analytical solution I am directly substituting the x value in the function T(x).
Analytical solution T(x) comes out to be T(x) = -(x^2)/2 +11/21(x);
E. g. 4 grid points then ;
x0 = 0.000000, x1 = 0.333333 , x2 = 0.666666 , x3 = 0.999999 .
T(x0) = 0.000000 , T(x1) = 0.119048 , T(x2) = 0.126984 , T(x3) = 0.023810.
And for numerical solution I have used TDMA technique, please refer the code below.
Enter n = 4 for the results.
#include<stdio.h>
void temp_matrix(int n, double *a, double *b, double *c, double *d, double *T);
int main() {
int Bi = 20.0;
int n;
printf("%s ", "Enter the Number of total Grid Points");
scanf("%d", &n);
float t = (n - 1);
double dx = 1.0 / t;
int i;
printf("\n");
double q; // analytical solution below
double z[n];
for (i = 0; i <= n - 1; i++) {
q = (dx) * i;
z[i] = -(q * q) / 2 + q * (11.0 / 21);
printf("\nT analytical %lf ", z[i]);
}
double b[n - 1];
b[n - 2] = -2.0 * Bi * dx - 2.0;
for (i = 0; i <= n - 3; i++) {
b[i] = -2.0;
}
double a[n - 1];
a[n - 2] = 2.0;
a[0] = 0;
for (i = 1; i < n - 2; i++) {
a[i] = 1.0;
}
double c[n - 1];
for (i = 0; i <= n - 2; i++) {
c[i] = 1.0;
}
double d[n - 1];
for (i = 0; i <= n - 2; i++) {
d[i] = -(dx * dx);
}
double T[n];
temp_matrix(n, a, b, c, d, T);
return 0;
}
void temp_matrix(int n, double *a, double *b, double *c, double *d, double *T) {
int i;
double beta[n - 1];
double gama[n - 1];
beta[0] = b[0];
gama[0] = d[0] / beta[0];
for (i = 1; i <= n - 2; i++) {
beta[i] = b[i] - a[i] * (c[i - 1] / beta[i - 1]);
gama[i] = (d[i] - a[i] * gama[i - 1]) / beta[i];
}
int loop;
for (loop = 0; loop < n - 1; loop++)
for (loop = 0; loop < n - 1; loop++)
T[0] = 0;
T[n - 1] = gama[n - 2];
for (i = n - 2; i >= 1; i--) {
T[i] = gama[i - 1] - (c[i - 1] * (T[i + 1])) / beta[i - 1];
}
printf("\n");
for (i = 0; i < n; i++) {
printf("\nT numerical %lf", T[i]);
}
}
Why is the numerical solution coming same as analytical solution in C language?
They differ, by about 3 bits.
Print with enough precision to see the difference.
Using the below, we see a a difference in the last hexdigit of the significand of x620 vs x619 of T[3]. This is only 1 part in 1015 difference.
#include<float.h>
printf("T analytical %.*e\t%a\n", DBL_DECIMAL_DIG - 1, z[i], z[i]);
printf("T numerical %.*e\t%a\n", DBL_DECIMAL_DIG - 1, T[i], T[i]);
C allows double math to be performed at long double math when FLT_EVAL_METHOD == 2 and then the same analytical/numerical results. Your results may differ from mine due to that as well as other subtle FP nuances.
printf("FLT_EVAL_METHOD %d\n", FLT_EVAL_METHOD);
Output
T analytical 0.0000000000000000e+00 0x0p+0
T analytical 1.1904761904761907e-01 0x1.e79e79e79e7ap-4
T analytical 1.2698412698412700e-01 0x1.0410410410411p-3
T analytical 2.3809523809523836e-02 0x1.861861861862p-6
T numerical 0.0000000000000000e+00 0x0p+0
T numerical 1.1904761904761904e-01 0x1.e79e79e79e79ep-4
T numerical 1.2698412698412698e-01 0x1.041041041041p-3
T numerical 2.3809523809523812e-02 0x1.8618618618619p-6
FLT_EVAL_METHOD 0
I was doing a C assignment for parallel computing, where I have to implement some sort of Monte Carlo simulations with efficient tread safe normal random generator using Box-Muller transform. I generate 2 vectors of uniform random numbers X and Y, with condition that X in (0,1] and Y in [0,1].
But I'm not sure that my way of sampling uniform random numbers from the halfopen interval (0,1] is right.
Did anyone encounter something similar?
I'm using following Code:
double* StandardNormalRandom(long int N){
double *X = NULL, *Y = NULL, *U = NULL;
X = vUniformRandom_0(N / 2);
Y = vUniformRandom(N / 2);
#pragma omp parallel for
for (i = 0; i<N/2; i++){
U[2*i] = sqrt(-2 * log(X[i]))*sin(Y[i] * 2 * pi);
U[2*i + 1] = sqrt(-2 * log(X[i]))*cos(Y[i] * 2 * pi);
}
return U;
}
double* NormalRandom(long int N, double mu, double sigma2)
{
double *U = NULL, stdev = sqrt(sigma2);
U = StandardNormalRandom(N);
#pragma omp parallel for
for (int i = 0; i < N; i++) U[i] = mu + stdev*U[i];
return U;
}
here is the bit of my UniformRandom function also implemented in parallel:
#pragma omp parallel for firstprivate(i)
for (long int j = 0; j < N;j++)
{
if (i == 0){
int tn = omp_get_thread_num();
I[tn] = S[tn];
i++;
}
else
{
I[j] = (a*I[j - 1] + c) % m;
}
}
}
#pragma omp parallel for
for (long int j = 0; j < N; j++)
U[j] = (double)I[j] / (m+1.0);
In the StandardNormalRandom function, I will assume that the pointer U has been allocated to the size N, in which case this function looks fine to me.
As well as the function NormalRandom.
However for the function UniformRandom (which is missing some parts, so I'll have to assume some stuff), if the following line I[j] = (a*I[j - 1] + c) % m + 1; is the body of a loop with a omp parallel for, then you will have some issues. As you can't know the order of execution of the thread, the current thread (with a fixed value of j) can't rely on the value of I[j - 1] as this value could be modified at any time (I should be shared by default).
Hope it helps!