This question already has answers here:
Creating large arrays in C
(2 answers)
Closed last month.
I am trying to optimize the best I can a DAXPY procedure. However,
there is a mistake in the following code that I cannot spot:
#include <stdio.h>
#include <time.h>
void daxpy(int n, double a, double *x, double *y) {
int i;
double y0, y1, y2, y3, x0, x1, x2, x3;
// loop unrolling
for (i = 0; i < n; i += 4) {
// multiple accumulating registers
x0 = x[i];
x1 = x[i + 1];
x2 = x[i + 2];
x3 = x[i + 3];
y0 = y[i];
y1 = y[i + 1];
y2 = y[i + 2];
y3 = y[i + 3];
y0 += a * x0;
y1 += a * x1;
y2 += a * x2;
y3 += a * x3;
y[i] = y0;
y[i + 1] = y1;
y[i + 2] = y2;
y[i + 3] = y3;
}
}
int main() {
int n = 10000000;
double a = 2.0;
double x[n];
double y[n];
int i;
for (i = 0; i < n; ++i) {
x[i] = (double)i;
y[i] = (double)i;
}
clock_t start = clock();
daxpy(n, a, x, y);
clock_t end = clock();
double time_elapsed = (double)(end - start) / CLOCKS_PER_SEC;
printf("Time elapsed: %f seconds\n", time_elapsed);
return 0;
}
When executing the binary I am getting the following error message:
Segmentation fault: 11
Using LLDB I am getting:
Process 89219 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x7ff7bb2b2cc0)
frame #0: 0x0000000100003e4f daxpy.x`main at daxpy.c:38:14
35 double y[n];
36 int i;
37 for (i = 0; i < n; ++i) {
-> 38 x[i] = (double)i;
39 y[i] = (double)i;
40 }
41 clock_t start = clock();
Target 0: (daxpy.x) stopped.
But I can't spot what is the mistake on lines 38 and 39. Any help?
Assuming sizeof(double) == 8 you try to allocate 160MB on the stack. The stack is not usually as big.
You need to allocate it from the heap
size_t n = 10000000;
double a = 2.0;
double *x = malloc(sizeof(*x) * n);
double *y = malloc(sizeof(*y) * n);
for (i = 0; i < n; where n is the array size.
What happens if you
y1 = y[i + 1];
y2 = y[i + 2];
y3 = y[i + 3];
You may access the array outside the bounds.
Related
I have the following code from the numerical recipes in C which calculates the incomplete beta function using continuous fraction and Lentz method.
float betacf(float m1, float m2, float theta){
void nrerror(char error_text[]);
int k, k2, MAXIT;
float aa, c, d, del, t, qab, qam, qap;
qab = m1 + m2;
qap = m1 + 1.0;
qam = m1 - 1.0;
c = 1.0;
d = 1.0 - (qab * theta)/qap;
if (fabs(d) < FPMIN) d = FPMIN;
d = 1.0/d;
t = d;
for (k = 1; k <= MAXIT; k++) {
k2 = 2 * k;
aa = k * (m2 - k) * theta/((qam + k2) * (m1 + k2));
d = 1.0 + aa * d;
if (fabs(d) < FPMIN) d = FPMIN;
c = 1.0 + aa/c;
if (fabs(c) < FPMIN) c = FPMIN;
d = 1.0/d;
t *= d * c;
aa = -(m1 + k) * (qab + k) * theta/((m1 + k2) * (qap + k2));
d = 1.0 + aa * d;
if (fabs(d) < FPMIN) d = FPMIN;
c = 1.0 + aa/c;
if (fabs(c) < FPMIN) c=FPMIN;
d = 1.0/d;
del = d * c;
t *= del;
if (fabs(del - 1.0) < EPS) break;
}
if (k > MAXIT) nrerror("m1 or m2 too big, or MAXIT too small in betacf");
return t;
}
/* Returns the incomplete beta function Ix(a, b) */
float betai(float m1, float m2, float theta){
void nrerror(char error_text[]);
float bt;
if (theta < 0.0 || theta > 1.0){
nrerror("Bad x in routine betai");
}
if (theta == 0.0 || theta == 1.0){
bt = 0.0;
}
else {
bt = exp(gammaln(m1+m2)-gammaln(m1)-gammaln(m2)+m1*log(theta)+m2*log(1.0-theta));
}
if (theta < (m1 + 1.0)/(m1 + m2 + 2.0))
{
return (bt * betacf(m1, m2, theta)/m1);
}
else {
return (1.0 - bt * betacf(m2, m1, 1.0 - theta)/m2);
}
}
Then I write a main code where I throw in theta as input and get a value for incomplete beta function.
Now I need to obtain a distribution for theta = [0,1]. Is there a way to write it in way where I don't change anything in this code. I mean just add a for loop in my main function for theta and get the output of the incomplete beta function. I tried doing this, but it throws an error "Incompatible types, expected 'double' but argument is of type 'double *' . I understand the error is because I try to get the output as an array but in my function it is defined to be a single value. Is there a work around this where I don't have to declare theta as an array in my function.
Failing main function
int main() {
float *theta, *result;
.....
.....
printf("Enter number of points required to describe the PDF profile:", N);
scanf("%d", &N);
theta = (float *)malloc(N*sizeof(float));
for (j = 1; j < N; j++)
theta[j] = (float)(j)/ ((float)(N) - 1.0);
result[j] = betai(m1, m2, theta);
printf("%f %f", theta[j], result[j]);
}
}
Thank you
I have made an 8x8x8 LED cube and I am writing animations for it. I am trying to write a line drawing function based off of the Bresenham line algorithm. I found some code for 3D line drawing at
https://www.geeksforgeeks.org/bresenhams-algorithm-for-3-d-line-drawing/
The code was in python which I have more experience with. I did my best to port it over to my Arduino code (running on an adafruit itsybitsy M4). It only draws the first pixel of the line and gets stuck in an infinite loop. I have did some testing and I found it isn't crashing the program, it's a bug. My investigating also revealed that the variable "xs" is zero when it should be set to either 1 or -1. I believe that if that gets fixed it should work. I just don't know how to fix it and also why it is ignoring my extremely obvious assignment of a value to "xs"
The function "setvoxel" is how the cube's pixels are turned on. It takes the x, y, and z coordinates with z being height
void drawline(int x1, int y1, int z1, int x2, int y2, int z2)
{
setvoxel(x1, y1, z1);
int dx = abs(x2 - x1);
int dy = abs(y2 - y1);
int dz = abs(z2 - z1);
int xs;
int ys;
int zs;
if (x2 > x1) //=========================
{
int xs = 1;
} //troublesome code
else
{
int xs = -1;
} //=========================
if (y2 > y1) //variables ys and zs might have the same problem
{
int ys = 1;
}
else
{
int ys = -1;
}
if (z2 > z1)
{
int zs = 1;
}
else
{
int zs = -1;
}
// Driving axis is the x-axis
if (dx >= dy && dx >= dz)
{
int p1 = 2 * dy - dx;
int p2 = 2 * dz - dx;
while (x1 != x2)
{
x1 += xs; // if x1 here doesn't increment from xs the program gets stuck
if (p1 >= 0)
{
y1 += ys;
p1 -= 2 * dx;
}
if (p2 >= 0)
{
z1 += zs;
p2 -= 2 * dx;
}
p1 += 2 * dy;
p2 += 2 * dz;
setvoxel(x1, y1, z1);
}
}
// Driving axis is the y-axis
else if (dy >= dx && dy >= dz)
{
int p1 = 2 * dx - dy;
int p2 = 2 * dz - dy;
while (y1 != y2)
{
y1 += ys;
if (p1 >= 0)
{
x1 += xs;
p1 -= 2 * dy;
}
if (p2 >= 0)
{
z1 += zs;
p2 -= 2 * dy;
}
p1 += 2 * dx;
p2 += 2 * dz;
setvoxel(x1, y1, z1);
}
}
// Driving axis is the z-axis
else
{
int p1 = 2 * dy - dz;
int p2 = 2 * dx - dz;
while (z1 != z2)
{
z1 += zs;
if (p1 >= 0)
{
y1 += ys;
p1 -= 2 * dz;
}
if (p2 >= 0)
{
x1 += xs;
p2 -= 2 * dz;
}
p1 += 2 * dy;
p2 += 2 * dx;
setvoxel(x1, y1, z1);
}
}
}
I managed to fix it by starting the variables as 0 and the if else statements just increment or decrement the value
I have coded a 1 dimension cfd problem but my numerical solution is coming same as to the analytical solution (up to 6 decimal places).
I am using TDMA method for numerical solution and for the analytical solution I am directly substituting the x value in the function T(x).
Analytical solution T(x) comes out to be T(x) = -(x^2)/2 +11/21(x);
E. g. 4 grid points then ;
x0 = 0.000000, x1 = 0.333333 , x2 = 0.666666 , x3 = 0.999999 .
T(x0) = 0.000000 , T(x1) = 0.119048 , T(x2) = 0.126984 , T(x3) = 0.023810.
And for numerical solution I have used TDMA technique, please refer the code below.
Enter n = 4 for the results.
#include<stdio.h>
void temp_matrix(int n, double *a, double *b, double *c, double *d, double *T);
int main() {
int Bi = 20.0;
int n;
printf("%s ", "Enter the Number of total Grid Points");
scanf("%d", &n);
float t = (n - 1);
double dx = 1.0 / t;
int i;
printf("\n");
double q; // analytical solution below
double z[n];
for (i = 0; i <= n - 1; i++) {
q = (dx) * i;
z[i] = -(q * q) / 2 + q * (11.0 / 21);
printf("\nT analytical %lf ", z[i]);
}
double b[n - 1];
b[n - 2] = -2.0 * Bi * dx - 2.0;
for (i = 0; i <= n - 3; i++) {
b[i] = -2.0;
}
double a[n - 1];
a[n - 2] = 2.0;
a[0] = 0;
for (i = 1; i < n - 2; i++) {
a[i] = 1.0;
}
double c[n - 1];
for (i = 0; i <= n - 2; i++) {
c[i] = 1.0;
}
double d[n - 1];
for (i = 0; i <= n - 2; i++) {
d[i] = -(dx * dx);
}
double T[n];
temp_matrix(n, a, b, c, d, T);
return 0;
}
void temp_matrix(int n, double *a, double *b, double *c, double *d, double *T) {
int i;
double beta[n - 1];
double gama[n - 1];
beta[0] = b[0];
gama[0] = d[0] / beta[0];
for (i = 1; i <= n - 2; i++) {
beta[i] = b[i] - a[i] * (c[i - 1] / beta[i - 1]);
gama[i] = (d[i] - a[i] * gama[i - 1]) / beta[i];
}
int loop;
for (loop = 0; loop < n - 1; loop++)
for (loop = 0; loop < n - 1; loop++)
T[0] = 0;
T[n - 1] = gama[n - 2];
for (i = n - 2; i >= 1; i--) {
T[i] = gama[i - 1] - (c[i - 1] * (T[i + 1])) / beta[i - 1];
}
printf("\n");
for (i = 0; i < n; i++) {
printf("\nT numerical %lf", T[i]);
}
}
Why is the numerical solution coming same as analytical solution in C language?
They differ, by about 3 bits.
Print with enough precision to see the difference.
Using the below, we see a a difference in the last hexdigit of the significand of x620 vs x619 of T[3]. This is only 1 part in 1015 difference.
#include<float.h>
printf("T analytical %.*e\t%a\n", DBL_DECIMAL_DIG - 1, z[i], z[i]);
printf("T numerical %.*e\t%a\n", DBL_DECIMAL_DIG - 1, T[i], T[i]);
C allows double math to be performed at long double math when FLT_EVAL_METHOD == 2 and then the same analytical/numerical results. Your results may differ from mine due to that as well as other subtle FP nuances.
printf("FLT_EVAL_METHOD %d\n", FLT_EVAL_METHOD);
Output
T analytical 0.0000000000000000e+00 0x0p+0
T analytical 1.1904761904761907e-01 0x1.e79e79e79e7ap-4
T analytical 1.2698412698412700e-01 0x1.0410410410411p-3
T analytical 2.3809523809523836e-02 0x1.861861861862p-6
T numerical 0.0000000000000000e+00 0x0p+0
T numerical 1.1904761904761904e-01 0x1.e79e79e79e79ep-4
T numerical 1.2698412698412698e-01 0x1.041041041041p-3
T numerical 2.3809523809523812e-02 0x1.8618618618619p-6
FLT_EVAL_METHOD 0
This is a function from LAME encoder, when I parallelize with #pragma omp for ther result is a core dumped; how I should to parallelize this function?
I think that the pointer is the problem with OpenMP, the threads increment wrong memory address.
static void quantize_lines_xrpow(unsigned int l, FLOAT istep, const FLOAT * xp, int *pi)
{
fi_union *fi;
unsigned int remaining;
int i;
assert(l > 0);
fi = (fi_union *) pi;
l = l >> 1;
remaining = l % 2;
l = l >> 1;
double x0,x1,x2,x3;
#pragma omp parallel for private(i)
for(i=l;i>0;i--){//while (l--) {
x0 = istep * xp[0];
x1 = istep * xp[1];
x2 = istep * xp[2];
x3 = istep * xp[3];
x0 += MAGIC_FLOAT;
fi[0].f = x0;
x1 += MAGIC_FLOAT;
fi[1].f = x1;
x2 += MAGIC_FLOAT;
fi[2].f = x2;
x3 += MAGIC_FLOAT;
fi[3].f = x3;
fi[0].f = x0 + adj43asm[fi[0].i - MAGIC_INT];
fi[1].f = x1 + adj43asm[fi[1].i - MAGIC_INT];
fi[2].f = x2 + adj43asm[fi[2].i - MAGIC_INT];
fi[3].f = x3 + adj43asm[fi[3].i - MAGIC_INT];
fi[0].i -= MAGIC_INT;
fi[1].i -= MAGIC_INT;
fi[2].i -= MAGIC_INT;
fi[3].i -= MAGIC_INT;
fi += 4;
xp += 4;
};
if (remaining) {
double x0 = istep * xp[0];
double x1 = istep * xp[1];
x0 += MAGIC_FLOAT;
fi[0].f = x0;
x1 += MAGIC_FLOAT;
fi[1].f = x1;
fi[0].f = x0 + adj43asm[fi[0].i - MAGIC_INT];
fi[1].f = x1 + adj43asm[fi[1].i - MAGIC_INT];
fi[0].i -= MAGIC_INT;
fi[1].i -= MAGIC_INT;
}
}
You have many shared variables among threads, x0, x1, x2, x3, fi, xp which result in lot of data races. This includes the pointers (arrays) fi and xp. The threads increment these nondeterministically thus any dereference of that is error.
I am trying to speed up the execution of the following code with OpenMP. The code is for calculating a mandelbrot and output it to canvas.
The code works fine single threaded, but I want to use OpenMP to make it faster. I tried all sorts of combination of private and shared variables but nothing seems to work so far. The code always runs a little slower with OpenMP than without it (50 000 iterations - 2s slower).
I am using Ubuntu 16.04 and compiling with GCC.
void calculate_mandelbrot(GLubyte *canvas, GLubyte *color_buffer, uint32_t w, uint32_t h, mandelbrot_f x0, mandelbrot_f x1, mandelbrot_f y0, mandelbrot_f y1, uint32_t max_iter) {
mandelbrot_f dx = (x1 - x0) / w;
mandelbrot_f dy = (y1 - y0) / h;
uint16_t esc_time;
int i, j;
mandelbrot_f x, y;
//timer start
clock_t begin = clock();
#pragma omp parallel for private(i,j,x,y, esc_time) shared(canvas, color_buffer)
for(i = 0; i < w; ++i) {
x = x0 + i * dx;
for(j = 0; j < h; ++j) {
y = y1 - j * dy;
esc_time = escape_time(x, y, max_iter);
canvas[ GET_R(i, j, w) ] = color_buffer[esc_time * 3];
canvas[ GET_G(i, j, w) ] = color_buffer[esc_time * 3 + 1];
canvas[ GET_B(i, j, w) ] = color_buffer[esc_time * 3 + 2];
}
}
//time calculation
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%f\n",time_spent );
}
escape_time function which the code uses:
inline uint16_t escape_time(mandelbrot_f x0, mandelbrot_f y0, uint32_t max_iter) {
mandelbrot_f x = 0.0;
mandelbrot_f y = 0.0;
mandelbrot_f xtemp;
uint16_t iteration = 0;
while((x*x + y*y < 4) && (iteration < max_iter)) {
xtemp = x*x - y*y + x0;
y = 2*x*y + y0;
x = xtemp;
iteration++;
}
return iteration;
}
The code is from this repository https://github.com/hortont424/mandelbrot
First, like hinted in the comment, use omp_get_wtime() instead of clock() (it will give you the number of clock ticks accumulated across all threads) measure the time. Second, If I recall correctly, this algorithm have load balancing problems, so try to use a dynamic scheduling:
//timer start
double begin = omp_get_wtime();
#pragma omg parallel for private(j,x,y, esc_time) schedule(dynamic, 1)
for(i = 0; i < w; ++i) {
x = x0 + i * dx;
for(j = 0; j < h; ++j) {
y = y1 - j * dy;
esc_time = escape_time(x, y, max_iter);
canvas[ GET_R(i, j, w) ] = color_buffer[esc_time * 3];
canvas[ GET_G(i, j, w) ] = color_buffer[esc_time * 3 + 1];
canvas[ GET_B(i, j, w) ] = color_buffer[esc_time * 3 + 2];
}
}
//time calculation
double end = omp_get_wtime();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%f\n",time_spent );
As it was suggested my problem was caused by using the clock() function, which measures CPU time.
Using omp_get_wtime() instead solved my problem.