Do While don't work inside CUDA Kernel - c

Ok, I'm pretty new into CUDA, and I'm kind of lost, really lost.
I'm trying to calculate pi using the Monte Carlo Method, and at the end I just get one add instead of 50.
I don't want to "do while" for calling the kernel, since it's too slow. My issue is, that my code don't loop, it executes only once in the kernel.
And also, I'd like that all the threads access the same niter and pi, so when some thread hit the counters all the others would stop.
#define SEED 35791246
__shared__ int niter;
__shared__ double pi;
__global__ void calcularPi(){
double x;
double y;
int count;
double z;
count = 0;
niter = 0;
//keep looping
niter = niter + 1;
//Generate random number
curandState state;
curand_init(SEED,(int)niter, 0, &state);
x = curand(&state);
y = curand(&state);
z = x*x+y*y;
if (z<=1) count++;
pi =(double)count/niter*4;
}while(niter < 50);
int main(void){
float tempoTotal;
//Start timer
clock_t t;
t = clock();
//call kernel
//wait while kernel finish
typeof(pi) piFinal;
cudaMemcpyFromSymbol(&piFinal, "pi", sizeof(piFinal),0, cudaMemcpyDeviceToHost);
typeof(niter) niterFinal;
cudaMemcpyFromSymbol(&niterFinal, "niter", sizeof(niterFinal),0, cudaMemcpyDeviceToHost);
//Ends timer
t = clock() - t;
tempoTotal = ((double)t)/CLOCKS_PER_SEC;
printf("Pi: %g \n", piFinal);
printf("Adds: %d \n", niterFinal);
printf("Total time: %f \n", tempoTotal);

There are a variety of issues with your code.
I suggest using proper cuda error checking and run your code with cuda-memcheck to spot any runtime errors. I've omitted proper error checking in my code below for brevity of presentation, but I've run it with cuda-memcheck to indicate no runtime errors.
Your usage of curand() is probably not correct (it returns integers over a large range). For this code to work correctly, you want a floating-point quantity between 0 and 1. The correct call for that is curand_uniform().
Since you want all threads to work on the same values, you must prevent those threads from stepping on each other. One way to do that is to use atomic updates of the variables in question.
It should not be necessary to re-run curand_init on each iteration. Once per thread should be sufficient.
We don't use cudaMemcpy..Symbol operations on __shared__ variables. For convenience, and to preserve something that resembles your original code, I've elected to convert those to __device__ variables.
Here's a modified version of your code that has most of the above issues fixed:
$ cat
#include <curand.h>
#include <curand_kernel.h>
#include <stdio.h>
#define ITER_MAX 5000
#define SEED 35791246
__device__ int niter;
__device__ int count;
__global__ void calcularPi(){
double x;
double y;
double z;
int lcount;
curandState state;
curand_init(SEED,threadIdx.x, 0, &state);
//keep looping
lcount = atomicAdd(&niter, 1);
//Generate random number
x = curand_uniform(&state);
y = curand_uniform(&state);
z = x*x+y*y;
if (z<=1) atomicAdd(&count, 1);
}while(lcount < ITER_MAX);
int main(void){
float tempoTotal;
//Start timer
clock_t t;
t = clock();
int count_final = 0;
int niter_final = 0;
cudaMemcpyToSymbol(niter, &niter_final, sizeof(int));
cudaMemcpyToSymbol(count, &count_final, sizeof(int));
//call kernel
//wait while kernel finish
cudaMemcpyFromSymbol(&count_final, count, sizeof(int));
cudaMemcpyFromSymbol(&niter_final, niter, sizeof(int));
//Ends timer
double pi = count_final/(double)niter_final*4;
t = clock() - t;
tempoTotal = ((double)t)/CLOCKS_PER_SEC;
printf("Pi: %g \n", pi);
printf("Adds: %d \n", niter_final);
printf("Total time: %f \n", tempoTotal);
$ nvcc -o t978 -lcurand
$ cuda-memcheck ./t978
Pi: 3.12083
Adds: 5032
Total time: 0.558463
========= ERROR SUMMARY: 0 errors
I've modified the iterations to a larger number, but you can use 50 if you want for ITER_MAX.
Note that there are many criticisms that could be levelled against this code. My aim here, since it's clearly a learning exercise, is to point out what the minimum number of changes could be to get a functional code, using the algorithm you've outlined. As just one example, you might want to change your kernel launch config (<<<1,32>>>) to other, larger numbers, in order to more fully utilize the GPU.


Not quite understanding MPI

I am attempting to make a program using MPI that will find the value of PI using MPI.
Currently I can find the sum this way:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define NUMSTEPS 1000000
int main() {
int i;
double x, pi, sum = 0.0;
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
double step = 1.0/(double) NUMSTEPS;
x = 0.5 * step;
for (i=0;i<= NUMSTEPS; i++){
sum += 4.0/(1.0+x*x);
pi = step * sum;
clock_gettime(CLOCK_MONOTONIC, &end);
u_int64_t diff = 1000000000L * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
printf("PI is %.20f\n",pi);
printf("elapsed time = %llu nanoseconds\n", (long long unsigned int) diff);
return 0;
But this does not use MPI.
So I have tried to make my own in MPI. My logic is:
Split the 1000000 into equal parts based on how many processors I have
Calculate the values for each range
Send the calculated value back to the master and then divide by the number of processors. I would like to keep the main thread free and not do any work. Similar to a master-slave system.
Here's what I have currently. This doesn't seem to be working and the send/receive gives errors about incompatible variables for receive and send.
#include <mpi.h>
#include <stdio.h>
#include <string.h>
#define NUMSTEPS 1000000
int main(int argc, char** argv) {
int comm_sz; //number of processes
int my_rank; //my process rank
// Initialize the MPI environment
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Slaves
if (my_rank != 0) {
// Process math then send
int i;
double x, pi, sum = 0.0;
double step = 1.0/(double) NUMSTEPS;
x = 0.5 * step;
// Find the start and end for the number
int processors = comm_sz - 1;
int thread_multi = NUMSTEPS / processors;
int start = my_rank * thread_multi;
if((my_rank - 1) != 0){
start += 1;
int end = start + thread_multi ;
for (i=start; i <= end; i++){
sum += 4.0 / (1.0 + x * x);
pi = step * sum;
MPI_Send(pi, 1.0, MPI_DOUBLE 1, 0, MPI_COMM_WORLD);
// Master
} else {
// Things in here only get called once.
double pi = 0.0;
double total = 0.0;
for (int q = 1; q < comm_sz; q++) {
total += pi;
pi = 0.0;
// Take the added totals and divide by amount of processors that processed, to get the average
double finished = total / (comm_sz - 1);
// Print sum here
printf("Pi Is: %d", finished);
// Finalize the MPI environment.
I've currently spent around 3 hours working on this. Never used MPI. Any help would be greatly appreciated.
Try compiling with more compiler warnings and try to fix them, for instance -Wall -Wextra should give you excellent clues about what the issues are.
According to MPI_Send documentation the first argument is a pointer, so you seem to be ignoring an automatic "conversion to pointer" error. You have the same issue in the MPI_Recv() call.
You can try to pass pi as &pi in MPI_Recv and MPI_Send and check if that fixes the error.
As a comment, you can declare dummy variables as pi as a local variables inside the master loop to avoid side-effects:
for (int q = 1; q < comm_sz; q++) {
double pi = 0;
total += pi;

Why is this not working as a disk speed test?

I wrote a very simple program in C to test the read/write speed of the storage device where the program is located. It could be a SSD, HDD or a usb stick. But I get very inconsistent results, which is weird because the program is very simple and straightforward.
When I run it on a usb 3.0 stick it gives values like 270 mb/s [write], and 2100 mb/s [read].
For a HDD it gives similar values.
And for the SSD, it gives similar read speeds, and around 300 mb/s write speed.
This is weird, because there isn't anything complicated in the code, and I am not optimizing it either. The speeds reported don't match with the normal speeds of these devices. Though, it could be that I am not really understanding how this works.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
const unsigned int N = 25000000; /// Number of floats to be written
int main(){
double time0, time1, time2;
unsigned int i, size = N*sizeof(float); /// Size of the file to be written/read in bytes
FILE *pfin, *pfout;
float *array_write, *array_read, sum, delta_write, delta_read;
array_write = (float *) malloc(N*sizeof(float)); /// Array to be written to a file
array_read = (float *) malloc(N*sizeof(float)); /// Array to be read
for(i = 0; i < N; i++)
array_write[i] = i*1.f/N; /// Filling in array with some values
time0 = omp_get_wtime();
pfout = fopen("test.dat", "wb");
fwrite(array_write, N*sizeof(float), 1, pfout);
time1 = omp_get_wtime();
pfin = fopen("test.dat", "rb");
fread(array_read, N*sizeof(float), 1, pfin);
time2 = omp_get_wtime();
sum = 0.f;
for(i = 0; i < N; i++)
sum += fabsf(array_read[i] - array_write[i]); /// Simple test to check whether it read properly or not
delta_write = time1 - time0;
delta_read = time2 - time1;
printf("delta1 = %f, delta2 = %f, size = %f Gb, diff = %f\n", delta_write, delta_read, size/1000000000.f, sum);
printf("Speed: \n Write: %f [Mb/s]\n Read: %f [Mb/s]\n", size/1000000.f/delta_write, size/1000000.f/delta_read);
//// compile with gcc program.c -lgomp -lm -O0 -o program.x
Be aware that it creates a 100 mb file.

Why does OpenMP speed up a SINGLE-ITERATION loop?

I'm using the "read" benchmark from Why is writing to memory much slower than reading it?, and I added just two lines:
#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)
They should have no effect, because OpenMP should only parallelize the outer loop, but the code now consistently runs twice faster.
Update: These lines aren't even necessary. Simply adding
(implicitly declared) in the same place has the same effect.
Complete code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
unsigned long do_xor(const unsigned long* p, unsigned long n)
unsigned long i, x = 0;
for(i = 0; i < n; ++i)
x ^= p[i];
return x;
int main()
unsigned long n, r, i;
unsigned long *p;
clock_t c0, c1;
double elapsed;
n = 1000 * 1000 * 1000; /* GB */
r = 100; /* repeat */
p = calloc(n/sizeof(unsigned long), sizeof(unsigned long));
c0 = clock();
#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)
for(i = 0; i < r; ++i) {
p[0] = do_xor(p, n / sizeof(unsigned long)); /* "use" the result */
printf("%4ld/%4ld\r", i, r);
c1 = clock();
elapsed = (c1 - c0) / (double)CLOCKS_PER_SEC;
printf("Bandwidth = %6.3f GB/s (Giga = 10^9)\n", (double)n * r / elapsed / 1e9);
Compiled and executed with
gcc -O3 -Wall -fopenmp single_iteration.c && time taskset -c 0 ./a.out
The wall time reported by time is 3.4s vs 7.5s.
GCC 7.3.0 (Ubuntu)
The reason for the performance difference is not actually any difference in code, but in how memory is mapped. In the fast case you are reading from zero-pages, i.e. all virtual addresses are mapped to a single physical page - so nothing has to be read from memory. In the slow case, it is not zeroed. For details see this answer from a slightly different context.
On the other side, it is not caused by calling omp_get_num_threads or the pragma itstelf, but merely linking to the OpenMP runtime library. You can confirm that by using -Wl,--no-as-needed -fopenmp. If you just specify -fopenmp but don't use it at all, the linker will omit it.
Now unfortunately I am still missing the final puzzle piece: why does linking to OpenMP change the behavior of calloc regarding zero'd pages .

OpenMP slower than single threaded even though embarrassingly parallelizable [duplicate]

I have optimized as much as I could my function for sequential running.
When I use openMP I see no gain in performance.
I tried my program on a machine with 1 cores and on a machine with 8 cores, and the performance is the same.
With year set to 20, I have
1 core: 1 sec.
8 core: 1 sec.
With year set to 25 I have
1 core: 40 sec.
8 core: 40 sec.
1 core machine: my laptop's intel core 2 duo 1.8 GHz, ubuntu linux
8 core machine: 3.25 GHz, ubuntu linux
My program enumerate all the possible path of a binomial tree and do some work on each path. So my loop size increase exponentially and I would expect the footprint of openMP thread to be zero. In my loop, I only do a reduction of one variable. All other variable are read-only. I only use function I wrote, and I think they are thread safe.
I also run Valgrind cachegrind on my program. I don't fully understand the output but there seems to be no cache miss or false sharing.
I compile with
gcc -O3 -g3 -Wall -c -fmessage-length=0 -lm -fopenmp -ffast-math
My complete program is as below. Sorry for posting a lot of code. I'm not familiar with openMP nor C, and I couldn't resume my code more without loosing the main task.
How can I improve performance when I use openMP?
Are they some compiler flags or C tricks that will make the program run faster?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include "test.h"
int main(){
int year=20;
int tradingdate0=1;
int i;
float v=0;
long n=pow(tradingdate0+1,year);
#pragma omp parallel for reduction(+:v)
return 0;
//***function on which openMP is applied
float pathvalue(long pathindex) {
float value = -ctx.firstpremium;
float personalaccount = ctx.personalaccountat0;
float account = ctx.firstpremium;
int i;
for (i = 0; i < ctx.year-1; i++) {
value *= ctx.accumulationfactor;
double index = getindex(i,pathindex);
account = account * index;
double death = fmaxf(account,ctx.guarantee[i]);
value += qx(i) * death;
if (haswithdraw(i)){
double withdraw = personalaccount*ctx.allowed;
value += px(i) * withdraw;
personalaccount = fmaxf(personalaccount-withdraw,0);
account = fmaxf(account-withdraw,0);
//last year
double index = getindex(ctx.year-1,pathindex);
account = account * index;
return value * ctx.discountfactor;
int haswithdraw(int period){
return 1;
float getindex(int period, long pathindex){
int ndx = (pathindex/ctx.chunksize[period])%ctx.tradingdate;
return ctx.stock[ndx];
float qx(int period){
return 0;
float px(int period){
return 1;
struct context ctx;
void globalinit(int year, int tradingdate0){
ctx.year = year;
ctx.tradingdate0 = tradingdate0;
ctx.firstpremium = 1;
ctx.riskfreerate = 0.06;
ctx.personalaccountat0 = 1;
ctx.allowed = 0.07;
ctx.guaranteerate = 0.03;
ctx.beta = 1;
ctx.discountfactor = exp(-ctx.riskfreerate * ctx.year);
ctx.accumulationfactor = exp(ctx.riskfreerate);
ctx.guaranteefactor = 1+ctx.guaranteerate;
int i;
void globaldel(){
float pathvalue(long pathindex);
int haswithdraw(int period);
float getindex(int period, long pathindex);
float qx(int period);
float px(int period);
struct context{
int year;
int tradingdate0;
float firstpremium;
float riskfreerate;
float volatility;
float personalaccountat0;
float allowed;
float guaranteerate;
float alpha;
float beta;
int tradingdate;
float discountfactor;
float accumulationfactor;
float guaranteefactor;
float upmove;
float downmove;
float* stock;
long* chunksize;
float* guarantee;
struct context ctx;
void globalinit();
void globaldel();
EDIT I simplify all global variables as constant. For 20 year, the program run two time faster (great!). I tried to set the number of thread with OMP_NUM_THREADS=4 ./test for example. But it didn't give me any performance gain.
Can my gcc have some problem?
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <omp.h>
#include "test.h"
int main(){
int i;
float v=0;
#pragma omp parallel for reduction(+:v)
return 0;
//function on which openMP is applied
float pathvalue(long pathindex) {
float value = -firstpremium;
float personalaccount = personalaccountat0;
float account = firstpremium;
int i;
for (i = 0; i < year-1; i++) {
value *= accumulationfactor;
double index = getindex(i,pathindex);
account = account * index;
double death = fmaxf(account,guarantee[i]);
value += death;
double withdraw = personalaccount*allowed;
value += withdraw;
personalaccount = fmaxf(personalaccount-withdraw,0);
account = fmaxf(account-withdraw,0);
//last year
double index = getindex(year-1,pathindex);
account = account * index;
return value * discountfactor;
float getindex(int period, long pathindex){
int ndx = (pathindex/chunksize[period])%tradingdate;
return stock[ndx];
clock_t begin;
void starttimer(){
begin = clock();
void endtimer(){
clock_t end = clock();
double elapsed = (double)(end - begin) / CLOCKS_PER_SEC;
printf("\nelapsed: %f\n",elapsed);
float pathvalue(long pathindex);
int haswithdraw(int period);
float getindex(int period, long pathindex);
float qx(int period);
float px(int period);
void starttimer();
void endtimer();
const int year= 20 ;
const int tradingdate0= 1 ;
const float firstpremium= 1 ;
const float riskfreerate= 0.06 ;
const float volatility= 0.25 ;
const float personalaccountat0= 1 ;
const float allowed= 0.07 ;
const float guaranteerate= 0.03 ;
const float alpha= 1 ;
const float beta= 1 ;
const int tradingdate= 2 ;
const int numberofpath= 1048576 ;
const float discountfactor= 0.301194211912 ;
const float accumulationfactor= 1.06183654655 ;
const float guaranteefactor= 1.03 ;
const float upmove= 1.28402541669 ;
const float downmove= 0.778800783071 ;
const float stock[2]={1.2840254166877414, 0.7788007830714049};
const long chunksize[20]={524288, 262144, 131072, 65536, 32768, 16384, 8192, 4096, 2048, 1024, 512, 256, 128, 64, 32, 16, 8, 4, 2, 1};
const float guarantee[20]={1.03, 1.0609, 1.092727, 1.1255088100000001, 1.1592740743, 1.1940522965290001, 1.2298738654248702, 1.2667700813876164, 1.304773183829245, 1.3439163793441222, 1.384233870724446, 1.4257608868461793, 1.4685337134515648, 1.512589724855112, 1.557967416600765, 1.6047064390987882, 1.6528476322717518, 1.7024330612399046, 1.7535060530771016, 1.8061112346694148};
Even if your program benefits from using OpenMP, you won't see it because you are measuring the wrong time.
clock() returns the total CPU time spent in all threads. If you run with four threads and each runs for 1/4 of the time, clock() will still return the same value since 4*(1/4) = 1. You should be measuring the wall-clock time instead.
Replace calls to clock() with omp_get_wtime() or gettimeofday(). They both provide high precision wall-clock timing.
P.S. Why are there so many people around SO using clock() for timing?
It seems as if it should work. Probably you need to specify the number of threads to use. You can do so by setting the OMP_NUM_THREADS variable. For instance, for using 4 threads:
EDIT: I just compiled the code and I observe significant speedups when changing the number of threads.
I don't see any section in which you're specifying the number of cores OpenMP will use. It's supposed to, by default, use the number of CPUs it sees, but for my purposes, I've always forced it to use as many as I specified.
Add this line before your parallel for construct:
#pragma omp parallel num_threads(num_threads)
// Your parallel for follows here
...where num_threads is an integer between 1 and the number of cores on your machine.
EDIT: Here's the makefile used to build the code. Place this in a text file named Makefile in the same directory.
test: test.c test.h
cc -o $# $< -O3 -g3 -fmessage-length=0 -lm -fopenmp -ffast-math

Calculating pi in pthreads

I am trying to calculate pi using the bpp method but my result keeps coming up at 0.The whole idea is for each thread to compute a part of it and the sum of each thread gets summed up using the join method
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#define NUM_THREADS 20
void *pi_function(void *p);//returns the value of pi
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER; //creates a mutex variable
double pi=0,p16=1;int k=0;
double sumvalue=0,sum=0;
pthread_t threads[NUM_THREADS]; //creates the number of threads NUM_THREADS
int iret1; //used to ensure that threads are created properly
int i;
pthread_mutex_init(&mutex1, NULL);
iret1= pthread_create(&threads[i],NULL,pie_function,(void *) i);
printf("ERROR; return code from pthread_create() is %d\n", iret1);
printf("ERROR; return code from pthread_create() is %d\n", iret1);
pi=pi+sumvalue; //my result here keeps returning 0
printf("Main: program completed. Exiting.\n");
printf("The value of pi is : %f\n",pi);
void *pie_function(void * p){
int rc;
int k=(int)p;
sumvalue += 1.0/p16 * (4.0/(8* k + 1) - 2.0/(8*k + 4)
- 1.0/(8*k + 5) - 1.0/(8*k+6));
pthread_mutex_lock( &mutex1 ); //locks the share variable pi and p16
p16 *=16;
rc=pthread_mutex_unlock( &mutex1 );
printf("ERROR; return code from pthread_create() is %d\n", rc);
For your purpose you don't need to have a mutex or other complicated structure. Just have every thread compute on its own local variables. Provide to each thread the address of a double where he receives his k and may return the result, in the same way as you already separate the ptread_t variables for each thread.
To avoid getting 0 as the output. Put pi=pi+sumvalue inside the join for loop.
since pi=pi+sumvalue is executed only once when sumvalue is very small and pi=0. pi=0 is the output. Just put pi=pi+sumvalue in the join for loop.
To get consistent value of pi follow below mentioned details.
You have to make sure two thinks:-
Just make pi as an global variable and remove all global variables. limit the use of many global variable as updating them will become a critical section. Update pi as
pthread mutex lock(&mutex1);
pi += pi + my sum;
pthread mutex unlock(&mutex1);
you can also calculate sum_values as:
void pie_function(void rank) {
long my_rank = (long) rank;
//printf("%ld \n",my_rank);
double factor, sumvalue = 0.0;
long long i;
long long n=1000000;
long long my_n = n/thread_count;
long long my_first_i = my_n*my_rank;
long long my_last_i = my_first_i + my_n;
if (my_first_i % 2 == 0)
factor = 1.0;
factor = -1.0;
for (i = my_first_i; i < my_last_i; i++, factor = -factor)
sumvalue+= 4factor/(2i+1);
