I'm writing implementation of Sieve of Eratosthenes (https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) on GPU. But no sth like this - http://developer-resource.blogspot.com/2008/07/cuda-sieve-of-eratosthenes.html
Method:
Creating n-element array with default values 0/1 (0 - prime, 1 - no) and passing it on GPU (I know that it can be done directly in kernel but it's not problem in this moment).
Each thread in block checks multiples of a single number. Each block checks in total sqrt(n) possibilities. Each block == different interval.
Marking multiples as 1 and passing data back to the host.
Code:
#include <stdio.h>
#include <stdlib.h>
#define THREADS 1024
__global__ void kernel(int *global, int threads) {
extern __shared__ int cache[];
int tid = threadIdx.x + 1;
int offset = blockIdx.x * blockDim.x;
int number = offset + tid;
cache[tid - 1] = global[number];
__syncthreads();
int start = offset + 1;
int end = offset + threads;
for (int i = start; i <= end; i++) {
if ((i != tid) && (tid != 1) && (i % tid == 0)) {
cache[i - offset - 1] = 1;
}
}
__syncthreads();
global[number] = cache[tid - 1];
}
int main(int argc, char *argv[]) {
int *array, *dev_array;
int n = atol(argv[1]);
int n_sqrt = floor(sqrt((double)n));
size_t array_size = n * sizeof(int);
array = (int*) malloc(n * sizeof(int));
array[0] = 1;
array[1] = 1;
for (int i = 2; i < n; i++) {
array[i] = 0;
}
cudaMalloc((void**)&dev_array, array_size);
cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice);
int threads = min(n_sqrt, THREADS);
int blocks = n / threads;
int shared = threads * sizeof(int);
kernel<<<blocks, threads, shared>>>(dev_array, threads);
cudaMemcpy(array, dev_array, array_size, cudaMemcpyDeviceToHost);
int count = 0;
for (int i = 0; i < n; i++) {
if (array[i] == 0) {
count++;
}
}
printf("Count: %d\n", count);
return 0;
}
Run:
./sieve 10240000
It works correctly when n = 16, 64, 1024, 102400... but for n = 10240000 I getting incorrect result. Where is problem?
This code has a variety of problems, in my view.
You are fundamentally accessing items out of range. Consider this sequence in your kernel:
int tid = threadIdx.x + 1;
int offset = blockIdx.x * blockDim.x;
int number = offset + tid;
cache[tid - 1] = global[number];
You (in some cases -- see below) have launched a thread array exactly equal in size to your global array. So what happens when the highest numbered thread runs the above code? number = threadIdx.x+1+blockIdx.x*blockDim.x. This number index will be one beyond the end of your array. This is true for many possible values of n. This problem would have been evident to you if you had either used proper cuda error checking or had run your code with cuda-memcheck. You should always do those things when you are having trouble with a CUDA code and also before asking for help from others.
The code only has a chance of working correctly if the input n is a perfect square. The reason for this is contained in these lines of code (as well as dependencies in the kernel):
int n = atol(argv[1]);
int n_sqrt = floor(sqrt((double)n));
...
int threads = min(n_sqrt, THREADS);
int blocks = n / threads;
(note that the correct function here would be atoi not atol, but I digress...) Unless n is a perfect square, the resultant n_sqrt will be somewhat less than the actual square root of n. This will lead you to compute a total thread array that is smaller than the necessary size. (It's OK if you don't believe me at this point. Run the code I will post below and input a size like 1025, then see if the number of threads * blocks is of sufficient size to cover an array of 1025.)
As you've stated:
Each block checks in total sqrt(n) possibilities.
Hopefully this also points out the danger of non-perfect square n, but we must now ask "what if n is larger than the square of the largest threadblock size (1024)? The answer is that the code will not work correctly in many cases - and your chosen input of 10240000, although a perfect square, exceeds 1024^2 (1048576) and it does not work for this reason. Your algorithm (which I claim is not a Sieve of Eratosthenes) requires that each block be able to check sqrt(n) possibilities, just as you stated in the question. When that no longer becomes possible because of the limits of threads per block, then your algorithm starts to break.
Here is a code that makes some attempt to fix issue #1 above, and at least give an explanation for the failures associated with #2 and #3:
#include <stdio.h>
#include <stdlib.h>
#define THREADS 1024
#define MAX 10240000
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__global__ void kernel(int *global, int threads) {
extern __shared__ int cache[];
int tid = threadIdx.x + 1;
int offset = blockIdx.x * blockDim.x;
int number = offset + tid;
if ((blockIdx.x != (gridDim.x-1)) || (threadIdx.x != (blockDim.x-1))){
cache[tid - 1] = global[number];
__syncthreads();
int start = offset + 1;
int end = offset + threads;
for (int i = start; i <= end; i++) {
if ((i != tid) && (tid != 1) && (i % tid == 0)) {
cache[i - offset - 1] = 1;
}
}
__syncthreads();
global[number] = cache[tid - 1];}
}
int cpu_sieve(int n){
int limit = floor(sqrt(n));
int *test_arr = (int *)malloc(n*sizeof(int));
if (test_arr == NULL) return -1;
memset(test_arr, 0, n*sizeof(int));
for (int i = 2; i < limit; i++)
if (!test_arr[i]){
int j = i*i;
while (j <= n){
test_arr[j] = 1;
j += i;}}
int count = 0;
for (int i = 2; i < n; i++)
if (!test_arr[i]) count++;
return count;
}
int main(int argc, char *argv[]) {
int *array, *dev_array;
if (argc != 2) {printf("must supply n as command line parameter\n"); return 1;}
int n = atoi(argv[1]);
if ((n < 1) || (n > MAX)) {printf("n out of range %d\n", n); return 1;}
int n_sqrt = floor(sqrt((double)n));
size_t array_size = n * sizeof(int);
array = (int*) malloc(n * sizeof(int));
array[0] = 1;
array[1] = 1;
for (int i = 2; i < n; i++) {
array[i] = 0;
}
cudaMalloc((void**)&dev_array, array_size);
cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice);
int threads = min(n_sqrt, THREADS);
int blocks = n / threads;
int shared = threads * sizeof(int);
printf("threads = %d, blocks = %d\n", threads, blocks);
kernel<<<blocks, threads, shared>>>(dev_array, threads);
cudaMemcpy(array, dev_array, array_size, cudaMemcpyDeviceToHost);
cudaCheckErrors("some error");
int count = 0;
for (int i = 0; i < n; i++) {
if (array[i] == 0) {
count++;
}
}
printf("Count: %d\n", count);
printf("CPU Sieve: %d\n", cpu_sieve(n));
return 0;
}
There are a couple of issues, I think, but here's a pointer to the actual problem: The sieve of Eratosthenes removes iteratively multiples of already encountered prime numbers, and you want to separate the work-load into thread-blocks, where each thread-block operates on a piece of shared memory (cache, in your example). Thread-blocks, however, are generally independent from all other thread-blocks and cannot easily communicate with one another. One example to illustrate the problem: The thread with index 0 in thread-block with index 0 removes multiples of 2. Thread blocks with index > 0 have no way to know about this.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am writing some code to compute the sum of fibonacci values up to n, as stored in an array. For only certain values of n, I get an error on calling free().
Edit: This code should now compile.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
long fib(long *fibs, int n);
int main(int argc, char *argv[]) {
long num, sum;
long n;
long *fibs;
if(argc < 2) {
printf("Usage %s n\n", argv[0]);
exit(EXIT_SUCCESS);
}
n = strtol(argv[1], NULL, 10);
printf("%ld\n", n);
printf("--Allocating memory\n");
fibs = (long *) malloc(sizeof(long) * n);
printf("--Memory allocated\n");
fibs[0] = 1;
fibs[1] = 1;
sum = 0;
for(int i = 0; i <= n; i++) {
num = fib(fibs, i);
sum += num;
printf("%ld\n", num);
}
printf("%ld\n", sum);
printf("--Freeing memory\n");
free(fibs);
printf("--Memory freed\n");
}
long fib(long *fibs, int n) {
if((n == 0) || (n == 1)) {
return 1;
}
fibs[n] = fibs[n - 1] + fibs[n - 2];
return fibs[n];
}
For instance, when I call the program ./fibsum with n=5, I get a core dump.
The lines
fibs[n] = 1;
and
fibs[n] = fibs[n - 1] + fibs[n - 2];
modify memory beyond the legal range. The legal range is fibs[0] - fibs[n-1]. Due to that, the program displays undefined behavior. In your case, the undefined behavior manifests in the form of problem in the call to free.
You may want to allocate one more element than you are currently allocating. Instead of
fibs = (long *) malloc(n * sizeof(n));
use
fibs = malloc((n+1) * sizeof(n));
See an SO post on why you should not cast the return value of malloc.
Like i told you in the comments, you are overflowing when using <= instead of < in the loop. Take a look at the following example, this is by no means trying to be a "cleaned up" version of your code, I just made it work without changing too much.
#include <stdio.h>
#include <stdlib.h>
int number = 300;
long* fib(long* fibs, int n)
{
if (n == 0 || n == 1)
{
fibs[n] = 1;
}
else
{
fibs[n] = (fibs[n-1] + fibs[n-2]);
}
return fibs;
}
int main(int argc, char *argv[])
{
long* fibs = (long*)malloc(number * sizeof(long));
long sum = 0;
for (int i = 0; i < number; ++i) //changed <= to < like i said in the comments
{
sum += fib(fibs, i)[i];
printf("%d\n", sum);
}
printf("\n\nSum of everything: %ld\n", sum);
free(fibs); //no problem
return 0;
}
There are number of problems -
i is undeclared in main.
n is not declared in main.
And most important one this loop-
for(i = 0; i <= n; i++)
As you allocate memory for n items but loop goes from 0 to n .You forgot it should be till n-1 .Thus behaviour is undefined.
As case you described (n=5 so loop should be from 0 to 4).
There are few problems in your code, some are typing mistake I guess, some are logical problem:
You forgot to type n and i
Never cast malloc, so instead of fibs = (long *) malloc(n * sizeof(n)); try fibs = malloc(n * sizeof(long));
In the for loop, instead of for(i = 0; i <= n; i++) use for(i = 0; i < n; i++)
sum += fib(fibs, i); here, sum is long but fib() returns long *, so change function defalcation long *fib(long *fibs, int n) to long fib(long *fibs, int n)
I update your code then run and get output : 12.
The modified code:
#include <stdio.h>
#include <stdlib.h>
long fib(long *fibs, int n) {
if((n == 0) || (n == 1)) {
fibs[n] = 1;
} else {
fibs[n] = fibs[n - 1] + fibs[n - 2];
}
return fibs[n];
}
int main(int argc, char *argv[]) {
long *fibs;
long sum = 0;
int n = 6, i;
fibs = malloc(n * sizeof(long));
long tem;
for(i = 0; i < n; i++) {
tem = fib(fibs, i);
sum += tem;
}
printf("%ld\n", sum);
free(fibs);
}
I am doing the Udacity course on parallel programming (homework 3) and can not figure out why I can't get the maximum in the array using parallel reduction (Udacity forums yet to provide solution). I am pretty certain that I have set up the arrays properly and that the algorithm is correct. I suspect that I have a problem with memory management (accessing out of bounds, incorrect array sizes, copying to and from). Please help! I am running this in the Udacity environment, not locally. Below is the code that I am currently using. For some reason when I change the fmaxf's to fminf's it does find the minimum.
#include "reference_calc.cpp"
#include "utils.h"
#include "math.h"
#include <stdio.h>
#include <cmath>
__global__ void reduce_max_kernel(float *d_out, const float *d_logLum, int size) {
// Reduce log Lum with Max Operator
int myId = threadIdx.x + blockDim.x * blockIdx.x;
int tid = threadIdx.x;
extern __shared__ float temp[];
if (myId < size) {
temp[tid] = d_logLum[myId];
}
else {
temp[tid] = d_logLum[tid];
}
for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
if (myId < size) {
temp[tid] = fmaxf(d_logLum[myId + s], d_logLum[myId]);
} else {
temp[tid] = d_logLum[tid];
}
}
__syncthreads();
}
if (tid == 0) {
d_out[blockIdx.x] = temp[0];
}
}
__global__ void reduce_max_kernel2(float *d_out, float *d_in) {
// Reduce log Lum with Max Operator
int myId = threadIdx.x + blockDim.x * blockIdx.x;
int tid = threadIdx.x;
for (unsigned int s = blockDim.x >> 1; s > 0; s >>= 1) {
if (tid < s) {
d_in[myId] = fmaxf(d_in[myId + s], d_in[myId]);
}
__syncthreads();
}
if (tid == 0) {
d_out[0] = d_in[0];
}
}
void your_histogram_and_prefixsum(const float* const d_logLuminance,
unsigned int* const d_cdf,
float &min_logLum,
float &max_logLum,
const size_t numRows,
const size_t numCols,
const size_t numBins)
{
//TODO
/*Here are the steps you need to implement
1) find the minimum and maximum value in the input logLuminance channel
store in min_logLum and max_logLum
2) subtract them to find the range
3) generate a histogram of all the values in the logLuminance channel using
the formula: bin = (lum[i] - lumMin) / lumRange * numBins
4) Perform an exclusive scan (prefix sum) on the histogram to get
the cumulative distribution of luminance values (this should go in the
incoming d_cdf pointer which already has been allocated for you) */
//int size = 1 << 18;
int points = numRows * numCols;
int logPoints = ceil(log(points)/log(2));
int sizePow = logPoints;
int size = pow(2, sizePow);
int numThreads = 1024;
int numBlocks = size / numThreads;
float *d_out;
float *d_max_out;
checkCudaErrors(cudaMalloc((void **) &d_out, numBlocks * sizeof(float)));
checkCudaErrors(cudaMalloc((void **) &d_max_out, sizeof(float)));
cudaDeviceSynchronize();
reduce_max_kernel<<<numBlocks, numThreads, sizeof(float)*numThreads>>>(d_out, d_logLuminance, points);
cudaDeviceSynchronize();
reduce_max_kernel2<<<1, numBlocks>>>(d_max_out, d_out);
float h_out_max;
checkCudaErrors(cudaMemcpy(&h_out_max, d_max_out, sizeof(float), cudaMemcpyDeviceToHost));
printf("%f\n", h_out_max);
checkCudaErrors(cudaFree(d_max_out));
checkCudaErrors(cudaFree(d_out));
}
You are trying to reproduce the reduce2 reduction kernel of the CUDA SDK reduction sample. Robert Crovella has already spot two mistakes that you have made in your code. Besides them, I think you are also mistakenly initializing the shared memory.
Below, please find a complete working example constructed around your attempt. I have left the wrong instructions of your approach.
#include <thrust\device_vector.h>
#define BLOCKSIZE 256
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) { getchar(); exit(code); }
}
}
/*******************************************************/
/* CALCULATING THE NEXT POWER OF 2 OF A CERTAIN NUMBER */
/*******************************************************/
unsigned int nextPow2(unsigned int x)
{
--x;
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return ++x;
}
__global__ void reduce_max_kernel(float *d_out, const float *d_logLum, int size) {
int tid = threadIdx.x; // Local thread index
int myId = blockIdx.x * blockDim.x + threadIdx.x; // Global thread index
extern __shared__ float temp[];
// --- Loading data to shared memory. All the threads contribute to loading the data to shared memory.
temp[tid] = (myId < size) ? d_logLum[myId] : -FLT_MAX;
// --- Your solution
// if (myId < size) { temp[tid] = d_logLum[myId]; } else { temp[tid] = d_logLum[tid]; }
// --- Before going further, we have to make sure that all the shared memory loads have been completed
__syncthreads();
// --- Reduction in shared memory. Only half of the threads contribute to reduction.
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s) { temp[tid] = fmaxf(temp[tid], temp[tid + s]); }
// --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
__syncthreads();
}
// --- Your solution
//for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
// if (tid < s) { if (myId < size) { temp[tid] = fmaxf(d_logLum[myId + s], d_logLum[myId]); } else { temp[tid] = d_logLum[tid]; } }
// __syncthreads();
//}
if (tid == 0) {
d_out[blockIdx.x] = temp[0];
}
}
/********/
/* MAIN */
/********/
int main()
{
const int N = 10;
thrust::device_vector<float> d_vec(N,3.f); d_vec[4] = 4.f;
int NumThreads = (N < BLOCKSIZE) ? nextPow2(N) : BLOCKSIZE;
int NumBlocks = (N + NumThreads - 1) / NumThreads;
// when there is only one warp per block, we need to allocate two warps
// worth of shared memory so that we don't index shared memory out of bounds
int smemSize = (NumThreads <= 32) ? 2 * NumThreads * sizeof(int) : NumThreads * sizeof(int);
// --- reduce2
thrust::device_vector<float> d_vec_block(NumBlocks);
reduce_max_kernel<<<NumBlocks, NumThreads, smemSize>>>(thrust::raw_pointer_cast(d_vec_block.data()), thrust::raw_pointer_cast(d_vec.data()), N);
// --- The last part of the reduction, which would be expensive to perform on the device, is executed on the host
thrust::host_vector<float> h_vec_block(d_vec_block);
float result_reduce0 = -FLT_MAX;
for (int i=0; i<NumBlocks; i++) result_reduce0 = fmax(h_vec_block[i], result_reduce0);
printf("Result = %f\n",result_reduce0);
}
I have been working on this program that accomplishes this:
counts the number of occurrences of a specific integer value in a 2D array (matrix). Each position of the matrix must first be initialized to an integer value between 0 and
n. Once initialized, program will search and count the total number of occurrences of a specific value.
The program is run by taking in the parameters as command line arguments:
programName rows cols n c
rows – number of rows of the matrix
cols – number of columns of the matrix
n – the upper bound of the random values of the matrix, values can be 0–(n-1)
c – the value to search for in the matrix, note c must be between 0–(n-1)
After this, the program implements the search using 1 to 10 threads and displays the execution time and number of occurrences.
I seem to have all of this working how I wish, however the problem is that whenever I enter a value over 4 in the command line for rows, I keep getting the segment fault error.
I am at a loss as to what is causing this. Please help me understand what error in my coding may be contributing to this? Thank you.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <unistd.h>
#define NUM_THREADS 10
int **arr;
int rows, cols, n, c, totalOccurrence, done, numThreads;
int threadCounter[10];
void *matrixThread (void *threadid)
{
long tid;
tid = (long)threadid;
long lowBound = tid * (rows / numThreads);
long highBound = lowBound + (rows / numThreads);
int localcount = 0;
if (tid == numThreads - 1)
{
highBound = rows;
}
long i;
int ic;
for (i = lowBound; i < highBound; i++)
{
for (ic = 0; ic < cols; ic++)
{
if (arr[i][ic] == c)
{
localcount++;
}
}
}
threadCounter[tid] = localcount;
pthread_exit(NULL);
}
int main (int argc, char *argv[])
{
pthread_t threads[NUM_THREADS];
if (argc != 5)
{
printf("Error: Invalid number of arguments\n");
}
else
{
rows = strtol(argv[1], NULL, 10);
cols = strtol(argv[2], NULL, 10);
n = strtol(argv[3], NULL, 10);
c = strtol(argv[4], NULL, 10);
int r, cl;
arr = (int**)malloc(rows * sizeof(int));
for (r = 0; r < rows; r++)
{
arr[r] = malloc(cols * sizeof(int));
}
int randomNum;
srand(time(NULL));
for (r = 0; r < rows; r++)
{
for (cl = 0; cl < cols; cl++)
{
randomNum = rand() % n;
arr[r][cl] = randomNum;
}
}
long rc, t;
for (numThreads = 1; numThreads <= 10; numThreads++)
{
struct timeval start,end;
double elapsed_time;
gettimeofday(&start, NULL);
for (t = 0; t < numThreads; t++)
{
rc = pthread_create(&threads[t], NULL, matrixThread, (void *)t);
if (rc)
{
printf ("Error: Thread could not be created; return %d", rc);
exit(-1);
}
}
for (t = 0; t < numThreads; t++)
{
pthread_join(threads[t], NULL);
}
totalOccurrence = 0;
int q;
for (q = 0; q < numThreads; q++)
{
totalOccurrence += threadCounter[q];
}
gettimeofday(&end, NULL);
elapsed_time = (end.tv_sec + end.tv_usec/1000000.10000) - (start.tv_sec + start.tv_usec/1000000.10000);
printf("\nNumber of threads: %d " , numThreads);
printf("Total Occurrences of %d: %d " ,c, totalOccurrence);
printf("Elapsed time: %.8f\n" , elapsed_time);
totalOccurrence = 0;
}
}
pthread_exit(NULL);
}
Here is one problem:
arr = (int**)malloc(rows * sizeof(int));
should be:
arr = (int**)malloc(rows * sizeof(int *));
The allocation of the rows should be like this
arr = (int**)malloc(rows * sizeof(int*));
Because the sizeof datatypes can vary. But the sizeof a pointer will be constant in a particular machine architecture. In a 64 bit machine the sizeof a pointer will be 8 bytes. But sizeof int will be usually 4 bytes (gcc). So here you will be having allocated only 4 blocks. That why when you try to pass more than 4, it's crashing because there's an invalid memory read.
Also your program will cause memory leak, as you are not freeing the allocated memory. Use like this at the end.
for (r = 0; r < rows; r++)
{
free (arr[r]);
}
free (arr);
I'm attempting to implement concurrent kernel launches for a very complex CUDA kernel, so I thought I'd start out with a simple example. It just launches a kernel which does a sum reduction. Simple enough. Here it is:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <cuda.h>
extern __shared__ char dsmem[];
__device__ double *scratch_space;
__device__ double NDreduceSum(double *a, unsigned short length)
{
const int tid = threadIdx.x;
unsigned short k = length;
double *b;
b = scratch_space;
for (int i = tid; i < length; i+= blockDim.x)
b[i] = a[i];
__syncthreads();
do {
k = (k + 1) / 2;
if (tid < k && tid + k < length)
b[tid] += b[tid + k];
length = k;
__syncthreads();
} while (k > 1);
return b[0];
}
__device__ double reduceSum(double *a, unsigned short length)
{
const int tid = threadIdx.x;
unsigned short k = length;
do
{
k = (k + 1) / 2;
if (tid < k && tid + k < length)
a[tid] += a[tid + k];
length = k;
__syncthreads();
}
while (k > 1);
return a[0];
}
__global__ void kernel_thing(double *ad, int size)
{
double sum_1, sum_2, sum_3;
time_t begin, end, t1, t2, t3;
scratch_space = (double *) &dsmem[0];
for (int j = 0; j < 1000000; j++) {
begin = clock();
sum_1 = NDreduceSum(ad, size);
end = clock();
}
__syncthreads();
t1 = end - begin;
begin = clock();
sum_2 = 0;
if (threadIdx.x == 0) {
for (int i = 0; i < size; i++) {
sum_2 += ad[i];
}
}
__syncthreads();
end = clock();
t2 = end - begin;
__syncthreads();
begin = clock();
sum_3 = reduceSum(ad, size);
end = clock();
__syncthreads();
t3 = end - begin;
if (threadIdx.x == 0) {
printf("Sum found: %lf and %lf and %lf. In %ld and %ld and %ld ticks.\n", sum_1, sum_2, sum_3, t1, t2, t3);
}
}
int main(int argc, char **argv)
{
int i;
const int size = 512;
double *a, *ad, *b, *bd;
double sum_a, sum_b;
cudaStream_t stream_a, stream_b;
cudaError_t result;
cudaEvent_t a_start, a_stop, b_start, b_stop;
a = (double *) malloc(sizeof(double) * size);
b = (double *) malloc(sizeof(double) * size);
srand48(time(0));
for (i = 0; i < size; i++) {
a[i] = drand48();
}
for (i = 0; i < size; i++) {
b[i] = drand48();
}
sum_a = 0;
for (i = 0; i < size; i++) {
sum_a += a[i];
}
sum_b = 0;
for (i = 0; i < size; i++) {
sum_b += b[i];
}
printf("Looking for sum_a %lf\n", sum_a);
printf("Looking for sum_b %lf\n", sum_b);
cudaEventCreate(&a_start);
cudaEventCreate(&b_start);
cudaEventCreate(&a_stop);
cudaEventCreate(&b_stop);
cudaMalloc((void **) &ad, sizeof(double) * size);
cudaMalloc((void **) &bd, sizeof(double) * size);
result = cudaStreamCreate(&stream_a);
result = cudaStreamCreate(&stream_b);
result = cudaMemcpyAsync(ad, a, sizeof(double) * size, cudaMemcpyHostToDevice, stream_a);
result = cudaMemcpyAsync(bd, b, sizeof(double) * size, cudaMemcpyHostToDevice, stream_b);
cudaEventRecord(a_start);
kernel_thing<<<1, 512, 49152, stream_a>>>(ad, size);
cudaEventRecord(a_stop);
cudaEventRecord(b_start);
kernel_thing<<<1, 512, 49152, stream_b>>>(bd, size);
cudaEventRecord(b_stop);
result = cudaMemcpyAsync(a, ad, sizeof(double) * size, cudaMemcpyDeviceToHost, stream_a);
result = cudaMemcpyAsync(b, bd, sizeof(double) * size, cudaMemcpyDeviceToHost, stream_b);
cudaEventSynchronize(a_stop);
cudaEventSynchronize(b_stop);
float a_ms = 0;
float b_ms = 0;
cudaEventElapsedTime(&a_ms, a_start, a_stop);
cudaEventElapsedTime(&b_ms, b_start, b_stop);
printf("%lf ms for A.\n", a_ms);
printf("%lf ms for B.\n", b_ms);
result = cudaStreamDestroy(stream_a);
result = cudaStreamDestroy(stream_b);
if (result != cudaSuccess) {
printf("I should probably do this after each important operation.\n");
}
/*
printf("Matrix after:\n");
for (i = 0; i < size; i++) {
printf("%lf ", a[i]);
}
printf("\n");
*/
free(a);
free(b);
cudaFree(ad);
cudaFree(bd);
return 0;
}
Compiled like so:
CFLAGS = -arch sm_35
CC = nvcc
all: parallel
parallel: parallel.cu
$(LINK.c) $^ -o $#
clean:
rm -f *.o core parallel
I'm using a single Tesla K20X.
When I run this simple example, I get the following output:
Looking for sum_a 247.983945
Looking for sum_b 248.033749
Sum found: 247.983945 and 247.983945 and 247.983945. In 3242 and 51600 and 4792 ticks.
Sum found: 248.033749 and 248.033749 and 248.033749. In 3314 and 52000 and 4497 ticks.
4645.079102 ms for A.
4630.725098 ms for B.
Application 577759 resources: utime ~8s, stime ~2s, Rss ~82764, inblocks ~406, outblocks ~967
So, as you can see, each of the kernels gets the correct results and takes around 4.5 s, which is what I got in an earlier one-kernel version. Great! However, as you can see from the aprun output, the wall time is actually around 10 s, which is much more than the one-kernel version. So, it looks like the kernels are either not launching in parallel, or I'm not getting nearly the speed-up (2x) that I was expecting from concurrent kernel launches.
To tl;dr this question:
Am I missing anything in my code example? Are the kernels actually launching in parallel?
What kind of speed-up should I expect with a Tesla K20X? Shouldn't the kernels run exactly in parallel, completing twice the work in the same time? How many kernels can I expect to run efficiently in parallel?
Thanks for you help.
The cudaEventRecord operations in between your kernels are causing serialization.
Right now the results you are getting:
4645.079102 ms for A.
4630.725098 ms for B.
are back-to-back due to this serialization.
Instead, just time the entire kernel launch sequence:
cudaEventRecord(a_start);
kernel_thing<<<1, 512, 49152, stream_a>>>(ad, size);
kernel_thing<<<1, 512, 49152, stream_b>>>(bd, size);
cudaEventRecord(a_stop);
And I think you will see an elapsed time for (a_start, a_stop) that is roughly the same as one of your previous kernels (~4600ms) indicating more or less full concurrency. I used CUDA 6 RC, copied data back to the host rather than printf from kernel, and eliminated the cudaEventRecord operations between the kernel calls, and I got an overall execution time of ~4.8s. If I didn't modify the cudaEventRecord arrangement, instead my execution time was ~8.3s
A few other notes:
I wouldn't use printf from kernel when running tests like these.
You won't get overlap of compute and cudaMemcpyAsync when the host buffer is allocated with malloc. You need to use cudaHostAlloc.
I would start with running and understanding the concurrent kernels cuda sample first.
You may want to review the appropriate section of the programming guide
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
I'm new to CUDA. To get my hands dirty, I tried writing a Sieve of Eratosthenes (for finding all the primes up to some number n).
There are a number of things I had to do to get it to work that it seems shouldn't have been necessary. I'm curious whether anyone knows of a more natural (and still CUDA-optimized) approach.
To take the entries marked as prime in the isPrime array, I had to do two separate kernel calls. The first counts the number of primes in each threadblock and assigns to each entry i the number of primes in that block less than i. Then I have to make a second call to add in the number of primes in all the previous blocks in order to get the final index.
But it's even worse than that, because to avoid heaps of concurrent reads, I had to store the number of primes in the block in a separate array at each of THREADS_PER_BLOCK indices effectively doubling the required memory for the algorithm. It seems like there should be a way to have all the threads read the same value for each block rather than have to copy it so many times.
Despite all this, there's still the problem of concurrent reads in the clearMultiples method. Especially for small primes like 2 and 3, every thread has to read the value in. Isn't there any way to deal with this?
Could anyone look at my code and tell me if there's anything obvious I could do that would be simpler or more efficient?
Is there anything I'm doing that's particularly inefficient (besides printing out all the primes at the end of course)?
Is it necessary to call synchronize after every kernel call?
Do I need to synchronize after memcpy's as well?
Finally, how come when I set THREADS_PER_BLOCK to 512 it doesn't work?
Thank you
#include <stdio.h>
#include <cuda.h>
#include <assert.h>
#include <math.h>
#define MAX_BLOCKS 256
#define THREADS_PER_BLOCK 256 //Must be a power of 2
#define BLOCK_SPACE 2 * THREADS_PER_BLOCK
__global__ void initialize(int* isPrime, int n) {
int idx = blockIdx.x * THREADS_PER_BLOCK + threadIdx.x;
int step = gridDim.x * THREADS_PER_BLOCK;
int i;
for (i = idx; i <= 1; i += step) {
isPrime[i] = 0;
}
for (; i < n; i += step) {
isPrime[i] = 1;
}
}
__global__ void clearMultiples(int* isPrime, int* primeList, int startInd,
int endInd, int n) {
int yidx = blockIdx.y * blockDim.y + threadIdx.y;
int xidx = blockIdx.x * blockDim.x + threadIdx.x;
int ystep = gridDim.y * blockDim.y;
int xstep = gridDim.x * blockDim.x;
for (int pnum = startInd + yidx; pnum < endInd; pnum += ystep) {
int p = primeList[pnum];
int pstart = p * (p + xidx);
int pstep = p * xstep;
for (int i = pstart; i < n; i += pstep) {
isPrime[i] = 0;
}
}
}
__device__ void makeCounts(int* isPrime, int* addend, int start, int stop) {
__shared__ int tmpCounts[BLOCK_SPACE];
__shared__ int dumbCounts[BLOCK_SPACE];
int idx = threadIdx.x;
tmpCounts[idx] = ((start + idx) < stop) ? isPrime[start + idx] : 0;
__syncthreads();
int numEntries = THREADS_PER_BLOCK;
int cstart = 0;
while (numEntries > 1) {
int prevStart = cstart;
cstart += numEntries;
numEntries /= 2;
if (idx < numEntries) {
int i1 = idx * 2 + prevStart;
tmpCounts[idx + cstart] = tmpCounts[i1] + tmpCounts[i1 + 1];
}
__syncthreads();
}
if (idx == 0) {
dumbCounts[cstart] = tmpCounts[cstart];
tmpCounts[cstart] = 0;
}
while (cstart > 0) {
int prevStart = cstart;
cstart -= numEntries * 2;
if (idx < numEntries) {
int v1 = tmpCounts[idx + prevStart];
int i1 = idx * 2 + cstart;
tmpCounts[i1 + 1] = tmpCounts[i1] + v1;
tmpCounts[i1] = v1;
dumbCounts[i1] = dumbCounts[i1 + 1] = dumbCounts[idx + prevStart];
}
numEntries *= 2;
__syncthreads();
}
if (start + idx < stop) {
isPrime[start + idx] = tmpCounts[idx];
addend[start + idx] = dumbCounts[idx];
}
}
__global__ void createCounts(int* isPrime, int* addend, int lb, int ub) {
int step = gridDim.x * THREADS_PER_BLOCK;
for (int i = lb + blockIdx.x * THREADS_PER_BLOCK; i < ub; i += step) {
int start = i;
int stop = min(i + step, ub);
makeCounts(isPrime, addend, start, stop);
}
}
__global__ void sumCounts(int* isPrime, int* addend, int lb, int ub,
int* totalsum) {
int idx = blockIdx.x;
int s = 0;
for (int i = lb + idx; i < ub; i += THREADS_PER_BLOCK) {
isPrime[i] += s;
s += addend[i];
}
if (idx == 0) {
*totalsum = s;
}
}
__global__ void condensePrimes(int* isPrime, int* primeList, int lb, int ub,
int primeStartInd, int primeCount) {
int idx = blockIdx.x * THREADS_PER_BLOCK + threadIdx.x;
int step = gridDim.x * THREADS_PER_BLOCK;
for (int i = lb + idx; i < ub; i += step) {
int term = isPrime[i];
int nextTerm = i + 1 == ub ? primeCount : isPrime[i + 1];
if (term < nextTerm) {
primeList[primeStartInd + term] = i;
}
}
}
int main(void) {
printf("Enter upper bound:\n");
int n;
scanf("%d", &n);
int *isPrime, *addend, *numPrimes, *primeList;
cudaError_t t = cudaMalloc((void**) &isPrime, n * sizeof(int));
assert(t == cudaSuccess);
t = cudaMalloc(&addend, n * sizeof(int));
assert(t == cudaSuccess);
t = cudaMalloc(&numPrimes, sizeof(int));
assert(t == cudaSuccess);
int primeBound = 2 * n / log(n);
t = cudaMalloc(&primeList, primeBound * sizeof(int));
assert(t == cudaSuccess);
int numBlocks = min(MAX_BLOCKS,
(n + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK);
initialize<<<numBlocks, THREADS_PER_BLOCK>>>(isPrime, n);
t = cudaDeviceSynchronize();
assert(t == cudaSuccess);
int bound = (int) ceil(sqrt(n));
int lb;
int ub = 2;
int primeStartInd = 0;
int primeEndInd = 0;
while (ub < n) {
if (primeEndInd > primeStartInd) {
int lowprime;
t = cudaMemcpy(&lowprime, primeList + primeStartInd, sizeof(int),
cudaMemcpyDeviceToHost);
assert(t == cudaSuccess);
int numcols = n / lowprime;
int numrows = primeEndInd - primeStartInd;
int threadx = min(numcols, THREADS_PER_BLOCK);
int thready = min(numrows, THREADS_PER_BLOCK / threadx);
int blockx = min(numcols / threadx, MAX_BLOCKS);
int blocky = min(numrows / thready, MAX_BLOCKS / blockx);
dim3 gridsize(blockx, blocky);
dim3 blocksize(threadx, thready);
clearMultiples<<<gridsize, blocksize>>>(isPrime, primeList,
primeStartInd, primeEndInd, n);
t = cudaDeviceSynchronize();
assert(t == cudaSuccess);
}
lb = ub;
ub *= 2;
if (lb >= bound) {
ub = n;
}
numBlocks = min(MAX_BLOCKS,
(ub - lb + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK);
createCounts<<<numBlocks, THREADS_PER_BLOCK>>>(isPrime, addend, lb, ub);
t = cudaDeviceSynchronize();
assert(t == cudaSuccess);
sumCounts<<<THREADS_PER_BLOCK, 1>>>(isPrime, addend, lb, ub, numPrimes);
t = cudaDeviceSynchronize();
assert(t == cudaSuccess);
int primeCount;
t = cudaMemcpy(&primeCount, numPrimes, sizeof(int),
cudaMemcpyDeviceToHost);
assert(t == cudaSuccess);
assert(primeCount > 0);
primeStartInd = primeEndInd;
primeEndInd += primeCount;
condensePrimes<<<numBlocks, THREADS_PER_BLOCK>>>(isPrime, primeList, lb,
ub, primeStartInd, primeCount);
t = cudaDeviceSynchronize();
assert(t == cudaSuccess);
}
int finalprimes[primeEndInd];
t = cudaMemcpy(finalprimes, primeList, primeEndInd * sizeof(int),
cudaMemcpyDeviceToHost);
assert(t == cudaSuccess);
t = cudaFree(isPrime);
assert(t == cudaSuccess);
t = cudaFree(addend);
assert(t == cudaSuccess);
t = cudaFree(numPrimes);
assert(t == cudaSuccess);
t = cudaFree(primeList);
assert(t == cudaSuccess);
for (int i = 0; i < primeEndInd; i++) {
if (i % 16 == 0)
printf("\n");
else
printf(" ");
printf("%4d", finalprimes[i]);
}
printf("\n");
return 0;
}
Answering some of your questions.
Fix your error checking as defined in the comments.
define what you mean by "concurrent reads". You're concerned about this but I'm not sure what you mean by it.
Is it necessary to call synchronize after every kernel call?
No, it isn't. If your code is not working correctly, synchronizing after every kernel call then doing proper error checking will tell you if any kernels are not launching correctly. Synchronization is generally not needed for relatively simple single-stream programs like this one. The cuda calls that need to synchronize like cudaMemcpy will do this automatically for you.
Do I need to synchronize after memcpy's as well?
No, cudaMemcpy is synchronous in nature (it will force all cuda calls in the same stream to complete before it begins, and it will not return control to the host thread until the copy is complete.) If you don't want the blocking characteristic (not returning control to the host thread until complete) then you can use the cudaMemcpyAsync version of the call. You would use streams to get around the behavior of forcing all previous cuda calls to complete.
Finally, how come when I set THREADS_PER_BLOCK to 512 it doesn't work?
Please define what you mean by "it doesn't work". I compiled your code with THREADS_PER_BLOCK of 512 and 256, and for an upper bound of 1000 it gave the same output in each case.