Cuda programming-passing nested structs to kernels - c

I'm new to CUDA C and I'm trying to parallelize the following piece of code of the slave_sort function, which you will realize that is already parallel to work with posix threads..
I have the following structs:
typedef struct{
long densities[MAX_RADIX];
long ranks[MAX_RADIX];
char pad[PAGE_SIZE];
}prefix_node;
struct global_memory {
long Index; /* process ID */
struct prefix_node prefix_tree[2 * MAX_PROCESSORS];
} *global;
void slave_sort(){
.
.
.
long *rank_me_mynum;
struct prefix_node* n;
struct prefix_node* r;
struct prefix_node* l;
.
.
MyNum = global->Index;
global->Index++;
n = &(global->prefix_tree[MyNum]);
for (i = 0; i < radix; i++) {
n->densities[i] = key_density[i];
n->ranks[i] = rank_me_mynum[i];
}
offset = MyNum;
level = number_of_processors >> 1;
base = number_of_processors;
while ((offset & 0x1) != 0) {
offset >>= 1;
r = n;
l = n - 1;
index = base + offset;
n = &(global->prefix_tree[index]);
if (offset != (level - 1)) {
for (i = 0; i < radix; i++) {
n->densities[i] = r->densities[i] + l->densities[i];
n->ranks[i] = r->ranks[i] + l->ranks[i];
}
} else {
for (i = 0; i < radix; i++) {
n->densities[i] = r->densities[i] + l->densities[i];
}
}
base += level;
level >>= 1;
}
Mynum is the number of processors. I want after passing the code to kernel, Mynum to berepresented by blockIdx.x.The problem is that i get confused with the structs. I don't know how to pass them in the kernel. Can anyone help me?
Is the following code right?
__global__ void testkernel(prefix_node *prefix_tree, long *dev_rank_me_mynum, long *key_density,long radix)
int i = threadIdx.x + blockIdx.x*blockDimx.x;
prefix_node *n;
prefix_node *l;
prefix_node *r;
long offset;
.
.
.
n = &prefix_tree[blockIdx.x];
if((i%numthreads) == 0){
for(int j=0; j<radix; j++){
n->densities[j] = key_density[j + radix*blockIdx.x];
n->ranks[i] = dev_rank_me_mynum[j + radix*blockIdx.x];
}
.
.
.
}
int main(...){
long *dev_rank_me_mynum;
long *key_density;
prefix_node *prefix_tree;
long radix = 1024;
cudaMalloc((void**)&dev_rank_me_mynum, radix*numblocks*sizeof(long));
cudaMalloc((void**)&key_density, radix*numblocks*sizeof(long));
cudaMalloc((void**)&prefix_tree, numblocks*sizeof(prefix_node));
testkernel<<<numblocks,numthreads>>>(prefix_tree,dev_runk_me_mynum,key_density,radix);
}

The host API code you have posted in your edit looks fine. The prefix_node structure only contains statically declared arrays, so all that is needed is a single cudaMalloc call to allocate memory for the kernel to work on. Your method of passing prefix_tree to the kernel is also fine.
The kernel code, although incomplete and containing a couple of obvious typos, is another story. It seems that your intention is to only have a single thread per block operate on one "node" of the prefix_tree. That will be terribly inefficient and utilise only a small portion of the GPU's total capacity. For example why do this:
prefix_node *n = &prefix_tree[blockIdx.x];
if((i%numthreads) == 0){
for(int j=0; j<radix; j++){
n->densities[j] = key_density[j + radix*blockIdx.x];
n->ranks[j] = dev_rank_me_mynum[j + radix*blockIdx.x];
}
.
.
.
}
when you could to this:
prefix_node *n = &prefix_tree[blockIdx.x];
for(int j=threadIdx.x; j<radix; j+=blockDim.x){
n->densities[j] = key_density[j + radix*blockIdx.x];
n->ranks[j] = dev_rank_me_mynum[j + radix*blockIdx.x];
}
which coalesces the memory reads and uses as many threads in the block as you choose to run, rather than just one and should be many times faster as a result. So perhaps you should rethink your strategy of directly trying to translate the serial C code you posted into a kernel....

Related

How to initialize a field in a struct from another struct? C

So im new to C programming and my assignment is to write a function(Max_way) that prints the driver who had the total of longest trips.
im using these 2 structs:
#define LEN 8
typedef struct
{
unsigned ID;
char name[LEN];
}Driver;
typedef struct
{
unsigned T_id;
char T_origin[LEN];
char T_dest[LEN];
unsigned T_way;
}Trip;
and a function to determine the total trips of a certain driver:
int Driver_way(Trip trips[], int size, unsigned id)
{
int km=0;
for (int i = 0; i < size; i++)
{
if (id == trips[i].T_id)
{
km = km + trips[i].T_way;
}
}
return km;
}
but when im trying to print the details of a specific driver from an array of drivers, i receive the correct ID, the correct distance of km, but the driver's name is not copied properly and i get garbage string containing 1 character instead of 8.
i've also tried strcpy(max_driver.name,driver[i].name) with same result.
void Max_way(Trip trips[], int size_of_trips, Driver drivers[], int size_of_drivers)
{
int *km;
int max = 0;
Driver max_driver;
km = (int*)malloc(sizeof(int) * (sizeof(drivers) / sizeof(Driver)));
for (int i = 0; i < size_of_drivers; i++)
{
km[i] = Driver_way(trips, sizeof(trips), drivers[i].ID);
for (int j = 1; j < size_of_drivers; j++)
{
if (km[j] > km[j - 1])
{
max = km[j];
max_driver.ID = drivers[i].ID;
max_driver.name = drivers[i].name;
}
}
}
printf("The driver who drove the most is:\n%d\n%s\n%d km\n", max_driver.ID, max_driver.name, max);
}
any idea why this is happening?
Note that one cannot copy a string using a simple assignment operator; you must use strcpy (or similar) as follows:
if (km[j] > km[j - 1]) {
max = km[j];
max_driver.ID = drivers[i].ID;
strcpy(max_driver.name,drivers[i].name);
}
Also note that since you were using ==, this was not even a simple assignment, put a comparison. Changing to == likely fixed a compile-time error, but it did NOT give you what you want.

Order array of struct by one chosen attribute in C

I'm having trouble while ordering an array of a struct that contains 3 infos types. I want to order using one of the info, then organize the other 2 infos, but I'm using heapsort and don't know how I would change the code to automatically reorganize. There is any logical hint that would help me? here's the struct
typedef struct{
int orig;
int dest;
float cost;
}infos;
thanks. i can add any info that help you people help me!
[EDIT]
void build ( Armaz x[], int n ){
int i;
float val;
int s;
int f;
for (i = 1; i<n; i++){
val = x[i].cost;
s = i;
f = (s-1)/2;
while (s>0 && x[f].cost<val){
x[s].cost = x[f].cost;
s = f;
f = (s-1) / 2;
}
x[s].cost = val;
}
}
void heapsort (infos x[], int n){
build(x,n);
int i;
int s;
int f;
float vali ;
for (i = n - 1 ; i > 0 ; i--){
vali = x[i].cost;
x[i].cost = x[0].cost;
f = 0 ;
if (i == 1){
s = -1;
}
else s = 1;
if ( i > 2 && x[2].cost > x[1].cost) s = 2 ;
while ( s >= 0 && vali < x[s].cost){
x[f].cost = x[s].cost;
f = s;
s = 2 * f + 1;
if ( s + 1 <= i - 1 && x[s].cost < x[s + 1].cost ) s++;
if (s > i-1) s = -1;
}
x[f].cost = vali;
}
}
Even if you implement your own sorting functions, I recommend you take a closer look at how the standard qsort function works.
In short: It has an argument that is a pointer to a function used for comparison. That function receives pointers to two "objects" that should be compared, and can do whatever it wants to compare the two "objects". That means one could have three different comparison functions, one for each member of your structure, and pass the appropriate one when needed.

Simple reverb alghoritm when buffer is small

I'm trying to implement simple delay/reverb described in this post https://stackoverflow.com/a/5319085/1562784 and I have a problem. On windows where I record 16bit/16khz samples and get 8k samples per recording callback call, it works fine. But on linux I get much smaller chunks from soundcard. Something around 150 samples. Because of that I modified delay/reverb code to buffer samples:
#define REVERB_BUFFER_LEN 8000
static void reverb( int16_t* Buffer, int N)
{
int i;
float decay = 0.5f;
static int16_t sampleBuffer[REVERB_BUFFER_LEN] = {0};
//Make room at the end of buffer to append new samples
for (i = 0; i < REVERB_BUFFER_LEN - N; i++)
sampleBuffer[ i ] = sampleBuffer[ i + N ] ;
//copy new chunk of audio samples at the end of buffer
for (i = 0; i < N; i++)
sampleBuffer[REVERB_BUFFER_LEN - N + i ] = Buffer[ i ] ;
//perform effect
for (i = 0; i < REVERB_BUFFER_LEN - 1600; i++)
{
sampleBuffer[i + 1600] += (int16_t)((float)sampleBuffer[i] * decay);
}
//copy output sample
for (i = 0; i < N; i++)
Buffer[ i ] = sampleBuffer[REVERB_BUFFER_LEN - N + i ];
}
This results in white noise on output, so clearly I'm doing something wrong.
On linux, I record in 16bit/16khz, same like on Windows and I'm running linux in VMWare.
Thank you!
Update:
As indicated in answered post, I was 'reverbing' old samples over and over again. Simple 'if' sovled a problem:
for (i = 0; i < REVERB_BUFFER_LEN - 1600; i++)
{
if((i + 1600) >= REVERB_BUFFER_LEN - N)
sampleBuffer[i + 1600] += (int16_t)((float)sampleBuffer[i] * decay);
}
Your loop that performs the actual reverb effect will be performed multiple times on the same samples, on different calls to the function. This is because you save old samples in the buffer, but you perform the reverb on all samples each time. This will likely cause them to overflow at some point.
You should only perform the reverb on the new samples, not on ones which have already been modified. I would also recommend checking for overflow and clipping to the min/max values instead of wrapping in that case.
A probably better way to perform reverb, which will work for any input buffer size, is to maintain a circular buffer of size REVERB_SAMPLES (1600 in your case), which contains the last samples.
void reverb( int16_t* buf, int len) {
static int16_t reverb_buf[REVERB_SAMPLES] = {0};
static int reverb_pos = 0;
for (int i=0; i<len; i++) {
int16_t new_value = buf[i] + reverb_buf[reverb_pos] * decay;
reverb_buf[reverb_pos] = new_value;
buf[i] = new_value;
reverb_pos = (reverb_pos + 1) % REVERB_SAMPLES;
}
}

Multiplying large numbers through strings

I'm trying to write a program that will receive 2 strings representing numbers of any length
(for instance, char *a = "10000000000000";, char *b = "9999999999999999";) and multiply them.
This is what I came up with so far, not sure how to continue (nullify simply fills the whole string with '0'):
char *multiply(char *hnum, const char *other)
{
int num1=0, num2=0, carry=0, hnumL=0, otherL=0, i=0, temp1L=0, temp2L=0, n=0;
char *temp1, *temp2;
if(!hnum || !other) return NULL;
for(hnumL=0; hnum[hnumL] != '\0'; hnumL++);
for(otherL=0; other[otherL] != '\0'; otherL++);
temp1 = (char*)malloc(otherL+hnumL);
if(!temp1) return NULL;
temp2 = (char*)malloc(otherL+hnumL);
if(!temp2) return NULL;
nullify(temp1);
nullify(temp2);
hnumL--;
otherL--;
for(otherL; otherL >= 0; otherL--)
{
carry = 0;
num1 = other[otherL] - '0';
for(hnumL; hnumL >= 0; hnumL--)
{
num2 = hnum[hnumL] - '0';
temp1[i+n] = (char)(((int)'0') + ((num1 * num2 + carry) % 10));
carry = (num1 * num2 + carry) / 10;
i++;
temp1L++;
}
if(carry > 0)
{
temp1[i+n] = (char)(((int)'0') + carry);
temp1L++;
}
p.s. Is there a library that handles this already? Couldn't find anything like it.
On paper, you would probably do as follows:
999x99
--------
8991
8991
========
98901
The process is to multiply individual digits starting from the right of each number and adding them up keeping a carry in mind each time ("9 times 9 equals 81, write 1, keep 8 in mind"). I'm pretty sure you covered that in elementary school, didn't you?.
The process can be easily put into an algorithm:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
struct result
{
int carry;
int res;
};
/*
* multiply two numbers between 0 and 9 into result.res. If there is a carry, put it into
* result.carry
*/
struct result mul(int a, int b)
{
struct result res;
res.res = a * b;
if (res.res > 9)
{
res.carry = res.res / 10;
res.res %= 10;
}
else
res.carry = 0;
return res;
}
/*
* add
* adds a digit (b) to str at pos. If the result generates a carry,
* it's added also (recursively)
*/
add(char str[], int pos, int b)
{
int res;
int carry;
res = str[pos] - '0' + b;
if (res > 9)
{
carry = res / 10;
res %= 10;
add(str, pos - 1, carry);
}
str[pos] = res + '0';
}
void nullify(char *numstr, int len)
{
while (--len >= 0)
numstr[len] = '0';
}
int main(void)
{
struct result res;
char *mp1 = "999";
char *mp2 = "999";
char sum[strlen(mp1) + strlen(mp2) + 1];
int i;
int j;
nullify(sum, strlen(mp1) + strlen(mp2));
for (i = strlen(mp2) - 1; i >= 0; i--)
{
/* iterate from right over second multiplikand */
for (j = strlen(mp1) - 1; j >= 0; j--)
{
/* iterate from right over first multiplikand */
res = mul((mp2[i] - '0'), (mp1[j] - '0'));
add(sum, i + j + 1, res.res); /* add sum */
add(sum, i + j, res.carry); /* add carry */
}
}
printf("%s * %s = %s\n", mp1, mp2, sum);
return 0;
}
This is just the same as on paper, except that you don't need to remember individual summands since we add up everything on the fly.
This might not bee the fastest way to do it, but it doesn't need malloc() (provided you have a C99 compiler, otherwise you would need to dynamically allocate sum) and works for arbitrarily long numbers (up to the stack limit since add() is implemented as recursive function).
Yes there are libraries that handle this. It's actually a pretty big subject area that a lot of research has gone into. I haven't looked through your code that closely, but I know that the library implementations of big num operations have very efficient algorithms that you're unlikely to discover on your own. FOr example, the multiplication routine we all learned in grade school (pre common-core) is a O(n^2) solution to multiplication, but there exist ways to solve it in ~O(n^1.5).
THe standard GNU c big num library is GNU MP
https://gmplib.org/

GPU code running slower than CPU version

I am working on an application which divides a string into pieces and assigns each to a block. Within each block the the text is scanned character by character and a shared array of int, D is to be updated by different threads in parallel based on the character read. At the end of each iteration the last element of D is checked, and if it satisfied the condition, a global int array m is set to 1 at the position corresponding to the text. This code was executed on a NVIDIA GEForce Fermi 550, and runs even slower than the CPU version. I have just included the kernel here:
__global__ void match(uint32_t* BB_d,const char* text_d,int n, int m,int k,int J,int lc,int start_addr,int tBlockSize,int overlap ,int* matched){
__shared__ int D[MAX_THREADS+2];
__shared__ char Text_S[MAX_PATTERN_SIZE];
__shared__ int DNew[MAX_THREADS+2];
__shared__ int BB_S[4][MAX_THREADS];
int w=threadIdx.x+1;
for(int i=0;i<4;i++)
{
BB_S[i][threadIdx.x]= BB_d[i*J+threadIdx.x];
}
{
D[threadIdx.x] = 0;
{
D[w] = (1<<(k+1)) -1;
for(int i = 0; i < lc - 1; i++)
{
D[w] = (D[w] << k+2) + (1<<(k+1)) -1;
}
}
D[J+1] = (1<<((k+2)*lc)) - 1;
}
int startblock=(blockIdx.x == 0?start_addr:(start_addr+(blockIdx.x * (tBlockSize-overlap))));
int size= (((startblock + tBlockSize) > n )? ((n- (startblock))):( tBlockSize));
int copyBlock=(size/J)+ ((size%J)==0?0:1);
if((threadIdx.x * copyBlock) <= size)
memcpy(Text_S+(threadIdx.x*copyBlock),text_d+(startblock+threadIdx.x*copyBlock),(((((threadIdx.x*copyBlock))+copyBlock) > size)?(size-(threadIdx.x*copyBlock)):copyBlock));
memcpy(DNew, D, (J+2)*sizeof(int));
__syncthreads();
uint32_t initial = D[1];
uint32_t x;
uint32_t mask = 1;
for(int i = 0; i < lc - 1; i++)mask = (mask<<(k+2)) + 1;
for(int i = 0; i < size;i++)
{
{
x = ((D[w] >> (k+2)) | (D[w - 1] << ((k + 2)* (lc - 1))) | (BB_S[(((int)Text_S[i])/2)%4][w-1])) & ((1 << (k + 2)* lc) - 1);
DNew[w] = ((D[w]<<1) | mask)
& (((D[w] << k+3) | mask|((D[w +1] >>((k+2)*(lc - 1)))<<1)))
& (((x + mask) ^ x) >> 1)
& initial;
}
__syncthreads();
memcpy(D, DNew, (J+2)*sizeof(int));
if(!(D[J] & 1<<(k + (k + 2)*(lc*J -m + k ))))
{
matched[startblock+i] = 1;
D[J] |= ((1<<(k + 1 + (k + 2)*(lc*J -m + k ))) - 1);
}
}
}
I am not very familiar with CUDA so I dont quite understand issues such as shared memory bank conflicts. Could that be the bottleneck here?
As asked, this is the code where I launch the kernels:
#include <stdio.h>
#include <assert.h>
#include <cuda.h>
#define uint32_t unsigned int
#define MAX_THREADS 512
#define MAX_PATTERN_SIZE 1024
#define MAX_BLOCKS 8
#define MAX_STREAMS 16
#define TEXT_MAX_LENGTH 1000000000
void calculateBBArray(uint32_t** BB,const char* pattern_h,int m,int k , int lc , int J){};
void checkCUDAError(const char *msg) {
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.\n", msg,
cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
}
char* getTextString() {
FILE *input, *output;
char c;
char * inputbuffer=(char *)malloc(sizeof(char)*TEXT_MAX_LENGTH);
int numchars = 0, index = 0;
input = fopen("sequence.fasta", "r");
c = fgetc(input);
while(c != EOF)
{
inputbuffer[numchars] = c;
numchars++;
c = fgetc(input);
}
fclose(input);
inputbuffer[numchars] = '\0';
return inputbuffer;
}
int main(void) {
const char pattern_h[] = "TACACGAGGAGAGGAGAAGAACAACGCGACAGCAGCAGACTTTTTTTTTTTTACAC";
char * text_h=getTextString(); //reading text from file, supported upto 200MB currently
int k = 13;
int i;
int count=0;
char *pattern_d, *text_d; // pointers to device memory
char* text_new_d;
int* matched_d;
int* matched_new_d;
uint32_t* BB_d;
uint32_t* BB_new_d;
int* matched_h = (int*)malloc(sizeof(int)* strlen(text_h));
cudaMalloc((void **) &pattern_d, sizeof(char)*strlen(pattern_h)+1);
cudaMalloc((void **) &text_d, sizeof(char)*strlen(text_h)+1);
cudaMalloc((void **) &matched_d, sizeof(int)*strlen(text_h));
cudaMemcpy(pattern_d, pattern_h, sizeof(char)*strlen(pattern_h)+1, cudaMemcpyHostToDevice);
cudaMemcpy(text_d, text_h, sizeof(char)*strlen(text_h)+1, cudaMemcpyHostToDevice);
cudaMemset(matched_d, 0,sizeof(int)*strlen(text_h));
int m = strlen(pattern_h);
int n = strlen(text_h);
uint32_t* BB_h[4];
unsigned int maxLc = ((((m-k)*(k+2)) > (31))?(31/(k+2)):(m-k));
unsigned int lc=2; // Determines the number of threads per block
// can be varied upto maxLc for tuning performance
if(lc>maxLc)
{
exit(0);
}
unsigned int noWordorNfa =((m-k)/lc) + (((m-k)%lc) == 0?0:1);
cudaMalloc((void **) &BB_d, sizeof(int)*noWordorNfa*4);
if(noWordorNfa >= MAX_THREADS)
{
printf("Error: max threads\n");
exit(0);
}
calculateBBArray(BB_h,pattern_h,m,k,lc,noWordorNfa); // not included this function
for(i=0;i<4;i++)
{
cudaMemcpy(BB_d+ i*noWordorNfa, BB_h[i], sizeof(int)*noWordorNfa, cudaMemcpyHostToDevice);
}
int overlap=m;
int textBlockSize=(((m+k+1)>n)?n:(m+k+1));
cudaStream_t stream[MAX_STREAMS];
for(i=0;i<MAX_STREAMS;i++) {
cudaStreamCreate( &stream[i] );
}
int start_addr=0,index=0,maxNoBlocks=0;
if(textBlockSize>n)
{
maxNoBlocks=1;
}
else
{
maxNoBlocks=((1 + ((n-textBlockSize)/(textBlockSize-overlap)) + (((n-textBlockSize)%(textBlockSize-overlap)) == 0?0:1)));
}
int kernelBlocks = ((maxNoBlocks > MAX_BLOCKS)?MAX_BLOCKS:maxNoBlocks);
int blocksRemaining =maxNoBlocks;
printf(" maxNoBlocks %d kernel Blocks %d \n",maxNoBlocks,kernelBlocks);
while(blocksRemaining >0)
{
kernelBlocks = ((blocksRemaining > MAX_BLOCKS)?MAX_BLOCKS:blocksRemaining);
printf(" Calling %d Blocks with starting Address %d , textBlockSize %d \n",kernelBlocks,start_addr,textBlockSize);
match<<<kernelBlocks,noWordorNfa,0,stream[(index++)%MAX_STREAMS]>>>(BB_d,text_d,n,m,k,noWordorNfa,lc,start_addr,textBlockSize,overlap,matched_d);
start_addr+=kernelBlocks*(textBlockSize-overlap);;
blocksRemaining -= kernelBlocks;
}
cudaMemcpy(matched_h, matched_d, sizeof(int)*strlen(text_h), cudaMemcpyDeviceToHost);
checkCUDAError("Matched Function");
for(i=0;i<MAX_STREAMS;i++)
cudaStreamSynchronize( stream[i] );
// do stuff with matched
// ....
// ....
free(matched_h);cudaFree(pattern_d);cudaFree(text_d);cudaFree(matched_d);
return 0;
}
Number of threads launched per block depends upon the length pattern_h(could be at most maxLc above). I expect it to be around 30 in this case. Shoudn't that be enough to see a good amount of concurrency? As for blocks, I see no point in launching more than MAX_BLOCKS (=10) at a time since the hardware can schedule only 8 simultaneously
NOTE: I don't have GUI access.
With all the shared memory you're using, you could be running into bank conflicts if consecutive threads are not reading from consecutive addresses in the shared arrays ... that could cause serialization of the memory accesses, which in turn will kill the parallel performance of your algorithm.
I breifly looked at your code but it looks like your sending data to the gpu back and forth creating a bottle neck on the bus? did you try profiling it?
I found that I was copying the whole array Dnew to D in each thread rather than copying only the portion each thread was supposed to update D[w]. This would cause the threads to execute serially, although I don't know if it could be called a shared memory bank conflict. Now it gives 8-9x speedup for large enough patterns(=more threads). This is much less than what I expected. I will try to increase number of blocks as suggested. I dont know how to increase the # of threads

Resources