U-Boot: Unexpected problems porting code - c

I want to extend the u-boot SPL code with some fuzzy extractor logic by adding code into {u-boot_sources}/arch/arm/cpu/armv7/omap-common/hwinit-common.c. U-boot shall be used on a PandaBoard ES (omap4460 SoC).
Thus, first I successfully implemented the code on my x86 pc and I am porting it to the ARM-based PandaBoard. The complete code can be found here (as a side note the "main" function is s_init()):
However, I am expecting dozens of unexptected effects, which results in either stopping during the execution of the code, stopping u-boot after reading u-boot.img or not sending output (and thus not booting) at all.
For example, I want to call two functions (computeSyndrome, decodeErrors) inside a for-loop, which is part of another function golayDecode.
For my first problem please ignore the code below the multiline comment starting with /* >>>> These lines of code below totally break u-boot. Also only the function computeSyndrome in conjunction with the calling function golayDecode is important.
The issue: If comment out both functions computeSyndrome and decodeErrors everything works fine and the OS (Android) is booting. However, if computeSyndrome is not commented out and thus gets processed, u-boot stucks after displaying reading u-boot.img.
The funny thing about it: even if I replace computeSyndrome with a bogus function which does not but iterating a values or displaying stuff, u-boot stucks as well.
Furthermore, if I remove the multiline comment furhter below to also include the residual code, u-boot doesn't display ony character. (1*)
I am a beginner regarding microprocessor programming but I can not figure out a possible error in these 12 lines of the computeSyndrome function or the general behaviour of u-boot at all. (2*)
Does anyone have a clue what I am missing?
1* I am using minicom to display the output of u-boot, which I receive over serial-usb-converter.
2* I am using the following compiler flags to make sure there are no errors at compile time: -Wall -Wstrict-prototypes -Wdisabled-optimization -W -pedantic
void golayDecode(volatile int x[12], volatile int y[12], volatile unsigned int golayEncodedSecret[30], volatile unsigned int s, volatile unsigned char repetitionDecodedSecretBits[360]){
printf("\n[I] - Performing Golay decoding\r\n");
volatile unsigned char secret[22] = {0};
volatile unsigned char currentByte = 0, tmpByte = 0;
volatile unsigned int golayDecodedSecret[30] ={0};
volatile int twelveBitCounter = 0;//, j = 0, k = 0, q = 0, aux = 0, found = 0, bitCounter = 0, i_2 = 7, currentSecretEncByte = 0x00;
volatile int c_hat[2] = {0}, e[2] = {0};
e[0] = s;
e[1] = 0;
for(twelveBitCounter = 0; twelveBitCounter < 30; twelveBitCounter+=2){
printf("Computing syndrome and decoding errors for bytes %03x & %03x\n", golayEncodedSecret[twelveBitCounter], golayEncodedSecret[twelveBitCounter+1]);
computeSyndrome(golayEncodedSecret[twelveBitCounter], golayEncodedSecret[twelveBitCounter+1], x, y, s);
decodeErrors(golayEncodedSecret[i], golayEncodedSecret[i+1], x, y, s);
printf("\n[D] - Reconstructing secret bytes\r\n");
/* >>>> These lines of code below totally break u-boot
for(i = 0; i < 30; i+=2){
currentSecretEncByte = golayDecodedSecret[i];
volatile int j = 11;
// Access each source bit
for(; 0<=j; j--){
volatile int currentSourceBit = (currentSecretEncByte >> j) & 0x01;
repetitionDecodedSecretBits[bitCounter] = currentSourceBit;
k = 0;
for(i = 0; i<176; i++){
tmpByte = repetitionDecodedSecretBits[i] << i_2;
currentByte = currentByte | tmpByte;
if(i_2==0){ // We collected 8 bits and created a byte
secret[k] = currentByte;
i_2 = 7;
tmpByte = 0x00;
currentByte = 0x00;
SHA256_CTX ctx;
unsigned char hash[32];
printf("\n[I] - Generating secret key K\n");
sha256_update(&ctx,secret,strlen((const char*)secret));
printf("\n[I] - This is our secret key K\n\t==================================\n\t");
/* Function for syndrome computation */
void computeSyndrome(int r0, int r1, volatile int x[12], volatile int y[12], volatile unsigned int s){
unsigned int syndromeBitCounter, syndromeMatrixCounter, syndromeAux;
s = 0;
for(syndromeMatrixCounter=0; syndromeMatrixCounter<12; syndromeMatrixCounter++){
syndromeAux = 0;
for(syndromeBitCounter=0; syndromeBitCounter<12; syndromeBitCounter++){
syndromeAux = syndromeAux^((x[syndromeMatrixCounter]&r0)>>syndromeBitCounter &0x01);
for(syndromeBitCounter=0; syndromeBitCounter<12; syndromeBitCounter++){
syndromeAux = syndromeAux^((y[syndromeMatrixCounter]&r1)>>syndromeBitCounter &0x01);
s = (s<<1)^syndromeAux;
/* Funcion to recover original byte */
void decodeErrors(int r0, int r1, volatile int x[12], volatile int y[12], volatile unsigned int s){
//printf("\n[D] - Starting to decode errors for %3x | %3x\n", r0, r1);
volatile unsigned int c_hat[2] = {0xaa}, e[2] = {0xaa};
volatile unsigned int q;
unsigned int i, j, aux, found;
//printf("Step 2\n");
e[0] = s;
e[1] = 0;
/******* STEP 3 */
//printf("Step 3\n");
i = 0;
found = 0;
if (weight(s^y[i]) <=2){
e[0] = s^y[i];
e[1] = x[i];
found = 1;
printf("\ntest 2\n");
}while ((i<12) && (!found));
if (( i==12 ) && (!found)){
/******* STEP 4 */
//printf("Step 4\n");
q = 0;
for (j=0; j<12; j++){
aux = 0;
for (i=0; i<12; i++)
aux = aux ^ ( (y[j]&s)>>i & 0x01 );
q = (q<<1) ^ aux;
/******* STEP 5 */
//printf("Step 5\n");
if (weight(q) <=3){
e[0] = 0;
e[1] = q;
/******* STEP 6 */
//printf("Step 6\n");
i = 0;
found = 0;
if (weight(q^y[i]) <=2){
e[0] = x[i];
e[1] = q^y[i];
found = 1;
}while((i<12) && (!found));
if ((i==12) && (!found)){
/******* STEP 7 */
printf("\n[E] - uncorrectable error pattern! (%3x | %3x)\n", r0, r1);
/* You can raise a flag here, or output the vector as is */
c_hat[0] = r0^e[0];
c_hat[1] = r1^e[1];
//printf("\t\tEstimated codeword = %x%x\n", c_hat[0], c_hat[1]);

Indeed, the code was a little bit too complex to be executed at this point of boot time. At this time there is ne real CRT and I only have a minimal stack.
Thus, I moved the code to board_init_f() which is still part of the SPL. It gave more stable results and my algorithm now works as expected.


converting base 4 code to letters

im working on an assmebler project that i have and i need to translate binary machine code that i have to a "weird" 4 base code for example
if i get binary code like this "0000-10-01-00" i should translate it to "aacba"
i have managed to translate the code to 4 base code but i dont know how to continue from there or if this is the right way to do it,...
adding my code below
void intToBase4 (unsigned int *num)
int d[7];
int j,i=0;
double x=0;
for(x=0,j=i-1; j>=0; j--)
x += d[j]*pow(10,j);
(*num)=(unsigned int)x;
I've included a little 32-bit to num to letter converter for you to grasp the basics. It works a single "32-bit number" at a time. You could use this as a basis for an array based solution like you have half way done in your example, or change the type to be bigger, or whatever. It should show you roughly what you need to do:
void intToBase4 (uint32_t num, char *outString)
// There are 16 digits per num in this example
for(int i=0; i<16; i++)
// Grab the lowest 2 bits and convert to a letter.
*outString++ = (num & 0x03) + 'a';
// Shift next 2 bits low
num >>= 2;
// NUL terminate string.
*outString = '\0';
A bit more universal one:
value - value to decode, buff - buff where result string will be stored, numofwrds - number of fields to be decoded, ... fields sizes in bits
example "xxxxyyvvzz": - 4 bits, two bits, two bits, two bits
decode(v, buff, 4, 4, 2, 2, 2);
char dictionary[] = "abcdefghijklmnopqrstuwxyz";
char *decode(unsigned int value, char *buff, int numofwrds, ...)
va_list vl;
int *fieldsizes = malloc(sizeof(int) * numofwrds);
int bitsize = 0;
char *result = NULL;
if (fieldsizes != NULL)
va_start(vl, numofwrds);
for (int i = 0; i < numofwrds; i++)
fieldsizes[i] = va_arg(vl, int);
bitsize += fieldsizes[i];
for (int i = 0; i < numofwrds; i++)
unsigned int mask, offset;
mask = (1 << fieldsizes[i]) - 1;
offset = bitsize - fieldsizes[i];
mask <<= offset;
buff[i] = dictionary[(value & mask) >> offset];
bitsize -= fieldsizes[i];
result = buff;
buff[numofwrds] = '\0';
return result;

figure out why my RC4 Implementation doesent produce the correct result

Ok I am new to C, I have programmed in C# for around 10 years now so still getting used to the whole language, Ive been doing great in learning but im still having a few hickups, currently im trying to write a implementation of RC4 used on the Xbox 360 to encrypt KeyVault/Account data.
However Ive run into a snag, the code works but it is outputting the incorrect data, I have provided the original c# code I am working with that I know works and I have provided the snippet of code from my C project, any help / pointers will be much appreciated :)
Original C# Code :
public struct RC4Session
public byte[] Key;
public int SBoxLen;
public byte[] SBox;
public int I;
public int J;
public static RC4Session RC4CreateSession(byte[] key)
RC4Session session = new RC4Session
Key = key,
I = 0,
J = 0,
SBoxLen = 0x100,
SBox = new byte[0x100]
for (int i = 0; i < session.SBoxLen; i++)
session.SBox[i] = (byte)i;
int index = 0;
for (int j = 0; j < session.SBoxLen; j++)
index = ((index + session.SBox[j]) + key[j % key.Length]) % session.SBoxLen;
byte num4 = session.SBox[index];
session.SBox[index] = session.SBox[j];
session.SBox[j] = num4;
return session;
public static void RC4Encrypt(ref RC4Session session, byte[] data, int index, int count)
int num = index;
session.I = (session.I + 1) % 0x100;
session.J = (session.J + session.SBox[session.I]) % 0x100;
byte num2 = session.SBox[session.I];
session.SBox[session.I] = session.SBox[session.J];
session.SBox[session.J] = num2;
byte num3 = data[num];
byte num4 = session.SBox[(session.SBox[session.I] + session.SBox[session.J]) % 0x100];
data[num] = (byte)(num3 ^ num4);
while (num != (index + count));
Now Here is my own c version :
typedef struct rc4_state {
int s_box_len;
uint8_t* sbox;
int i;
int j;
} rc4_state_t;
unsigned char* HMAC_SHA1(const char* cpukey, const unsigned char* hmac_key) {
unsigned char* digest = malloc(20);
digest = HMAC(EVP_sha1(), cpukey, 16, hmac_key, 16, NULL, NULL);
return digest;
void rc4_init(rc4_state_t* state, const uint8_t *key, int keylen)
state->i = 0;
state->j = 0;
state->s_box_len = 0x100;
state->sbox = malloc(0x100);
// Init sbox.
int i = 0, index = 0, j = 0;
uint8_t buf;
while(i < state->s_box_len) {
state->sbox[i] = (uint8_t)i;
while(j < state->s_box_len) {
index = ((index + state->sbox[j]) + key[j % keylen]) % state->s_box_len;
buf = state->sbox[index];
state->sbox[index] = (uint8_t)state->sbox[j];
state->sbox[j] = (uint8_t)buf;
void rc4_crypt(rc4_state_t* state, const uint8_t *inbuf, uint8_t **outbuf, int buflen)
int idx = 0;
uint8_t num, num2, num3;
*outbuf = malloc(buflen);
if (*outbuf) { // do not forget to test for failed allocation
while(idx != buflen) {
state->i = (int)(state->i + 1) % 0x100;
state->j = (int)(state->j + state->sbox[state->i]) % 0x100;
num = (uint8_t)state->sbox[state->i];
state->sbox[state->i] = (uint8_t)state->sbox[state->j];
state->sbox[state->j] = (uint8_t)num;
num2 = (uint8_t)inbuf[idx];
num3 = (uint8_t)state->sbox[(state->sbox[state->i] + (uint8_t)state->sbox[state->j]) % 0x100];
(*outbuf)[idx] = (uint8_t)(num2 ^ num3);
printf("%02X", (*outbuf)[idx]);
Usage (c#) :
byte[] cpukey = new byte[16]
byte[] hmac_key = new byte[16]
byte[] buf = new System.Security.Cryptography.HMACSHA1(cpukey).ComputeHash(hmac_key);
MessageBox.Show(BitConverter.ToString(buf).Replace("-", ""), "");
const char cpu_key[16] = { 0xXX, 0xXX, 0xXX };
const unsigned char hmac_key[16] = { ... };
unsigned char* buf = HMAC_SHA1(cpu_key, hmac_key);
uint8_t buf2[20];
uint8_t buf3[8] = { 0x1E, 0xF7, 0x94, 0x48, 0x22, 0x26, 0x89, 0x8E }; // Encrypted Xbox 360 data
uint8_t* buf4;
// Allocated 8 bytes out.
buf4 = malloc(8);
int num = 0;
while(num < 20) {
buf2[num] = (uint8_t)buf[num]; // convert const char
rc4_state_t* rc4 = malloc(sizeof(rc4_state_t));
rc4_init(rc4, buf2, 20);
rc4_crypt(rc4, buf3, &buf4, 8);
Now I have the HMACsha1 figured out, im using openssl for that and I confirm I am getting the correct hmac/decryption key its just the rc4 isnt working, Im trying to decrypt part of the Kyevault that should == "Xbox 360"||"58626F7820333630"
The output is currently : "0000008108020000" I do not get any errors in the compilation, again any help would be great ^.^
Thanks to John's help I was able to fix it, it was a error in the c# version, thanks John !
As I remarked in comments, your main problem appeared to involve how the output buffer is managed. You have since revised the question to fix that, but I describe it anyway here, along with some other alternatives for fixing it. The remaining problem is discussed at the end.
Function rc4_crypt() allocates an output buffer for itself, but it has no mechanism to communicate a pointer to the allocated space back to its caller. Your revised usage furthermore exhibits some inconsistency with rc4_crypt() with respect to how the output buffer is expected to be managed.
There are three main ways to approach the problem.
Function rc4_crypt() presently returns nothing, so you could let it continue to allocate the buffer itself, and modify it to return a pointer to the allocated output buffer.
You could modify the type of the outbuf parameter to uint8_t ** to enable rc4_crypt() to set the caller's pointer value indirectly.
You could rely on the caller to manage the output buffer, and make rc4_crypt() just write the output via the pointer passed to it.
The only one of those that might be tricky for you is #2; it would look something like this:
void rc4_crypt(rc4_state_t* state, const uint8_t *inbuf, uint8_t **outbuf, int buflen) {
*outbuf = malloc(buflen);
if (*outbuf) { // do not forget to test for failed allocation
// ...
(*outbuf)[idx] = (uint8_t)(num2 ^ num3);
// ...
And you would use it like this:
rc4_crypt(rc4, buf3, &buf4, 8);
... without otherwise allocating any memory for buf4.
The caller in any case has the responsibility for freeing the output buffer when it is no longer needed. This is clearer when it performs the allocation itself; you should document that requirement if rc4_crypt() is going to be responsible for the allocation.
The remaining problem appears to be strictly an output problem. You are apparently relying on print statements in rc4_crypt() to report on the encrypted data. I have no problem whatever with debugging via print statements, but you do need to be careful to print the data you actually want to examine. In this case you do not. You update the joint buffer index idx at the end of the encryption loop before printing a byte from the output buffer. As a result, at each iteration you print not the encrypted byte value you've just computed, but rather an indeterminate value that happens to be in the next position of the output buffer.
Move the idx++ to the very end of the loop to fix this problem, or change it from a while loop to a for loop and increment idx in the third term of the loop control statement. In fact, I strongly recommend for loops over while loops where the former are a good fit to the structure of the code (as here); I daresay you would not have made this mistake if your loop had been structured that way.

Stack around the variable 'ch' was corrupted

I am in the process of writing a decipher algorithm for Vegenere Variant Cipher and ran into some C specific issues(I am not too familiar with C).
I get
"Run-Time Check Failure #2 - Stack around the variable 'ch' was corrupted" error.
If I understand the error right, ch is not available when I try to read/write to it(ch in this case represents a HEX value read from the text file, I have posted the code of the function below).
But, for the life of me, I can't figure out where it happens. I close the file way before the I exit the function(exception is thrown at the time I leave the function).
Can you take a look an let me know where I have it wrong? Thanks in advance.
P.S. I am tagging the question with C++ as well as it should pretty much be the same except, maybe, how we read the file in.
Anyways, my code below:
int getKeyLength(char *cipherTxtF){
int potKeyL = 1;
float maxFreq = 0.00;
int winKL = 1;
for (potKeyL = 1; potKeyL <= 13; potKeyL++)// loop that is going through each key size startig at 1 and ending at 13
unsigned char ch;
FILE *cipherTxtFi;
cipherTxtFi = fopen(cipherTxtF, "r");
int fileCharCount = 0;
int freqCounter[256] = { 0 };
int nThCharCount = 0;
while (fscanf(cipherTxtFi, "%02X", &ch) != EOF) {
if (ch != '\n') {
if (fileCharCount % potKeyL == 0){
int asciiInd = (int)ch;
freqCounter[asciiInd] += 1;
float frequenciesArray[256] = { 0 };
float sumq_iSq = 0;
int k;
for (k = 0; k < 256; k++){
frequenciesArray[k] = freqCounter[k] / (float)nThCharCount;
for (k = 0; k < 256; k++){
sumq_iSq += frequenciesArray[k] * frequenciesArray[k];
printf("%f \n", sumq_iSq);
if (maxFreq < sumq_iSq) {
maxFreq = sumq_iSq;
winKL = potKeyL;
return winKL;
You are trying to read an hexadecimal integer with fscanf() (format "%02X", where X means "integer in hex format") and store it into a char.
Unfortuantely fscanf() just receives the address of the char and doesn't know that you've not provided the address of an int. As int is larger than a char, the memory gets corrupted.
A solution could be:
int myhex;
while (fscanf(cipherTxtFi, "%02X", &myhex) != EOF) {
ch = myhex;

C arrays calling functions

below is my code for your reference. Rx_Buf contains one of the two values from Rfid_Tag[][]. I am getting tht value from another function and I want to find and confirm. the problem is its not working. Rfid_Tag[][] has different values i.e. its corrupted. I am not sure how a global variable defined is getting corrupted. I tried declaring it const, extern still the same problem.When i run this as standalone program it works perfectly, but when I call this function from main.c its not working. Can anyone pleasee help me with this.
unsigned char Rfid_Tag[NUMBER_OF_RFID_TAGS][RFID_DATA_LENGTH]= {
int RFID_check(char *Rx_Buf)
//unsigned char ucRfidReceivedData[RFID_DATA_LENGTH]= *Rx_Buf; //{0x00,0x01,0x2,0x03,0x04,0x5,0x06,0x07,0x8,0x09,0x0A,0xB,0x0C,0x0D,0xE,0x0F,0x10};
unsigned count = 0;
int found = false;
for(int i = 0; i < NUMBER_OF_RFID_TAGS; i++){
count = 0;
for(int j = 0; j < RFID_DATA_LENGTH; j++){
if(Rfid_Tag[i][j] == Rx_Buf[j]){
//PORTR.DIR = 0xff;
//PORTR.OUTTGL = 0xff;
if(count == RFID_DATA_LENGTH){
found = true;
if(found == true){
PORTR.DIR = 0xff;
PORTR.DIR = 0xff;
return 0;

GPU code running slower than CPU version

I am working on an application which divides a string into pieces and assigns each to a block. Within each block the the text is scanned character by character and a shared array of int, D is to be updated by different threads in parallel based on the character read. At the end of each iteration the last element of D is checked, and if it satisfied the condition, a global int array m is set to 1 at the position corresponding to the text. This code was executed on a NVIDIA GEForce Fermi 550, and runs even slower than the CPU version. I have just included the kernel here:
__global__ void match(uint32_t* BB_d,const char* text_d,int n, int m,int k,int J,int lc,int start_addr,int tBlockSize,int overlap ,int* matched){
__shared__ int D[MAX_THREADS+2];
__shared__ char Text_S[MAX_PATTERN_SIZE];
__shared__ int DNew[MAX_THREADS+2];
__shared__ int BB_S[4][MAX_THREADS];
int w=threadIdx.x+1;
for(int i=0;i<4;i++)
BB_S[i][threadIdx.x]= BB_d[i*J+threadIdx.x];
D[threadIdx.x] = 0;
D[w] = (1<<(k+1)) -1;
for(int i = 0; i < lc - 1; i++)
D[w] = (D[w] << k+2) + (1<<(k+1)) -1;
D[J+1] = (1<<((k+2)*lc)) - 1;
int startblock=(blockIdx.x == 0?start_addr:(start_addr+(blockIdx.x * (tBlockSize-overlap))));
int size= (((startblock + tBlockSize) > n )? ((n- (startblock))):( tBlockSize));
int copyBlock=(size/J)+ ((size%J)==0?0:1);
if((threadIdx.x * copyBlock) <= size)
memcpy(Text_S+(threadIdx.x*copyBlock),text_d+(startblock+threadIdx.x*copyBlock),(((((threadIdx.x*copyBlock))+copyBlock) > size)?(size-(threadIdx.x*copyBlock)):copyBlock));
memcpy(DNew, D, (J+2)*sizeof(int));
uint32_t initial = D[1];
uint32_t x;
uint32_t mask = 1;
for(int i = 0; i < lc - 1; i++)mask = (mask<<(k+2)) + 1;
for(int i = 0; i < size;i++)
x = ((D[w] >> (k+2)) | (D[w - 1] << ((k + 2)* (lc - 1))) | (BB_S[(((int)Text_S[i])/2)%4][w-1])) & ((1 << (k + 2)* lc) - 1);
DNew[w] = ((D[w]<<1) | mask)
& (((D[w] << k+3) | mask|((D[w +1] >>((k+2)*(lc - 1)))<<1)))
& (((x + mask) ^ x) >> 1)
& initial;
memcpy(D, DNew, (J+2)*sizeof(int));
if(!(D[J] & 1<<(k + (k + 2)*(lc*J -m + k ))))
matched[startblock+i] = 1;
D[J] |= ((1<<(k + 1 + (k + 2)*(lc*J -m + k ))) - 1);
I am not very familiar with CUDA so I dont quite understand issues such as shared memory bank conflicts. Could that be the bottleneck here?
As asked, this is the code where I launch the kernels:
#include <stdio.h>
#include <assert.h>
#include <cuda.h>
#define uint32_t unsigned int
#define MAX_THREADS 512
#define MAX_PATTERN_SIZE 1024
#define MAX_BLOCKS 8
#define MAX_STREAMS 16
#define TEXT_MAX_LENGTH 1000000000
void calculateBBArray(uint32_t** BB,const char* pattern_h,int m,int k , int lc , int J){};
void checkCUDAError(const char *msg) {
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
fprintf(stderr, "Cuda error: %s: %s.\n", msg,
cudaGetErrorString( err) );
char* getTextString() {
FILE *input, *output;
char c;
char * inputbuffer=(char *)malloc(sizeof(char)*TEXT_MAX_LENGTH);
int numchars = 0, index = 0;
input = fopen("sequence.fasta", "r");
c = fgetc(input);
while(c != EOF)
inputbuffer[numchars] = c;
c = fgetc(input);
inputbuffer[numchars] = '\0';
return inputbuffer;
int main(void) {
char * text_h=getTextString(); //reading text from file, supported upto 200MB currently
int k = 13;
int i;
int count=0;
char *pattern_d, *text_d; // pointers to device memory
char* text_new_d;
int* matched_d;
int* matched_new_d;
uint32_t* BB_d;
uint32_t* BB_new_d;
int* matched_h = (int*)malloc(sizeof(int)* strlen(text_h));
cudaMalloc((void **) &pattern_d, sizeof(char)*strlen(pattern_h)+1);
cudaMalloc((void **) &text_d, sizeof(char)*strlen(text_h)+1);
cudaMalloc((void **) &matched_d, sizeof(int)*strlen(text_h));
cudaMemcpy(pattern_d, pattern_h, sizeof(char)*strlen(pattern_h)+1, cudaMemcpyHostToDevice);
cudaMemcpy(text_d, text_h, sizeof(char)*strlen(text_h)+1, cudaMemcpyHostToDevice);
cudaMemset(matched_d, 0,sizeof(int)*strlen(text_h));
int m = strlen(pattern_h);
int n = strlen(text_h);
uint32_t* BB_h[4];
unsigned int maxLc = ((((m-k)*(k+2)) > (31))?(31/(k+2)):(m-k));
unsigned int lc=2; // Determines the number of threads per block
// can be varied upto maxLc for tuning performance
unsigned int noWordorNfa =((m-k)/lc) + (((m-k)%lc) == 0?0:1);
cudaMalloc((void **) &BB_d, sizeof(int)*noWordorNfa*4);
if(noWordorNfa >= MAX_THREADS)
printf("Error: max threads\n");
calculateBBArray(BB_h,pattern_h,m,k,lc,noWordorNfa); // not included this function
cudaMemcpy(BB_d+ i*noWordorNfa, BB_h[i], sizeof(int)*noWordorNfa, cudaMemcpyHostToDevice);
int overlap=m;
int textBlockSize=(((m+k+1)>n)?n:(m+k+1));
cudaStream_t stream[MAX_STREAMS];
for(i=0;i<MAX_STREAMS;i++) {
cudaStreamCreate( &stream[i] );
int start_addr=0,index=0,maxNoBlocks=0;
maxNoBlocks=((1 + ((n-textBlockSize)/(textBlockSize-overlap)) + (((n-textBlockSize)%(textBlockSize-overlap)) == 0?0:1)));
int kernelBlocks = ((maxNoBlocks > MAX_BLOCKS)?MAX_BLOCKS:maxNoBlocks);
int blocksRemaining =maxNoBlocks;
printf(" maxNoBlocks %d kernel Blocks %d \n",maxNoBlocks,kernelBlocks);
while(blocksRemaining >0)
kernelBlocks = ((blocksRemaining > MAX_BLOCKS)?MAX_BLOCKS:blocksRemaining);
printf(" Calling %d Blocks with starting Address %d , textBlockSize %d \n",kernelBlocks,start_addr,textBlockSize);
blocksRemaining -= kernelBlocks;
cudaMemcpy(matched_h, matched_d, sizeof(int)*strlen(text_h), cudaMemcpyDeviceToHost);
checkCUDAError("Matched Function");
cudaStreamSynchronize( stream[i] );
// do stuff with matched
// ....
// ....
return 0;
Number of threads launched per block depends upon the length pattern_h(could be at most maxLc above). I expect it to be around 30 in this case. Shoudn't that be enough to see a good amount of concurrency? As for blocks, I see no point in launching more than MAX_BLOCKS (=10) at a time since the hardware can schedule only 8 simultaneously
NOTE: I don't have GUI access.
With all the shared memory you're using, you could be running into bank conflicts if consecutive threads are not reading from consecutive addresses in the shared arrays ... that could cause serialization of the memory accesses, which in turn will kill the parallel performance of your algorithm.
I breifly looked at your code but it looks like your sending data to the gpu back and forth creating a bottle neck on the bus? did you try profiling it?
I found that I was copying the whole array Dnew to D in each thread rather than copying only the portion each thread was supposed to update D[w]. This would cause the threads to execute serially, although I don't know if it could be called a shared memory bank conflict. Now it gives 8-9x speedup for large enough patterns(=more threads). This is much less than what I expected. I will try to increase number of blocks as suggested. I dont know how to increase the # of threads
