I'm having a problem with some program, I have searched about segmentation faults, by I don't understand them quite well, the only thing I know is that presumably I am trying to access some memory I shouldn't. The problem is that I see my code and don't understand what I am doing wrong.
#include<stdio.h>
#include<math.h>
#include<stdlib.h>
#define lambda 2.0
#define g 1.0
#define Lx 100
#define F0 1.0
#define Tf 10
#define h 0.1
#define e 0.00001
FILE *file;
double F[1000][1000000];
void Inicio(double D[1000][1000000]) {
int i;
for (i=399; i<600; i++) {
D[i][0]=F0;
}
}
void Iteration (double A[1000][1000000]) {
long int i,k;
for (i=1; i<1000000; i++) {
A[0][i]= A[0][i-1] + e/(h*h*h*h)*g*g*(A[2][i-1] - 4.0*A[1][i-1] + 6.0*A[0][i-1]-4.0*A[998][i-1] + A[997][i-1]) + 2.0*g*e/(h*h)*(A[1][i-1] - 2*A[0][i-1] + A[998][i-1]) + e*A[0][i-1]*(lambda-A[0][i-1]*A[0][i-1]);
A[1][i]= A[1][i-1] + e/(h*h*h*h)*g*g*(A[3][i-1] - 4.0*A[2][i-1] + 6.0*A[1][i-1]-4.0*A[0][i-1] + A[998][i-1]) + 2.0*g*e/(h*h)*(A[2][i-1] - 2*A[1][i-1] + A[0][i-1]) + e*A[1][i-1]*(lambda-A[1][i-1]*A[1][i-1]);
for (k=2; k<997; k++) {
A[k][i]= A[k][i-1] + e/(h*h*h*h)*g*g*(A[k+2][i-1] - 4.0*A[k+1][i-1] + 6.0*A[k][i-1]-4.0*A[k-1][i-1] + A[k-2][i-1]) + 2.0*g*e/(h*h)*(A[k+1][i-1] - 2*A[k][i-1] + A[k-1][i-1]) + e*A[k][i-1]*(lambda-A[k][i-1]*A[k][i-1]);
}
A[997][i] = A[997][i-1] + e/(h*h*h*h)*g*g*(A[0][i-1] - 4*A[998][i-1] + 6*A[997][i-1] - 4*A[996][i-1] + A[995][i-1]) + 2.0*g*e/(h*h)*(A[998][i-1] - 2*A[997][i-1] + A[996][i-1]) + e*A[997][i-1]*(lambda-A[997][i-1]*A[997][i-1]);
A[998][i] = A[998][i-1] + e/(h*h*h*h)*g*g*(A[1][i-1] - 4*A[0][i-1] + 6*A[998][i-1] - 4*A[997][i-1] + A[996][i-1]) + 2.0*g*e/(h*h)*(A[0][i-1] - 2*A[998][i-1] + A[997][i-1]) + e*A[998][i-1]*(lambda-A[998][i-1]*A[998][i-1]);
A[999][i]=A[0][i];
}
}
main() {
long int i,j;
Inicio(F);
Iteration(F);
file = fopen("P1.txt","wt");
for (i=0; i<1000000; i++) {
for (j=0; j<1000; j++) {
fprintf(file,"%lf \t %.4f \t %lf\n", 1.0*j/10.0, 1.0*i, F[j][i]);
}
}
fclose(file);
}
Thanks for your time.
This declaration:
double F[1000][1000000];
would occupy 8 * 1000 * 1000000 bytes on a typical x86 system. This is about 7.45 GB. Chances are your system is running out of memory when trying to execute your code, which results in a segmentation fault.
Your array is occupying roughly 8 GB of memory (1,000 x 1,000,000 x sizeof(double) bytes). That might be a factor in your problem. It is a global variable rather than a stack variable, so you may be OK, but you're pushing limits here.
Writing that much data to a file is going to take a while.
You don't check that the file was opened successfully, which could be a source of trouble, too (if it did fail, a segmentation fault is very likely).
You really should introduce some named constants for 1,000 and 1,000,000; what do they represent?
You should also write a function to do the calculation; you could use an inline function in C99 or later (or C++). The repetition in the code is excruciating to behold.
You should also use C99 notation for main(), with the explicit return type (and preferably void for the argument list when you are not using argc or argv):
int main(void)
Out of idle curiosity, I took a copy of your code, changed all occurrences of 1000 to ROWS, all occurrences of 1000000 to COLS, and then created enum { ROWS = 1000, COLS = 10000 }; (thereby reducing the problem size by a factor of 100). I made a few minor changes so it would compile cleanly under my preferred set of compilation options (nothing serious: static in front of the functions, and the main array; file becomes a local to main; error check the fopen(), etc.).
I then created a second copy and created an inline function to do the repeated calculation, (and a second one to do subscript calculations). This means that the monstrous expression is only written out once — which is highly desirable as it ensure consistency.
#include <stdio.h>
#define lambda 2.0
#define g 1.0
#define F0 1.0
#define h 0.1
#define e 0.00001
enum { ROWS = 1000, COLS = 10000 };
static double F[ROWS][COLS];
static void Inicio(double D[ROWS][COLS])
{
for (int i = 399; i < 600; i++) // Magic numbers!!
D[i][0] = F0;
}
enum { R = ROWS - 1 };
static inline int ko(int k, int n)
{
int rv = k + n;
if (rv >= R)
rv -= R;
else if (rv < 0)
rv += R;
return(rv);
}
static inline void calculate_value(int i, int k, double A[ROWS][COLS])
{
int ks2 = ko(k, -2);
int ks1 = ko(k, -1);
int kp1 = ko(k, +1);
int kp2 = ko(k, +2);
A[k][i] = A[k][i-1]
+ e/(h*h*h*h) * g*g * (A[kp2][i-1] - 4.0*A[kp1][i-1] + 6.0*A[k][i-1] - 4.0*A[ks1][i-1] + A[ks2][i-1])
+ 2.0*g*e/(h*h) * (A[kp1][i-1] - 2*A[k][i-1] + A[ks1][i-1])
+ e * A[k][i-1] * (lambda - A[k][i-1] * A[k][i-1]);
}
static void Iteration(double A[ROWS][COLS])
{
for (int i = 1; i < COLS; i++)
{
for (int k = 0; k < R; k++)
calculate_value(i, k, A);
A[999][i] = A[0][i];
}
}
int main(void)
{
FILE *file = fopen("P2.txt","wt");
if (file == 0)
return(1);
Inicio(F);
Iteration(F);
for (int i = 0; i < COLS; i++)
{
for (int j = 0; j < ROWS; j++)
{
fprintf(file,"%lf \t %.4f \t %lf\n", 1.0*j/10.0, 1.0*i, F[j][i]);
}
}
fclose(file);
return(0);
}
This program writes to P2.txt instead of P1.txt. I ran both programs and compared the output files; the output was identical. When I ran the programs on a mostly idle machine (MacBook Pro, 2.3 GHz Intel Core i7, 16 GiB 1333 MHz RAM, Mac OS X 10.7.5, GCC 4.7.1), I got reasonably but not wholly consistent timing:
Original Modified
6.334s 6.367s
6.241s 6.231s
6.315s 10.778s
6.378s 6.320s
6.388s 6.293s
6.285s 6.268s
6.387s 10.954s
6.377s 6.227s
8.888s 6.347s
6.304s 6.286s
6.258s 10.302s
6.975s 6.260s
6.663s 6.847s
6.359s 6.313s
6.344s 6.335s
7.762s 6.533s
6.310s 9.418s
8.972s 6.370s
6.383s 6.357s
However, almost all that time is spent on disk I/O. I reduced the disk I/O to just the very last row of data, so the outer I/O for loop became:
for (int i = COLS - 1; i < COLS; i++)
the timings were vastly reduced and very much more consistent:
Original Modified
0.168s 0.165s
0.145s 0.165s
0.165s 0.166s
0.164s 0.163s
0.151s 0.151s
0.148s 0.153s
0.152s 0.171s
0.165s 0.165s
0.173s 0.176s
0.171s 0.165s
0.151s 0.169s
The simplification in the code from having the ghastly expression written out just once is very beneficial, it seems to me. I'd certainly far rather have to maintain that program than the original.
What system are you running on? Do you have access to some sort of debugger (gdb, visual studio's debugger, etc.)?
That would give us valuable information, like the line of code where the program crashes... Also, the amount of memory may be prohibitive.
Additionally, may I recommend that you replace the numeric limits by named definitions?
As such:
#define DIM1_SZ 1000
#define DIM2_SZ 1000000
Use those whenever you wish to refer to the array dimension limits. It will help avoid typing errors.
Run your program with valgrind of linked to efence. That will tell you where the pointer is being dereferenced and most likely fix your problem if you fix all the errors they tell you about.
Related
I wrote the code for a problem in codeforces and even though I believe I was doing it in the best time complexity it was exceeding the time limit on the 7th test case. After some testing, it seemed to me that the major amount of time was being taken by printf, which seemed odd since using printf some 3 * 10^5 times shouldn't be such a big deal. So I searched a lot and found this: https://codeforces.com/blog/entry/105687#comment-940911
Now I made the conclusion that using this line at the top of my code will make printf faster:
#define __USE_MINGW_ANSI_STDIO 0
So I ran my code with the above included and voila what was exceeding the time limit of 1s earlier now with the inclusion of just one line of code got accepted in merely 62 ms.
I didn't understand most of the other stuff that was talked about in the link like MinGW implementations and all.
So my question is, firstly why does it work this way? Secondly, can I/should I keep using the above line of code in all my programs on codeforces from now on?
P.S. I also found this blog: https://codeforces.com/blog/entry/47180
It was too confusing for me to grasp for the time being but maybe someone else can understand it and shed some light on the matter.
Also, here is the codeforces problem: https://codeforces.com/contest/1774/problem/C
Here is my solution:
https://codeforces.com/contest/1774/submission/185781891
I don't know the entire input as codeforces doesn't share it and it'd be very very big. But I know that the value inputted to the tests variable is 3, the values inputted to n[0], n[1], n[2] are 100000, 100000, 100000
Here is my code:
#define __USE_MINGW_ANSI_STDIO 0
#include <stdio.h>
#include <stdlib.h>
// #include <math.h>
// #include <string.h>
// #define lint long long int
// Function Declarations
int main(void)
{
int tests;
scanf("%i", &tests);
int **answers = malloc(tests * sizeof(int*));
int *n = malloc(sizeof(int) * tests);
for (int i = 0; i < tests; i++)
{
scanf("%i", &n[i]);
char *enviro = malloc((n[i]) * sizeof(int));
answers[i] = malloc((n[i] - 1) * sizeof(int));
int consec = 1; // No. of same consecutive elements at the very
// end.
scanf("%s", enviro);
answers[i][0] = 1; // Case where x = 2;
for (int x = 3; x < n[i] + 1; x++)
{
// comparing corresponding to current x vs previous x
if (enviro[x - 2] == enviro[x - 3])
{
consec++;
}
else
{
consec = 1;
}
answers[i][x - 2] = x - consec;
}
// Free loop variables
free(enviro);
}
/* if (tests == 3)
{
printf("n[%i] = %i\n", i, n[i]);
} */
for (int i = 0; i < tests; i++)
{
for (int j = 0; j < n[i] - 1; j++)
{
printf("%i ", answers[i][j]);
}
printf("\n");
free(answers[i]);
}
// Free variables
free(answers);
return 0;
}
EDIT: So I tried the following code for the same problem on codeforces (https://codeforces.com/contest/1774/submission/185788962) just to see the execution time:
// #define __USE_MINGW_ANSI_STDIO 0
#include <stdio.h>
#include <math.h>
int main(void)
{
int n = pow(10, 5);
for (int i = 0; i < n; i++)
{
printf("*");
}
}
Without the #define __USE_MINGW_ANSI_STDIO 0 it gave an e.t. of 374ms. With it, it gave e.t. of 15ms.
It seems like MinGW defined their own printf() functions, __mingw_printf(). This is done to fix format specifiers' problems on some old Windows operating systems, as seen in their wikis. The macro __USE_MINGW_ANSI_STDIO is set to 0 if you don't want to use MinGW's implementation, and 1 if you do.
It also seems like MinGW's implementation is slower, so not using it will make your code faster.
I've written a simple benchmark to test and measure the single-precision fused multiply add performance of both processors, and OpenCL devices.
I recently added SMP support using Pthread. The CPU side is simple, it generates a couple of random matrices for inputs to ensure that the work can't be optimized out by the compiler.
The function cpu_result_matrix() creates the threads, and blocks until every thread returns using pthread_join(). It's this function that's timed to determine the performance of the device.
static float *cpu_result_matrix(struct bench_buf *in)
{
const unsigned tc = nthreads();
struct cpu_res_arg targ[tc];
float *res = aligned_alloc(16, BUFFER_SIZE * sizeof(float));
for (unsigned i = 0; i < tc; i++) {
targ[i].tid = i;
targ[i].tc = tc;
targ[i].in = in;
targ[i].ret = res;
}
pthread_t cpu_res_t[tc];
for (unsigned i = 0; i < tc; i++)
pthread_create(&cpu_res_t[i], NULL,
cpu_result_matrix_mt, (void *)&targ[i]);
for (unsigned i = 0; i < tc; i++)
pthread_join(cpu_res_t[i], NULL);
return res;
}
The actual kernel is in cpu_result_matrix_mt():
static void *cpu_result_matrix_mt(void *v_arg)
{
struct cpu_res_arg *arg = (struct cpu_res_arg *)v_arg;
const unsigned buff_size = BUFFER_SIZE;
const unsigned work_size = buff_size / arg->tc;
const unsigned work_start = arg->tid * work_size;
const unsigned work_end = work_start + work_size;
const unsigned round_cnt = ROUNDS_PER_ITERATION;
float lres;
for (unsigned i = work_start; i < work_end; i++) {
lres = 0;
float a = arg->in->a[i], b = arg->in->b[i], c = arg->in->c[i];
for (unsigned j = 0; j < round_cnt; j++) {
lres += a * ((b * c) + b);
lres += b * ((c * a) + c);
lres += c * ((a * b) + a);
}
arg->ret[i] = lres;
}
return NULL;
}
I noticed that the reported time taken for the kernel was roughly the same, regardless of how much I unrolled the inner loop.
To investigate, I made the kernel much larger by manually unrolling the inner loop until I could easily measure the wall time of the program running.
In the process, I observed that (it appears) the threads are returning before the kernel does the work it actually should do, which causes pthread_join() to stop blocking the main thread, and the execution time to appear to be much lower than it really is. I don't understand how this is possible, or how the program could continue to run and output correct results under these conditions.
Htop shows that the threads are still very much alive and working. I checked the return value of pthread_join(), and it was successful after every run. I got curious, and put a print statement in at the end of the kernel, before the return statement, and sure enough, each thread printed that it finished much sooner than it should have.
I watched ps while running the program, and it showed one thread, followed by three more, another five, then it dropped down to four.
I'm baffled, I've never seen threads act like this before.
The full source for my modified test branch is here: https://github.com/jakogut/clperf/tree/test
In the process, I observed that (it appears) the threads are returning before the kernel does the work it actually should do, which causes pthread_join() to stop blocking the main thread, and the execution time to appear to be much lower than it really is.
I'm not sure how you determine this. But looking at the assembly with -Ofast shows that
res[i] += a * ((b * c) + b);
res[i] += b * ((c * a) + c);
res[i] += c * ((a * b) + a);
is calculated before the inner loop. The inner loop is effectively
float t = a * ((b * c) + b) + b * ((c * a) + c) + c * ((a * b) + a);
float sum = 0;
for (unsigned j = 0; j < ROUNDS_PER_ITERATION; j++) {
sum += t;
}
res[i] = sum;
If in your timing you're expecting your inner loop to do sum += a * ((b * c) + b) + b * ((c * a) + c) + c * ((a * b) + a) each iteration when in fact it only does sum += t then your timing estimate will be much larger than what you observe.
OpenMP seems to be a much better solution. It requires far less setup and complexity with problems like this that can exploit data parallelism.
static float *cpu_result_matrix(struct bench_buf *in)
{
float *res = aligned_alloc(16, BUFFER_SIZE * sizeof(float));
#pragma omp parallel for
for (unsigned i = 0; i < BUFFER_SIZE; i++) {
float a = in->a[i], b = in->b[i], c = in->c[i];
for (unsigned j = 0; j < ROUNDS_PER_ITERATION; j++) {
res[i] += a * ((b * c) + b);
res[i] += b * ((c * a) + c);
res[i] += c * ((a * b) + a);
}
}
return res;
}
However, that doesn't answer why pthreads were behaving like they were in the question.
I thought memory access would be faster than the multiplication and division (although compiler-optimized) done with alpha blending. But it wasn't as fast as expected.
The 16 megabytes used for the table is not an issue in this case. But it is a problem if table lookup could even be slower than doing all the CPU calculations.
Can anyone explain to me why and what is happening? Will the table lookup beat out with a slower CPU?
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <time.h>
#define COLOR_MAX UCHAR_MAX
typedef unsigned char color;
color (*blending_table)[COLOR_MAX + 1][COLOR_MAX + 1];
static color blend(unsigned int destination, unsigned int source, unsigned int a) {
return (source * a + destination * (COLOR_MAX - a)) / COLOR_MAX;
}
void initialize_blending_table(void) {
int destination, source, a;
blending_table = malloc((COLOR_MAX + 1) * sizeof *blending_table);
for (destination = 0; destination <= COLOR_MAX; ++destination) {
for (source = 0; source <= COLOR_MAX; ++source) {
for (a = 0; a <= COLOR_MAX; ++a) {
blending_table[destination][source][a] = blend(destination, source, a);
}
}
}
}
struct timer {
double start;
double end;
};
void timer_start(struct timer *self) {
self->start = clock();
}
void timer_end(struct timer *self) {
self->end = clock();
}
double timer_measure_in_seconds(struct timer *self) {
return (self->end - self->start) / CLOCKS_PER_SEC;
}
#define n 300
int main(void) {
struct timer timer;
volatile int i, j, k, l, m;
timer_start(&timer);
initialize_blending_table();
timer_end(&timer);
printf("init %f\n", timer_measure_in_seconds(&timer));
timer_start(&timer);
for (i = 0; i <= n; ++i) {
for (j = 0; j <= COLOR_MAX; ++j) {
for (k = 0; k <= COLOR_MAX; ++k) {
for (l = 0; l <= COLOR_MAX; ++l) {
m = blending_table[j][k][l];
}
}
}
}
timer_end(&timer);
printf("table %f\n", timer_measure_in_seconds(&timer));
timer_start(&timer);
for (i = 0; i <= n; ++i) {
for (j = 0; j <= COLOR_MAX; ++j) {
for (k = 0; k <= COLOR_MAX; ++k) {
for (l = 0; l <= COLOR_MAX; ++l) {
m = blend(j, k, l);
}
}
}
}
timer_end(&timer);
printf("function %f\n", timer_measure_in_seconds(&timer));
return EXIT_SUCCESS;
}
result
$ gcc test.c -O3
$ ./a.out
init 0.034328
table 14.176643
function 14.183924
Table lookup is not a panacea. It helps when the table is small enough, but in your case the table is very big. You write
16 megabytes used for the table is not an issue in this case
which I think is very wrong, and is possibly the source of the problem you experience. 16 megabytes is too big for L1 cache, so reading data from random indices in the table will involve the slower caches (L2, L3, etc). The penalty for cache misses is typically large; your blending algorithm must be very complex if you want your LUT solution to be faster.
Read the Wikipedia article for more info.
Your benchmark is hopelessly broken, it makes the LUT look a lot better than it actually is because it reads the table in-order.
If your performance results show that the LUT is worse than direct calculation, then when you start with real-world random access patterns and cache misses, the LUT is going to be much worse.
Focus on improving the computation, and enabling vectorization. It's likely to pay off far better than a table-based approach.
(source * a + destination * (COLOR_MAX - a)) / COLOR_MAX
with rearrangement becomes
(source * a + destination * COLOR_MAX - destination * a) / COLOR_MAX
which simplifies to
destination + (source - destination) * a / COLOR_MAX
which has one multiply and one division by a constant, both of which are very efficient. And it is easily vectorized.
You should also mark your helper function as inline, although a good optimizing compiler is probably inlining it anyway.
Following the skeleton of my setup. Executed like this it doesn't give the correct result. This is most likely due to the async data transfers which haven't finished when the kernel uses them. I implemented a "failsafe" version with the preprocessor if-else statement. When translating the else part the program runs fine. I don't get it. Why that?
The in1, out1 ,... are just placeholders. Of course they point to different containers each iteration of the for loop. so that async transfer can take place. But within on iteration the out1 used by the transfer and the one by the kernel are the same.
cudaStream_t streams[2];
cudaEvent_t evCopied;
cudaStreamCreate(&streams[0]); // TRANSFER
cudaStreamCreate(&streams[1]); // KERNEL
cudaEventCreate(&evCopied);
// many iterations
for () {
// Here I want overlapping of transfers with previous kernel
cudaMemcpyAsync( out1, in1, size1, cudaMemcpyDefault, streams[0] );
cudaMemcpyAsync( out2, in2, size2, cudaMemcpyDefault, streams[0] );
cudaMemcpyAsync( out3, in3, size3, cudaMemcpyDefault, streams[0] );
#if 1
// make sure host thread doesn't "run away"
cudaStreamSynchronize( streams[1] );
cudaEventRecord( evCopied , streams[0] );
cudaStreamWaitEvent( streams[1] , evCopied , 0);
#else
// this gives the correct results
cudaStreamSynchronize( streams[0] );
cudaStreamSynchronize( streams[1] );
#endif
kernel<<< grid , sh_mem , streams[1] >>>(out1,out2,out3);
}
Please don't post answers suggesting a rearrangement of the setup. Something like, divide your kernels into several ones and issue them in separate stream.
What you are doing -- or at least the use of an event to synchronize two streams -- should work. It is basically impossible to say why your actual code doesn't work because you have chosen not to post it, and the devil is always in the detail.
However, here is a complete, runnable example which I think is using the streams API in a fashion similar to what you are trying to do and which works correctly:
#include <cstdio>
typedef unsigned int uint;
template<uint bsz>
__global__ void kernel(uint * a, uint * b, uint * c, const uint N)
{
__shared__ volatile uint buf[bsz];
uint tid = threadIdx.x + blockIdx.x * blockDim.x;
uint stride = blockDim.x * gridDim.x;
uint val = 0;
for(uint i=tid; i<N; i+=stride) {
val += a[i] + b[i];
}
buf[threadIdx.x] = val; __syncthreads();
#pragma unroll
for(uint i=(threadIdx.x+warpSize); (threadIdx.x<warpSize)&&(i<bsz); i+=warpSize)
buf[threadIdx.x] += buf[i];
if (threadIdx.x < 16) buf[threadIdx.x] += buf[threadIdx.x+16];
if (threadIdx.x < 8) buf[threadIdx.x] += buf[threadIdx.x+8];
if (threadIdx.x < 4) buf[threadIdx.x] += buf[threadIdx.x+4];
if (threadIdx.x < 2) buf[threadIdx.x] += buf[threadIdx.x+2];
if (threadIdx.x == 0) c[blockIdx.x] += buf[0] + buf[1];
}
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
int main(void)
{
const int nruns = 16, ntransfers = 3;
const int Nb = 32, Nt = 192, Nr = 3000, N = Nr * Nb * Nt;
const size_t szNb = Nb * sizeof(uint), szN = size_t(N) * sizeof(uint);
size_t sz[4] = { szN, szN, szNb, szNb };
uint * d[ntransfers+1];
for(int i=0; i<ntransfers+1; i++)
gpuErrchk(cudaMallocHost((void **)&d[i], sz[i]));
uint * a = d[0], * b = d[1], * c = d[2], * out = d[3];
for(uint i=0; i<N; i++) {
a[i] = b[i] = 1;
if (i<Nb) c[i] = 0;
}
uint * _d[3];
for(int i=0; i<ntransfers; i++)
gpuErrchk(cudaMalloc((void **)&_d[i], sz[i]));
uint * _a = _d[0], * _b = _d[1], * _c = _d[2];
cudaStream_t stream[2];
for (int i = 0; i < 2; i++)
gpuErrchk(cudaStreamCreate(&stream[i]));
cudaEvent_t sync_event;
gpuErrchk(cudaEventCreate(&sync_event));
uint results[nruns];
for(int j=0; j<nruns; j++) {
for(int i=0; i<ntransfers; i++)
gpuErrchk(cudaMemcpyAsync(_d[i], d[i], sz[i], cudaMemcpyHostToDevice, stream[0]));
gpuErrchk(cudaEventRecord(sync_event, stream[0]));
gpuErrchk(cudaStreamWaitEvent(stream[1], sync_event, 0));
kernel<Nt><<<Nb, Nt, 0, stream[1]>>>(_a, _b, _c, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaMemcpyAsync(out, _c, szNb, cudaMemcpyDeviceToHost, stream[1]));
gpuErrchk(cudaStreamSynchronize(stream[1]));
results[j] = uint(0);
for(int i=0; i<Nb; i++) results[j]+= out[i];
}
for(int j=0; j<nruns; j++)
fprintf(stdout, "%3d: ans = %u\n", j, results[j]);
gpuErrchk(cudaDeviceReset());
return 0;
}
The kernel is a "fused vector addition/reduction", just nonsense, but it relies on the last of the three inputs being zeroed prior to kernel execution to produce the correct answer, which should simply be twice the number of input data points. As in your example, the kernel execution and asynchronous input array copying are in different streams, so the copying and the execution can potentially overlap. There is no sane reason to copy the first two large inputs at every iteration in this case, other than to introduce delay before the last copy (which is the critical one) is done and increase the chance it will incorrectly overlap with the kernel. This might be where you are going wrong, because I don't believe the CUDA memory model guarantees that it is safe to asynchronously modify memory being accessed by a running kernel. If that is what you are trying to do, then expect it to fail. But without seeing real code, it is impossible to say more.
With that out of the way, you can see for yourself that the kernel won't produce the correct result without the cudaStreamWaitEvent to synchronize the two streams prior to kernel launch. The only difference between the your pseudo code and this example is the location of the cudaStreamSynchronize on the execution stream. Here I placed it after the kernel launch in order to make sure the kernel finishes before the transfer to gather the results back to the host. That could be the critical difference, but again, no real code equal no real code analysis....
All I can suggest is you play with this example to get a feel for how it works. I understand there is the possibility to profile asynchronous code without the profiling artificially serializing the execution streams in very recent versions of Nsight for Windows. That might be able to help you diagnose your problem if you can't work out the problem from this example or your own code.
I have a simple test program in C to scramble an array of values on the heap. Sidenote: I know the random logic here has a flaw that will not allow the "displaced" value to exceed RAND_MAX, but that is not the point of this post.
The point is that when I run the code with N = 10000, every once in a while it will crash with very little information (screenshots posted below). I'm using MinGW compiler. I can't seem to reproduce the crash for lower or higher N values (1000 or 100000 for example).
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
const int N = 10000;
int main() {
int i, rand1, rand2, temp, *values;
/* allocate values on heap and initialize */
values = malloc(N * sizeof(int));
for (i = 0; i < N; i++) {
values[i] = i + 1;
}
/* scramble */
srand(time(NULL));
for (i = 0; i < N/10; i++) {
rand1 = (int)(N*((double)rand()/(double)RAND_MAX));
rand2 = (int)(N*((double)rand()/(double)RAND_MAX));
temp = values[rand1];
values[rand1] = values[rand2];
values[rand2] = temp;
}
int displaced = 0;
for (i = 0; i < N; i++) {
if (values[i] != (i+1)) {
displaced++;
}
}
printf("%d numbers out of order\n", displaced);
free(values);
return 0;
}
it may be because rand() generates a random number from 0 to RAND_MAX inclusive so (int)(N*((double)rand()/(double)RAND_MAX)) can be N, which exceeds the array boundary. however, i don't see why that would vary with array size (it does explain why it only crashes sometimes, though).
try /(1+(double)RAND_MAX) (note that addition is to the double, to avoid overflow, depending on the value of RAND_MAX) (although i'm not convinced that will always work, depending on the types involved. it would be safer to test for N and try again).
also, learn to use a tool from Is there a good Valgrind substitute for Windows? - they make this kind of thing easy to fix (they tell you exactly what went wrong when you run your program).