Why is this statement in a CUDA kernel slow? - c

I am doing some computer vision stuff using CUDA. Following code takes about 20 seconds to complete.
__global__ void nlmcuda_kernel(float* fpOMul,/*other input args*/){
float fpODenoised[75];
/*Do awesome stuff to compute fpODenoised*/
//inside nested loops:(This is the statement that is the bottleneck in the code.)
fpOMul[ii * iwl * iwxh + iindex * iwxh + il] = fpODenoised[ii * iwl +iindex];
}
if I replace that statement with
fpOMul[ii * iwl * iwxh + iindex * iwxh + il] = 2.0f;
the code hardly takes a couple of seconds to complete.
Why is the specified statment slow and how can I make it run fast?

When you make the code change the compiler can see that all your awesome fpdenoised code is no longer needed and can optimize it out. The actual statement you modified is not the direct cause of the perf difference. You can verify this by looking at the ptx or sass code in each case.

Related

OpenMP for beginners

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?
First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.
The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

Delay using timer on raspberry pi

I need to create a accurate delay (around 100us) inside a thread function. I tried using the nanosleep function but it was no accurate enough. I read some post about how to read the hardware 1MHz timer, so on my function in order to create a 100us delay y tried something like this:
prev = *timer;
do {
t = *timer;
} while ((t - prev) < 100);
However, the program seems to stay inside the loop. But if I insert a small nano sleep inside the loop it works (but loosing precision):
sleeper.tv_sec = 0;
sleeper.tv_nsec = (long)(1);
prev = *timer;
do {
nanosleep (&sleeper, &dummy);
t = *timer;
} while ((t - prev) < 500);
I tried the first version in a stand along program and it works, but in my main program, where this is inside a thread it does not.
Does anyone know what the first version (without a small nanosleep) does not work?
I'm sorry to say but Raspberry Pi's OS is not a "real-time OS". In another words, you won't get consistent 100us precision in a user space program due to inherent OS scheduling limitations. If you need that kind of precision, you should use an embedded controller like an Arduino.

Calculation time elapsed by a particular function in C program

I have a code in which i want to calculate the time taken by two sorting algorithms merge sort and quick sort to sort N numbers in microseconds or more precise.
The two times thus calculated will then we outputted to the terminal.
Code(part of code):
printf("THE LIST BEFORE SORTING IS(UNSORTED LIST):\n");
printlist(arr,n);
mergesort(extarr,0,n-1);
printf("THE LIST AFTER SORTING BY MERGE SORT IS(SORTED LIST):\n");
printlist(extarr,n);
quicksort(arr,0,n-1);
printf("THE LIST AFTER SORTING BY QUICK SORT IS(SORTED LIST):\n");
printlist(arr,n);
Help me by providing that how it will be done.I have tried clock_t by taking two variables as start stop and keeping them above and below the function call respectively but this doesnt help at all and always print out the its difference as zero.
Please suggest some other methods or function keeping in mind that it has no problem running in any type of OS.
Thanks for any help in advance.
Method : 1
To calculate total time taken by program You can use linux utility "time".
Lets your program name is test.cpp.
$g++ -o test test.cpp
$time ./test
Output will be like :
real 0m11.418s
user 0m0.004s
sys 0m0.004s
Method : 2
You can also use linux profiling method "gprof" to find the time by different functions.
First you have to compile the program with "-pg" flag.
$g++ -pg -o test test.cpp
$./test
$gprof test gmon.out
PS : gmon.out is default file created by gprof
You can call gettimeofday function in Linux and timeGetTime in Windows. Call these functions before calling your sorting function and after calling your sorting function and take the difference.
Please check the man page for further details. If you are still unable to get some tangible data (as the time taken may be too small due to smaller data sets), better to try to measure the time together for 'n' number of iterations and then deduce the time for a single run or increase the size of the data set to be sorted.
Not sure if you tried the following. I know your original post says that you have tried utilizing the CLOCKS_PER_SEC. Using CLOCKS_PER_SEC and doing (stop-start)/CLOCKS_PER_SEC will allow you get seconds. The double will provide more precision.
#include <time.h>
main()
{
clock_t launch = clock();
//do work
clock_t done = clock();
double diff = (done - launch) / CLOCKS_PER_SEC;
}
The reason to get Zeroas the result is likely the poor resolution of the time source you're using. These time sources typically increment by some 10 to 20 ms. This is poor but that's the way they work. When your sorting is done in less that this time increment, the result will be zero. You may increase this resultion into the 1 ms regime by increasing the systems interrupt frequency. There is no standard way to accomplish this for windows and Linux. They have their individual way.
An even higher resolution can be obtained by a high frequency counter. Windows and Linux do provide access to such counters, but again, the code may look slightly different.
If you deserve one piece of code to run on windows and linux, I'd recommend to perform the time measurement in a loop. Run the code to measure hundreds or even more times in a loop
and capture the time outside the loop. Divide the captured time by the numer of loop cycles and have the result.
Of course: This is for evaluation only. You don't want to have that in final code.
And: Taking into account that the time resolution is in the 1 to 20 ms you should make a good choice of the total time to go for to get decent resolution of you measurement. (Hint: Adjust the loop count to let it go for at least a second or so.)
Example:
clock_t start, end;
printf("THE LIST BEFORE SORTING IS(UNSORTED LIST):\n");
printlist(arr,n);
start = clock();
for(int i = 0; i < 100; i++){
mergesort(extarr,0,n-1);
}
end = clock();
double diff = (end - start) / CLOCKS_PER_SEC;
// and so on...
printf("THE LIST AFTER SORTING BY MERGE SORT IS(SORTED LIST):\n");
printlist(extarr,n);
quicksort(arr,0,n-1);
printf("THE LIST AFTER SORTING BY QUICK SORT IS(SORTED LIST):\n");
printlist(arr,n);
If you are in Linux 2.6.26 or above then getrusage(2) is the most accurate way to go:
#include <sys/time.h>
#include <sys/resource.h>
// since Linux 2.6.26
// The macro is not defined in all headers, but supported if your Linux version matches
#ifndef RUSAGE_THREAD
#define RUSAGE_THREAD 1
#endif
// If you are single-threaded then RUSAGE_SELF is POSIX compliant
// http://linux.die.net/man/2/getrusage
struct rusage rusage_start, rusage_stop;
getrusage(RUSAGE_THREAD, &rusage_start);
...
getrusage(RUSAGE_THREAD, &rusage_stop);
// amount of microseconds spent in user space
size_t user_time = ((rusage_stop.ru_utime.tv_sec - rusage_start.ru_stime.tv_sec) * 1000000) + rusage_stop.ru_utime.tv_usec - rusage_start.ru_stime.tv_usec;
// amount of microseconds spent in kernel space
size_t system_time = ((rusage_stop.ru_stime.tv_sec - rusage_start.ru_stime.tv_sec) * 1000000) + rusage_stop.ru_stime.tv_usec - rusage_start.ru_stime.tv_usec;

problems when creating many plans and executing plans

I am a little confused about creating many_plan by calling fftwf_plan_many_dft_r2c() and executing it with OpenMP. What I am trying to achieve here is to see if explicitly using OpenMP and organizing FFTW data could work together. ( I know I "should" use multithreaded version of fftw but I failed to get a expected speedup from it ).
My code looks like this:
/* I ignore some helper APIs */
#define N 1024*1024 //N is the total size of 1d fft
fftwf_plan p;
float * in;
fftwf_complex *out;
omp_set_num_threads(threadNum); // Suppose threadNum is 2 here
in = fftwf_alloc_real(2*(N/2+1));
std::fill(in,in+2*(N/2+1),1.1f); // just try with a random real floating numbers
out = (fftwf_complex *)&in[0]; // for in-place transformation
/* Problems start from here */
int n[] = {N/threadNum}; // according to the manual, n is the size of each "howmany" transformation
p = fftwf_plan_many_dft_r2c(1, n, threadNum, in, NULL,1 ,1, out, NULL, 1, 1, FFTW_ESTIMATE);
#pragma omp parallel for
for (int i = 0; i < threadNum; i ++)
{
fftwf_execute(p);
// fftwf_execute_dft_r2c(p,in+i*N/threadNum,out+i*N/threadNum);
}
What I got is like this:
If I use fftwf_execute(p), the program executes successfully, but the result seems not correct. ( I compare the result with the version of not using many_plan and openmp )
If I use fftwf_execute_dft_r2c(), I got segmentation fault.
Can somebody help me here? How should I partition the data across multiple threads? Or it is not correct in the first place.
Thank you in advance.
flyree
Do you properly allocate memory for out? Does this:
out = (fftwf_complex *)&in[0]; // for in-place transformation
do the same as this:
out = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfOutputColumns);
You are trying to access 'p' inside your parallel block, without specifically telling openMP how to use it. It should be:
pragma omp parallel for shared(p)
If you are going to split the work up for n threads, I would think you'd explicitly want to tell omp to use n threads:
pragma omp parallel for shared(p) num_threads(n)
Does this code work without multithreading? If you removed the for loop and openMP call and executed fftwf_execute(p) just once does it work?
I don't know much about FFTW's plans for many, but it seems like p is really many plans, not one single plan. So, when you "execute" p, you are executing all plans at once, right? You don't really need to iteratively execute p.
I'm still learning about OpenMP + FFTW so I could be wrong on these. StackOverflow doesn't like it when i put a # in front of pragma, but you need one.

OPENMP F90/95 Nested DO loops - problems getting improvement over serial implementation

I've done some searching but couldn't find anything that appeared to be related to my question (sorry if my question is redundant!). Anyway, as the title states, I'm having trouble getting any improvement over the serial implementation of my code. The code snippet that I need to parallelize is as follows (this is Fortran90 with OpenMP):
do n=1,lm
do m=1,jm
do l=1,im
sum_u = 0
sum_v = 0
sum_t = 0
do k=1,lm
!$omp parallel do reduction (+:sum_u,sum_v,sum_t)
do j=1,jm
do i=1,im
exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
sum_u = sum_u + u_p(i,j,k) * exp_smoother
sum_v = sum_v + v_p(i,j,k) * exp_smoother
sum_t = sum_t + t_p(i,j,k) * exp_smoother
sum_u_pert(l,m,n) = sum_u
sum_v_pert(l,m,n) = sum_v
sum_t_pert(l,m,n) = sum_t
end do
end do
end do
end do
end do
end do
Am I running into race condition issues? Or am I simply putting the directive in the wrong place? I'm pretty new to this, so I apologize if this is an overly simplistic problem.
Anyway, without parallelization, the code is excruciatingly slow. To give an idea of the size of the problem, the lm, jm, and im indexes are 60, 401, and 501 respectively. So the parallelization is critical. Any help or links to helpful resources would be very much appreciated! I'm using xlf to compile the above code, if that's at all useful.
Thanks!
-Jen
The obvious place to put the omp pragma is at the very outside loop.
For every (l,m,n), you're calculating a convolution between your perturbed variables and an exponential smoother. Each (l,m,n) calculation is completely independant from the others, so you can put it on the outermost loop. So for instance the simplest thing
!$omp parallel do private(n,m,l,i,j,k,exp_smoother) shared(sum_u_pert,sum_v_pert,sum_t_pert,u_p,v_p,t_p), default(none)
do n=1,lm
do m=1,jm
do l=1,im
do k=1,lm
do j=1,jm
do i=1,im
exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
sum_u_pert(l,m,n) = sum_u_pert(l,m,n) + u_p(i,j,k) * exp_smoother
sum_v_pert(l,m,n) = sum_v_pert(l,m,n) + v_p(i,j,k) * exp_smoother
sum_t_pert(l,m,n) = sum_t_pert(l,m,n) + t_p(i,j,k) * exp_smoother
end do
end do
end do
end do
end do
end do
gives me a ~6x speedup on 8 cores (using a much reduced problem size of 20x41x41). Given the amount of work there is to do in the loops, even at the smaller size, I assume the reason it's not an 8x speedup involves memory contension or false sharing; for further performance tuning you might want to explicitly break the sum arrays into sub-blocks for each thread, and combine them at the end; but depending on the problem size, having the equivalent of an extra im x jm x lm sized array might not be desirable.
It seems like there's a lot of structure in this problem you aught to be able to explot to speed up even the serial case, but it's easier to say that then to find it; playing around on pen and paper nothing comes to mind in a few minutes, but someone cleverer may spot something.
What you have is a convolution. This can be done with a Fast Fourier Transform in N log2(N) time. Your algorithm is N^2. If you use FFT, one core will probably be enough!

Resources