I am trying to solve a linear system using the following code:
#include <stdio.h>
#include <lapacke.h>
int main () {
lapack_complex_double mat[4];
lapack_complex_double vec[2];
lapack_int p[2];
mat[0] = lapack_make_complex_double(1,0);
mat[1] = lapack_make_complex_double(1,0);
mat[2] = lapack_make_complex_double(1,0);
mat[3] = lapack_make_complex_double(-1,0);
vec[0] = lapack_make_complex_double(1,0);
vec[1] = lapack_make_complex_double(1,0);
LAPACKE_zgetrf(LAPACK_ROW_MAJOR, 2, 2, mat, 2, p);
LAPACKE_zgetrs(LAPACK_ROW_MAJOR, 'N', 2, 1, mat, 2, p, vec, 2);
printf("%g %g\n", lapack_complex_double_real(vec[0]),
lapack_complex_double_imag(vec[0]));
return 0;
}
For some reasons, this causes illegal memory access in LAPACKE_zgetrs (as detected by valgrind and by my big program crashing in zgetrs because of "glibc detected corruption or double free"). I did not include this in my SSCCE for brevity, but all LAPACKE routines that return, return 0.
The same code with LAPACK_COL_MAJOR runs and valgrinds flawlessly.
My lapacke, lapack etc. is self-built for Ubuntu 12.04. I used the following settings in the lapack CMake file:
BUILD_COMPLEX ON
BUILD_COMPLEX16 ON
BUILD_DOUBLE ON
BUILD_SHARED_LIBS ON
BUILD_SINGLE ON
BUILD_STATIC_LIBS ON
BUILD_TESTING ON
CMAKE_BUILD_TYPE Release
LAPACKE ON
LAPACKE_WITH_TMG ON
and the rest (the optimized blas/lapack and xblas) off. There were no errors during the build and all tests succeeded.
Where did I mess up?
Edit: I just tried this with Fedora21 and the packaged lapacke. It did not reproduce the error.
Edit 2: While it does not reproduce the memory fails, it produces a wrong solution, namely (1 + 0I, 1 + 0I) for the above input (should be (1,0))
After some more research and overthinking things, I found the culprit:
Using LAPACK_ROW_MAJOR switches the meaning of the ld* leading dimension parameters. While the leading dimension of a normal Fortran array is the numbers of rows, switching to ROW_MAJOR switches its meaning to the number of columns. So the correct calls (giving correct results) would be:
LAPACKE_zgetrs(LAPACK_ROW_MAJOR, 'N', 2, 1, mat, 2, p, vec, 1);
where the second 2 is the number of columns (not rows!) of mat, and the last parameter must equal the number of right hand sides nrhs (not the number of variables!). I isolated this very call because all the other calls in my project dealt with square matrices, so the "wrong" calls do not have any negative effect due to symmetry.
As usual, if you are skipping columns at the end, the leading dimensions get bigger accordingly, as they would with skipping rows in the normal setting.
Obviously, this is not mentioned in the Fortran documentations. Unfortunately, I did see no such remark in the Lapacke documentation, which would have saved me a couple of hours of my life. :)
Related
I'm new to OpenCL and I'm working on converting an existing algorithm to OpenCL.
In this process, I am experiencing a phenomenon that I cannot solve on my own, and I would like to ask some help.
Here's details.
My kernel is applied to images of different size (to be precise, each layer of the Laplacian pyramid).
I get normal results for images of larger size such as 3072 x 3072, 1536 x 1536.
But I get abnormal results for smaller images such as 12 x 12, 6 x 6, 3 x 3, 2 x 2.
At first, I suspected that clEnqueueNDRangeKernel had a bottom limit for dimensions, causing this problem. So, I added printf to the beginning of the kernel as follows. It is confirmed that all necessary kernel instances are executed.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
}
So after wandering for a while, I added the same printf to the end of the kernel. When I did this, it was confirmed that printf works only for some pixel positions. For pixel positions not output by printf, the calculated values in the resulting image are incorrect, and as a result, I concluded that some kernel instances terminate abnormally before completing the calculations.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
printf("(%d, %d, %f)\n", xB, yB, result_for_this_position);
}
It seems that there is no problem with the calculation of the kernel. If I compile the kernel turning off the optimization with the -cl-opt-disable option, I get perfectly correct results for all images regardless of their size. In addition to that, with NVIDA P4000, it works correct. Of course, in theses cases, I confirmed that the printf added at the bottom of the Kernel works for all pixels.
Below I put additional information and attach a part of the code I wrote.
Any advice is welcomed and appreciated.
Thank you.
SDK: Intel® SDK For OpenCL™ Applications 2020.3.494
Platform: Intel(R) OpenCL HD Graphics
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2,
NULL, globalSize, NULL, 0, NULL, NULL);
if (CL_SUCCESS != err)
return -1;
// I tried with this but it didn't make any difference
//std::this_thread::sleep_for(std::chrono::seconds(1));
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
And I tried with event, too, but it works the same way.
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
cl_event event;
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2, NULL, globalSize, NULL, 0, NULL, &event);
if (CL_SUCCESS != err)
return -1;
err = clWaitForEvents(1, &event);
if (CL_SUCCESS != err)
return -1;
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
/////// Added contents ////////////////////////////////////////////
Would you guys please take look at this issue in the aspect of clFinsh, or clWaitEvent. Am I missing something in this regard?
Sometimes I get less correct values and sometimes I get more correct values.
To be more specific, let's say I'm applying the kernel to 12 x 12 size image. So there're 144 pixel values.
Sometime I get correct values for 56 pixels.
Sometime I get correct values for 89 pixels.
Some other time I get correct value for n(less then 144) pixels.
If I turn off the OpenCL optimization when compiling the kernel by specifying -cl-opt-disable option, I get correct values for all 144 pixels.
The other thing that makes me think the calculation code is correct is that the same OpenCL code with no modification(other then device select code) runs perfectly correctly with NVIDIA P4000.
At first, I was really suspicious about the calculation code, but more I inspect code, more I'm confident there's nothing wrong with calculation code.
I know there's still a chance that there is an error in the calculation code so that there happen some exceptions anywhere during calculations.
I have plain C++ code for same task. I'm comparing results from those two.
/////// Another added contents ////////////////////////////////////////////
I made a minimum code(except projects template) to reproduce the phenomenon.
What's odd more is that if I install "Intel® Distribution for GDB Target" I get correct results.
https://github.com/heysweetethan/GPUOpenCLProjectforWindows
OpenCL kernels run threads in parallel on a specified global range, which in your case is the image size, with one thread per pixel.
The threads are grouped in workgroups, Workgroup size should be a multiple of 32; ideally 64 to make full use of the hardware, or 8x8 pixels in 2D. These workgroups cannot be split, so the global range must be a multiple of workgroup size.
What happens if global range is not clearly divisible by workgroup size, or smaller than workgroup size, like 3x3 pixels? Then the last workgroup is still executed with all 8x8 threads. The first 3x3 work on valid data in memory, but all the other threads read/write unallocated memory. This can cause undefined behavior or even crashes.
If you cannot have global size as a multiple of workgroup size, there is still a solution: a guard clause in the very beginning of the kernel:
if(xB>=xImage||yB>=yImage) return;
This ensures that no threads access unallocated memory.
As you don't supply a complete reproducible code sample, here's a loose collection of comments/suggestions/advice:
1. printf in kernel code
Don't rely on large amounts of printf output from kernels. It's necessarily buffered, and some implementations don't guarantee delivery of messages - often there's a fixed size buffer and when that's full, messages are dropped.
Note that your post-calculation printf increases the total amount of output, for example.
The reliable way to check or print kernel output is to write it to a global buffer and print it in host code. For example, if you want to verify each work-item reaches a specific point in the code, consider creating a zero-initialised global buffer where you can set a flag in each work-item.
2. Events
As you asked about events, flushing, etc. Your clFinish call certainly should suffice to ensure everything has executed - if anything, it's overkill, but especially while you're debugging other issues it's a good way to rule out queuing issue.
The clWaitForEvents() call preceeding it is not a great idea, as you haven't called clFlush() after queueing the kernel whose event you're waiting for. It's fairly minor, but could be a problem on some implementations.
3. Small image sizes
You've not actually posted any of the code that deals with the images themselves, so I can only guess at potential issues there. It looks like you're not using workgroups, so you shouldn't be running into the usual multiple-of-group-size pitfall.
However, are you sure you're loading the source data correctly, and you're correctly indexing into it? There could be all sorts of pitfalls here, from alignment of pixel rows in the source data, enqueueing the kernel before filling the source buffers has completed, creating source buffers with the wrong flags, etc.
So in summary, I'd suggest:
Don't believe in-kernel-printf if something strange is going on. Switch to something more reliable for observing the behaviour of your kernel code.
At minimum, post all your OpenCL API calling host code. Buffer creation, setting arguments, etc. Any fragments of kernel code accessing the buffers are probably not a bad idea either.
Thanks to a person from intel community, I could understand the phenomenon.
Briefly, if you spend to much time on a single kernel instance, 'Timeout Detection and Recovery(TDR)' stops the kernel instance.
For more information about this, you could refer to the followings.
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/
https://community.intel.com/t5/GPU-Compute-Software/It-s-like-OpenCL-kernel-instance-ends-abruptly/m-p/1386883#M478
I appreciate for all the people who gave me advices.
I am pretty new to MPI, so apologies if this is simple.
I have some code from a month or two ago that has been working fine, but I decided to go back and revise it. (It was written when I was just starting out, and it's not a performance critical section.) The code basically generates a random graph on one process and then shares the results with all other processes. An excerpt from the baby's-first-steps version follows:
unsigned int *graph;
if (commrank == 0) {
graph = gengraph(params); //allocates graph memory in function
if (commsize > 1) {
for (int k=1; k<commsize; k++)
MPI_Send(graph, n*n, MPI_UNSIGNED, k, 0, MPI_COMM_WORLD);
}
} else {
MPI_Status recvStatus;
graph = malloc(sizeof(unsigned int)*n*n);
MPI_Recv(graph, n*n, MPI_UNSIGNED, 0, 0, MPI_COMM_WORLD, &recvStatus);
}
While obviously naive, this worked just fine for a while, before I chose to go back and do it in what I thought was the proper fashion:
if (commrank == 0) {
graph = gengraph(params);
MPI_Bcast(graph, n*n, MPI_UNSIGNED, 0, MPI_COMM_WORLD);
} else {
graph = malloc(sizeof(unsigned int)*n*n);
MPI_Bcast(graph, n*n, MPI_UNSIGNED, 0, MPI_COMM_WORLD);
}
The problem is, I keep getting when "stack smashing" errors in the second version when I compile with -O3 optimization, though it works fine when compiled unoptimized. Note that I have checked the graph allocation function multiple times and debugged it, and it appears to be fine. I have also debugged the second version, and it appears to work fine. The crash occurs later when I try to free the graph memory. (Note that this is not a double free error, and, again, it works fine in the naive implementation and has for some time.)
One final wrinkle: The first version also fails if, instead of using the recvStatus variable, I instead use MPI_STATUS_IGNORE. And, again, this only fails with -O3.
Any thoughts would be greatly appreciated. If it's any help, I'm using mpicc on top of gcc 7.5.0, but I imagine I'm doing something stupid rather than encountering a compiler problem.
I changed the mpicc compiler to Clang and used Address Sanitizer, per the suggestion of #hristo-iliev, and found an error in a subsequent MPI call (a recv with the wrong count size). This led to the undefined behavior. Notably, the address sanitizer pinpointed location of the error quite clearly, while valgrind only gave rather opaque indications that something was going on in MPI (as, well, it always does).
Apologies to the StackOverflow community for this, as the code above was not the culprit (not entirely surprising). It was just some standard undefined behavior due to sloppiness.
I have read a lot of posts on a similar topic but I have not yet succeeded resolving this.
I should mention that I have simplified my code a lot for this post.
My intention is to use a c function by calling it from fortran77 and receiving back values from c. The fact that I mention fortran77 is because I want to link my code to a much larger project that uses fortran77, but I am willing to consider solutions with other versions of fortran if they do the job and if you believe they will simplify my problem.
I have two files: Try_stack.f and client2.c.
I am compiling my code as:
gcc -c client2.c
gfortran -g Try_stack.f client2.o -o combined
My Try_stack.f file:
program circle
call circle2
stop
end
subroutine circle2
dimension rread(2)
double precision r, area,rread
external client
area = 3.
rread(1)=area
rread(2)=area+10.
write (*,*) 'Area = ', rread(1)
call client(rread)
retNread = rread(1) * 2
write(*,*) 'new nread is: ',retNread
return
end
And my client2.c file:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
int client_(double rread[2])
{
double result;
result=1.;
rread[1]=result;
printf("%.2lf",rread);
return 0;
}
After running the compiled version I am getting:
Area = 3.0000000000000000
0.00 new nread is: 6.00000000
But, I wanted the return value to the fortran program to have been equal to 8.000 instead of 6.0000 (because fortran sends the value 3., 1. is added to 3. and a 4.0 should return back to fortran for multiplying it with 2.). If I wanted to write this in a simple way to explain it, I would say:
First, I want the fortran file to send number 3. to c (actually I want to exchange arrays).
Second, I want the c file to take number 3. and add 1.
Third, I want c to return back the result to the fortran file, i.e. number 4.
Finally, I want fortran to continue computing, in this case multiply 4*2=8.
I read a lot about iso_c_binding but I have not obviously managed to utilise it, plus it requires recent versions of Fortran if my understanding is correct.
Any help will be much appreciated.
There are a lot of comments, did anyone actually compile and try to run this code?
Beside the FORTRAN (index start form 1) and C (index start from 0), there is a typo preventing you get expected result.
BTW, please use implicit none in any FORTRAN!
int client_(double rread[2])
{
double result;
result=1.;
//rread[1]=result; --> typo?
rread[0]+=result;
printf("%.2lf",rread);
return 0;
}
Area = 3.0000000000000000
0.00 new nread is: 8.0000000000000000
I am working with GeForce 210, compute capability 1.2 and CUDA 6.5.
I wish to print float values from my CUDA kernel, I have included "cuPrintf.cu" and "cuPrintf.cuh" in my project directory as well as included them in my code. It compiles fine and runs without errors, but prints nothing. This is how I compile my code :
$ nvcc -arch=compute_12 test.cu
I read similar question and then surrounded my kernel with cudaPrintfInit() and cudaPrintfDisplay().
if(cudaPrintfInit() != cudaSuccess)
printf("cudaPrintfInit failed\n");
test_kernel<<<grid, block>>>(val);
if(cudaPrintfDisplay(stdout, true) != cudaSuccess)
printf("cudaPrintfDisplay failed\n");
cudaPrintfEnd();
My kernel looks like this:
__global__ void test_kernel (float val){
i = blockIdx.x*BLOCK_X + threadIdx.x;
j = blockIdx.y*BLOCK_Y + threadIdx.y;
if( j == 20 )
cuPrintf("%f is value, %d is j", val, j);
}
On compiling and running, the output is :
cudaPrintfInit failed
cudaPrintfDisplay failed
I guess there could be a problem with the way I am compiling, or cuPrintf does not allow float to be printed? According to the attached link of the similar question, the problem was with the threads per block exceeding a max value, but my block size is 16 x 16 (so that should not be the problem). cudaPrintfInit and cudaPrintfDisplay show failed!
I have also run the CUDA sample code "simplePrintf" which comes with the CUDA installation. That works perfectly. Help!
Formatted output is only supported by devices of compute capability 2.x and higher.
int printf(const char *format[, arg, ...]);
prints formatted output from a kernel to a host-side output stream.
Reference: CUDA C Programming Guide 2015, pag 119.
see this link: https://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialHelloWorld
I could solve the problem by running with 'cuda-memcheck'. cudaPrintf was not working because 'nan' values were being generated in the kernel. The denominator in some computations was becoming zero, and when I avoided those cases, cudaPrintfInit and cudaPrintfDisplay started working.
I am totally stumped. I have a fairly large recursive program written in c that calls cblas_dgemm(). The result is verified independently by a program that works correctly.
C = alpha*A*B + beta*C
On repeated tests using random matrices and all possible combination of parameters the program gives correct answer ONLY if abs(beta) = 2^n (1,2,4,8..). Any value works for alpha. Any other positive/negative, odd/even value for beta gives correct answer b/w 10-30% of the time.
I am using Ubuntu 10.04, GCC 4.4.x, I have tried system installed blas/cblas/atlas as well as manually compiled atlas.
Any hints or suggestions would be greatly appreciated. I am amazed at the wonderfully generous (and smart) folks lurking at this site.
Thanking you all in advance,
Russ
Two completely unrelated errors conspired to produce an illusive picture. It made me look for problems in the wrong place.
(1) There was a simple error in the logic of the function calling dgemm. Would have been easily fixed if I was not chasing the wrong problem.
(2) My double-compare function: double version of AlmostEqual2sComplement() (http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm) used incorrect sized integer - resulting in an incorrect TRUE under certain rare circumstances. This was the first time the error bit me!
Thanks again for the useful suggestion of using the scientific method when trying to debug a program.
Russ
Yes, a full example would be handy. Here is an old example I had hanging around using GSL's sgemm variant; should be easy to fix to double. Please try and see if this gives the result shown in the GSL manual:
/* from the gsl info documentation in node 'gsl cblas examples' */
/* compile via 'gcc -o $file $file.c -lgslcblas' */
/* edd 15 Nov 2003 */
#include <stdio.h>
#include <gsl/gsl_cblas.h>
int
main (void)
{
int lda = 3;
float A[] = { 0.11, 0.12, 0.13,
0.21, 0.22, 0.23 };
int ldb = 2;
float B[] = { 1011, 1012,
1021, 1022,
1031, 1032 };
int ldc = 2;
float C[] = { 0.00, 0.00,
0.00, 0.00 };
/* Compute C = A B */
cblas_sgemm (CblasRowMajor,
CblasNoTrans, CblasNoTrans, 2, 2, 3,
1.0, A, lda, B, ldb, 0.0, C, ldc);
printf ("[ %g, %g\n", C[0], C[1]);
printf (" %g, %g ]\n", C[2], C[3]);
return 0;
}