OpenCL: clBuildProgram Fails without log

OpenCL: clBuildProgram Fails without log - c

What I am trying to accomplish:
I am trying to render some stuff in OpenCL and write it to the OpenGL Framebuffer (As it is the only Framebuffer I can get via Renderbuffers etc., but I will gladly accept any others I could use - You will not help me tho by telling to use glsl shaders)
The Problem:
As the Title says, the OpenCL function clBuildProgram Fails with error -11 (CL_BUILD_PROGRAM_FAILURE). This wouldn't be an issue, but the Log from the CL compiler is empty. I doublechecked my log code, but it should be fine. I posted it down below none the less, just so you can see yourself.
What I tried to fix:
Googling, of course
Reading the docs from the Khronos Groups
Checking, if my device supports the "cl_khr_gl_sharing" extension, which it does (it is contained in the string returned from clGetDeviceInfo(device_id, CL_DEVICE_EXTENSIONS, retSize, extensions, &retSize);)
Modifying the shader/kernel:
Made some intentional errors, to see if the logging code actually works (which it does)
And minifying the shader/kernel, to see if some things in the shader do not work as they should (as I've read, that some missing things can make the compiler for cl crash)
What I found out:
From the last point of what I tried, I noticed, that the write_imageui and read_imageui opencl functions make the compiler fail to compile my code (this is why I checked for the "cl_khr_gl_sharing" extension)
Furthermore:
My operating System is Windows 10, the C Compiler I am using is GCC (I do not know, how that could help, since the host program compiles fine, but here it is none the less)
Some Code:
The Shader/Kernel (as minified as possible to reproduce the problem, I hope also for you; the last two lines of calls are what I think causes the opencl compiler to not work; the other stuff is in there, to make it a shader, that can actually process something, once it is working):
#define ScreenWidth 1000
#define ScreenHight 1000
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE | CLK_FILTER_NEAREST;
__kernel void rainbow(__read_write image2d_t asd) {
int i = get_global_id(0);
unsigned int x = i%ScreenWidth;
unsigned int y = i/ScreenHight;
uint4 pixel;
pixel = read_imageui(asd, sampler, (int2)(x, y));
write_imageui(asd, (int2)(x, y), pixel);
}
Minified Call Code (C) which does all the initialization stuff necessary (note: the log buffer is dynamically changeable):
cl_program program = clCreateProgramWithSource(contextZ, 1, (const char **)&source_str, (const size_t *)&source_size, &ret);
size_t retSize = 0;
clGetDeviceInfo(device_id, CL_DEVICE_EXTENSIONS, 0, NULL, &retSize);
char extensions[retSize];
clGetDeviceInfo(device_id, CL_DEVICE_EXTENSIONS, retSize, extensions, &retSize);
printf("%s\n", extensions);
// Build the program
ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
if (ret == CL_BUILD_PROGRAM_FAILURE) {
l_logError("Could not build Kernel!");
// Determine the size of the log
size_t log_size;
printf(" reta: %i\n", clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size));
// Allocate memory for the log
char *log = (char *) malloc(log_size);
// Get the log
printf(" retb: %i\n", clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, log_size, log, NULL));
// Print the log
printf(" ret-val: %i\n", ret);
printf("%s\n", log);
}
You might be interested in the output (Last 2 lines are caused by the Kernel not beeing built. The program could be made from the source though - look at the code):
E: Could not build Kernel!
reta: 0
retb: 0
ret-val: -11
E: Could not create Kernel!
kernel error: -45
Did anybody else have a similar problem? Any Ideas, what I should do about it? Might there be a Header for the cl Kernel/Shader, which I need to include in it? It could be possible, that my clBuildProgram call is incorrect? (I read somebody did not pass a device, so maybe something else could be missing in my code)
Make sure to tell me, if you need further details, so I can provide them (I cannot think of any other you might need right now)
Thanks in Advance for your time!
EDIT:
According to the specification, a device needs to support the CL_DEVICE_IMAGE_SUPPORT extension, which it does
I checked it using this:
cl_bool image_support = CL_FALSE;
clGetDeviceInfo(device_id, CL_DEVICE_IMAGE_SUPPORT, sizeof(cl_bool), &image_support, NULL);
printf("image_support: %i\n", image_support);
Which outputs:
image_support: 1
aka. CL_TRUE
Edit 2:
It turns out, OpenCL extensions need to be enabled in the kernel: https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/EXTENSION.html
Adding #pragma OPENCL EXTENSION all : enable in the first line of the kernel/shader does results in the same issue tho
EDIT 3:
Removing the __read_write flag from the kernel image paramters or replace it with something else (like __read_only) causes the OpenCL compiler to crash or loop infinetly, as clBuildProgram never returns (or will return in like a very long time)

What I found out in the last few days may have been slightly incorrect.
My Edit 3 (from the original post/question) states, that replacing the __read_write with an __read_only causes the compiler to completly crash. This is incorrect, as I can now confirm, I just didn't add any more debug-output code after the compilation call. Adding some more debug lines after the clBuildProgram call shows, that it actually works.
I do not know, why this causes the OpenCL compiler to output literally nothing as an error, and the vendors of the driver should definetly fix this/output something (Device info in comment), to make development somewhat easier. (Even just a warning will be helpful)
I found this stackoverflow post, discussing a similar problem: OpenCL - Pass image2d_t twice to get both read and write from kernel?. This is how I even figured out, that this can cause devastating problems.
To be fair, the official docs state this, but I understood it more like they can be combined to __read_write (read_only | write_only == __read_write):
aQual in the following table refers to one of the access qualifiers. For write functions this may be write_only or read_write.
I replaced my kernel with only a write_image call (write_imageui) and set the image2d_t to be __write_only, to get minimal debug possibilities. This caused the shader/kernel to compile succesfully, but the screen is still empty. But latter is a matter for another question.

Related

It's like OpenCL kernel instance ends abruptly

I'm new to OpenCL and I'm working on converting an existing algorithm to OpenCL.
In this process, I am experiencing a phenomenon that I cannot solve on my own, and I would like to ask some help.
Here's details.
My kernel is applied to images of different size (to be precise, each layer of the Laplacian pyramid).
I get normal results for images of larger size such as 3072 x 3072, 1536 x 1536.
But I get abnormal results for smaller images such as 12 x 12, 6 x 6, 3 x 3, 2 x 2.
At first, I suspected that clEnqueueNDRangeKernel had a bottom limit for dimensions, causing this problem. So, I added printf to the beginning of the kernel as follows. It is confirmed that all necessary kernel instances are executed.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
}
So after wandering for a while, I added the same printf to the end of the kernel. When I did this, it was confirmed that printf works only for some pixel positions. For pixel positions not output by printf, the calculated values in the resulting image are incorrect, and as a result, I concluded that some kernel instances terminate abnormally before completing the calculations.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
printf("(%d, %d, %f)\n", xB, yB, result_for_this_position);
}
It seems that there is no problem with the calculation of the kernel. If I compile the kernel turning off the optimization with the -cl-opt-disable option, I get perfectly correct results for all images regardless of their size. In addition to that, with NVIDA P4000, it works correct. Of course, in theses cases, I confirmed that the printf added at the bottom of the Kernel works for all pixels.
Below I put additional information and attach a part of the code I wrote.
Any advice is welcomed and appreciated.
Thank you.
SDK: Intel® SDK For OpenCL™ Applications 2020.3.494
Platform: Intel(R) OpenCL HD Graphics
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2,
NULL, globalSize, NULL, 0, NULL, NULL);
if (CL_SUCCESS != err)
return -1;
// I tried with this but it didn't make any difference
//std::this_thread::sleep_for(std::chrono::seconds(1));
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
And I tried with event, too, but it works the same way.
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
cl_event event;
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2, NULL, globalSize, NULL, 0, NULL, &event);
if (CL_SUCCESS != err)
return -1;
err = clWaitForEvents(1, &event);
if (CL_SUCCESS != err)
return -1;
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
/////// Added contents ////////////////////////////////////////////
Would you guys please take look at this issue in the aspect of clFinsh, or clWaitEvent. Am I missing something in this regard?
Sometimes I get less correct values and sometimes I get more correct values.
To be more specific, let's say I'm applying the kernel to 12 x 12 size image. So there're 144 pixel values.
Sometime I get correct values for 56 pixels.
Sometime I get correct values for 89 pixels.
Some other time I get correct value for n(less then 144) pixels.
If I turn off the OpenCL optimization when compiling the kernel by specifying -cl-opt-disable option, I get correct values for all 144 pixels.
The other thing that makes me think the calculation code is correct is that the same OpenCL code with no modification(other then device select code) runs perfectly correctly with NVIDIA P4000.
At first, I was really suspicious about the calculation code, but more I inspect code, more I'm confident there's nothing wrong with calculation code.
I know there's still a chance that there is an error in the calculation code so that there happen some exceptions anywhere during calculations.
I have plain C++ code for same task. I'm comparing results from those two.
/////// Another added contents ////////////////////////////////////////////
I made a minimum code(except projects template) to reproduce the phenomenon.
What's odd more is that if I install "Intel® Distribution for GDB Target" I get correct results.
https://github.com/heysweetethan/GPUOpenCLProjectforWindows

OpenCL kernels run threads in parallel on a specified global range, which in your case is the image size, with one thread per pixel.
The threads are grouped in workgroups, Workgroup size should be a multiple of 32; ideally 64 to make full use of the hardware, or 8x8 pixels in 2D. These workgroups cannot be split, so the global range must be a multiple of workgroup size.
What happens if global range is not clearly divisible by workgroup size, or smaller than workgroup size, like 3x3 pixels? Then the last workgroup is still executed with all 8x8 threads. The first 3x3 work on valid data in memory, but all the other threads read/write unallocated memory. This can cause undefined behavior or even crashes.
If you cannot have global size as a multiple of workgroup size, there is still a solution: a guard clause in the very beginning of the kernel:
if(xB>=xImage||yB>=yImage) return;
This ensures that no threads access unallocated memory.

As you don't supply a complete reproducible code sample, here's a loose collection of comments/suggestions/advice:
1. printf in kernel code
Don't rely on large amounts of printf output from kernels. It's necessarily buffered, and some implementations don't guarantee delivery of messages - often there's a fixed size buffer and when that's full, messages are dropped.
Note that your post-calculation printf increases the total amount of output, for example.
The reliable way to check or print kernel output is to write it to a global buffer and print it in host code. For example, if you want to verify each work-item reaches a specific point in the code, consider creating a zero-initialised global buffer where you can set a flag in each work-item.
2. Events
As you asked about events, flushing, etc. Your clFinish call certainly should suffice to ensure everything has executed - if anything, it's overkill, but especially while you're debugging other issues it's a good way to rule out queuing issue.
The clWaitForEvents() call preceeding it is not a great idea, as you haven't called clFlush() after queueing the kernel whose event you're waiting for. It's fairly minor, but could be a problem on some implementations.
3. Small image sizes
You've not actually posted any of the code that deals with the images themselves, so I can only guess at potential issues there. It looks like you're not using workgroups, so you shouldn't be running into the usual multiple-of-group-size pitfall.
However, are you sure you're loading the source data correctly, and you're correctly indexing into it? There could be all sorts of pitfalls here, from alignment of pixel rows in the source data, enqueueing the kernel before filling the source buffers has completed, creating source buffers with the wrong flags, etc.
So in summary, I'd suggest:
Don't believe in-kernel-printf if something strange is going on. Switch to something more reliable for observing the behaviour of your kernel code.
At minimum, post all your OpenCL API calling host code. Buffer creation, setting arguments, etc. Any fragments of kernel code accessing the buffers are probably not a bad idea either.

Thanks to a person from intel community, I could understand the phenomenon.
Briefly, if you spend to much time on a single kernel instance, 'Timeout Detection and Recovery(TDR)' stops the kernel instance.
For more information about this, you could refer to the followings.
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/
https://community.intel.com/t5/GPU-Compute-Software/It-s-like-OpenCL-kernel-instance-ends-abruptly/m-p/1386883#M478
I appreciate for all the people who gave me advices.

Why do I get a stack smashing error using MPI_Bcast and -O3 compiler flag, but everything works without -O3?

I am pretty new to MPI, so apologies if this is simple.
I have some code from a month or two ago that has been working fine, but I decided to go back and revise it. (It was written when I was just starting out, and it's not a performance critical section.) The code basically generates a random graph on one process and then shares the results with all other processes. An excerpt from the baby's-first-steps version follows:
unsigned int *graph;
if (commrank == 0) {
graph = gengraph(params); //allocates graph memory in function
if (commsize > 1) {
for (int k=1; k<commsize; k++)
MPI_Send(graph, n*n, MPI_UNSIGNED, k, 0, MPI_COMM_WORLD);
}
} else {
MPI_Status recvStatus;
graph = malloc(sizeof(unsigned int)*n*n);
MPI_Recv(graph, n*n, MPI_UNSIGNED, 0, 0, MPI_COMM_WORLD, &recvStatus);
}
While obviously naive, this worked just fine for a while, before I chose to go back and do it in what I thought was the proper fashion:
if (commrank == 0) {
graph = gengraph(params);
MPI_Bcast(graph, n*n, MPI_UNSIGNED, 0, MPI_COMM_WORLD);
} else {
graph = malloc(sizeof(unsigned int)*n*n);
MPI_Bcast(graph, n*n, MPI_UNSIGNED, 0, MPI_COMM_WORLD);
}
The problem is, I keep getting when "stack smashing" errors in the second version when I compile with -O3 optimization, though it works fine when compiled unoptimized. Note that I have checked the graph allocation function multiple times and debugged it, and it appears to be fine. I have also debugged the second version, and it appears to work fine. The crash occurs later when I try to free the graph memory. (Note that this is not a double free error, and, again, it works fine in the naive implementation and has for some time.)
One final wrinkle: The first version also fails if, instead of using the recvStatus variable, I instead use MPI_STATUS_IGNORE. And, again, this only fails with -O3.
Any thoughts would be greatly appreciated. If it's any help, I'm using mpicc on top of gcc 7.5.0, but I imagine I'm doing something stupid rather than encountering a compiler problem.

I changed the mpicc compiler to Clang and used Address Sanitizer, per the suggestion of #hristo-iliev, and found an error in a subsequent MPI call (a recv with the wrong count size). This led to the undefined behavior. Notably, the address sanitizer pinpointed location of the error quite clearly, while valgrind only gave rather opaque indications that something was going on in MPI (as, well, it always does).
Apologies to the StackOverflow community for this, as the code above was not the culprit (not entirely surprising). It was just some standard undefined behavior due to sloppiness.

Member value lost when passing object by pointer

I am very new to the FreeBSD world and am currently porting my terminal emulation library from Linux to FreeBSD and Mac OS. I've encountered some very strange behavior such that when I pass a struct by pointer to a subroutine the member values become zeroed out. This does not happen on Linux or Mac OS. It also does not matter if the compiler is GCC or Clang.
I've confirmed that the member value is correct before the subroutine is called and the parent struct is passed by pointer.
I've tested the same code on Linux and Mac OS and they do not exhibit the problem.
I've switched between GCC and Clang on FreeBSD and that seems to have no effect.
I've consider that stack smashing could be happening but it seems unlikely because ulimit shows that the stack size on Linux is 8M but on FreeBSD it's much larger (524 MB). I've also tried compiling with -fstack-protector-strong but none of this matters.
#include "vterm.h"
#include "vterm_private" // vterm_t and vterm_desc_t defined here
void vterm_cursor_move_backward(vterm_t* vterm) {
vterm_desc_t* v_desc = NULL;
int min_row;
int idx;
// idx = vterm_buffer_get_active(vterm);
idx = 0; // hard set to 0 just for debugging
v_desc = &vterm->vterm_desc[idx];
// printf() will display a value of zero
printf("%d\n\r", v_desc->ccol);
fflush(stdout);
}
void vterm_interpret_ctrl_char(vterm_t* vterm, const char* data) {
vterm_desc_t *v_desc = NULL;
int idx;
char verb;
// idx = vterm_buffer_get_active(vterm);
idx = 0; // hard set to 0 just for debugging
v_desc = &vterm->vterm_desc[idx];
verb = data[0];
switch (verb) {
case '\b': {
// the following printf will print a positive number
printf("%d\n\r", v_desc->ccol);
fflush(stdout);
vterm_cursor_move_backward(vterm);
break;
}
}
}
I expect the value of v_desc->ccol to be identical in both functions. Godbolt Link Github Link See files vterm_ctrl_char.c and vterm_cursor.c

After countless hours of debugging I figured out that data in the vterm_desc_t structure was actually being shifted causing the member value to be set to zero. Although, the ncurses header file is included via vterm_private.h, on FreeBSD that doesn't seem to matter. Both GCC and Clang are happy to silently compile the vterm_cursor.c translation unit with bad / incomplete alignment.
I would recommend anyone running into kind of problem to try and compile each translation unit individually which is how I unearthed it. For example gcc -S vterm_cursor.c
Thank you to everyone who took a look at this.

cudaPrintfInit and cudaPrintfDisplay failed

I am working with GeForce 210, compute capability 1.2 and CUDA 6.5.
I wish to print float values from my CUDA kernel, I have included "cuPrintf.cu" and "cuPrintf.cuh" in my project directory as well as included them in my code. It compiles fine and runs without errors, but prints nothing. This is how I compile my code :
$ nvcc -arch=compute_12 test.cu
I read similar question and then surrounded my kernel with cudaPrintfInit() and cudaPrintfDisplay().
if(cudaPrintfInit() != cudaSuccess)
printf("cudaPrintfInit failed\n");
test_kernel<<<grid, block>>>(val);
if(cudaPrintfDisplay(stdout, true) != cudaSuccess)
printf("cudaPrintfDisplay failed\n");
cudaPrintfEnd();
My kernel looks like this:
__global__ void test_kernel (float val){
i = blockIdx.x*BLOCK_X + threadIdx.x;
j = blockIdx.y*BLOCK_Y + threadIdx.y;
if( j == 20 )
cuPrintf("%f is value, %d is j", val, j);
}
On compiling and running, the output is :
cudaPrintfInit failed
cudaPrintfDisplay failed
I guess there could be a problem with the way I am compiling, or cuPrintf does not allow float to be printed? According to the attached link of the similar question, the problem was with the threads per block exceeding a max value, but my block size is 16 x 16 (so that should not be the problem). cudaPrintfInit and cudaPrintfDisplay show failed!
I have also run the CUDA sample code "simplePrintf" which comes with the CUDA installation. That works perfectly. Help!

Formatted output is only supported by devices of compute capability 2.x and higher.
int printf(const char *format[, arg, ...]);
prints formatted output from a kernel to a host-side output stream.
Reference: CUDA C Programming Guide 2015, pag 119.
see this link: https://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialHelloWorld

I could solve the problem by running with 'cuda-memcheck'. cudaPrintf was not working because 'nan' values were being generated in the kernel. The denominator in some computations was becoming zero, and when I avoided those cases, cudaPrintfInit and cudaPrintfDisplay started working.

libopcodes: get the size of a instruction

I have to find out the size of a instruction which I have in memory (actually, I have a small code segment in memory and want to get the size of the first instruction).
It took me some time to find libopcodes and libbfd. I red the headers and tried to come up with a simple solution but it seems like I missunderstood something since the program always crashes:
int main(int argc, char **argv) {
disassemble_info *dis = malloc(sizeof(*dis));
assert(dis != NULL);
dis->arch = bfd_arch_i386;
dis->read_memory_func = buffer_read_memory;
dis->buffer_length = 64;
dis->buffer = malloc(dis->buffer_length);
memset(dis->buffer, 0x90, dis->buffer_length);
disassemble_init_for_target(dis);
int instr_size = print_insn_i386(0, dis);
printf("instruction size is %d\n", instr_size);
return 0;
}
The expected result would be an instruction size of 1 (nop).
EDIT:
sorry guys, I'm a stupid person.
memset(dis, 0, sizeof(*dis));

There is some code in the Linux kernel you can steal. It should work well if copied into a user mode program.
Take a look at arch/x86/lib and arch/x86/tools
There's an opcode map file there, and an awk script that reads it to produce a table in a file named innat.c. There are some other files there that use the table to implement a decoder.
It is sufficient to determine instruction sizes.
This assumes you are ok with GPL, of course.

It looks like the disassemble_info data structure requires more initialization than you have provided. From examples I have been studying, the correct way to initialize is to call init_disassemble_info().
See if that helps. Failing that, compile your program with debug info ('-g') and run gdb to diagnose where the crash occurs.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight