It's like OpenCL kernel instance ends abruptly

It's like OpenCL kernel instance ends abruptly - c

I'm new to OpenCL and I'm working on converting an existing algorithm to OpenCL.
In this process, I am experiencing a phenomenon that I cannot solve on my own, and I would like to ask some help.
Here's details.
My kernel is applied to images of different size (to be precise, each layer of the Laplacian pyramid).
I get normal results for images of larger size such as 3072 x 3072, 1536 x 1536.
But I get abnormal results for smaller images such as 12 x 12, 6 x 6, 3 x 3, 2 x 2.
At first, I suspected that clEnqueueNDRangeKernel had a bottom limit for dimensions, causing this problem. So, I added printf to the beginning of the kernel as follows. It is confirmed that all necessary kernel instances are executed.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
}
So after wandering for a while, I added the same printf to the end of the kernel. When I did this, it was confirmed that printf works only for some pixel positions. For pixel positions not output by printf, the calculated values in the resulting image are incorrect, and as a result, I concluded that some kernel instances terminate abnormally before completing the calculations.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
printf("(%d, %d, %f)\n", xB, yB, result_for_this_position);
}
It seems that there is no problem with the calculation of the kernel. If I compile the kernel turning off the optimization with the -cl-opt-disable option, I get perfectly correct results for all images regardless of their size. In addition to that, with NVIDA P4000, it works correct. Of course, in theses cases, I confirmed that the printf added at the bottom of the Kernel works for all pixels.
Below I put additional information and attach a part of the code I wrote.
Any advice is welcomed and appreciated.
Thank you.
SDK: Intel® SDK For OpenCL™ Applications 2020.3.494
Platform: Intel(R) OpenCL HD Graphics
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2,
NULL, globalSize, NULL, 0, NULL, NULL);
if (CL_SUCCESS != err)
return -1;
// I tried with this but it didn't make any difference
//std::this_thread::sleep_for(std::chrono::seconds(1));
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
And I tried with event, too, but it works the same way.
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
cl_event event;
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2, NULL, globalSize, NULL, 0, NULL, &event);
if (CL_SUCCESS != err)
return -1;
err = clWaitForEvents(1, &event);
if (CL_SUCCESS != err)
return -1;
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
/////// Added contents ////////////////////////////////////////////
Would you guys please take look at this issue in the aspect of clFinsh, or clWaitEvent. Am I missing something in this regard?
Sometimes I get less correct values and sometimes I get more correct values.
To be more specific, let's say I'm applying the kernel to 12 x 12 size image. So there're 144 pixel values.
Sometime I get correct values for 56 pixels.
Sometime I get correct values for 89 pixels.
Some other time I get correct value for n(less then 144) pixels.
If I turn off the OpenCL optimization when compiling the kernel by specifying -cl-opt-disable option, I get correct values for all 144 pixels.
The other thing that makes me think the calculation code is correct is that the same OpenCL code with no modification(other then device select code) runs perfectly correctly with NVIDIA P4000.
At first, I was really suspicious about the calculation code, but more I inspect code, more I'm confident there's nothing wrong with calculation code.
I know there's still a chance that there is an error in the calculation code so that there happen some exceptions anywhere during calculations.
I have plain C++ code for same task. I'm comparing results from those two.
/////// Another added contents ////////////////////////////////////////////
I made a minimum code(except projects template) to reproduce the phenomenon.
What's odd more is that if I install "Intel® Distribution for GDB Target" I get correct results.
https://github.com/heysweetethan/GPUOpenCLProjectforWindows

OpenCL kernels run threads in parallel on a specified global range, which in your case is the image size, with one thread per pixel.
The threads are grouped in workgroups, Workgroup size should be a multiple of 32; ideally 64 to make full use of the hardware, or 8x8 pixels in 2D. These workgroups cannot be split, so the global range must be a multiple of workgroup size.
What happens if global range is not clearly divisible by workgroup size, or smaller than workgroup size, like 3x3 pixels? Then the last workgroup is still executed with all 8x8 threads. The first 3x3 work on valid data in memory, but all the other threads read/write unallocated memory. This can cause undefined behavior or even crashes.
If you cannot have global size as a multiple of workgroup size, there is still a solution: a guard clause in the very beginning of the kernel:
if(xB>=xImage||yB>=yImage) return;
This ensures that no threads access unallocated memory.

As you don't supply a complete reproducible code sample, here's a loose collection of comments/suggestions/advice:
1. printf in kernel code
Don't rely on large amounts of printf output from kernels. It's necessarily buffered, and some implementations don't guarantee delivery of messages - often there's a fixed size buffer and when that's full, messages are dropped.
Note that your post-calculation printf increases the total amount of output, for example.
The reliable way to check or print kernel output is to write it to a global buffer and print it in host code. For example, if you want to verify each work-item reaches a specific point in the code, consider creating a zero-initialised global buffer where you can set a flag in each work-item.
2. Events
As you asked about events, flushing, etc. Your clFinish call certainly should suffice to ensure everything has executed - if anything, it's overkill, but especially while you're debugging other issues it's a good way to rule out queuing issue.
The clWaitForEvents() call preceeding it is not a great idea, as you haven't called clFlush() after queueing the kernel whose event you're waiting for. It's fairly minor, but could be a problem on some implementations.
3. Small image sizes
You've not actually posted any of the code that deals with the images themselves, so I can only guess at potential issues there. It looks like you're not using workgroups, so you shouldn't be running into the usual multiple-of-group-size pitfall.
However, are you sure you're loading the source data correctly, and you're correctly indexing into it? There could be all sorts of pitfalls here, from alignment of pixel rows in the source data, enqueueing the kernel before filling the source buffers has completed, creating source buffers with the wrong flags, etc.
So in summary, I'd suggest:
Don't believe in-kernel-printf if something strange is going on. Switch to something more reliable for observing the behaviour of your kernel code.
At minimum, post all your OpenCL API calling host code. Buffer creation, setting arguments, etc. Any fragments of kernel code accessing the buffers are probably not a bad idea either.

Thanks to a person from intel community, I could understand the phenomenon.
Briefly, if you spend to much time on a single kernel instance, 'Timeout Detection and Recovery(TDR)' stops the kernel instance.
For more information about this, you could refer to the followings.
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/
https://community.intel.com/t5/GPU-Compute-Software/It-s-like-OpenCL-kernel-instance-ends-abruptly/m-p/1386883#M478
I appreciate for all the people who gave me advices.

Related

Value is wrong first time pointer is dereferenced but correct after that

I have a ZYNQ Ultrascale+ MPSoC Genesys ZU dev board that I'm running my application on. I have an accelerator in the PL that is connected to the PS through a simple AXI DMA. The DMA reads the DDR memory through a normal, non-coherent, FPD slave port on the PS. The application is running on one of the A53 cores in the PS.
I've verified with an ILA that the data being written to the AXI slave port is correct. However, some of the data I'm reading back in software was incorrect. At least part of the issue before was the cache in the A53. As a temporary solution I've disabled the D-cache at the start of the program so there should be no issues there anymore. Now though, the first time I try to print/read from the array of data I receive, I get an incorrect value. Subsequent reads return the correct value. What gives? How is this happening?
Using the Vitis debugger/memory viewer, I've verified that the correct data is present at the memory location I allocated and told the DMA to write to.
Below is a watered down version of the program, removing much of the program that has no issues.
#define CACHE_LINE_SIZE 64
int main(void)
{
Xil_DCacheDisable();
//A bunch of DMA initialization
...
//Send data to accelerator through DMA, no issues here
...
float* outputCorrelation;
const size_t outputCorrelationSizeBytes = sizeof(*outputCorrelation) * 80;
outputCorrelation = aligned_alloc(CACHE_LINE_SIZE, outputCorrelationSizeBytes);
if(outputCorrelation == NULL) {
printf("Aligned Malloc failed\n");
return XST_FAILURE;
}
//Initiate data receive transfer first
int result = XAxiDma_SimpleTransfer(&axiDma,(UINTPTR) outputCorrelation, outputCorrelationSizeBytes, XAXIDMA_DEVICE_TO_DMA);
if(result != XST_SUCCESS) {
return result;
}
//Send data - assembledData allocation isn't shown as no problems here
result = XAxiDma_SimpleTransfer(&axiDma,(UINTPTR) assembledData, sizeof(*assembledData) * inLen, XAXIDMA_DMA_TO_DEVICE);
if(result != XST_SUCCESS) {
return result;
}
//Wait for completion interrupts from DMA
...
for(size_t x = 0; x < 80; x++) {
printf("[%zu]\t%f\n", x, outputCorrelation[x]);
}
}
The expected output is the value 4 for every element of the array.
Output:
[0] -nan
[1] 4.000000
[2] 4.000000
[3] 4.000000
[4] 4.000000
...
[79] 4.000000
If I add a print of the any value of the array prior to for loop, the first value becomes correct and all values in the for loop are perfect. What's going on here and how can I solve it?
Edit:
I had a thought that the compiler might be optimizing away the read or something since none of the functions directly write to the allocated array so I tried marking the output buffer as volatile. This did not change the behavior.
I did some more testing with my PL accelerator and tried connecting it to the LPD ports of the PS so I could try using the RPU instead of the APU. Using the exact same code in the RPU instead of the APU yielded my expected result. I have a suspicion there's still some issues with cache coherency even though I disabled the dcache when running on the APU.
Something I also didn't mention earlier is that when I single-step through my code, the issue does not exist. When still using the debugger but running through the critical sections, the issue does exist.

OpenCL: clBuildProgram Fails without log

What I am trying to accomplish:
I am trying to render some stuff in OpenCL and write it to the OpenGL Framebuffer (As it is the only Framebuffer I can get via Renderbuffers etc., but I will gladly accept any others I could use - You will not help me tho by telling to use glsl shaders)
The Problem:
As the Title says, the OpenCL function clBuildProgram Fails with error -11 (CL_BUILD_PROGRAM_FAILURE). This wouldn't be an issue, but the Log from the CL compiler is empty. I doublechecked my log code, but it should be fine. I posted it down below none the less, just so you can see yourself.
What I tried to fix:
Googling, of course
Reading the docs from the Khronos Groups
Checking, if my device supports the "cl_khr_gl_sharing" extension, which it does (it is contained in the string returned from clGetDeviceInfo(device_id, CL_DEVICE_EXTENSIONS, retSize, extensions, &retSize);)
Modifying the shader/kernel:
Made some intentional errors, to see if the logging code actually works (which it does)
And minifying the shader/kernel, to see if some things in the shader do not work as they should (as I've read, that some missing things can make the compiler for cl crash)
What I found out:
From the last point of what I tried, I noticed, that the write_imageui and read_imageui opencl functions make the compiler fail to compile my code (this is why I checked for the "cl_khr_gl_sharing" extension)
Furthermore:
My operating System is Windows 10, the C Compiler I am using is GCC (I do not know, how that could help, since the host program compiles fine, but here it is none the less)
Some Code:
The Shader/Kernel (as minified as possible to reproduce the problem, I hope also for you; the last two lines of calls are what I think causes the opencl compiler to not work; the other stuff is in there, to make it a shader, that can actually process something, once it is working):
#define ScreenWidth 1000
#define ScreenHight 1000
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE | CLK_FILTER_NEAREST;
__kernel void rainbow(__read_write image2d_t asd) {
int i = get_global_id(0);
unsigned int x = i%ScreenWidth;
unsigned int y = i/ScreenHight;
uint4 pixel;
pixel = read_imageui(asd, sampler, (int2)(x, y));
write_imageui(asd, (int2)(x, y), pixel);
}
Minified Call Code (C) which does all the initialization stuff necessary (note: the log buffer is dynamically changeable):
cl_program program = clCreateProgramWithSource(contextZ, 1, (const char **)&source_str, (const size_t *)&source_size, &ret);
size_t retSize = 0;
clGetDeviceInfo(device_id, CL_DEVICE_EXTENSIONS, 0, NULL, &retSize);
char extensions[retSize];
clGetDeviceInfo(device_id, CL_DEVICE_EXTENSIONS, retSize, extensions, &retSize);
printf("%s\n", extensions);
// Build the program
ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
if (ret == CL_BUILD_PROGRAM_FAILURE) {
l_logError("Could not build Kernel!");
// Determine the size of the log
size_t log_size;
printf(" reta: %i\n", clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size));
// Allocate memory for the log
char *log = (char *) malloc(log_size);
// Get the log
printf(" retb: %i\n", clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, log_size, log, NULL));
// Print the log
printf(" ret-val: %i\n", ret);
printf("%s\n", log);
}
You might be interested in the output (Last 2 lines are caused by the Kernel not beeing built. The program could be made from the source though - look at the code):
E: Could not build Kernel!
reta: 0
retb: 0
ret-val: -11
E: Could not create Kernel!
kernel error: -45
Did anybody else have a similar problem? Any Ideas, what I should do about it? Might there be a Header for the cl Kernel/Shader, which I need to include in it? It could be possible, that my clBuildProgram call is incorrect? (I read somebody did not pass a device, so maybe something else could be missing in my code)
Make sure to tell me, if you need further details, so I can provide them (I cannot think of any other you might need right now)
Thanks in Advance for your time!
EDIT:
According to the specification, a device needs to support the CL_DEVICE_IMAGE_SUPPORT extension, which it does
I checked it using this:
cl_bool image_support = CL_FALSE;
clGetDeviceInfo(device_id, CL_DEVICE_IMAGE_SUPPORT, sizeof(cl_bool), &image_support, NULL);
printf("image_support: %i\n", image_support);
Which outputs:
image_support: 1
aka. CL_TRUE
Edit 2:
It turns out, OpenCL extensions need to be enabled in the kernel: https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/EXTENSION.html
Adding #pragma OPENCL EXTENSION all : enable in the first line of the kernel/shader does results in the same issue tho
EDIT 3:
Removing the __read_write flag from the kernel image paramters or replace it with something else (like __read_only) causes the OpenCL compiler to crash or loop infinetly, as clBuildProgram never returns (or will return in like a very long time)

What I found out in the last few days may have been slightly incorrect.
My Edit 3 (from the original post/question) states, that replacing the __read_write with an __read_only causes the compiler to completly crash. This is incorrect, as I can now confirm, I just didn't add any more debug-output code after the compilation call. Adding some more debug lines after the clBuildProgram call shows, that it actually works.
I do not know, why this causes the OpenCL compiler to output literally nothing as an error, and the vendors of the driver should definetly fix this/output something (Device info in comment), to make development somewhat easier. (Even just a warning will be helpful)
I found this stackoverflow post, discussing a similar problem: OpenCL - Pass image2d_t twice to get both read and write from kernel?. This is how I even figured out, that this can cause devastating problems.
To be fair, the official docs state this, but I understood it more like they can be combined to __read_write (read_only | write_only == __read_write):
aQual in the following table refers to one of the access qualifiers. For write functions this may be write_only or read_write.
I replaced my kernel with only a write_image call (write_imageui) and set the image2d_t to be __write_only, to get minimal debug possibilities. This caused the shader/kernel to compile succesfully, but the screen is still empty. But latter is a matter for another question.

OpenCL - Local Memory

I do understand whats the difference between global- and local-memory in general.
But I have problems to use local-memory.
1) What has to be considered by transforming a global-memory variables to local-memory variables?
2) How do I use the local-barriers?
Maybe someone can help me with a little example.
I tried to do a jacobi-computation by using local-memory, but I only get 0 as result. Maybe someone can give me an advice.
Working Solution:
#define IDX(_M,_i,_j) (_M)[(_i) * N + (_j)]
#define U(_i, _j) IDX(uL, _i, _j)
__kernel void jacobi(__global VALUE* u, __global VALUE* f, __global VALUE* tmp, VALUE factor) {
int i = get_global_id(0);
int j = get_global_id(1);
int iL = get_local_id(0);
int jL = get_local_id(1);
__local VALUE uL[(N+2)*(N+2)];
__local VALUE fL[(N+2)*(N+2)];
IDX(uL, iL, jL) = IDX(u, i, j);
IDX(fL, iL, jL) = IDX(f, i, j);
barrier(CLK_LOCAL_MEM_FENCE);
IDX(tmp, i, j) = (VALUE)0.25 * ( U(iL-1, jL) + U(iL, jL-1) + U(iL, jL+1) + U(iL+1, jL) - factor * IDX(fL, iL, jL));
}
Thanks.

1) Query for CL_DEVICE_LOCAL_MEM_SIZE value, it is 16kB minimum and increses for different hardwares. If your local variables can fit in this and if they are re-used many times, you should put them in local memory before usage. Even if you don't, automatic usage of L2 cache when accessing global memory of a gpu can be still effective for utiliation of cores.
If global-local copy is taking important slice of time, you can do async work group copy while cores calculating things.
Another important part is, more free local memory space means more concurrent threads per core. If gpu has 64 cores per compute unit, only 64 threads can run when all local memory is used. When it has more space, 128,192,...2560 threads can be run at the same time if there are no other limitations.
A profiler can show bottlenecks so you can consider it worth a try or not.
For example, a naive matrix-matrix multiplication using nested loop relies on cache l1 l2 but submatices can fit in local memory. Maybe 48x48 submatices of floats can fit in a mid-range graphics card compute unit and can be used for N times for whole calculation before replaced by next submatrix.
CL_DEVICE_LOCAL_MEM_TYPE querying can return LOCAL or GLOBAL which also says that not recommended to use local memory if it is GLOBAL.
Lastly, any memory space allocation(except __private) size must be known at compile time(for device, not host) because it must know how many wavefronts can be issued to achieve max performance(and/or maybe other compiler optimizations). That is why no recursive function allowed by opencl 1.2. But you can copy a function and rename for n times to have pseudo recursiveness.
2) Barriers are a meeting point for all workgroup threads in a workgroup. Similar to cyclic barriers, they all stop there, wait for all until continuing. If it is a local barrier, all workgroup threads finish any local memory operations before departing from that point. If you want to give some numbers 1,2,3,4.. to a local array, you can't be sure if all threads writing these numbers or already written, until a local barrier is passed, then it is certain that array will have final values already written.
All workgroup threads must hit same barrier. If one cannot reach it, kernel stucks or you get an error.
__local int localArray[64]; // not each thread. For all threads.
// per compute unit.
if(localThreadId!=0)
localArray[localThreadId]=localThreadId; // 64 values written in O(1)
// not sure if 2nd thread done writing, just like last thread
if(localThreadId==0) // 1st core of each compute unit loads from VRAM
localArray[localThreadId]=globalArray[globalThreadId];
barrier(CLK_LOCAL_MEM_FENCE); // probably all threads wait 1st thread
// (maybe even 1st SIMD or
// could be even whole 1st wavefront!)
// here all threads written their own id to local array. safe to read.
// except first element which is a variable from global memory
// lets add that value to all other values
if(localThreadId!=0)
localArrray[localThreadId]+=localArray[0];
Working example(local work group size=64):
inputs: 0,1,2,3,4,0,0,0,0,0,0,..
__kernel void vecAdd(__global float* x )
{
int id = get_global_id(0);
int idL = get_local_id(0);
__local float loc[64];
loc[idL]=x[id];
barrier (CLK_LOCAL_MEM_FENCE);
float distance_square_sum=0;
for(int i=0;i<64;i++)
{
float diff=loc[idL]-loc[i];
float diff_squared=diff*diff;
distance_square_sum+=diff_squared;
}
x[id]=distance_square_sum;
}
output: 30, 74, 246, 546, 974, 30, 30, 30...

Calculating the right number of workgroups and their size OpenCL

I'm new with OpenCL and I'm trying to understand this example program written by Apple here.
The goal of the program is to calculate the square of each element of an input array and write the result in a new array.
You can see that the input array has dimension: 1024. The number of work groups is 1024 and the size of each of those is the max CL_KERNEL_WORK_GROUP_SIZE.
Can anybody explain me what's the point of using so many work-items in each work group if in the Kernel there's no get_local_id() call? Could they use 1 as the size of each work group? what would have been the difference?
Thanks.
Some code to show the point:
// Get the maximum work group size for executing the kernel on the device
//
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL);
// Execute the kernel over the entire range of our 1d input data set
// using the maximum number of work group items for this device
//
global = count;
err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);

Your global work size is executed in chunks of local work size (in theory), if you set 1 as your local work group size, then it would execute only 1 thread in each local work group. On GPUs, work groups match to compute units - if you have a work group size of 1, your 1 thread may potentially occupy a whole compute unit. This is really, really horribly slow

libopcodes: get the size of a instruction

I have to find out the size of a instruction which I have in memory (actually, I have a small code segment in memory and want to get the size of the first instruction).
It took me some time to find libopcodes and libbfd. I red the headers and tried to come up with a simple solution but it seems like I missunderstood something since the program always crashes:
int main(int argc, char **argv) {
disassemble_info *dis = malloc(sizeof(*dis));
assert(dis != NULL);
dis->arch = bfd_arch_i386;
dis->read_memory_func = buffer_read_memory;
dis->buffer_length = 64;
dis->buffer = malloc(dis->buffer_length);
memset(dis->buffer, 0x90, dis->buffer_length);
disassemble_init_for_target(dis);
int instr_size = print_insn_i386(0, dis);
printf("instruction size is %d\n", instr_size);
return 0;
}
The expected result would be an instruction size of 1 (nop).
EDIT:
sorry guys, I'm a stupid person.
memset(dis, 0, sizeof(*dis));

There is some code in the Linux kernel you can steal. It should work well if copied into a user mode program.
Take a look at arch/x86/lib and arch/x86/tools
There's an opcode map file there, and an awk script that reads it to produce a table in a file named innat.c. There are some other files there that use the table to implement a decoder.
It is sufficient to determine instruction sizes.
This assumes you are ok with GPL, of course.

It looks like the disassemble_info data structure requires more initialization than you have provided. From examples I have been studying, the correct way to initialize is to call init_disassemble_info().
See if that helps. Failing that, compile your program with debug info ('-g') and run gdb to diagnose where the crash occurs.