How to write/read a single float value(buffer) from OpenCL device - c

There are lots of questions about how to read an array from the device, but I only wanna read a single float value from the device. Or can it only read an array from the device?
I create a buffer for (float) sum like below.
ocl.sum = clCreateBuffer(context, CL_MEM_READ_WRITE, 1, NULL, &err);
Set the arg like this.
clSetKernelArg(kernel, 0, sizeof(cl_mem), &ocl.arr);
clSetKernelArg(kernel, 1, sizeof(cl_float), &ocl.sum);
In the kernel, I calculate the sum.
kernel calculate(global arr, float sum)
{
...
sum = 100.0f;
}
How can I get the sum from the device?
float result = 0.f;
err = clEnqueueReadBuffer(queue, ocl.sum, CL_TRUE, 0, 1, &result, 0, NULL, NULL);
print(result);

Reading from the device, whether that is for a single value or an array have to go via global memory. So the kernel signature has to be of kernel calculate(..., global float *sum). Then you read it from the device the way you posted - by passing &result to clEnqueueReadBuffer.

Related

OpenCL program using too much memory

I have successfully implemented a Digital Down Converter using OpenCL. When implementing the interpolation part, I can only set a maximum factor of 146. Any more causes the program to crash and an error code of CL_INVALID_MEM_OBJECT -38 is thrown.
For those who don't know, interpolation is a method of constructing new data points within the range of known data points. The DDC, or digital down converter, is used to increase or decrease the sampling rate all while trying to reconstruct data points by using a reconstruction filter.
Note that the file I am using is a 1.75Mb wav file as an input. It is sampled at 44100 and my goal is to make it sampled at 48000 (blue ray quality). This results in a Interpolation/Decimate factor of 160/147. But any interpolation factor over 146 crashes the drivers and the program and error for -38 is thrown as shown above.
I think the problem lies where I create the cl_mem buffers. I have about 7 and here are how they are initialised and used. Assume P is 3 and Q is 2 while num_items is 918222 samples:
input = clCreateBuffer(
context,
CL_MEM_READ_ONLY,
num_items * sizeof(float),
NULL,
&status);
output = clCreateBuffer(
context,
CL_MEM_WRITE_ONLY,
num_items * P * sizeof(float),
NULL,
&status);
//Lowpass kernel parameters
inputForLowpass = clCreateBuffer(
context,
CL_MEM_READ_ONLY,
num_items * P * sizeof(float),
NULL,
&status);
outputFromLowpass = clCreateBuffer(
context,
CL_MEM_READ_ONLY,
num_items * P * sizeof(float),
NULL,
&status);
//Decimate kernel parameters
inputForDecimate = clCreateBuffer(
context,
CL_MEM_READ_ONLY,
num_items * P * sizeof(float),
NULL,
&status);
outputFromDecimate = clCreateBuffer(
context,
CL_MEM_READ_ONLY,
(int)(num_items * (P*1.0 / Q) * sizeof(float)),
NULL,
&status);
//numOfCoefficients for number of taps
coeff = clCreateBuffer(
context,
CL_MEM_READ_ONLY,
numOfCoefficients * sizeof(float),
NULL,
&status);
I used the memory debugger in Visual Studio to find that the program uses 602Mb (for interpolation factor of 160 before it crashes. It used around 120Mb for a factor of 3, still alot!) How can I bring this down? Am I using the buffers in an incorrect manner?
On top of this, I have three other memory allocations in the host code. The 'Array' simply holds the values in the wav file while OutputData and OutputData2 store values from the filtered input and decimated input respectively.
Array = (float*)malloc(num_items * sizeof(float));
OutputData = (float*)malloc(num_items * P * sizeof(float));
OutputData2 = (float*)malloc((int)(num_items * (P*1.0 / Q) * sizeof(float)));
The following is an image of the memory usage in Visual studio when P=3 (array gets increased by a size of 3).
Here is one of the writeBuffers where I get the -38 code.
status = clEnqueueWriteBuffer(
cmdQueue,
inputForLowpass,
CL_FALSE,
0,
num_items * P * sizeof(float),
OutputData,
0,
NULL,
NULL);
printf("Input enqueueWriteBuffer for Lowpass Kernel status: %i \n", status);
Here is the lowpass kernel:
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output, __const int numOfCoefficients) {
int globalId = get_global_id(0);
float sum=0.0f;
int min_i= max((numOfCoefficients-1),globalId)-(numOfCoefficients-1);
int max_i= min_i+numOfCoefficients;
for (int i=min_i; i< max_i; i++)
{
sum +=Array[i]*coefficients[globalId-i];
}
//sum = min(., (0.999969482421875));
//sum = max(sum, -1.0f);
Output[globalId]=sum;
}
EDIT
Error occurs because the buffer size I allocate has over 512Mb of memory being used. That is the maximum size of a buffer that I can have. In order to fix this issue, I have to implement some sort of memory management system into my code. Perhaps by using an 8Mb buffer at a time.
Your memory objects are not explicitly backed by anything, and that might be your issue. I'd recommend adding a flag from
https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clCreateBuffer.html
such as CL_MEM_USE_HOST_PTR, CL_MEM_COPY_HOST_PTR, or CL_MEM_ALLOC_HOST_PTR, then specifying a host pointer. If you use CL_MEM_USE_HOST_PTR, the buffer will not be allocated again on the host if it will be in host memory, perhaps helping your issue if it's just memory usage.

ClEnqueueCopyBuffer with offset 1

To optimize a kernel i need to make a copy of a cl_mem object with an offset.
count_buffer3[n] = count_buffer[n+1]
is the desired result
Looking at the specification of ClEnqueueCopyBuffer it seems to be possible with a simple argument.
cl_int clEnqueueCopyBuffer ( cl_command_queue command_queue,
cl_mem src_buffer,
cl_mem dst_buffer,
size_t src_offset,
size_t dst_offset,
size_t cb,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event)
My idea was to set dst_offset to 1. So copy_buffer[0] goes to copy_buffer[1]
In my case the command looks like:
clEnqueueCopyBuffer(command_queue, count_buffer, count_buffer3, 1, 0, (inCount1 + 1) * sizeof(int), NULL, NULL, NULL);
So i want to copy count_buffer to count_buffer3 with an offset of 1.
The result should be like this:
count_buffer[1] = 2
count_buffer[2] = 12
count_buffer[3] = 26
count_buffer3[1] = 12
count_buffer3[2] = 26
Unfortunately, if my dst_offset is 1 like shown in the example my complete count_buffer3 object contains only "0" as int values.
If my offset is 0, the copy works fine and both count_buffers are identical.
Additional Information:
Here are the init of the clmem objects:
cl_mem count_buffer3 = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1 + 1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);
cl_mem count_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1+1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);
I am using INtel INDE update 2 with visual Studio 2013
Am i doing sth wrong here, or should the copy with offset work like this?
Edit:
i reduced the buffer size by one and the result changes.
Instead of all "0" i get some very huge numbers.
example from debug:
count_buffer[0] = 0
count_buffer[1] = 31
count_buffer[2] = 31
count_buffer3[0] = 520093696
count_buffer3[1] = 520093696
count_buffer3[2] = 520093696
It is an improvement to "0" values, but still wrong.
any ideas?
Thanks for the answer so far!
It's very likely clEnqueueCopyBuffer returns an error which you don't check. According to the manual:
CL_INVALID_VALUE is returned if src_offset, dst_offset, cb, src_offset + cb, or dst_offset + cb require accessing elements outside the buffer memory objects.
which seems to be your case.
You probably want to pass size to copy one less than the size of your buffer:
clEnqueueCopyBuffer(command_queue, count_buffer, count_buffer3, 1, 0, inCount1 * sizeof(int), NULL, NULL, NULL);
^^^^^^^^
The offset is in bytes. You probably want an offset of sizeof count_buffer[0] and a size of (n - 1) * sizeof count_buffer[0]:
clEnqueueCopyBuffer(
command_queue, count_buffer, count_buffer3,
sizeof(cl_int), 0,
inCount1 * sizeof(cl_int),
NULL, NULL, NULL);

Why is clEnqueueMapBuffer returning random data?

I am trying to learn about clEnqueueMapBuffer in OpenCL by writing a kernel which finds the square of values in an input buffer, but only returns two items at a time in the output buffer using clEnqueueMapBuffer. As I understand it, this function returns a pointer in the hosts memory which points to the buffer memory in the device. Then clEnqueueUnmapMemObject must unmap this buffer to allow the kernels to continue their computations. Now, when I call clEnqueueMapBuffer, it is returning random data.
Here is my kernel
__kernel void testStream(
__global int *input_vector,
__global int *output_vector,
__global int *mem_flag) // informs the host when the workload is finished
{
mem_flag[0] = 1;
}
and my source
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <CL/opencl.h>
#include "utils.h"
int main(void)
{
unsigned int n = 24;
int BUFF_SIZE = 2;
// Input and output vectors
int num_bytes = sizeof(int) * n;
int *output_buffer = (int *) malloc(num_bytes);
int output_buffer_offset = 0;
int *mapped_data = NULL;
// use mapped_flag for determining if the job on the device is finished
int *mapped_flag = NULL;
int *host_in = (int *) malloc(num_bytes);
int *host_out = (int *) malloc(num_bytes);
// Declare cl variables
cl_mem device_in;
cl_mem device_out;
cl_mem device_out_flag;
// Declare cl boilerplate
cl_platform_id platform = NULL;
cl_device_id device = NULL;
cl_command_queue queue = NULL;
cl_context context = NULL;
cl_program program = NULL;
cl_kernel kernel = NULL;
// Located in utils.c -- the source is irrelevant here
char *kernel_source = read_kernel("kernels/test.cl");
// Initialize host_in
int i;
for (i = 0; i < n; i++) {
host_in[i] = i + 1;
}
// Set up opencl
cl_int error;
error = clGetPlatformIDs(1, &platform, NULL);
printf("clGetPlatformIDs: %d\n", (int) error);
error = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 1, &device, NULL);
printf("clGetDeviceIDs: %d\n", (int) error);
context = clCreateContext(NULL, 1, &device, NULL, NULL, &error);
printf("clCreateContext: %d\n", (int) error);
queue = clCreateCommandQueue(context, device, 0, &error);
printf("clCreateCommandQueue: %d\n", (int) error);
program = clCreateProgramWithSource(context, 1,
(const char**)&kernel_source, NULL, &error);
printf("clCreateProgramWithSource: %d\n", (int) error);
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
kernel = clCreateKernel(program, "testStream", &error);
printf("clCreateKernel: %d\n", (int) error);
// Create the buffers
device_in = clCreateBuffer(context, CL_MEM_READ_ONLY,
num_bytes, NULL, NULL);
device_out = clCreateBuffer(context,
CL_MEM_WRITE_ONLY,
sizeof(int) * BUFF_SIZE, NULL, NULL);
device_out_flag = clCreateBuffer(context,
CL_MEM_WRITE_ONLY,
sizeof(int) * 2, NULL, NULL);
// Write the input buffer
clEnqueueWriteBuffer(
queue, device_in, CL_FALSE, 0, num_bytes,
host_in, 0, NULL, NULL);
// Set the kernel arguments
error = clSetKernelArg(kernel, 0, sizeof(cl_mem), &device_in);
error = clSetKernelArg(kernel, 1, sizeof(cl_mem), &device_out);
error = clSetKernelArg(kernel, 2, sizeof(cl_mem), &device_out_flag);
// Execute the kernel over the entire range of data
clEnqueueNDRangeKernel(queue, kernel, 1, NULL,
(const size_t *) &n, NULL, 0, NULL, NULL);
// Map and unmap until the flag is set to true
int break_flag = 0;
while(1) {
// Map the buffers
mapped_data = (int *) clEnqueueMapBuffer(
queue, device_out, CL_TRUE, CL_MAP_READ, 0,
sizeof(int) * BUFF_SIZE, 0, NULL, NULL, &error);
mapped_flag = (int *) clEnqueueMapBuffer(
queue, device_out_flag, CL_TRUE, CL_MAP_READ, 0,
sizeof(int) , 0, NULL,NULL, &error);
// Extract the data out of the buffer
printf("mapped_flag[0] = %d\n", mapped_flag[0]);
// Set the break_flag
break_flag = mapped_flag[0];
// Unmap the buffers
error = clEnqueueUnmapMemObject(queue, device_out, mapped_data, 0,
NULL, NULL);
error = clEnqueueUnmapMemObject(queue, device_out_flag, mapped_flag,
0, NULL, NULL);
if (break_flag == 1) {break;}
usleep(1000*1000);
}
return 0;
}
When I run the program, I get output similar to
clGetPlatformIDs: 0
clGetDeviceIDs: 0
clCreateContext: 0
clCreateCommandQueue: 0
clCreateProgramWithSource: 0
clCreateKernel: 0
mapped_flag[0] = 45366144
mapped_flag[0] = 45366144
mapped_flag[0] = 45366144
mapped_flag[0] = 45366144
mapped_flag[0] = 45366144
Why is this happening?
Edit
I am running this code on an HP dm1z with fedora 19 64-bit on the kernel 3.13.7-100.fc19.x86_64. Here is the output from clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 AMD-APP (1214.3)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Device ID: 4098
Board name: AMD Radeon HD 6310 Graphics
Device Topology: PCI[ B#0, D#1, F#0 ]
Max compute units: 2
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 0
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 0
Max clock frequency: 492Mhz
Address bits: 32
Max memory allocation: 134217728
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 201326592
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 32
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 0x00007fd434852fc0
Name: Loveland
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: 1214.3
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (1214.3)
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_amd_image2d_from_buffer_read_only
Also, it may be worth noting that when I began playing with OpenCL, I ran a test program to calculate the inner product, but that gave weird results. Initially I though it was an error with the program and forgot about it, but is it possible that the OpenCL implementation is faulty? If it helps, the OpenGL implementation has multiple errors, causing blocks of random data to show up on my desktop background, but this could also be a Linux problem.
You are passing NULL as the global work size to your clEnqueueNDRangeKernel call:
// Execute the kernel over the entire range of data
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, NULL, NULL, 0, NULL, NULL);
If you were checking the error code returned by this call (which you always should), you would get the error code corresponding to CL_INVALID_GLOBAL_WORK_SIZE back. You always need to specify a global work size, so your call should look something like this:
// Execute the kernel over the entire range of data
size_t global[1] = {1};
error = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global, NULL, 0, NULL, NULL);
// check error == CL_SUCCESS!
Your calls to map and unmap buffers are fine; I've tested this code with the above fix and it works for me.
Your updated code to fix the above problem looks like this:
unsigned int n = 24;
...
// Execute the kernel over the entire range of data
clEnqueueNDRangeKernel(queue, kernel, 1, NULL,
(const size_t *) &n, NULL, 0, NULL, NULL);
This is not a safe way of passing the global work size parameter to the kernel. As an example, the unsigned int n variable might occupy 32 bits, whereas a size_t could be 64 bits. This means that when you pass the address of n and cast to a const size_t*, the implementation will read a 64-bit value, which will encompass the 32 bits of n plus 32 other bits that have some arbitrary value. You should either assign n to a size_t variable before passing it to clEnqueueNDRangeKernel, or just change it to be a size_t itself.
This may or may not be related to the problems you are having. You could be accidentally launching a huge number of work-items for example, which might explain why the code appears to block on the CPU.
Here are few remarks that come to my mind:
I have the feeling that somehow you think that calling clEnqueueMapBuffer will interrupt the execution of the kernel. I don't think this is correct. Once a command is launched for execution it runs until it is completed (or failed...). It is possible for several commands to be launched concurrently, but trying to read some data while a kernel is still processing it will result in undefined behavior. Besides, the way you create your command queue wouldn't let you run several commands at the same time. You need to use CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property when creating the queue to allow that.
I don't know with which global size you call your kernel after the post of #jprice, but except if you execute the kernel with only one workitem you'll have trouble with this kind of statement: mem_flag[0] = 1; since all the workitems will write to the same location. (I'm guessing that you posted only a portion of your kernel. Check if you have other statements like that...actually that'd be useful if you post the entire kernel code).
Since you map and unmap always the same portion of the buffers and always go to check in the first element (of mapped_flag) and since the kernel has completed its computation at that moment (see first point), at least it is normal that you always have the same value read.

OpenCL kernel doesn't seem to get global id "globally"

I am trying to convert a program I did to OpenCL, but I am not familiar enough with it yet. Still, I am having trouble with one of my (three) kernels. It is basically a complex matrix vector multiplication, but I am writing it so to fit better with my needs.
The problem is, I can't get the kernel to work on GPU. I have simplified it to the most (2 lines), debugged on CPU, and it works perfectly ok on a CPU. But when it comes to GPU, everything screws up. I'm working on a MacBook Pro, and on a NVIDIA GeForce 650M I get one result, while on the integrated Intel HD 4000, I get another. The kernel is
__kernel void Chmv_(__global float2 *H, const float alpha, __global float2 *vec,
const int off/*in number of elements*/,
__local float2 *vw,
__global float2 *vout)
{
int gidx=get_global_id(0);
int gidy=get_global_id(1);
int gs=get_global_size(0);
vout[gidx].x += alpha*(H[gidx+gidy*gs].x*vec[gidy].x-H[gidx+gidy*gs].y*vec[gidy].y);
vout[gidx].y += alpha*(H[gidx+gidy*gs].y*vec[gidy].x+H[gidx+gidy*gs].x*vec[gidy].y);
}
For tests, I let the Matrix H be a 4x4 matrix, filled with (1.0f, 0.0f), while input vector vec is has x components (0.0, 1.0, 2.0, 3.0), and y components 0. alpha is set to 2.0f. So, I should have (12, 12, 12, 12) as x output, and I do, if I use CPU. NVIDIA gives me 6.0, while Intel gives me 4.0.
Now, closer inspection showed me that if the input vector is (0,1,2,0), NVIDIA gives me 0 as answer, and if it is (0,1,0,3), Intel gives 0 as well. By the way, changing vec[gidy] for vec[gidx] gives me just the vector doubled. From these, it seems to me that the kernel is executing well only in one dimension, x, while having only one value for get_global_id(1), which is clearly not ok.
I will add the test function which is calling this kernel inspection. Now, anyone has any idea of what can be going on?
void _test_(){
cl_mem mat,vec, out;
size_t gs[2]={4,4};
size_t ls[2]={1,4};
size_t cpuws[2]={1,1};
cl_float2 *A=(cl_float2*)calloc(gs[0]*gs[0], sizeof(cl_float2));
cl_float2 *v=(cl_float2*)calloc(gs[0], sizeof(cl_float2));
cl_float2 *w=(cl_float2*)calloc(gs[0], sizeof(cl_float2));
int i;
for (i=0; i<gs[0]; i++) {
A[i*gs[0]].x=1.0;
A[i*gs[0]+1].x= 1.0;//(i<ls-1)? 1.0f:0.0f;
A[i*gs[0]+2].x=1.0;
A[i*gs[0]+3].x=1.0;
v[i].x= (float)i;
printf("%d %f %f %f %f\n%v2f\n",i, A[i*gs[0]].x, A[i*gs[0]+1].x, A[i*gs[0]+2].x, A[i*gs[0]+3].x, v[i]);
}
v[2].x=0.0f; //<--- set individually for debug
mat = clCreateBuffer(context, CL_MEM_READ_WRITE, gs[0]*gs[0]*sizeof(cl_float2), NULL, NULL);
vec = clCreateBuffer(context, CL_MEM_READ_WRITE, gs[0]*sizeof(cl_float2), NULL, NULL);
out = clCreateBuffer(context, CL_MEM_READ_WRITE, gs[0]*sizeof(cl_float2), NULL, NULL);
error = clEnqueueWriteBuffer(queue, mat, CL_TRUE, 0, gs[0]*gs[0]*sizeof(cl_float2), A, 0, NULL, NULL);
error = clEnqueueWriteBuffer(queue, vec, CL_TRUE, 0, gs[0]*sizeof(cl_float2), v, 0, NULL, NULL);
error = clEnqueueWriteBuffer(queue, out, CL_TRUE, 0, gs[0]*sizeof(cl_float2), w, 0, NULL, NULL);
int offset=0;
float alpha=2.0;
error = clSetKernelArg(Chmv_, 0, sizeof(cl_mem),&mat);
error |= clSetKernelArg(Chmv_, 1, sizeof(float), &alpha);
error |= clSetKernelArg(Chmv_, 2, sizeof(cl_mem),&vec);
error |= clSetKernelArg(Chmv_, 3, sizeof(int), &offset);
error |= clSetKernelArg(Chmv_, 4, gs[0]*sizeof(cl_float2), NULL);
error |= clSetKernelArg(Chmv_, 5, sizeof(cl_mem), &out);
assert(error == CL_SUCCESS);
error = clEnqueueNDRangeKernel(queue, Chmv_, 2, NULL, gs, NULL, 0, NULL, &event);
error = clEnqueueReadBuffer(queue, out, CL_TRUE, 0, gs[0]*sizeof(cl_float2), w, 0, NULL, NULL);
clFinish(queue);
for (i=0; i<gs[0]; i++) {
printf("%f %f\n", w[i].x, w[i].y);
}
clReleaseMemObject(mat);
clReleaseMemObject(vec);
clReleaseMemObject(out);
}
You are experiencing a typical problem of a multithreaded unsafe access to a common memory zone. (vout)
You have to think that all of the work-items will run concurrently. This means, they will read and write memory in any order.
When you execute in CPU, the problem does not show up since the execution is serially done by the HW.
However in the GPU, some work items read the memory of vout, increment it and write it. But others do also read the memory of vout before the new value is written by the previous work items.
Probably all your work items are running in parallel since your kernel size is small, that's why you only see one of them adding to the final result.
This is a typical parallel reduction problem. You can google it for more details. What you need to achieve is sync all the threads when accesing vout, either by an atomic_add() (slow) or by a proper reduction (hard to code). You can check this guide, it is for CUDA but is more or less the same basic idea : Reduction Guide

OpenCL Kernel Executes Slower Than Single Thread

All, I wrote a very simple OpenCL kernel which transforms an RGB image to gray scale using simple averaging.
Some background:
The image is stored in mapped memory, as a 24 bit, non padded memory block
The output array is stored in pinned memory (mapped with clEnqueueMapBuffer) and is 8 bpp
There are two buffers allocated on the device (clCreateBuffer), one is specifically read (which we clWriteBuffer into before the kernel starts) and the other is specifically write (which we clReadBuffer after the kernel finishes)
I am running this on a 1280x960 image. A serial version of the algorithm averages 60ms, the OpenCL kernel averages 200ms!!! I'm doing something wrong but I have no idea how to proceed, what to optimize. (Timing my reads/writes without a kernel call, the algorithm runs in 15ms)
I am attaching the kernel setup (sizes and arguments) as well as the kernel
EDIT: So I wrote an even dumber kernel, that does no global memory accesses inside it, and it was only 150ms... This is still ridiculously slow. I thought maybe I'm messing up with global memory reads, they have to be 4 byte aligned or something? Nope...
Edit 2: Removing the all parameters from my kernel gave me significant speed up... I'm confused I thought that since I'm clEnqueueWriteBuffer the kernel should be doing no memory transfer from host->device and device->host....
Edit 3: Figured it out, but I still don't understand why. If anyone could explain it I would be glad to award correct answer to them. The problem was passing the custom structs by value. It looks like I'll need to allocate a global memory location for them and pass their cl_mems
Kernel Call:
//Copy input to device
result = clEnqueueWriteBuffer(handles->queue, d_input_data, CL_TRUE, 0, h_input.widthStep*h_input.height, (void *)input->imageData, 0, 0, 0);
if(check_result(result, "opencl_rgb_to_gray", "Failed to write to input buffer on device!")) return 0;
//Set kernel arguments
result = clSetKernelArg(handles->current_kernel, 0, sizeof(OpenCLImage), (void *)&h_input);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set input struct.")) return 0;
result = clSetKernelArg(handles->current_kernel, 1, sizeof(cl_mem), (void *)&d_input_data);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set input data.")) return 0;
result = clSetKernelArg(handles->current_kernel, 2, sizeof(OpenCLImage), (void *)&h_output);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set output struct.")) return 0;
result = clSetKernelArg(handles->current_kernel, 3, sizeof(cl_mem), (void *)&d_output_data);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set output data.")) return 0;
//Determine run parameters
global_work_size[0] = input->width;//(unsigned int)((input->width / (float)local_work_size[0]) + 0.5);
global_work_size[1] = input->height;//(unsigned int)((input->height/ (float)local_work_size[1]) + 0.5);
printf("Global Work Group Size: %d %d\n", global_work_size[0], global_work_size[1]);
//Call kernel
result = clEnqueueNDRangeKernel(handles->queue, handles->current_kernel, 2, 0, global_work_size, local_work_size, 0, 0, 0);
if(check_result(result, "opencl_rgb_to_gray", "Failed to run kernel!")) return 0;
result = clFinish(handles->queue);
if(check_result(result, "opencl_rgb_to_gray", "Failed to finish!")) return 0;
//Copy output
result = clEnqueueReadBuffer(handles->queue, d_output_data, CL_TRUE, 0, h_output.widthStep*h_output.height, (void *)output->imageData, 0, 0, 0);
if(check_result(result, "opencl_rgb_to_gray", "Failed to write to output buffer on device!")) return 0;
Kernel:
typedef struct OpenCLImage_t
{
int width;
int widthStep;
int height;
int channels;
} OpenCLImage;
__kernel void opencl_rgb_kernel(OpenCLImage input, __global unsigned char* input_data, OpenCLImage output, __global unsigned char * output_data)
{
int pixel_x = get_global_id(0);
int pixel_y = get_global_id(1);
unsigned char * cur_in_pixel, *cur_out_pixel;
float avg = 0;
cur_in_pixel = (unsigned char *)(input_data + pixel_y*input.widthStep + pixel_x * input.channels);
cur_out_pixel = (unsigned char *)(output_data + pixel_y*output.widthStep + pixel_x * output.channels);
avg += cur_in_pixel[0];
avg += cur_in_pixel[1];
avg+= cur_in_pixel[2];
avg /=3.0f;
if(avg > 255.0)
avg = 255.0;
else if(avg < 0)
avg = 0;
*cur_out_pixel = avg;
}
Overhead of copying the value to all the threads that will be created might be the possible reason for the time; where as for a global memory the reference will be enough in the other case. The only the SDK implementer will be able to answer exactly.. :)
You may want to try a local_work_size like [64, 1, 1] in order to coalesce your memory calls. (note that 64 is a diviser of 1280).
As previously said, you have to use a profiler in order to get more informations. Are you using an nvidia card ? Then download CUDA 4 (not 5), as it contains an openCL profiler.
Your performance must be far from the optimum. Change the local work size, the global work size, try to treat two or four pixels per tread. Can you change the way pixels are stored privous to your treatment? Then break your struct for tree arrays in order to coalesce memomry access more effectively.
Tou can hide your memory transfers with the GPU work: it will be more easy to do with a profiler near you.

Resources