I made two functions where one flips the image from left to right and the other one flips the image from top to bottom. But for some reason, when I load the image, nothing happened to the images.
This is the code for flipping from left to right.
void flip_horizontal( uint8_t array[],
unsigned int cols,
unsigned int rows )
unsigned int left = 0;
unsigned int right = cols;
for(int r = 0; r < rows; r++)
while(left != right && right > left)
int temp = array[r * cols+ left];
array[(r * cols) + left] = array[(r * cols) + cols - right];
array[(r * cols) + cols - right] = temp;
And this is the code for flipping from top to bottom.
void flip_vertical( uint8_t array[],
unsigned int cols,
unsigned int rows )
unsigned int top = 0;
unsigned int bottom = rows;
for(int r = 0; r < cols; r++)
while(top != bottom && bottom > top)
int temp = array[r * rows+ top];
array[(r * rows) + top] = array[(r * rows) + rows - bottom];
array[(r * rows) + rows - bottom] = temp;
Try moving these inside your for loop:
unsigned int left = 0;
unsigned int right = cols;
unsigned int top = 0;
unsigned int bottom = rows;
Otherwise, you will only flip the first row/column.
There are some other issues in the way you're indexing as well but I won't spoil the fun of fixing those :-)
Say I have some function that performs a matrix operation (like a transpose) on a float array:
void transpose(float *result, const float *input, int rows, int cols){
int i,j;
for(i = 0; i < rows; i++){
for(j = 0; j < cols; j++){
result[rows*j+i] = input[cols*i+j];
This function will work for any data type with size sizeof(float). Can this function be modified to work with arrays of arbitrary data type, or is it necessary to have separate functions for each data type of different size (e.g. transpose_8, transpose_32, etc.)?
From a comment by Eugene Sh., you can pass void *, the size of the data, and the size of the types you're passing so it works for all types.
You have to convert these to char * so you can use pointer arithmetic, though.
Here's how you can do that:
void transpose(void *result, const void *input, int size, int rows, int cols)
int i, j;
char *r = result;
const char *i = input;
for( i = 0; i < rows; i++ )
for( j = 0; j < cols; j++ )
memcpy(r + size * (rows * j + i), i + size * (cols * i + j), size);
Can this function be modified to work with arrays of arbitrary data type?
Yes, you can pass a generic void * pointer and the size of a single element as a parameter, which is exactly how qsort() handles any kind of data type (source).
Here's a working example:
void transpose(void *result, const void *input, size_t rows, size_t cols, size_t element_size) {
unsigned char *input_ptr = (unsigned char *)input;
unsigned char *result_ptr = (unsigned char *)result;
size_t i, j;
for(i = 0; i < rows; i++) {
for(j = 0; j < cols; j++) {
unsigned char *in = input_ptr + element_size * (cols * i + j);
unsigned char *res = result_ptr + element_size * (rows * j + i);
memcpy(res, in, element_size);
You could also do this in-place using the same swapping technique as qsort() does:
void transpose_inplace(void *input, size_t n, size_t element_size) {
unsigned char *input_ptr = (unsigned char *)input;
size_t i, j;
for(i = 0; i < n; i++) {
for(j = 0; j < i; j++) {
unsigned char *a = input_ptr + element_size * (n * i + j);
unsigned char *b = input_ptr + element_size * (n * j + i);
size_t size = element_size;
while (size--) {
unsigned char tmp = *a;
*a++ = *b;
*b++ = tmp;
I'm using n here since to transpose in-place you need a square matrix where rows = cols = n.
I am trying to adapt a secuential function writen for CPU to an OpenCL kernel for GPU.
The function is the well known im2col used in many deep learning applications.
I have found some code on the OpenCV repository implementing this im2col function written in OpenCL but the one that I have to adapt uses a batch that confuses me and seems to be a bit different.
What should I change on the OpenCL kernel to make it work the same on GPU as it does on the CPU function?
CPU code
int fn_im2col_cpu(int I, int WI, int HI, int B, int KW, int KH, int WO, int HO, int PW, int PH, int SW, int SH, type *in_ptr, type *out_ptr) {
int i; // scrolls input channels
int w; // scrolls channel columns (width)
int h; // scrolls channel rows (height)
int kw; // scrolls filter columns (width)
int kh; // scrolls filter rows (height)
// we sweep all output pixels, and for each pixel we compute the associated input pixel
#pragma omp parallel for private (kh, kw, h, w)
for (i = 0; i < I; i++) {
size_t out_addr = ((size_t)B * (size_t)WO * (size_t)HO * (size_t)KW * (size_t)KH * (size_t)i);
size_t in_addr1 = (size_t)i * (size_t)B * (size_t)WI * (size_t)HI;
for (kh = 0; kh < KH; kh++) {
for (kw = 0; kw < KW; kw++) {
for (h = 0; h < HO; h++) {
int hi = h * SH - PH + kh;
size_t in_addr2 = in_addr1 + ((size_t)hi * (size_t)B * (size_t)WI);
for (w = 0; w < WO; w++) {
int wi = w * SW - PW + kw;
int force_padding = (wi < 0) || (wi >= WI) || (hi < 0) || (hi >= HI);
if (force_padding) {
bzero(&out_ptr[out_addr], B*sizeof(type));
} else {
int in_addr = in_addr2 + (wi * B);
memcpy(&out_ptr[out_addr], &in_ptr[in_addr], B*sizeof(type));
return 1;
OpenCL kernel from https://github.com/opencv/opencv/blob/master/modules/dnn/src/opencl/im2col.cl
__kernel void im2col(__global const float *im_src, int im_src_offset,
int channels, int height_inp, int width_inp,
int kernel_h, int kernel_w, int pad_h, int pad_w,
int stride_h, int stride_w,
int height_out, int width_out,
__global float *im_col, int im_col_offset
int index = get_global_id(0);
if (index >= height_out * width_out * channels)
int j_out = index % width_out;
int i_out = (index / width_out) % height_out;
int c_inp = (index / width_out) / height_out;
int c_out = c_inp * kernel_h * kernel_w;
int i_inp = i_out * stride_h - pad_h;
int j_inp = j_out * stride_w - pad_w;
im_src += (c_inp * height_inp + i_inp) * width_inp + j_inp + im_src_offset;
im_col += (c_out * height_out + i_out) * width_out + j_out + im_col_offset;
for (int ki = 0; ki < kernel_h; ++ki)
for (int kj = 0; kj < kernel_w; ++kj) {
int i = i_inp + ki;
int j = j_inp + kj;
*im_col = (i >= 0 && j >= 0 && i < height_inp && j < width_inp) ?
im_src[ki * width_inp + kj] : 0;
im_col += height_out * width_out;
Your C version folds the batch into the lowest dimension. The opencl version isn't even using batch.
You need to pass in the batch size "B", and change this copy to a block copy (or just do a loop over) by the batch size:
for (int b=0; b<B; b++) *(im_col*B+b) = (i >= 0 && j >= 0 && i < height_inp && j < width_inp) ? im_src[(ki * width_inp + kj)*B + b] : 0;
to emulate the memcpy(..., B*sizeof(type)).
And then just stride B times more:
im_col += height_out * width_out * B;
I'm using FFTW in my C code and I have some issue.
First, I can transform an original image to two images (mag+phase) and get back the original image with the inverse transform.
However, If I want to get a mag file centered in frequency it does not work anymore: the final image has some issues.
Here some pieces of my code. Can someone help me to find the error in my code?
EDIT: I've fixed the code to follow #francis recommandation, but my issues is always here.
enum {
static void fft_to_spectra(fits* fit, fftw_complex *frequency_repr, double *as,
double *ps, int nbdata) {
unsigned int i;
for (i = 0; i < nbdata; i++) {
double r = creal(frequency_repr[i]);
double im = cimag(frequency_repr[i]);
as[i] = hypot(r, im);
ps[i] = atan2(im, r);
static void fft_to_freq(fits* fit, fftw_complex *frequency_repr, double *as, double *ps, int nbdata) {
unsigned int i;
for (i = 0; i < nbdata; i++) {
frequency_repr[i] = as[i] * (cos(ps[i]) + I * sin(ps[i]));
void change_symmetry(unsigned int width, unsigned int height, unsigned int i, unsigned int j, unsigned int *x,
unsigned int *y) {
if (i < width / 2 && j < height / 2) {
*x = i + width / 2;
*y = j + height / 2;
if (i >= width / 2 && j < height / 2) {
*x = i - width / 2;
*y = j + height / 2;
if (i < width / 2 && j >= height / 2) {
*x = i + width / 2;
*y = j - height / 2;
if (i >= width / 2 && j >= height / 2) {
*x = i - width / 2;
*y = j - height / 2;
static void centered(WORD *buf, unsigned int width,
unsigned int height) {
unsigned int i, j;
WORD *temp = malloc(width * height * sizeof(WORD));
for (j = 0; j < height; j++) {
for (i = 0; i < width; i++) {
unsigned int x = i;
unsigned int y = j;
change_symmetry(width, height, i, j, &x, &y);
temp[j * width + i] = buf[y * width + x];
memcpy(buf, temp, sizeof(WORD) * width * height);
static void normalisation_spectra(unsigned int w, unsigned int h, double *modulus, double *phase,
WORD *abuf, WORD *pbuf) {
unsigned int i;
for (i = 0; i < h * w; i++) {
pbuf[i] = round_to_WORD(((phase[i] + M_PI) * USHRT_MAX_DOUBLE / (2 * M_PI)));
abuf[i] = round_to_WORD((modulus[i] / w / h));
static void save_dft_information_in_gfit(fits *fit) {
strcpy(gfit.dft.ord, fit->dft.type);
strcpy(gfit.dft.ord, fit->dft.ord);
static void FFTD(fits *fit, fits *x, fits *y, int type_order, int layer) {
WORD *xbuf = x->pdata[layer];
WORD *ybuf = y->pdata[layer];
WORD *gbuf = fit->pdata[layer];
unsigned int i;
unsigned int width = fit->rx, height = fit->ry;
int nbdata = width * height;
fftw_complex *spatial_repr = fftw_malloc(sizeof(fftw_complex) * nbdata);
if (!spatial_repr) {
fftw_complex *frequency_repr = fftw_malloc(sizeof(fftw_complex) * nbdata);
if (!frequency_repr) {
/* copying image selection into the fftw data */
#ifdef _OPENMP
#pragma omp parallel for num_threads(com.max_thread) private(i) schedule(static) if(nbdata > 15000)
for (i = 0; i < nbdata; i++) {
spatial_repr[i] = (double) gbuf[i];
/* we run the Fourier Transform */
fftw_plan p = fftw_plan_dft_2d(height, width, spatial_repr, frequency_repr,
/* we compute modulus and phase */
double *modulus = malloc(nbdata * sizeof(double));
double *phase = malloc(nbdata * sizeof(double));
fft_to_spectra(fit, frequency_repr, modulus, phase, nbdata);
//We normalize the modulus and the phase
normalisation_spectra(width, height, modulus, phase, xbuf, ybuf);
if (type_order == TYPE_CENTERED) {
strcpy(x->dft.ord, "CENTERED");
centered(xbuf, width, height);
centered(ybuf, width, height);
static void FFTI(fits *fit, fits *xfit, fits *yfit, int type_order, int layer) {
WORD *xbuf = xfit->pdata[layer];
WORD *ybuf = yfit->pdata[layer];
WORD *gbuf = fit->pdata[layer];
unsigned int i;
unsigned int width = xfit->rx;
unsigned int height = xfit->ry;
int nbdata = width * height;
double *modulus = calloc(1, nbdata * sizeof(double));
double *phase = calloc(1, nbdata * sizeof(double));
if (type_order == TYPE_CENTERED) {
centered(xbuf, width, height);
centered(ybuf, width, height);
for (i = 0; i < height * width; i++) {
modulus[i] = (double) xbuf[i] * (width * height);
phase[i] = (double) ybuf[i] * (2 * M_PI / USHRT_MAX_DOUBLE);
phase[i] -= M_PI;
fftw_complex* spatial_repr = fftw_malloc(sizeof(fftw_complex) * nbdata);
if (!spatial_repr) {
fftw_complex* frequency_repr = fftw_malloc(sizeof(fftw_complex) * nbdata);
if (!frequency_repr) {
fft_to_freq(fit, frequency_repr, modulus, phase, nbdata);
fftw_plan p = fftw_plan_dft_2d(height, width, frequency_repr, spatial_repr,
for (i = 0; i < nbdata; i++) {
double pxl = creal(spatial_repr[i]) / nbdata;
gbuf[i] = round_to_WORD(pxl);
Here my images, the original one and the FFTD(centered)->FFTI result
The plan is created using the flag FFTW_MEASURE. Hence, several DFT are computed and the input array is likely overwritten. Here is the start of the description of planner flags in the documentation of FFTW:
FFTW_ESTIMATE specifies that, instead of actual measurements of different algorithms, a simple heuristic is used to pick a (probably sub-optimal) plan quickly. With this flag, the input/output arrays are not overwritten during planning.
FFTW_MEASURE tells FFTW to find an optimized plan by actually computing several FFTs and measuring their execution time. Depending on your machine, this can take some time (often a few seconds). FFTW_MEASURE is the default planning option.
Either switch to FFTW_ESTIMATE or create the plan before populating the input array:
/* we run the Fourier Transform */
fftw_plan p = fftw_plan_dft_2d(width, height, spatial_repr, frequency_repr,
/* copying image selection into the fftw data */
#ifdef _OPENMP
#pragma omp parallel for num_threads(com.max_thread) private(i) schedule(static) if(nbdata > 15000)
for (i = 0; i < nbdata; i++) {
spatial_repr[i] = (double) gbuf[i];
If you intend to a single image, using FFTW_ESTIMATE is the way to go. On the contrary, if you consider treating multiple images, creating the plan once using FFTW_MEASURE and storing it is a good option. Then you may use New-array Execute Functions each time a FFT is to be performed:
fftw_execute_dft(p, spatial_repr, frequency_repr);
You can test the return value of malloc() or fftw_malloc() to check if the allocations went right. If not, it returns NULL. fftw_malloc() is implemented as function *X(kernel_malloc)(size_t n) in fftw-3.3.6-pl2/kernel/kalloc.c . It calls functions like memalign() or _aligned_malloc() among others. Both these two return NULL just like malloc() in case of failure. Finally, I did not spotted a critical issue regarding memory allocation of deallocation in the piece of code you provided.
The argument double nbdata in fft_to_spectra() should likely be an integer. Valgrind might have considered it as strange...
EDIT : the change_symmetry() is to be modified for odd sizes. Something like:
void change_symmetry_forward(unsigned int width, unsigned int height, unsigned int i, unsigned int j, unsigned int *x,
unsigned int *y) {
*x = i + width / 2;
if (*x>=width){
*y = j + height / 2;
*y =*y-height;
void change_symmetry_backward(unsigned int width, unsigned int height, unsigned int i, unsigned int j, unsigned int *x,
unsigned int *y) {
*x = i +width- width / 2;
if (*x>=width){
*y = j +height- height / 2;
*y =*y-height;
When I'm running this code, I keep getting segmentation faults. I know segmentation faults occur when there's not enough memory allocated to the array. Does anybody know where the seg fault is occuring at?
void flip_horizontal( uint8_t array[],
unsigned int cols,
unsigned int rows )
for(int r = 0; r < rows; r++)
unsigned int left = 0;
unsigned int right = cols;
int* array = malloc(sizeof(uint8_t));
while(left != right && right > left)
int temp = array[r * cols+ left];
array[(r * cols) + left] = array[(r * cols) + cols - right];
array[(r * cols) + cols - right] = temp;
You made a very simple error. Your left and right indices should be moving towards each other; instead, you're incrementing both of them inside your loop.
int* array = malloc(sizeof(uint8_t));
You allocated memories, assigned its address to 'array' and the size of memory is only 1 byte. There is only 1 byte while 'cols' may be greater than 1, the size of memory is not large enough. Remove those 2 lines can fix the first issue. BTW, the variable 'array' should not be changed and you don't need a buffer to flip.
Besides, the left, right and swapping loop should be:
unsigned int left = 0;
unsigned int right = cols - 1;
while(left < right) {
int temp = array[r * cols + left];
array[r * cols + left] = array[r * cols + right];
array[r * cols + right] = temp;
I am using a CUDA kernel object in MATLAB in order to fill a 2D array with all '55's. The result is very strange. The 2D array only fills up to a certain point as shown below. After row 1025, the array is all zeros. Any idea what could be going wrong?
As I mentioned in the comment above, you are mistakenly offsetting the matrix rows. The code below is a full working example proving this point.
__global__ void myKern(double* masterForces, int r_max, int iterations) {
int threadsPerBlock = blockDim.x * blockDim.y;
int blockId = blockIdx.x + (blockIdx.y * gridDim.x);
int threadId = threadIdx.x + (threadIdx.y * blockDim.x);
int globalIdx = (blockId * threadsPerBlock) + threadId;
//for (int i=0; i<iterations; i++) masterForces[globalIdx * r_max + i] = 55;
for (int i=0; i<iterations; i++) masterForces[globalIdx * iterations + i] = 55;
void main() {
int ThreadBlockSize = 32;
int GridSize = 32;
int reps = 1024;
int iterations = 2000;
thrust::device_vector<double> gpuF_M(reps*iterations, 0);
int numerrors = 0;
for (int i=0; i<reps*iterations; i++) {
double test = gpuF_M[i];
if (test != 55) { printf("Error %i %f\n",i,test); numerrors++; }
printf("The number of errors is = %i\n",numerrors);