I have written a block matching algorithm in c++ using opencv for my thesis .
It is working on grayscale pictures and addresses the IPLImage by his absolute pixeladress.
I have to devide the IPLImage in blocks of the same size (8x8 pxls). In order to access the pixel values within the blocks, I compute the pixeladress and access the pixel value in this way:
for (int yBlock = 0; yBlock < maxYBlocks; yBlock++){
for (int xBlock = 0; yxlock < maxXBlocks; xBlock++){
for (int yPixel = 0; yPixel < 8; yPixel++){
for (int xPixel = 0; xPixel < 8; xPixel++){
pixelAdress = yBlock*imageWidth*8 + xBlock*8 + yPixel*imageWidth + xPixel;
unsigned char* imagePointer = (unsigned char*)(img->imageData);
pixelValue = imagePointer[pixelAdress];
}
}
}
}
I do NOT really itterate over rows and cols and it works great!
Now I have a colored IPLImage (no grayscale) and don't know how to access the r, g, b pixelvalues.
I found this on this forum
for( row = 0; row < img->height; row++ ){
for ( col = 0; col < img->width; col++ ){
b = (int)img->imageData[img->widthStep * row + col * 3];
g = (int)img->imageData[img->widthStep * row + col * 3 + 1];
r = (int)img->imageData[img->widthStep * row + col * 3 + 2];
}
}
but I'm not sure how to use it on my computed pixelAdress. Is it correct just to multiply it by 3 (because I do not iterate over rows and the add 0, 1 or 2? For example:
pixelValueR = imagePointer[pixelAdress*3 + 2];
pixelValueG = imagePointer[pixelAdress*3 + 1];
pixelValueB = imagePointer[pixelAdress*3 + 0];
or do I have to use widthStep where I used imageWidth before, like this:
pixelAdressR = pixelAdress = yBlock*img->widthStep*8 + xBlock*8*3 + yPixel*img->widthStep + xPixel*3 + 2;
pixelAdressG = pixelAdress = yBlock*img->widthStep*8 + xBlock*8*3 + yPixel*img->widthStep + xPixel*3 + 1;
pixelAdressB = pixelAdress = yBlock*img->widthStep*8 + xBlock*8*3 + yPixel*img->widthStep + xPixel*3;
and so access
pixelValueR = imagePointer[pixelAdressR];
pixelValueG = imagePointer[pixelAdressG];
pixelValueB = imagePointer[pixelAdressB];
In case of a multi channel Mat (BGR in this example) you can access the single pixel by using, as described here
Vec3b intensity = img.at<Vec3b>(y, x);
uchar blue = intensity.val[0];
uchar green = intensity.val[1];
uchar red = intensity.val[2];
not sure about your whole algorithm and can't test it at the moment, but for IplImages, the memory is aligned as this:
1. row
baseadress + 0 = b of [0]
baseadress + 1 = g of [0]
baseadress + 2 = r of [0]
baseadress + 3 = b of [1]
etc
2. row
baseadress + widthStep + 0 = b
baseadress + widthStep + 1 = g
baseadress + widthStep + 2 = r
so if you have have n*m blocks of size 8x8 unsigned char bgr data and you want to loop over variables [x,y] in block [bx,by] you can do it like this:
baseadress + (by*8+ y_in_block)*widthStep + (bx*8+x)*3 +0 = b
baseadress + (by*8+ y_in_block)*widthStep + (bx*8+x)*3 +1 = g
baseadress + (by*8+ y_in_block)*widthStep + (bx*8+x)*3 +2 = r
since row by*8+y is adressbaseadress + (by*8+ y_in_block)*widthStep`
and column bx*8+x is adress offset (bx*8+x)*3
For Mat (e.g. Mat img)
Grayscale (8UC1):
uchar intensity = img.at<uchar>(y, x);
Color image (BGR color ordering, the default format returned by imread):
Vec3b intensity = img.at<Vec3b>(y, x);
uchar blue = intensity.val[0];
uchar green = intensity.val[1];
uchar red = intensity.val[2];
For IplImage (e.g. IplImage* img)
Grayscale:
uchar intensity = CV_IMAGE_ELEM(img, uchar, h, w);
Color image:
uchar blue = CV_IMAGE_ELEM(img, uchar, y, x*3);
uchar green = CV_IMAGE_ELEM(img, uchar, y, x*3+1);
uchar red = CV_IMAGE_ELEM(img, uchar, y, x*3+2);
Related
I am currently developing an OCR for a sudoku and i am trying to first get a clean black and white image. I first apply a grayscale then a median filter then an otsu algorithm.
My problem is that my results are better when i dont apply my median filter.
Does anyone know why ?
starting image
with my median filter
without my median filter
here is the code for my median filter :
void median_filter(SDL_Surface *image) {
int width = image->w;
int height = image->h;
for (int y = 1; y < height - 1; y++) {
for (int x = 1; x < width - 1; x++) {
Uint8 gray_values[9];
int index = 0;
for (int dy = -1; dy <= 1; dy++) {
for (int dx = -1; dx <= 1; dx++) {
int pixel_offset = (y+dy) * image->pitch + (x+dx) * 4;
Uint8 r = *(Uint8 *)((Uint8 *)image->pixels + pixel_offset);
Uint8 g = *(Uint8 *)((Uint8 *)image->pixels + pixel_offset + 1);
Uint8 b = *(Uint8 *)((Uint8 *)image->pixels + pixel_offset + 2);
gray_values[index++] = (0.3 * r) + (0.59 * g) + (0.11 * b);
}
}
qsort(gray_values, 9, sizeof(Uint8), cmpfunc);
Uint8 gray = gray_values[4];
int pixel_offset = y * image->pitch + x * 4;
*(Uint8 *)((Uint8 *)image->pixels + pixel_offset) = gray;
*(Uint8 *)((Uint8 *)image->pixels + pixel_offset + 1) = gray;
*(Uint8 *)((Uint8 *)image->pixels + pixel_offset + 2) = gray;
}
}
}
You are filtering with some neighbour values that were already filtered – the three pixels above and one on the left.
You need to create median values in a new image. This must also include the unfiltered pixels around the edges.
If you are applying multiple filters, then use one buffer as the source, and another as the destination, then swap the direction for the next filter application (by passsing two buffers to the filter functions).
I'm rearranging an array in my project on ARMv7. Now I get the elements' address d[] in the order I expect. To make the code more efficient, I want to use neon intrinstics in C++. Now my problem is, I can load the address array d[] by using vld1q_s32(), but I do not know how to read the elements of this vector as addresses.The instructions I know can only simply duplicate one vector.
This problem has been confusing me for several days. Or neon cannot do certain thing?
Thanks for your answering.
Here is my code:
void InputRearrange(int8_t* din, int8_t* dout, const int x, const int y){
int8_t* dout_array[16];
int out = 0;
dout_array[0] = din;
for(int n = 1; n < 16; n++) {//get the address of the first line in z-axis
dout_array[n] = dout_array[n - 1] + x*y;
}
for(int y_count = 0; y_count < y; y_count++) {
for(int x_count = 0; x_count < x; x_count++) {
for(int z_count = 0; z_count < 16; z_count++) {
dout[out++] = *(dout_array[k]++);//dout_array[k]++ let dout_array[k] moves in x-axis and I want to change this loop into neon intrinsics.
}
}
}
}
}
din[ ] is the original array and is like a 3-D array as a cube but stored as a 1-D one. The cube has three axis: x, y , z(=16). The original way array din[ ] stores the elements from x-axis first and then y-axis and last z-axis. But my code changed the order to z-axis first and then x-axis and last y-axis. I would like to use neon intrinsics in the final for loop, but it seems that it cannot be realized.
Your code rearranges a three-dimensional array int8_t (&output)[y][x][16] to int8_t (&input)[16][y][x], which is also equivalent to transposing a 2d array of int8_t (&out)[x*y][16] to int8_t (&in)[16][x*y].
This can definitely benefit from arm neon intrinsics that can interleave/deinterleave either registers (vzip,vuzp) or memory content (vldN, vstN).
// planarizes next 128 bytes to 16 planes
void planarize(int8_t *in, int8_t *out, int xy) {
int8_t * o_1 = out + 4*xy;
int8_t * o_2 = out + 8*xy;
int8_t * o_3 = out + 12*xy;
int8x16x4_t a = vld4q_s8(in); in+=64;
int8x16x4_t b = vld4q_s8(in); in+=64;
int8x16x2_t c = vuzpq_s8(a.val[0], b.val[0]);
int8x16x2_t d = vuzpq_s8(a.val[1], b.val[1]);
int8x16x2_t e = vuzpq_s8(a.val[2], b.val[2]);
int8x16x2_t f = vuzpq_s8(a.val[3], b.val[3]);
c = vuzpq_s8(c.val[0], c.val[1]);
d = vuzpq_s8(d.val[0], d.val[1]);
e = vuzpq_s8(e.val[0], e.val[1]);
f = vuzpq_s8(f.val[0], f.val[1]);
// now c = 0 16 32 48 64 80 96 112 4 20 36 52 68 84 100 116
// 8 24 40 56 72 88 104 120 12 28 44 60 76 92 108 124
// d = c + 1, e = d + 1, f = e + 1
vst1_s8(out + 0 * xy, vget_low_s8(c.val[0]);
vst1_s8(out + 1 * xy, vget_low_s8(d.val[0]);
vst1_s8(out + 2 * xy, vget_low_s8(e.val[0]);
vst1_s8(out + 3 * xy, vget_low_s8(f.val[0]);
vst1_s8(o_1 + 4 * xy, vget_high_s8(c.val[0]);
vst1_s8(o_1 + 5 * xy, vget_high_s8(d.val[0]);
vst1_s8(o_1 + 6 * xy, vget_high_s8(e.val[0]);
vst1_s8(o_1 + 7 * xy, vget_high_s8(f.val[0]);
vst1_s8(o_2 + 0 * xy, vget_low_s8(c.val[1]);
vst1_s8(o_2 + 1 * xy, vget_low_s8(d.val[1]);
vst1_s8(o_2 + 2 * xy, vget_low_s8(e.val[1]);
vst1_s8(o_2 + 3 * xy, vget_low_s8(f.val[1]);
vst1_s8(o_3 + 4 * xy, vget_high_s8(c.val[1]);
vst1_s8(o_3 + 5 * xy, vget_high_s8(d.val[1]);
vst1_s8(o_3 + 6 * xy, vget_high_s8(e.val[1]);
vst1_s8(o_3 + 7 * xy, vget_high_s8(f.val[1]);
}
The opposite would interleave from 16 independent planes
int8x16x2_t load4(int8_t *in, int xy) {
int8x8_t a0 = vld1_s8(in);
int8x8_t a1 = vld1_s8(in + xy);
int8x8_t a2 = vld1_s8(in + 2 * xy);
int8x8_t a3 = vld1_s8(in + 3 * xy);
auto a = vzipq_s8(vcombine_s8(a0, a0), vcombine_s8(a1, a1)).val[0];
auto b = vzipq_s8(vcombine_s8(a2, a2), vcombine_s8(a3, a3)).val[0];
return vzipq_s8(a,b);
}
int8_t *store4(int8x16x2_t a, int8x16x2_t b, int8x16x2_t c, int8x16x2_t d, int8_t *out) {
int32x4x4_t A{
vreinterpretq_s32_s8(a.val[0]),
vreinterpretq_s32_s8(b.val[0]),
vreinterpretq_s32_s8(c.val[0]),
vreinterpretq_s32_s8(d.val[0])};
int32x4x4_t B{
vreinterpretq_s32_s8(a.val[1]),
vreinterpretq_s32_s8(b.val[1]),
vreinterpretq_s32_s8(c.val[1]),
vreinterpretq_s32_s8(d.val[1])};
vst4q_s32((int32_t*)out, A); out += 64;
vst4q_s32((int32_t*)out, B); out += 64;
return out;
}
void interleave(int8_t *in, int8_t *out, int xy) {
int w = xy;
while (w >= 8) {
auto a = load4(in, xy);
auto b = load4(in + 4*xy, xy);
auto c = load4(in + 8*xy, xy);
auto d = load4(in + 12*xy, xy);
in += 8;
out = store4(a,b,c,d, out);
w -= 8;
}
}
Handling the excess (xy & 7 != 0) can be done by processing one full block aligned at in_ptr + xy - 8 and out_ptr + xy * 16 - 8*16.
I am trying to implement dark (not exactly)emboss filter, my problem is when I use it on SQUARED Lena image 512x512 result is good.
But when I use it on image which has rectangular shape e.g. 1280x720 result is all messed up, why is it so? Format of images is RGB.
GOOD result with Lena 512x512 (original):
WRONG result with 1280x720 image (original not same size just for comparison):
For a 24bit image, if the width of the image is 682 then it needs padding. Because 682*3 is not a multiple of 4. Try changing the image width to 680 and try again.
To pad the image rows, use the following formula:
int pad = WIDTH % 4;
if(pad == 4) pad = 0;
WIDTH += pad;
Change the condition to fb_j < HEIGHT - 1 - FILTER_HEIGHT and fb_i < WIDTH - 1 - FILTER_WIDTH to avoid buffer overflow.
The bitmap is scanned from top to bottom. It works fine when I switched the dimension as follows (but I loaded the bitmap differently)
//Pixel frame_buffer[WIDTH][HEIGHT];
//Pixel temp_buffer[WIDTH][HEIGHT];
Pixel frame_buffer[HEIGHT][WIDTH];
Pixel temp_buffer[HEIGHT][WIDTH];
...
for(int fb_j = 1; fb_j < HEIGHT - 1 - FILTER_HEIGHT; fb_j++) {
for(int fb_i = 1; fb_i < WIDTH - 1 - FILTER_WIDTH; fb_i++) {
float r = 0, g = 0, b = 0;
for(int ker_i = 0; ker_i < FILTER_WIDTH; ker_i++) {
for(int ker_j = 0; ker_j < FILTER_HEIGHT; ker_j++) {
r += ((float)(frame_buffer[fb_j + ker_j][fb_i + ker_i].r / 255.0) * emboss_kernel[ker_j][ker_i]);
g += ((float)(frame_buffer[fb_j + ker_j][fb_i + ker_i].g / 255.0) * emboss_kernel[ker_j][ker_i]);
b += ((float)(frame_buffer[fb_j + ker_j][fb_i + ker_i].b / 255.0) * emboss_kernel[ker_j][ker_i]);
}
}
if(r > 1.0) r = 1.0;
else if(r < 0) r = 0;
if(g > 1.0) g = 1.0;
else if(g < 0) g = 0;
if(b > 1.0) b = 1.0;
else if(b < 0) b = 0;
// Output buffer which will be rendered after convolution
temp_buffer[fb_j][fb_i].r = (GLubyte)(r*255.0);
temp_buffer[fb_j][fb_i].g = (GLubyte)(g*255.0);
temp_buffer[fb_j][fb_i].b = (GLubyte)(b*255.0);
}
}
Also try running a direct copy for testing. Example:
temp_buffer[fb_j][fb_i].r = frame_buffer[fb_j][fb_i].r;
temp_buffer[fb_j][fb_i].g = frame_buffer[fb_j][fb_i].g;
temp_buffer[fb_j][fb_i].b = frame_buffer[fb_j][fb_i].b;
I am trying to implement a Navier-Stokes solver in 2D using CUDA. I am using Jacobi's method to solve the system of difference equations. I am dividing the code in 4x4 blocks consisting of 16x16 threads. As every inner point in my matrix (of dimension 64x64) requires its top, bottom, left and right element to compute its new value, I create a new shared matrix of 18x18 dimension for every block. I read all the values into the matrix in this fashion - The thread with indices (0, 0) will write its value into the (1, 1) element in the matrix and will also attempt to read the element above it and the one to its left if this access is not exceeding the boundary. Once this read is done, I update the values of all the internal points and then write them back into memory.
I end up getting garbage values in the matrix pn, even though all the values are initialized correctly. I honestly cannot see where I'm going wrong. Can someone help me with this?
My kernel -
__global__ void red_psi (float *psi_o, float *psi_n, float *e, float *omega, float l1)
{
// m = n = 64
int i1 = blockIdx.x;
int j1 = blockIdx.y;
int i2 = threadIdx.x;
int j2 = threadIdx.y;
int i = (i1 * blockDim.x) + i2; // Actual row of the element
int j = (j1 * blockDim.y) + j2; // Actual column of the element
int l = i * n + j;
// e_XX --> variables refers to expanded shared memory location in order to accomodate halo elements
//Current Local ID with radius offset.
int e_li = i2 + 1;
int e_lj = j2 + 1;
// Variable pointing at top and bottom neighbouring location
int e_li_prev = e_li - 1;
int e_li_next = e_li + 1;
// Variable pointing at left and right neighbouring location
int e_lj_prev = e_lj - 1;
int e_lj_next = e_lj + 1;
__shared__ float po[BLOCK_SIZE + 2][BLOCK_SIZE + 2];
__shared__ float pn[BLOCK_SIZE + 2][BLOCK_SIZE + 2];
__shared__ float oo[BLOCK_SIZE + 2][BLOCK_SIZE + 2];
//__shared__ float ee[BLOCK_SIZE + 2][BLOCK_SIZE + 2];
if (i2 < 1) // copy top and bottom halo
{
//Copy Top Halo Element
if (blockIdx.y > 0) // Boundary check
{
po[i2][e_lj] = psi_o[l - n];
//pn[i2][e_lj] = psi_n[l - n];
oo[i2][e_lj] = omega[l - n];
//printf ("i_pn[%d][%d] = %f\n", i2, e_lj, oo[i2][e_lj]);
}
//Copy Bottom Halo Element
if (blockIdx.y < (gridDim.y - 1)) // Boundary check
{
po[1 + BLOCK_SIZE][e_lj] = psi_o[l + n];
//pn[1 + BLOCK_SIZE][e_lj] = psi_n[l + n];
oo[1 + BLOCK_SIZE][e_lj] = omega[l + n];
//printf ("j_pn[%d][%d] = %f\n", 1 + BLOCK_SIZE, e_lj, oo[1 + BLOCK_SIZE][e_lj]);
}
}
if (j2 < 1) // copy left and right halo
{
if (blockIdx.x > 0) // Boundary check
{
po[e_li][j2] = psi_o[l - 1];
//pn[e_li][j2] = psi_n[l - 1];
oo[e_li][j2] = omega[l - 1];
//printf ("k_pn[%d][%d] = %f\n", e_li, j2, oo[e_li][j2]);
}
if (blockIdx.x < (gridDim.x - 1)) // Boundary check
{
po[e_li][1 + BLOCK_SIZE] = psi_o[l + 1];
//pn[e_li][1 + BLOCK_SIZE] = psi_n[l + 1];
oo[e_li][1 + BLOCK_SIZE] = omega[l + 1];
//printf ("l_pn[%d][%d] = %f\n", e_li, 1 + BLOCK_SIZE, oo[e_li][BLOCK_SIZE + 1]);
}
}
// copy current location
po[e_li][e_lj] = psi_o[l];
//pn[e_li][e_lj] = psi_n[l];
oo[e_li][e_lj] = omega[l];
//printf ("o_pn[%d][%d] = %f\n", e_li, e_lj, oo[e_li][e_lj]);
__syncthreads ();
// Checking whether we have an internal point.
if ((i >= 1 && i < (m - 1)) && (j >= 1 && j < (n - 1)))
{
//printf ("Calculating for - (%d, %d)\n", i, j);
pn[e_li][e_lj] = 0.25 * (po[e_li_next][e_lj] + po[e_li_prev][e_lj] + po[e_li][e_lj_next] + po[e_li][e_lj_prev] + h*h*oo[e_li][e_lj]);
//printf ("n_pn[%d][%d] (%d, %d), a(%d, %d) = %f\n", e_li_prev, e_lj, i1, j1, i, j, po[e_li_prev][e_lj]);
pn[e_li][e_lj] = po[e_li][e_lj] + 1.0 * (pn[e_li][e_lj] - po[e_li][e_lj]);
__syncthreads ();
psi_n[l] = pn[e_li][e_lj];
e[l] = po[e_li][e_lj] - pn[e_li][e_lj];
}
}
This is how I invoke the kernel -
dim3 threadsPerBlock (4, 4);
dim3 numBlocks (4, 4);
red_psi<<<numBlocks, threadsPerBlock>>> (d_xn, d_xx, d_e, d_w, l1);
(d_xx, d_xn, d_e, d_w are all float arrays of size 4096)
I switched the blockDim.x and blockDim.y when I was copying the top / bottom and the left / right halo elements.
In my application, we need to display the video frame on the screen. I use libvpx to decode a video from WebM, but frame is decoded to YUV format (VPX_IMG_FMT_I420 according to the documentation). I need to output format is RGB and the documentation says a image supported a RGB format (VPX_IMG_FMT_RGB24). I have a formula for translating YUV->RGB:
R = Y + 1.13983 * (V - 128);
G = Y - 0.39465 * (U - 128) - 0.58060 * (V - 128);
B = Y + 2.03211 * (U - 128);
But I think is too many conversions VP8->YUV->RGB. Is there a method for set a output frame format for conversion function?
If you can afford using Intel's IPP library, here is some CPU friendly piece of code that you can try and apply in your project:
unsigned char* mpRGBBuffer;
void VPXImageToRGB24(vpx_image_t* pImage, bool isUsingBGR)
{
const unsigned int rgbBufferSize = pImage->d_w * pImage->d_h * 3;
mpRGBBuffer - allocate your raw RGB buffer...
const IppiSize sz = { pImage->d_w, pImage->d_h };
const Ipp8u* src[3] = { pImage->planes[PLANE_Y], pImage->planes[PLANE_U], pImage->planes[PLANE_V] };
int srcStep[3] = { pImage->stride[VPX_PLANE_Y], pImage->stride[VPX_PLANE_U], pImage->stride[VPX_PLANE_V] };
if (isUsingBGR) ippiYCbCr420ToBGR_8u_P3C3R(src, srcStep, pDest, pImage->d_w * 3, sz);
else ippiYCbCr420ToRGB_8u_P3C3R(src, srcStep, pDest, pImage->d_w * 3, sz);
}
If you dont want to use IPP, here is a link to some working peace of core that could really be usefull. Tested this, works for 100% but not sure about the CPU cost.
Here is the code from the link above (in case link fails...)
inline int clamp8(int v)
{
return std::min(std::max(v, 0), 255);
}
Image VP8Decoder::convertYV12toRGB(const vpx_image_t* img)
{
Image rgbImg(img->d_w, img->d_h);
std::vector<uint8_t>& data = rgbImg.data;
uint8_t *yPlane = img->planes[VPX_PLANE_Y];
uint8_t *uPlane = img->planes[VPX_PLANE_U];
uint8_t *vPlane = img->planes[VPX_PLANE_V];
int i = 0;
for (unsigned int imgY = 0; imgY < img->d_h; imgY++) {
for (unsigned int imgX = 0; imgX < img->d_w; imgX++) {
int y = yPlane[imgY * img->stride[VPX_PLANE_Y] + imgX];
int u = uPlane[(imgY / 2) * img->stride[VPX_PLANE_U] + (imgX / 2)];
int v = vPlane[(imgY / 2) * img->stride[VPX_PLANE_V] + (imgX / 2)];
int c = y - 16;
int d = (u - 128);
int e = (v - 128);
// TODO: adjust colors ?
int r = clamp8((298 * c + 409 * e + 128) >> 8);
int g = clamp8((298 * c - 100 * d - 208 * e + 128) >> 8);
int b = clamp8((298 * c + 516 * d + 128) >> 8);
// TODO: cast instead of clamp8
data[i + 0] = static_cast<uint8_t>(r);
data[i + 1] = static_cast<uint8_t>(g);
data[i + 2] = static_cast<uint8_t>(b);
i += 3;
}
}
return rgbImg;
}