Faster way to copy a region from a bitmap in C - c

I am using the following way to copy a region from a bitmap in rgb565 pixel format:
void bmpcpy(size_t left, size_t top, size_t right, size_t bottom) {
size_t index = 0;
do {
do {
bmpCopy[index] = bmpSrc[(top * BMP_WIDTH) + left];
index++;
} while (++left < right);
} while (++top < bottom);
}
Is there a faster way to do the copy?

There might be faster ways using memcpy or accelerated graphics APIs, but first notice that your code is flawed:
bmpCopy and bmpSrc are not defined, it is unlikely they should be global variables.
bmpCopy is assumed to have a straddle value of right - left, not necessarily correct because of alignment constraints.
left is not reset for each row.
the width and height of the region are assumed to be non zero.
Depending on the type of bmpSrc, the parity and amplitude of width and the alignment of the source and destination pointers, it might be more efficient to copy multiple pixels at a time using a larger type.

Related

C (OSDev) - How could I shift the contents of a 32-bit framebuffer upwards efficiently?

I'm working on writing a hobby operating system. Currently a large struggle that I'm having is attempting to scroll the framebuffer upwards.
It's simply a 32-bit linear framebuffer.
I have access to a few tools that could be helpful:
some of the mem* functions from libc: memset, memcpy, memmove, and memcmp
direct access to the framebuffer
the width, height, and size in bytes, of said framebuffer
a previous attempt that managed to scroll it up a few lines, albeit EXTREMELY slowly, it took roughly 25 seconds to scroll the framebuffer up by 5 pixels
speaking of which, my previous attempt:
for (uint64_t i = 0; i != atoi(numLines); i++) {
for (uint64_t j = 0; j != bootboot.fb_width; j++) {
for (uint64_t k = 1; k != bootboot.fb_size; k++) {
((uint32_t *)&fb)[k - 1] = ((uint32_t *)&fb)[k];
}
}
}
A few things to note about the above:
numLines is a variable passed into the function, it's a char * that contains the number of lines to scroll up by, in a string. I eventually want this to be the number of actual text lines to scroll up by, but for now treating this as how many pixels to scroll up by is sufficient.
the bootboot struct is provided by the bootloader I use, it contains a few variables that could be of use: fb_width (the width of the framebuffer), fb_height (the height of the framebuffer), and fb_size (the size, in bytes, of the framebuffer)
the fb variable that I'm using the address of is also provided by the bootloader I use, it is a single byte that is placed at the first byte of the framebuffer, hence the need to cast it into a uint32_t * before using it.
Any and all help would be appreciated.
If I read the code correctly, what's happening with the triple nested loops is:
For every line to scroll,
For every pixel that the framebuffer is wide,
For every pixel in the entire framebuffer,
Move that pixel backwards by one.
Essentially you're moving each pixel one pixel distance at a time, so it's no wonder it takes so long to scroll the framebuffer. The total number of pixel moves is (numLines * fb_width * fb_size), so if your framebuffer is 1024x768, that's 5*1024*1024*768 moves, which is 4,026,531,840 moves. That's basically 5000 times the amount of work required.
Instead, you'll want to loop over the framebuffer only once, calculate that pixel's start and its end pointer, and only do the move once. Or you can calculate the source, destination, and size of the move once and then use memmove. Here's my attempt at this (with excessive comments):
// Convert string to integer
uint32_t numLinesInt = atoi(numLines);
// The destination of the move is just the top of the framebuffer
uint32_t* destination = (uint32_t*)&fb;
// Start the move from the top of the framebuffer plus however
// many lines we want to scroll.
uint32_t* source = (uint32_t*)&fb +
(numLinesInt * bootboot.fb_width);
// The total number of pixels to move is the size of the
// framebuffer minus the amount of lines we want to scroll.
uint32_t pixelSize = (bootboot.fb_height - numLinesInt)
* bootboot.fb_width;
// The total number of bytes is that times the size of one pixel.
uint32_t byteSize = pixelSize * sizeof(uint32_t);
// Do the move
memmove(destination, source, byteSize);
I haven't tested this, and I'm making a number of assumptions about how your framebuffer is laid out, so please make sure it works before using it. :)
(P.S. Also, if you put atoi(numLines) inside the end condition of the for loop, atoi will be called every time through the loop, instead of once at the beginning like you intended.)
Currently a large struggle that I'm having is attempting to scroll the framebuffer upwards.
The first problem is that the framebuffer is typically much slower to access than RAM (especially reads); so you want to do all the drawing in a buffer in RAM and then blit it efficiently (with a smaller number of much larger writes).
Once you have a buffer in RAM, you can make the buffer bigger than the screen. E.g. for a 1024 x 768 video mode you might have a 1024 x 1024 buffer. In that case small amounts of scrolling can often be done using the same "blit it efficiently" function; but sometimes you'll have to scroll the buffer in RAM.
To scroll the buffer in RAM you can cheat - treat it as a circular buffer and map a second copy into virtual memory immediately after the first. This allows you to (e.g.) copy 768 lines starting from the middle of the first copy without caring about hitting the end of the first buffer. The end result is that you can scroll the buffer in RAM without moving any data or changing the "blit it efficiently" function.
As a bonus, this also minimizing "tearing" artifacts. E.g. often you want to scroll the pixel data up and add more pixel data to the bottom, then blit it (without the user seeing an intermediate "half finished" frame).

can i compare 2 SDL_Surface (whether they are the same or not)

While making a game with SDL2 in c,i have to compare 2 SDL_Surface to check a win condition but i couldn't find a way to do so
It seems you're interested in comparing two SDL_Surfaces, so here is how you do it. There is probably a better way to solve your specific problem, but anyways:
From the SDL Wiki, SDL_Surface has members of interest format, w, h, pitch, pixels.
format represents the pixel encoding information
format->format is the specific enumeration constant specifying a given encoding
w represents the number of pixels in a row of pixels in the surface
h represents the number of rows of pixels in the surface
pitch represents the byte length of a row
pixels is an array with all the pixel data
If you want to compare two SDL_Surfaces, you need to compare the pixels against one another. But first we should check that the pixel encoding and the dimensions match:
int SDL_Surfaces_comparable(SDL_Surface *s1, SDL_Surface *s2) {
return (s1->format.format == s2->format.format && s1->w == s2->w && s1->h == s2->h);
}
If SDL_Surfaces_comparable evaluates to true, we can check if two surfaces are equal by comparing the pixels fields byte by byte.
int SDL_Surfaces_equal(SDL_Surface *s1, SDL_Surface *s2) {
if (!SDL_Surfaces_comparable(s1, s2) {
return 0;
}
// the # of bytes we want to check is bytes_per_pixel * pixels_per_row * rows
int len = s1->format->BytesPerPixel * s1->pitch * s1->h;
for (int i = 0; i < len; i++) {
// check if any two pixel bytes are unequal
if (*(uint8_t *)(s1->pixels + i) != *(uint8_t *)(s2->pixels + i))
break;
}
// return true if we finished our loop without finding non-matching data
return i == len;
}
This assumes that pixel data is serialized as bytes without any padding, or that the padding is zeroed. I couldn't find any SDLPixel structure, so I'm assuming this is the standard way to compare pixels. I did find this link, which seems to verify my approach.

Accessing portions of a dynamic array in C?

I know, another dynamic array question, this one is a bit different though so maybe it'll be worth answering. I am making a terrain generator in C with SDL, I am drawing 9 chunks surrounding the screen, proportional to the screen size, that way terrains can be generated easier in the future.
This means that I have to be able to resize the array at any given point, so I made a dynamic array (at least according to an answer I found on stack it is) and everything SEEMS to work fine, nothing is crashing, it even draws a single tile....but just one. I am looking at it and yeah, sure enough it's iterating through the array but only writing to one portion of memory. I am using a struct called Tile that just holds the x, y, w, and h of a rectangle.
This is the code I am using to allocate the array
Tile* TileMap = (Tile*)malloc(0 * sizeof(Tile*));
int arrayLen = sizeof(TileMap);
TileMap = (Tile*)realloc(TileMap, (totalTiles) * sizeof(Tile));
arrayLen = sizeof(totalTiles * sizeof(Tile));
The totalTiles are just the number of tiles that I have calculated previously are on the screen, I've checked the math and it's correct, and it even allocates the proper amount of memory. Here is the code I use to initialize the array:
//Clear all elements to zero.
for (int i = 0; i < arrayLen; i++)
{
Tile tile = {};
TileMap[i] = tile;
}
So what's weird to me is it is considering the size of a tile (16 bytes) * the totalTiles (78,000) is equaling 4....When I drill down into the array, it only has one single rect in it that gets cleared as well, so then when I go calculate the sizes of each tile:
//Figure out Y and heights
for (int i = startY; i <= (startY*(-1)) * 2; i += TILE_HEIGHT)
{
TileMap[i].y = i * TILE_HEIGHT;
TileMap[i].h = TILE_HEIGHT;
//Figure out X and widths
for (int j = startX; j <= (startX*(-1)) * 2; j += TILE_WIDTH)
{
TileMap[i].x = i * TILE_WIDTH;
TileMap[i].w = TILE_WIDTH;
}
}
*Side note, the startX is the negative offset I am using to draw chunks behind the camera, so I times it by -1 to make it positive and then time it by two to get one chunk in front of the camera
Alright, so obviously that only initializes one, and here is the render code
for (int i = 0; i < totalTiles; i++)
{
SDL_Rect currentTile;
currentTile.x = TileMap[i].x;
currentTile.y = TileMap[i].y;
currentTile.w = TileMap[i].w;
currentTile.h = TileMap[i].h;
SDL_RenderDrawRect(renderer, &currentTile);
}
free(TileMap);
So what am I doing wrong here? I mean I literally am just baffled right now...And before Vectors get recommended in place of dynamic arrays, I don't really like using them and I want to learn to deal with stuff like this, not just implement some simple fix.
Lots of confusion (which is commonplace with C pointers).
The following code doesn't provide expected answer :arrayLen = sizeof(totalTiles * sizeof(Tile));
totalTiles * sizeof(Tile) is not even a type, I'm surprised it compiles at all. Edit : See molbnilo comment below. so it provides the size of the return type.
Anyway, proper answer should be :
arrayLen = totalTiles;
Because that's what you need in your next loop :
//Clear all elements to zero.
for (int i = 0; i < arrayLen; i++)
{
Tile tile = {};
TileMap[i] = tile;
}
You don't need the size of the table, you need its number of elements.
There are other confusions in your sample, they don't directly impact the rest of the code, but better correct them :
Tile* TileMap = (Tile*)malloc(0 * sizeof(Tile*)); : avoid allocating a size of 0.
int arrayLen = sizeof(TileMap); : no, it's not the arrayLen, just the size of the pointer (hence 4 bytes on 32-bits binaries). Remember TileMap is not defined as a table, but as a pointer allocated with malloc() and then realloc().

Fastest way to traverse columns in a multidimensional array in C

I'm currently working on a program to solve the red/blue computation; program is written in C.
Description of the problem is here : http://www.cs.utah.edu/~mhall/cs4961f10/CS4961-L9.pdf
tl;dr you have a grid of colors (red/blue/white), first red cells move to the right according to certain rules, then blue cells move down according to other rules.
I've got my program working and giving correct output, and I'm now trying to see if I can't speed it up at all.
Using Intel's VTune Amplifier (this is for a parallel programming course, and we're doing pthreads in visual studio with parallel studio integrated), I've identified that the biggest hotspot in my code is when moving blue cells.
Implementation details: grid is stored as a dynamically allocated int **, set up this way
globalBoard = malloc(sizeof(int *) * size);
for (i = 0; i < size; i++)
{
globalBoard[i] = malloc(sizeof(int) * size);
for (j = 0; j < size; j++)
globalBoard[i][j] = rand() % 3;
}
After some research, I believe the cause of the hotspot (almost 4 times as much CPU time as moving red cells) is cache misses when traversing column by column.
I understand that under the hood, this grid will be stored as a 1d array, so when I move red cells to the right and go row by row, I'm most often checking contiguous values, so the CPU doesn't need to load new values into the cache as often, whereas going column by column results in jumping around through the array by amounts that only increase as the size of the board does.
All that being said, I want this particular section to go faster. Here's the code as it stands now :
void blueStep(int col)
{
int i;
int local[size];
for (i = 0; i < size; local[i] = globalBoard[i++][col]);
for (i = 0; i < size; i++)
{
if (i < size - 1)
{
if (globalBoard[i][col] == 2 && globalBoard[i + 1][col] == 0)
{
local[i++] = 0;
local[i] = 2;
}
}
else
{
if (globalBoard[i][col] == 2 && globalBoard[0][col] == 0)
{
local[i++] = 0;
local[0] = 2;
}
}
}
for (i = 0; i < size; i++)
globalBoard[i][col] = local[i];
}
Here, col is which column to work on and size is how big the grid is (it's always square).
I was thinking that I might be able to do some kind of fancy pointer arithmetic to speed this up, and was reading this : http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/pointer.html.
Looking at that, I feel like I might need to change how I declare the grid in order to take advantage of 2d array pointer arithmetic, but I'm still not sure how I would go about traversing columns using that method.
Any help with that, or any other suggestions of fast ways to go through a column are welcome.
UPDATE: After a bit more research and discussion, it would seem my assumptions were incorrect. Turns out it's actually taking almost twice as long to write the results back to the global array than it is to loop over columns, due to false sharing. That said, I'm still somewhat curious to see if there are any better ways of doing column traversal.
I think the answer is to process the grid in tiles. You can do a very quick tile move, either down or right, in a 16x16 or 32x32 tile. They two moves will be effectively the same, and run at the same speed: read all values into XMM registers, process, write. You may want to investigate MASKMOVDQU instruction here. If I understand the nature of the problem, you can overlap tiles by one row/column and this will work okay if you process them in the usual (scan) order. If not, you have to handle stitching the tiles separately.
There is no truly fast way to do this in C code. However, you can try (1) changing your board type to be a unit8_t, (2) replacing all if .. statements with arithmetic, like this: value = (mask & value) | (^mask & newvalue), and (3) turning on maximum loop unrolling and auto-vectorization in the compiler options. This will give you a nice speedup - especially avoiding conditionals.
EDIT In addition to tiles that can fit in registers, you can also do a second level of tiles sized to fit in your cache. I think the combination will run at roughly your memory bandwidth.
EDIT Or, make your board type be two bits: pack four cells to a byte. Goes nicely with the replacing if statements with arithmetic idea :)

Efficient algorithm for pixel neighborhood difference in OpenCV

I was reading this and this paper about hand/head tracking. They both talk about detecting motion computing the difference in a neighborhood of each pixel and comparing the result with a threshold:
Quoting from the first paper:
We use the temporal differencing method described in Ref. [41], which computes the absolute value of differences in the neighborhood surrounding each pixel, and then derive the accumulated differ- ence by summing the difference of all neighboring pixels. When the accumulated difference is above a predetermined thresh- old, the pixel is assigned to the moving region.
Is there an efficient way to do it (possibly in OpenCV)?
The code I wrote is pretty naive and, besides losing the real-time, seems not to give better results than a simpler pixel-to-pixel difference:
template<class T> class Image {
private:
IplImage* imgp;
public:
Image(IplImage* img=0) {imgp=img;}
~Image(){imgp=0;}
void operator=(IplImage* img) {imgp=img;}
inline T* operator[](const int rowIndx) {
return ((T *)(imgp->imageData + rowIndx*imgp->widthStep));}
};
typedef Image<unsigned char> BwImage;
typedef Image<float> BwImageFloat;
void computeMovingRegion( IplImage* prev, IplImage* cur, IplImage *mov) {
BwImage _prev(prev);
BwImage _cur(cur);
BwImage _mov(mov);
for (int i = 3; i<prev->height-3; i++) {
for (int j=3; j<prev->width-3; j++) {
int res=0;
for (int k=i-3; k<i+3; k++)
for (int n=j-3; n<j+3; n++)
res += abs(_cur[k][n] -_prev[k][n]);
if (res>2000) {
_mov[i][j]=_cur[i][j];
}
else
_mov[i][j]=0;
}
}
}
Images are in grayscale. Don't think it matters, but I'm using MacOS 10.8 and Xcode 4.4.2.
You should be able remove a lot of the redundancy if you first calculate the absolute difference image (i.e. abs(_cur[] - prev[])) and then just iterate over this. There are a lot more optimisations you can do beyond this, but this would be a good start for relatively little effort.
Also note that your loop indexing looks wrong - if you want to do a 7x7 neighbourhood operation it should be:
for (int k=i-3; k<=i+3; k++)
for (int n=j-3; n<=j+3; n++)
...

Resources