Graphic buffer - horizontal and vertical filling, performace - c

I wonder if it is possible to solve a certain problem.
In short: get optimal performance by filling the buffer not only line by line but also column by column.
Description below:
A graphic buffer is given (i.e. intended to hold a bitmap)
#define WIDTH 320
#define HEIGHT 256
typedef struct
unsigned char r,g,b,a;
sRGBA* bufor_1;
bufor_1 = (sRGBA*)malloc(WIDTH*HEIGHT*sizeof(sRGBA));
There is no problem with filling it horizontally line by line, because it is a 'cache friendly' case, which is the best one, e.g. floor and ceiling rycasting:
bufor_1 = (sRGB*)malloc(WIDTH*HEIGHT*sizeof(sRGB));
for (int y = 0; y < HEIGHT; ++y)
for (int x = 0; x < WIDTH; ++x)
bufor_1[x+y*WIDTH].r = 100;
The difference in performance appears when we want to supplement such a buffer vertically, i.e. column by column, e.g. wall regeneration, which is done in this way, i.e.
bufor_1 = (sRGB*)malloc(WIDTH*HEIGHT*sizeof(sRGB));
for (int x = 0; x < WIDTH; ++x)
for (int y = 0; y < HEIGHT; ++y)
bufor_1[x+y*WIDTH].r = 100;
The question that arises is whether it is possible to somehow combine efficient line-by-line and column-by-column completion.
From a few tests that I have performed, it turned out that if the buffer is presented as two-dimensional, i.e.
column-by-column filling is even faster than line-by-line in a one-dimensional one - but then it is the other way around, i.e. filling such a two-dimensional buffer line by line will be inefficient.
Solutions I was thinking about:
rotate the buffer 90 degrees, unfortunately it takes too much time, at least with the algorithms that I checked,
unless there is some mega-fast N (1) way
some sort of buffer remapping so that some table contains pointers to the next pixels in the column, but it probably won't be 'cache friendly' or even worse - I haven't checked anyway


How to loop through blocks of pixels with minimum number of for loops

I have an image of width * height pixels in which i want to loop through blocks of pixels, say block size of 10 * 10. How can i do this with minimum number of loops?
I have tried by first looping through each column, then through each row and took the starting x and y position from this two outer loops. Then the loop goes from start position of the block and loops till the block size and manipulates the pixels. This consumes four nested loops.
for (int i = 0; i < Width; i+=Block_Size) {
for (int j = 0; j < Height; j+=Block_Size) {
for (int x = i; x < i + Block_Size; x++) {
for (int y = j; y < j + Block_Size; y++) {
//Get pixel values within the block
How can i do this with minimum number of loops?
You can reduce the number of loops by completely unrolling as many loop levels as you like. For fixed raster dimensions, you could unroll them all, yielding a (probably lengthy) implementation with zero loops. For known Block_Size you can unroll one or both of the inner loops regardless of whether the overall dimensions are known, yielding as few as two loops remaining.
But why would you consider such a thing? The question seems to assume that there would be some kind of inherent advantage to reducing the depth of loop nest, but that's not necessarily true, and whatever effect there might be is likely to be small.
I'm inclined to guess that you've studied a bit of computational complexity theory, and taken away the idea that deep loop nests necessarily yield poorly-scaling performance, or even that deep loop nests have inherently poor performance, period. These are misconceptions, albeit relatively common ones, and they anyway look at the problem backwards.
The primary consideration in how the performance of your loop nest scales is how many times the body of the innermost loop,
//Get pixel values within the block
, is executed. You'll have roughly the same performance for any reasonable approach that causes it to be executed exactly once for every pixel in the raster, regardless of how many loops are involved. With that being the case, code clarity should be your goal, and your original four-loop nest is pretty clear.
It is possible to achieve this with three loops, but in order to do that you will need to store information about where each block of pixels starts and how many blocks of pixels there are in total!
Independent of that, both the width as well as the height of the image have to be multiples of your Block_Size.
Here is how it is possible with three loops:
int numberOfBlocks = x;
int pixelBlockStartingPoints[numberOfBlocks] = { startingPoint1, startingPoint2, ... };
for(int i = 0; i < numberOfBlocks; i++){
for(int j = pixelBlockStartingPoints[i]; j < pixelBlockStartingPoint[i] + Block_Size; j++){
for(int k = pixelBlockStartingPoints[i]; k < pixelBlockStartingPoint[i] + Block_Size; k++){
// Get Pixel-Data

matrix optimization - segmentation fault when using intrinsics and loop unrolling

I'm currently trying to optimize matrix operations with intrinsics and loop unrolling. There was segmentation fault which I couldn't figure out. Here is the code I made change:
const int UNROLL = 4;
void outer_product(matrix *vec1, matrix *vec2, matrix *dst) {
assert(vec1->dim.cols == 1 && vec2->dim.cols == 1 && vec1->dim.rows == dst->dim.rows && vec2->dim.rows == dst->dim.cols);
__m256 tmp[4];
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_setzero_ps();
for (int i = 0; i < vec1->dim.rows; i+=UNROLL*8) {
for (int j = 0; j < vec2->dim.rows; j++) {
__m256 row2 = _mm256_broadcast_ss(&vec2->data[j][0]);
for (int x = 0; x<UNROLL; x++) {
tmp[x] = _mm256_mul_ps(_mm256_load_ps(&vec1->data[i+x*8][0]), row2);
_mm256_store_ps(&dst->data[i+x*8][j], tmp[x]);
void matrix_multiply(matrix *mat1, matrix *mat2, matrix *dst) {
assert (mat1->dim.cols == mat2->dim.rows && dst->dim.rows == mat1->dim.rows && dst->dim.cols == mat2->dim.cols);
for (int i = 0; i < mat1->dim.rows; i+=UNROLL*8) {
for (int j = 0; j < mat2->dim.cols; j++) {
__m256 tmp[4];
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_setzero_ps();
for (int k = 0; k < mat1->dim.cols; k++) {
__m256 mat2_s = _mm256_broadcast_ss(&mat2->data[k][j]);
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_add_ps(tmp[x], _mm256_mul_ps(_mm256_load_ps(&mat1->data[i+x*8][k]), mat2_s));
for (int x = 0; x < UNROLL; x++) {
_mm256_store_ps(&dst->data[i+x*8][j], tmp[x]);
Here is the struct of matrix. I didn't modified it.
typedef struct shape {
int rows;
int cols;
} shape;
typedef struct matrix {
shape dim;
float** data;
} matrix;
I tried gdb to figure out which line caused segmentation fault and it looked like it was _mm256_load_ps(). Am I indexing into the matrix in a wrong way such that it cannot load from the correct address? Or is the problem of aligned memory?
In at least one place, you're doing 32-byte alignment-required loads with a stride of only 4 bytes. I think that's not what you actually meant to do, though:
for (int k = 0; k < mat1->dim.cols; k++) {
for (int x = 0; x < UNROLL; x++) {
_mm256_load_ps loads 8 contiguous floats, i.e. it loads data[i+x*8][k] to data[i+x*8][k+7]. I think you want data[i+x][k*8], and loop over k in the inner-most loop.
If you need unaligned loads / stores, use _mm256_loadu_ps / _mm256_storeu_ps. But prefer aligning your data to 32B, and pad the storage layout of your matrix so the row stride is a multiple of 32 bytes. (The actual logical dimensions of the array don't have to match the stride; it's fine to leave padding at the end of each row out to a multiple of 16 or 32 bytes. This makes loops much easier to write.)
You're not even using a 2D array (you're using an array of pointers to arrays of float), but the syntax looks the same as for float A[100][100], even though the meaning in asm is very different. Anyway, in Fortran 2D arrays the indexing goes the other way, where incrementing the left-most index takes you to the next position in memory. But in C, varying the left index by one takes you to a whole new row. (Pointed to by a different element of float **data, or in a proper 2D array, one row stride away.) Of course you're striding by 8 rows because of this mixup combined with using x*8.
Speaking of the asm, you get really bad results for this code especially with gcc, where it reloads 4 things for every vector, I think because it's not sure the vector stores don't alias the pointer data. Assign things to local variables to make sure the compiler can hoist them out of loops. (e.g. const float *mat1dat = mat1->data;.) Clang does slightly better, but the access pattern in the source is inherently bad and requires pointer-chasing for each inner-loop iteration to get to a new row, because you loop over x instead of k. I put it up on the Godbolt compiler explorer.
But really you should optimize the memory layout first, before trying to manually vectorize it. It might be worth transposing one of the arrays, so you can loop over contiguous memory for rows of one matrix and columns of the other while doing the dot product of a row and column to calculate one element of the result. Or it could be worth doing c[Arow,Bcol] += a_value_from_A * b[Arow,Bcol] inside an inner loop instead of transposing up front (but that's a lot of memory traffic). But whatever you do, make sure you're not striding through non-contiguous accesses to one of your matrices in the inner loop.
You'll also want to ditch the array-of-pointers thing and do manual 2D indexing (data[row * row_stride + col] so your data is all in one contiguous block instead of having each row allocated separately. Making this change first, before you spend any time manually-vectorizing, seems to make the most sense.
gcc or clang with -O3 should do a not-terrible job of auto-vectorizing scalar C, especially if you compile with -ffast-math. (You might remove -ffast-math after you're done manually vectorizing, but use it while tuning with auto-vectorization).
How does BLAS get such extreme performance?
Also see my comments on Poor maths performance in C vs Python/numpy for another bad-memory-layout problem.
how to optimize matrix multiplication (matmul) code to run fast on a single processor core
You might manually vectorize before or after looking at cache-blocking, but when you do, see Matrix Multiplication with blocks.

Filling in random positions in a huge 2D array

Is a there a neat algorithm that I can use to fill in random positions in a huge 2D n x n array with m number of integers without filling in an occupied position? Where , and
Kind of like this pseudo code:
int n;
int m;
void init(int new_n, int new_m) {
n = new_n;
m = new_m;
void create_grid() {
int grid[n][n];
int x, y;
for(x = 1; x <= n; x ++) {
for(y = 1; y <= n; y ++) {
grid[x][y] = 0;
void populate_grid(int grid[][]) {
int i = 1;
int x, y;
while(i <= m) {
x = get_pos();
y = get_pos();
if(grid[x][y] == 0) {
grid[x][y] = i;
i ++;
int get_pos() {
return random() % n + 1;
... but more efficient for bigger n's and m's. Specially if m is bigger and more positions are being occupied, it would take longer to generate a random position that isn't occupied.
Unless the filling factor really gets large, you shouldn't worry about hitting occupied positions.
Assuming for instance that half of the cells are already filled, you have 50% of chances to first hit a filled cell; and 25% to hit two filled ones in a row; 12.5% of hitting three... On average, it takes... two attempts to find an empty place ! (More generally, if there is only a fraction 1/M of free cells, the average number of attempts raises to M.)
If you absolutely want to avoid having to test the cells, you can work by initializing an array with the indexes of the free cells. Then instead of choosing a random cell, you choose a random entry in the array, between 1 and L (the lenght of the list, initially N²).
After having chosen an entry, you set the corresponding cell, you move the last element in the list to the random position, and set L= L-1. This way, the list of free positions is kept up-to-date.
Note the this process is probably less efficient than blind attempts.
To generate pseudo-random positions without repeats, you can do something like this:
for (int y=0; y<n; ++y) {
for(int x=0; x<n; ++x) {
int u=x,v=y;
u = (u+hash(v))%n;
v = (v+hash(u))%n;
u = (u+hash(v))%n;
for this to work properly, hash(x) needs to be a good pseudo-random hash function that produces positive numbers that won't overflow when you add to a number between 0 and n.
This is a version of the Feistel structure (, which is commonly used to make cryptographic ciphers like DES.
The trick is that each step like u = (u+hash(v))%n; is invertible -- you can get your original u back by doing u = (u-hash(v))%n (I mean you could if the % operator worked with negative numbers the way everyone wishes it did)
Since you can invert the operations to get the original x,y back from each u,v output, each distinct x,y MUST produce a distinct u,v.

Accessing portions of a dynamic array in C?

I know, another dynamic array question, this one is a bit different though so maybe it'll be worth answering. I am making a terrain generator in C with SDL, I am drawing 9 chunks surrounding the screen, proportional to the screen size, that way terrains can be generated easier in the future.
This means that I have to be able to resize the array at any given point, so I made a dynamic array (at least according to an answer I found on stack it is) and everything SEEMS to work fine, nothing is crashing, it even draws a single tile....but just one. I am looking at it and yeah, sure enough it's iterating through the array but only writing to one portion of memory. I am using a struct called Tile that just holds the x, y, w, and h of a rectangle.
This is the code I am using to allocate the array
Tile* TileMap = (Tile*)malloc(0 * sizeof(Tile*));
int arrayLen = sizeof(TileMap);
TileMap = (Tile*)realloc(TileMap, (totalTiles) * sizeof(Tile));
arrayLen = sizeof(totalTiles * sizeof(Tile));
The totalTiles are just the number of tiles that I have calculated previously are on the screen, I've checked the math and it's correct, and it even allocates the proper amount of memory. Here is the code I use to initialize the array:
//Clear all elements to zero.
for (int i = 0; i < arrayLen; i++)
Tile tile = {};
TileMap[i] = tile;
So what's weird to me is it is considering the size of a tile (16 bytes) * the totalTiles (78,000) is equaling 4....When I drill down into the array, it only has one single rect in it that gets cleared as well, so then when I go calculate the sizes of each tile:
//Figure out Y and heights
for (int i = startY; i <= (startY*(-1)) * 2; i += TILE_HEIGHT)
TileMap[i].y = i * TILE_HEIGHT;
TileMap[i].h = TILE_HEIGHT;
//Figure out X and widths
for (int j = startX; j <= (startX*(-1)) * 2; j += TILE_WIDTH)
TileMap[i].x = i * TILE_WIDTH;
TileMap[i].w = TILE_WIDTH;
*Side note, the startX is the negative offset I am using to draw chunks behind the camera, so I times it by -1 to make it positive and then time it by two to get one chunk in front of the camera
Alright, so obviously that only initializes one, and here is the render code
for (int i = 0; i < totalTiles; i++)
SDL_Rect currentTile;
currentTile.x = TileMap[i].x;
currentTile.y = TileMap[i].y;
currentTile.w = TileMap[i].w;
currentTile.h = TileMap[i].h;
SDL_RenderDrawRect(renderer, &currentTile);
So what am I doing wrong here? I mean I literally am just baffled right now...And before Vectors get recommended in place of dynamic arrays, I don't really like using them and I want to learn to deal with stuff like this, not just implement some simple fix.
Lots of confusion (which is commonplace with C pointers).
The following code doesn't provide expected answer :arrayLen = sizeof(totalTiles * sizeof(Tile));
totalTiles * sizeof(Tile) is not even a type, I'm surprised it compiles at all. Edit : See molbnilo comment below. so it provides the size of the return type.
Anyway, proper answer should be :
arrayLen = totalTiles;
Because that's what you need in your next loop :
//Clear all elements to zero.
for (int i = 0; i < arrayLen; i++)
Tile tile = {};
TileMap[i] = tile;
You don't need the size of the table, you need its number of elements.
There are other confusions in your sample, they don't directly impact the rest of the code, but better correct them :
Tile* TileMap = (Tile*)malloc(0 * sizeof(Tile*)); : avoid allocating a size of 0.
int arrayLen = sizeof(TileMap); : no, it's not the arrayLen, just the size of the pointer (hence 4 bytes on 32-bits binaries). Remember TileMap is not defined as a table, but as a pointer allocated with malloc() and then realloc().

Linear algorithm to fill a contour

I am trying to make an algorithm that will fill a contour in linear complexity. I know such an algorithm exists. I've read somewhere that it has to do with the number of crossings, but there is a special case that I haven't had great luck in solving yet.
So far I have tried using the following algorithm. Note that I can't access previous elements (to the left) because they will/may be overwritten:
for (int y = blob->miny; y < blob->maxy; ++y)
int NumberOfBorderCrossings = 0;
unsigned int NextElem = 0;
unsigned int NextNextElem = 0;
for (int x = blob->minx-1; x < blob->maxx-1; ++x)
NextElem = CV_IMAGE_ELEM(labelimg,unsigned int,y,x+1);
NextNextElem = CV_IMAGE_ELEM(labelimg,unsigned int,y,x+2);
if (CV_IMAGE_ELEM(labelimg,unsigned int,y,x) != label)
if (NextElem == label && NextNextElem != label)
if (NumberOfBorderCrossings%2)
CV_IMAGE_ELEM(labelimg,unsigned int,y,x) = label;
The result I get is the following. The input is to the right (all non-black pixels must be copied), and the erroneous output is to the left. Note again that I only have the contour of the image to the right (not rendered).
It appears that you're looking for a general Polygon filling algorithm. Your line crossing counting algorithm will break where it hits single points and horizontal and vertical lines. Have a look at Quickfill for a possible alternative.
