Resizing single 1 pixel wide bitmap strip - faster than this example? (for Raycaster algorithm) - arrays

I am attaching the picture example and my current code.
My question is: Can I make resizing/streching/interpolating single vertical bitmap strip faster
that using another for-loop.
The current Code looks very optimal:
for current strip size in the screen, iterate from start height to end height. Get corresponding
pixel from texture and add to output buffer. Add step to get another pixel.
here is an essential part of my code:
inline void RC_Raycast_Walls()
{
// casting ray for every width pixel
for (u_int16 rx = 0; rx < RC_render_width_i; ++rx)
{
// ..
// traversing thru map of grid
// finding intersecting point
// calculating height of strip in screen
// ..
// step size for nex pixel in texutr
float32 tex_step_y = RC_texture_size_f / (float32)pp_wall_height;
// starting texture coordinate
float32 tex_y = (float32)(pp_wall_start - RC_player_pitch - player_z_div_wall_distance - RC_render_height_d2_i + pp_wall_height_d2) * tex_step_y;
// drawing walls into buffer <- ENTERING ANOTHER LOOP only for SINGLE STRIP
for (int16 ry = pp_wall_start; ry < pp_wall_end; ++ry)
{
// cast the texture coordinate to integer, and mask with (texHeight - 1) in case of overflow
u_int16 tex_y_safe = (u_int16)tex_y & RC_texture_size_m1_i;
tex_y += tex_step_y;
u_int32 texture_current_pixel = texture_pixels[RC_texture_size_i * tex_y_safe + tex_x];
u_int32 output_pixel_index = rx + ry * RC_render_width_i;
output_buffer[output_pixel_index] =
(((texture_current_pixel >> 16 & 0x0ff) * intensity_value) >> 8) << 16 |
(((texture_current_pixel >> 8 & 0x0ff) * intensity_value) >> 8) << 8 |
(((texture_current_pixel & 0x0ff) * intensity_value) >> 8);
}
}
}
Maybe some bigger stepping like 2 instead of 1, got then every second line empty,
but adding another line of code that could fil that empty space results the same performance..
I would not like to have doubled pixels and interpolating between two of them I think would take even
longer. ??
Thank You in Advance!
ps.
Its based on Lodev Raycaster algorithm:
https://lodev.org/cgtutor/raycasting.html

You do not need floats at all
You can use DDA on integers without multiplication and division. These days floating is not that slow as it used to but your conversion between float and int might be ... See these QAs (both use this kind of DDA:
DDA line with subpixel
DDA based rendering routines
use LUT for applying Intensity
Looks like each color channel c is 8 bit and intensity i is fixed point in range <0,1> so you can precompute every combination into something like this:
u_int8 LUT[256][256]
for (int c=0;c<256;c++)
for (int i=0;i<256;i++)
LUT[c][i]=((c*i)>>8)
use pointers or union to access RGB channels instead of bit operations
My favorite is union:
union color
{
u_int32 dd; // 1x 32bit RGBA
u_int16 dw[2]; // 2x 16bit
u_int8 db[4]; // 4x 8bit (individual channels)
};
texture coordinates
Again looks like you are doing too many operations. for example [RC_texture_size_i * tex_y_safe + tex_x] if your texture size is 128 you can bitshift lef by 7 bits instead of multiplication. Yes on modern CPUs is this not an issue however the whole thing can be replaced by simple LUT. You can remember pointer to each horizontal ScanLine of texture and rewrite to [tex_y_safe][tex_x]
So based on #2,#3 rewrite your color computation to this:
color c;
c.dd=texture_current_pixel;
c.db[0]=LUT[c.db[0]][intensity_value];
c.db[1]=LUT[c.db[1]][intensity_value];
c.db[2]=LUT[c.db[2]][intensity_value];
output_buffer[output_pixel_index]=c.dd;
As you can see its just bunch of memory transfers instead of multiple bit-shifts,bit-masks and bit-or operations. You can also use pointer of color instead of texture_current_pixel and output_buffer[output_pixel_index] to speed up little more.
And finally see this:
Ray Casting with different height size
Which is my version of the raycast using VCL.
Now before changing anything measure the performance you got now by measuring the time it needs to render. Then after each change in the code measure if it actually improve performance or not. In case it didn't use old version of code as predicting what is fast on nowadays platforms is sometimes hard.
Also for resize much better visual results are obtained by using mipmaps ... that usually eliminates the weird noise while moving

Related

Generating Logarithmically Spaced Values on an Operation Limited Microcontroller

I've recently come across a problem where, using a cheap 16 bit uC (MSP430 series), I've had to generate a logarithmically spaced output value based on the 10 bit ADC read. The reason for this is that I require fine grain control at the low end of the integer space, while, at the same time, requiring the use of the larger values, though at less precision, (to me, the difference between 2^15 and 2^16 in my feedback loop is of little consequence). I've never done this before and I had no luck finding examples online, so I came up with a little scheme to do this on my operation-limited uC.
With my method here, the ADC result is linearly interpolated between the two closest integer powers-of-two via only integer multiplication/addition/summation and bitwise shifting, (outlined below).
My question is, is there a better, (faster/less operations), way than this to generate a smooth, (or smooth-ish), set of data logarithmically spaced over the integer resolution? I haven't found anything online, hence my attempt at coming up with something from scratch in the first place.
N is the logarithmic resolution of the micro controller, (here assumed to be 16 bit). M is the integer resolution of the ADC, (here assumed to be 10 bit). ADC_READ is the value read by the ADC at a given time. On a uC that supports floating point operations, doing this is trivial:
x = N / M #16/1024
y = (float) ADC_READ / M #ADC_READ/1024
result = 2 ^ ( x * y )
In all of the plots below, this is the "Ideal" set of values. The "Resultant" values are generated by variations of the following:
unsigned int returnValue( adcRead ){
unsigned int e;
unsigned int a;
unsigned int rise;
unsigned int base;
unsigned int xoffset;
unsigned int yoffset;
e = adcRead >> 6;
a = 1 << e;
rise = ( 1 << (e + 1) ) - ( 1 << e );
base = e << 6;
xoffset = adcRead - base;
yoffset = ( rise >> rise_shift ) * (xoffset >> offset_shift); //this is an operation to prevent rolling over. rise_shift + offset_shift = M/N, here = 6
result = a + yoffset;
return result;
}
The extra declarations and what not are for readability only. Assume the final product is condensed. Basically, it does as intended, with varying degrees of discretization at the low end and smoothness at the high end based on the values of rise_shift and offset_shift. Here, they are both equal to 3:
Here rise_shift = 2, offset_shift = 4
Here rise_shift = 4, offset_shift = 2
I'm interested to see if anyone has come up with or knows of anything better. Currently, I only have to run this code ~20-30 times a second, so I obviously have not encountered any delays. But, with a 16MHz clock, and using information from here, I estimate this entire operation taking at most ~110 clock cycles, or ~7us. This is on the scale the ADC read time, which is ~4us.
Thanks
EDIT: By "better" I do not necessarily just mean faster, (it's already quite fast, apparently). Immediately, one sees that the low end has fairly drastic discretization to the integer powers of two, which results from the shifting operations to prevent roll-ever. Other than a look-up table, (suggested below), the answer to how this could be improved is not immediate.
based on the 10 bit ADC read.
This ADC can output only 1024 different values (0-1023), so you can use a table of 1024 16-Bit values, which would consume 2KB Flash memory:
const uint16_t LogarithmicTable[1024] = { 0, 1, ... , 64380};
Calculating the logarithmic output is now a simple array access:
result = LogarithmicTable[ADC_READ];
You can use a tool like Excel to generate the constants in this Table for you.
It sounds like you want to compute the function 2n/64, which would map 1024 to 65536 just above the high end but maps anything up to 64 to zero (or one, depending on rounding). Other exponential functions could avoid the low-end discretization, but it's not clear whether that would help the functionality.
We can factor 2n/64 into 2floor( n/64 ) × 2(n mod 64)/64. Usually multiplying by an integer power of 2 involves a left shift, but because the other side is a fraction between one and two, we're better off doing a right shift.
uint16_t exp_table[ 64 ] = {
32768u,
pow( 2, 1./64 ) * 32768u,
pow( 2, 2./64 ) * 32768u,
...
};
uint16_t adc_exp( uint16_t linear ) {
return exp_table[ linear % 64 ] >> ( 15 - linear / 64 );
}
This loses no precision against a full, 2-kilobyte table. To save more space, use linear interpolation.

math function as binary operations

I have an MP3 board attached to a ATmega microcontroller which is additionally connected to a potentiometer. The MP3 board plays MP3 data directly through its onboard speaker and therefore I am also able to set the volume of the output.
So, as you might guess, I read the value from the poti and forward it to the microcontroller. Unfortunately, the microcontroller does not increase the volume in a linear way. So, from values 0 to 128 you nearly hear nothing, and from 128 to 255 (max) the volume increases rapidly.
I found out, that the following function could solve this problem:
vol = 1 - (1 - x)^4
but x must be between 0 and 1 and the result is also between 0 and 1.
Since I am on a microcontroller, I would like to
transform this formula, so that I can use it with unsigned integers
optimize it (maybe use some cheap binary functions), because I read the poti value multiple times per second. So this function has to be calculated multiple times per second and I want to use the microcontroller for other stuff too ;-)
Maybe some of you have an idea? Would be great!
uint8_t linearize_volume(uint8_t value) {
// ideas?
// please don't use bigger data types than uint16_t
}
You can "pay" with memory for CPU cycles. If you have 256 bytes of ROM available to you, the cheapest way of computing such function would be building a lookup table.
Make a program that prints a list of 256 8-bit numbers with the values of your non-linear function. It does not matter how fast the program is, because you are going to run it only once. Copy the numbers the program prints into your C program as an array initializer, and perform the lookup instead of calculating the function.
You can get a decent estimate by treating the values as 8.8 fixed-point and raising to the power of four by squaring twice.
uint8_t linearize_volume(uint8_t value) {
// Approximate 255 * (1 - (1 - x/255)^4)
uint16_t x = 0xff - value;
x = (x * x) >> 8;
x = (x * x) >> 8;
return 0xff - x;
}
First, be sure you're using a linear pot, not an audio-taper pot.
This is typical of audio outputs. The data is a sine wave, and therefore negative values are necessary. You can certainly convert negatives to positives for the sole purpose of accessing their power level, but you can't alter the sample without hearing a completely different sound.
Depending upon the output device, lower values may not pack enough power to energize your speaker much at all.
The "MP3 board" should include an ability to control the volume without your having to alter samples.
You state you read the pot and forward it to the micro. Aren't you reading the pot with the micro's ADC?

Algorithms for downscaling bitmapped fonts

This is a follow-up to this question.
I am working on a low level C app where I have to draw text. I have decided to store the font I want to use as an array (black and white, each char 128x256, perhaps), then I'd downscale it to the sizes I need with some algorithm (as grayscale, so I can have some crude font smoothing).
Note: this is a toy project, please disregard stuff like doing calculations at runtime or not.
Question is, which algorithm?
I looked up 2xSaI, but it's rather complicated. I'd like something I can read the description for and work out the code myself (I am a beginner and have been coding in C/C++ for just under a year).
Suggestions, anyone?
Thanks for your time!
Edit: Please note, the input is B&W, the output should be smoothed grayscale
Figure out the rectangle in the source image that will correspond to a destination pixel. For example if your source image is 50x100 and your destination is 20x40, the upper left pixel in the destination corresponds to the rectangle from (0,0) to (2.2,2.2) in the source image. Now, do an area-average over those pixels:
Area is 2.2 * 2.2 = 4.84. You'll scale the result by 1/4.84.
Pixels at (0,0), (0,1), (1,0), and (1,1) each weigh in at 1 unit.
Pixels at (0,2), (1,2), (2,0), and (2,1) each weigh in at 0.2 unit (because the rectangle only covers 20% of them).
The pixel at (2,2) weighs in at 0.04 (because the rectangle only covers 4% of it).
The total weight is of course 4*1 + 4*0.2 + 0.04 = 4.84.
This one was easy because you started with source and destination pixels lined up evenly at the edge of the image. In general, you'll have partial coverage at all 4 sides/4 corners of the sliding rectangle.
Don't bother with algorithms other than area-averaging for downscaling. Most of them are plain wrong (they result in horrible aliasing, at least with a factor smaller than 1/2) and the ones that aren't plain wrong are a good bit more painful to implement and probably won't give you better results.
Consider that your image is a N*M BW bitmap. For simplicity we'll consider it char Letter[N][M], when allowable values are 0 and 1. Now consider that you want to downscale it to the unsigned char letter[n][m]. This will mean that each greyscale pixel from letter will be computed as number of white pixels in the big bitmap:
char Letter[N][M];
unsigned char letter[n][m];
int rect_sz_X = N / n; // the size of rectangle that will map to a single pixel
int rect_sz_Y = M / m; // in the downscaled image
int i, j, x, y;
for (i = 0; i < n; i++) for (j = 0; j < m; j++){
int sum = 0;
for (x = 0; x < rect_sz_X; x++) for (y = 0; y < rect_sz_Y; y++)
sum += Letter[i*rect_sz_X + x][j*rect_sz_Y + y];
letter[n][m] = ( sum * 255) / (rect_sz_X * rect_sz_Y);
};
Note that the rectangles that creates pixels could overlap (in case when sizes aren't divisible). The larger is your original bitmap, the better.
Scaling a bitmapped font is the same problem as scaling any other bitmap. The general class of algorithm that you're after is interpolation. There's quite a few ways to do this - in general, the more visually accurate the result, the more complicated the algorithm. You could start by looking at (in increasing order of complexity):
Nearest-neighbour
Bilinear interpolation
Bicubic interpolation
It's pretty simple. If all you've got is a bitmapped font instead of an outline font then you have very limited choices in picking an anti-aliasing pixel color. For example, if the bitmapped font point size is exactly four times as large as the desired display point size then you can only ever get 16 distinct choices. The number of 'lit' pixels in the 4x4 mapping rectangle.
Having to deal with fractional mapping is a programming exercise but not one that improves the quality.
If it is acceptable to constrain the downscaling to multiples of 2 (50%, 25%, 12.5%, etc.), then a very simple and fairly good algorithm is to create each downscaled pixel as the majority vote of all the source pixels. For example, at 50%, a square of four pixels are forming the one downscaled pixel: if zero or one of them is on, then the output is off; if three or four are on, then the output is on. The artistic case (for two pixels on), either always choose on or off, or look at other surrounding pixels for tiebreaking.

How to map a long integer number to a N-dimensional vector of smaller integers (and fast inverse)?

Given a N-dimensional vector of small integers is there any simple way to map it with one-to-one correspondence to a large integer number?
Say, we have N=3 vector space. Can we represent a vector X=[(int16)x1,(int16)x2,(int16)x3] using an integer (int48)y? The obvious answer is "Yes, we can". But the question is: "What is the fastest way to do this and its inverse operation?"
Will this new 1-dimensional space possess some very special useful properties?
For the above example you have 3 * 32 = 96 bits of information, so without any a priori knowledge you need 96 bits for the equivalent long integer.
However, if you know that your x1, x2, x3, values will always fit within, say, 16 bits each, then you can pack them all into a 48 bit integer.
In either case the technique is very simple you just use shift, mask and bitwise or operations to pack/unpack the values.
Just to make this concrete, if you have a 3-dimensional vector of 8-bit numbers, like this:
uint8_t vector[3] = { 1, 2, 3 };
then you can join them into a single (24-bit number) like so:
uint32_t all = (vector[0] << 16) | (vector[1] << 8) | vector[2];
This number would, if printed using this statement:
printf("the vector was packed into %06x", (unsigned int) all);
produce the output
the vector was packed into 010203
The reverse operation would look like this:
uint8_t v2[3];
v2[0] = (all >> 16) & 0xff;
v2[1] = (all >> 8) & 0xff;
v2[2] = all & 0xff;
Of course this all depends on the size of the individual numbers in the vector and the length of the vector together not exceeding the size of an available integer type, otherwise you can't represent the "packed" vector as a single number.
If you have sets Si, i=1..n of size Ci = |Si|, then the cartesian product set S = S1 x S2 x ... x Sn has size C = C1 * C2 * ... * Cn.
This motivates an obvious way to do the packing one-to-one. If you have elements e1,...,en from each set, each in the range 0 to Ci-1, then you give the element e=(e1,...,en) the value e1+C1*(e2 + C2*(e3 + C3*(...Cn*en...))).
You can do any permutation of this packing if you feel like it, but unless the values are perfectly correlated, the size of the full set must be the product of the sizes of the component sets.
In the particular case of three 32 bit integers, if they can take on any value, you should treat them as one 96 bit integer.
If you particularly want to, you can map small values to small values through any number of means (e.g. filling out spheres with the L1 norm), but you have to specify what properties you want to have.
(For example, one can map (n,m) to (max(n,m)-1)^2 + k where k=n if n<=m and k=n+m if n>m--you can draw this as a picture of filling in a square like so:
1 2 5 | draw along the edge of the square this way
4 3 6 v
8 7
if you start counting from 1 and only worry about positive values; for integers, you can spiral around the origin.)
I'm writing this without having time to check details, but I suspect the best way is to represent your long integer via modular arithmetic, using k different integers which are mutually prime. The original integer can then be reconstructed using the Chinese remainder theorem. Sorry this is a bit sketchy, but hope it helps.
To expand on Rex Kerr's generalised form, in C you can pack the numbers like so:
X = e[n];
X *= MAX_E[n-1] + 1;
X += e[n-1];
/* ... */
X *= MAX_E[0] + 1;
X += e[0];
And unpack them with:
e[0] = X % (MAX_E[0] + 1);
X /= (MAX_E[0] + 1);
e[1] = X % (MAX_E[1] + 1);
X /= (MAX_E[1] + 1);
/* ... */
e[n] = X;
(Where MAX_E[n] is the greatest value that e[n] can have). Note that these maximum values are likely to be constants, and may be the same for every e, which will simplify things a little.
The shifting / masking implementations given in the other answers are a generalisation of this, for cases where the MAX_E + 1 values are powers of 2 (and thus the multiplication and division can be done with a shift, the addition with a bitwise-or and the modulus with a bitwise-and).
There is some totally non portable ways to make this real fast using packed unions and direct accesses to memory. That you really need this kind of speed is suspicious. Methods using shifts and masks should be fast enough for most purposes. If not, consider using specialized processors like GPU for wich vector support is optimized (parallel).
This naive storage does not possess any usefull property than I can foresee, except you can perform some computations (add, sub, logical bitwise operators) on the three coordinates at once as long as you use positive integers only and you don't overflow for add and sub.
You'd better be quite sure you won't overflow (or won't go negative for sub) or the vector will become garbage.
#include <stdint.h> // for uint8_t
long x;
uint8_t * p = &x;
or
union X {
long L;
uint8_t A[sizeof(long)/sizeof(uint8_t)];
};
works if you don't care about the endian. In my experience compilers generate better code with the union because it doesn't set of their "you took the address of this, so I must keep it in RAM" rules as quick. These rules will get set off if you try to index the array with stuff that the compiler can't optimize away.
If you do care about the endian then you need to mask and shift.
I think what you want can be solved using multi-dimensional space filling curves. The link gives a lot of references on this, which in turn give different methods and insights. Here's a specific example of an invertible mapping. It works for any dimension N.
As for useful properties, these mappings are related to Gray codes.
Hard to say whether this was what you were looking for, or whether the "pack 3 16-bit ints into a 48-bit int" does the trick for you.

Getting RGB values for each pixel from a raw image in C

I want to read the RGB values for each pixel from a raw image. Can someone tell me how to achieve this? Thanks for help!
the format of my raw image is .CR2 which come from camera.
Assuming the image is w * h pixels, and stored in true "packed" RGB format with no alpha component, each pixel will require three bytes.
In memory, the first line of the image might be represented in awesome ASCII graphics like this:
R0 G0 B0 R1 G1 B1 R2 G2 B2 ... R(w-1) G(w-1) B(w-1)
Here, each Rn Gn and Bn represents a single byte, giving the red, green or blue component of pixel n of that scanline. Note that the order of the bytes might be different for different "raw" formats; there's no agreed-upon world standard. Different environments (graphics cards, cameras, ...) do it differently for whatever reason, you simply have to know the layout.
Reading out a pixel can then be done by this function:
typedef unsigned char byte;
void get_pixel(const byte *image, unsigned int w,
unsigned int x,
unsigned int y,
byte *red, byte *green, byte *blue)
{
/* Compute pointer to first (red) byte of the desired pixel. */
const byte * pixel = image + w * y * 3 + 3 * x;
/* Copy R, G and B to outputs. */
*red = pixel[0];
*green = pixel[1];
*blue = pixel[2];
}
Notice how the height of the image is not needed for this to work, and how the function is free from bounds-checking. A production-quality function might be more armor-plated.
Update If you're worried this approach will be too slow, you can of course just loop over the pixels, instead:
unsigned int x, y;
const byte *pixel = /* ... assumed to be pointing at the data as per above */
for(y = 0; y < h; ++y)
{
for(x = 0; x < w; ++x, pixel += 3)
{
const byte red = pixel[0], green = pixel[1], blue = pixel[2];
/* Do something with the current pixel. */
}
}
None of the methods posted so far are likely to work with a camera "raw" file. The file formats for raw files are proprietary to each manufacturer, and may contain exposure data, calibration constants, and white balance information, in addition to the pixel data, which will likely be in a packed format where each pixel can takes up more than one byte, but less than two.
I'm sure there are open-source raw file converter programs out there that you could consult to find out the algorithms to use, but I don't know of any off the top of my head.
Just thought of an additional complication. The raw file does not store RGB values for each pixel. Each pixel records only one color. The other two colors have to be interpolated from heighboring pixels. You'll definitely be better off finding a program or library that works with your camera.
A RAW image is an uncompressed format, so you just have to point where your pixel is (skipping any possible header, and then adding the size of the pixel times the number columns times the number of row plus the number of the colum), and then read whatever binary data is giving a meaningful format to the layout of the data (with masks and shifts, you know).
That's the general procedure, for your current format you'll have to check the details.

Resources