I'm very beginner to NEON intrinsic. I am trying to optimize the algorithm below
uint32_t blue = 0, red = 0 , green = 0, alpha = 0, factor = 0 , shift = 0;
// some initial calculation to calculate factor shift and R G B init values all are expected to be initilized with 16 bit unsigned
//pSRC is 32 bbp flat pixel array and count is total pixels count
for( int i = 0; i < count; i++ )
{
blue += *psrc++;
green += *psrc++;
green += *psrc++;
alpha += *psrc++;
*pDest++ = static_cast< uint_8 >( ( blue * factor ) >> shift );
*pDest++ = static_cast< uint_8 >( ( green * factor ) >> shift );
*pDest++ = static_cast< uint_8 >( ( red * factor ) >> shift );
*pDest++ = static_cast< uint_8 >( ( alpha * factor ) >> shift );
}
I am not sure how to do this since I need the result in 32-bit containers and I have source data as 8-bit ( R G B A ), and there is no instruction which can add 8-bits with 32-bits.
Can anyone help me out with this?
I was able to convert them to 32-bits as suggested by Paul's link and do the needful arithmetic. Now I have:
uint32x4_t result1 = vshlq_u32(mult1281, shift);
uint32x4_t result2 = vshlq_u32(mult1282, shift);
uint32x4_t result3 = vshlq_u32(mult1283, shift);
uint32x4_t result4 = vshlq_u32(mult1284, shift);
result 1/2/3/4 now contains 32-bits (per channel) RGB channels. How can I now combine result 1/2/3/4 to get 8-bits (per channel) RGB channels and put it back to the destination?
I still haven't understood deep sense of the algorithm, but of course you can optimize it with using NEON:
uint32_t blue = 0, red = 0, green = 0, alpha = 0, factor = 0, shift = 0;
// some your initializations.
uint32x4_t bgra = { blue, green, red, alpha };
for (int i = 0; i < count; i += 2)
{
//load 8 8-bit values and unpack to 16-bit
uint16x8_t src = vmovl_u8(vld1_u8(psrc + i * 4));
//accumulate low 4 values
bgra = vaddw_u16(bgra, vget_low_u16(src));
//get low 4 values of dst
uint32x4_t lo = vshrq_n_u32(vmulq_u32(bgra, vdupq_n_u32(factor)), shift);
//accumulate high 4 values
bgra = vaddw_u16(bgra, vget_high_u16(src));
//get high 4 values of dst
uint32x4_t hi = vshrq_n_u32(vmulq_u32(bgra, vdupq_n_u32(factor)), shift);
//pack 8 32-bit values to 8 8-bit.
uint8x8_t dst = vmovn_u16(vcombine_u16(vmovn_u32(lo), vmovn_u32(hi)));
//store result
vst1_u8(pDest + i * 4, dst);
}
Related
I am a SIMD new, writing a program that converts an image from ARGB to grayscale, and the main operation code is as follows:
void* ptr;
int* pBitmap;
posix_memalign(&ptr, 16, height * width * sizeof(int));
pBitmap = (int*)ptr;
for(row = 0; row < height; row++){
for(col = 0; col < width; col++){
int pixel = pBitmap[col + row * width];
int alpha = (pixel >> 24) & 0xff;
int red = (pixel >> 16) & 0xff;
int green = (pixel >> 8) & 0xff;
int blue = pixel & 0xff;
int bw = (int)(red * 0.299 + green * 0.587 + blue * 0.114);
pBitmap[col + row * width] = (alpha << 24) + (bw << 16) + (bw << 8) + (bw);
}
}
And this is my modified SIMD program, which is much slower than the original one.
__m128i bw;
__m128i* rec;
__m128d blue, grees, red, alpha;
for(int i = 0; i < width * height; i += 2){
rec = (__m128i*)(pBitmap + i);
alpha = _mm_cvtepi32_pd(_mm_srli_epi32(*rec, 24));
red = _mm_cvtepi32_pd(_mm_and_si128(_mm_srli_epi32(*rec, 16), _mm_set1_epi32(0xff)));
green = _mm_cvtepi32_pd(_mm_and_si128(_mm_srli_epi32(*rec, 8), _mm_set1_epi32(0xff)));
blue = _mm_cvtepi32_pd(_mm_and_si128(*rec, _mm_set1_epi32(0xff)));
bw = _mm_add_epi32(_mm_cvtpd_epi32(_mm_mul_pd(reds, _mm_set_pd1(0.299))), _mm_cvtpd_epi32(_mm_mul_pd(greens, _mm_set_pd1(0.587))));
bw = _mm_add_epi32(bws, _mm_cvtpd_epi32(_mm_mul_pd(blues, _mm_set_pd1(0.114))));
*rec = _mm_add_epi32(_mm_add_epi32(_mm_slli_epi32(_mm_cvtpd_epi32(alphas), 24), _mm_slli_epi32(bws, 16)), _mm_add_epi32(_mm_slli_epi32(bws, 8), bws));
}
Is the reason for this result because there are more type conversions? But I don't know where else I can optimize, please help me, thank you.
A few issues with your implementation.
SIMD works best when doing multiple pixels at a time in parallel. Do an Internet search "Arrays of Structures vs. Structures of Arrays" for some examples.
Why use doubles instead of single-precision? That's halving your throughput.
Most compilers do not have way to automatically create data constants from SIMD vectors. All those calls to _mm_set_* intrinsics are doing a lot of things at runtime you should really do at compile time.
Replace all the use of _mm_set_* macros with something like:
union simdConstant
{
float f[4];
__m128 v;
};
static const simdConstant c_luminance = { { 0.299f, 0.587f, 0.114f, 1.f } };
static const simdConstant c_luminanceRed = { { 0.299f, 0.299f, 0.299f, 0.299f } };
Then use c_luminance.v or c_luminanceRed.v instead of _mm_set_ps or _mm_set_ps1.
See also DirectXMath which will provide numerous examples of SIMD implementations.
I'm taking my first steps in C, and was trying to make a gradient color function, that draws a bunch of rectangles to the screen (vertically).
This is the code so far:
void draw_gradient(uint32_t start_color, uint32_t end_color) {
int steps = 8;
int draw_height = window_height / 8;
//Change this value inside the loop to write different color
uint32_t loop_color = start_color;
for (int i = 0; i < steps; i++) {
draw_rect(0, i * draw_height, window_width, draw_height, loop_color);
}
}
Ignoring the end_color for now, I want to try and pass a simple red color in like 0xFFFF0000 (ARGB)..and then take the red 'FF' and convert it to an integer or decrease it using the loop_color variable.
I'm not sure how to go get the red value from the hexcode and then minipulate it as a number and then write it back to hex..any ideas?
So in 8 steps the code should for example go in hex from FF to 00 or as integer from 255 to 0.
As you have said, your color is in RGB format. This calculation assumes vertical gradient - meaning from top to the bottom (linear lines).
Steps to do are:
Get number of lines to draw; this is your rectangle height
Get A, R, G, B color components from your start and end colors
uint8_t start_a = start_color >> 24;
uint8_t start_r = start_color >> 16;
uint8_t start_g = start_color >> 8;
uint8_t start_b = start_color >> 0;
uint8_t end_a = end_color >> 24;
uint8_t end_r = end_color >> 16;
uint8_t end_g = end_color >> 8;
uint8_t end_b = end_color >> 0;
Calculate step for each of the components
float step_a = (float)(end_a - start_a) / (float)height;
float step_r = (float)(end_r - start_r) / (float)height;
float step_g = (float)(end_g - start_g) / (float)height;
float step_b = (float)(end_b - start_b) / (float)height;
Run for loop and apply different step for each color
for (int i = 0; i < height; ++i) {
uint32_t color = 0 |
((start_a + i * step_a) & 0xFF) << 24 |
((start_r + i * step_r) & 0xFF) << 16 |
((start_g + i * step_g) & 0xFF) << 8 |
((start_b + i * step_b) & 0xFF) << 0
draw_horizontal_line(i, color);
}
It is better to use float for step_x and multiply/add on each iteration. Otherwise with integer rounding, you may never increase number as it will always get rounded down.
Load a 32bit image into a buffer and I then premultiply the color values with the corresponding alpha to use for blending.
The following works but I am wondering if there is a more efficient way of doing this, even if it only results in a good-enough approximation?
image data is a pointer of this type:
typedef struct rgba_pixel
{
uint8_t r;
uint8_t g;
uint8_t b;
uint8_t a;
} rgba_pixel;
rgba_pixel * image_data;
for ( i = 0; i < length; i++ )
{
if ( image_data[i].a == 0 )
image_data[i].r = image_data[i].g = image_data[i].b = 0;
else if ( image_data[i].a < 255 )
{
alpha_factor = image_data[i].a / 255.0;
image_data[i].r = image_data[i].r * alpha_factor;
image_data[i].g = image_data[i].g * alpha_factor;
image_data[i].b = image_data[i].b * alpha_factor;
}
}
Given that your a, r, g and b components are unsigned char, you can improve performance by turning floating point multiplication to integer multiplication and use shr 8 (division by 256) instead of dividing by 255:
for ( i = 0; i < length; i++ )
{
if ( image_data[i].a == 0 )
image_data[i].r = image_data[i].g = image_data[i].b = 0;
else if ( image_data[i].a < 255 )
{
image_data[i].r = (unsigned short)image_data[i].r * image_data[i].a >> 8;
image_data[i].g = (unsigned short)image_data[i].g * image_data[i].a >> 8;
image_data[i].b = (unsigned short)image_data[i].b * image_data[i].a >> 8;
}
}
This will convert 1 fp division and 3 fp multiplications into 3 integer multiplications and 3 bit shifts.
Another improvement which can be done is using union struct for the pixel data:
typedef union rgba_pixel
{
struct {
uint8_t r;
uint8_t g;
uint8_t b;
uint8_t a;
};
uint32_t u32;
} rgba_pixel;
And then assigning zero to r, g and b at once:
//image_data[i].r = image_data[i].g = image_data[i].b = 0;
image_data[i].u32 = 0; //use this instead
According to https://godbolt.org/ with x86-64 gcc 7.2, the latter generates less instructions at -O3. Which of course may or may not be faster in practice.
Another thing to be considered is partial loop unrolling, i.e. processing multiple (for example 4) pixels per loop iteration. If you are guaranteed that your rows are multiples of 4 in width, you do it even without additional checks.
Could someone help me understanding the SSE implementation of the FAST corner detection in OpenCV? I understand the algorithm but not the implementation. Could somebody walk me through the code?
The code is long, so thank you in advance.
I am using OpenCV 2.4.11 and the code goes like this:
__m128i delta = _mm_set1_epi8(-128);
__m128i t = _mm_set1_epi8((char)threshold);
__m128i m0, m1;
__m128i v0 = _mm_loadu_si128((const __m128i*)ptr);
I think the following have something to do with threshold checking, but can't understand the use of delta
__m128i v1 = _mm_xor_si128(_mm_subs_epu8(v0, t), delta);
v0 = _mm_xor_si128(_mm_adds_epu8(v0, t), delta);
Now it checks the neighboring 4 pixels, but again, what is the use of delta?
__m128i x0 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[0])), delta);
__m128i x1 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[4])), delta);
__m128i x2 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[8])), delta);
__m128i x3 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[12])), delta);
m0 = _mm_and_si128(_mm_cmpgt_epi8(x0, v0), _mm_cmpgt_epi8(x1, v0));
m1 = _mm_and_si128(_mm_cmpgt_epi8(v1, x0), _mm_cmpgt_epi8(v1, x1));
m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x1, v0), _mm_cmpgt_epi8(x2, v0)));
m1 = _mm_or_si128(m1, _mm_and_si128(_mm_cmpgt_epi8(v1, x1), _mm_cmpgt_epi8(v1, x2)));
m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x2, v0), _mm_cmpgt_epi8(x3, v0)));
m1 = _mm_or_si128(m1, _mm_and_si128(_mm_cmpgt_epi8(v1, x2), _mm_cmpgt_epi8(v1, x3)));
m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x3, v0), _mm_cmpgt_epi8(x0, v0)));
m1 = _mm_or_si128(m1, _mm_and_si128(_mm_cmpgt_epi8(v1, x3), _mm_cmpgt_epi8(v1, x0)));
m0 = _mm_or_si128(m0, m1);
Here it checks the continuity of the neighboring pixels. (Right?)
int mask = _mm_movemask_epi8(m0);
if( mask == 0 )
continue;
This is another puzzle for me. Why shifting 8 bytes to the left? I assume the mask tells me the location of the corner candidate, but why 8 bytes?
if( (mask & 255) == 0 )
{
j -= 8;
ptr -= 8;
continue;
}
I gave up at this point...
__m128i c0 = _mm_setzero_si128(), c1 = c0, max0 = c0, max1 = c0;
for( k = 0; k < N; k++ )
{
__m128i x = _mm_xor_si128(_mm_loadu_si128((const __m128i*)(ptr + pixel[k])), delta);
m0 = _mm_cmpgt_epi8(x, v0);
m1 = _mm_cmpgt_epi8(v1, x);
c0 = _mm_and_si128(_mm_sub_epi8(c0, m0), m0);
c1 = _mm_and_si128(_mm_sub_epi8(c1, m1), m1);
max0 = _mm_max_epu8(max0, c0);
max1 = _mm_max_epu8(max1, c1);
}
max0 = _mm_max_epu8(max0, max1);
int m = _mm_movemask_epi8(_mm_cmpgt_epi8(max0, K16));
for( k = 0; m > 0 && k < 16; k++, m >>= 1 )
if(m & 1)
{
cornerpos[ncorners++] = j+k;
if(nonmax_suppression)
curr[j+k] = (uchar)cornerScore<patternSize>(ptr+k, pixel, threshold);
}
As harold said, delta is used to make unsigned comparsion.
Let's describe this implementation by steps:
__m128i x0 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr +
pixel[0])), delta);
__m128i x1 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[4])), delta);
__m128i x2 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[8])), delta);
__m128i x3 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[12])), delta); m0 = _mm_and_si128(_mm_cmpgt_epi8(x0, v0),
_mm_cmpgt_epi8(x1, v0)); m1 = _mm_and_si128(_mm_cmpgt_epi8(v1, x0), _mm_cmpgt_epi8(v1, x1)); m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x1, v0), _mm_cmpgt_epi8(x2, v0))); ......
Here it's not checking of 4 neighboring pixels. It checks 4 points, for example, like this:
Here they check that "corner condition" is true for this 4 points, because if it's not true there are no 8 neighboring pixels that satisfy "corner condition", so it's not corner pixel. If mask is zero it means that all pixels in vector can't be corner so we shift left for 16 pixels.
int mask = _mm_movemask_epi8(m0);
if( mask == 0 )
continue;
If mask is not zero, but for first 8 pixels "corner condition" is not true they shift left only for 8 pixels to check remain pixels on next iteration.
if( (mask & 255) == 0 )
{
j -= 8;
ptr -= 8;
continue;
}
And final step. Here they count number of neighboring pixels which are greater than x + threshold to c0 counter and which are less than x - threshold to c1 counter.
Here generating mask for such conditions:
__m128i x = _mm_xor_si128(_mm_loadu_si128((const __m128i*)(ptr + pixel[k])), delta);
m0 = _mm_cmpgt_epi8(x, v0);
m1 = _mm_cmpgt_epi8(v1, x);
Note that if condition is true for element of vector his value set to 0xFF or -1 since we treat him as signed char.
c0 = _mm_and_si128(_mm_sub_epi8(c0, m0), m0);
c1 = _mm_and_si128(_mm_sub_epi8(c1, m1), m1);
If element of mask is -1 it accumulates to c0 or c1 counter since of substraction (for example c0 - (-1)) . But if it equal to zero they reset counter to zero (_mm_and_si128).
Than they need to store maximum value of counters:
max0 = _mm_max_epu8(max0, c0);
max1 = _mm_max_epu8(max1, c1);
So they store maximum number of neighboring pixels which satisfy "corner condition".
Here they determine which pixels are actually corners and which are not:
max0 = _mm_max_epu8(max0, max1);
int m = _mm_movemask_epi8(_mm_cmpgt_epi8(max0, K16));
for( k = 0; m > 0 && k < 16; k++, m >>= 1 )
if(m & 1)
{
cornerpos[ncorners++] = j+k;
if(nonmax_suppression)
curr[j+k] = (uchar)cornerScore<patternSize>(ptr+k, pixel, threshold);
}
I hope it will help. I'm sorry for my bad English.
delta is a mask in which only the signbits are set. They use it because they want to compare for greater than unsigned, but there is only a signed comparison.
Adding 128 (or subtracting it, because -128 == 128) and xoring with it do the same (if you're working with bytes), because
a + b == (a ^ b) + ((a & b) << 1)
and if b only has the top bit set, the ((a & b) << 1) term must be zero (a & b can have the top bit set, but it's shifted out).
Then as you can see in the diagram below, subtracting 128 "shifts" the entire range down in such a way that a signed comparison will give the same result as an unsigned comparison would have given on the original range.
|0 ... 127 ... 255| unsigned
|-128 ... 0 ... 127| signed
I don't know about the rest, I hope someone else can answer that.
this is a general question about what precisely happens when I cast a very big/small SIGNED integer to a floating point using gcc 4.4.
I see some weird behaviour when doing the casting. Here are some examples:
MUSTBE is obtained with this method:
float f = (float)x;
unsigned int r;
memcpy(&r, &f, sizeof(unsigned int));
./btest -f float_i2f -1 0x80800001
input: 10000000100000000000000000000001
absolute value: 01111111011111111111111111111111
exponent: 10011101
mantissa: 00000000011111101111111111111111 (right shifted absolute value)
EXPECT: 11001110111111101111111111111111 (sign|exponent|mantissa)
MUST BE: 11001110111111110000000000000000 (sign ok, exponent ok,
mantissa???)
./btest -f float_i2f -1 0x3f7fffe0
EXPECT: 01001110011111011111111111111111
MUST BE: 01001110011111100000000000000000
./btest -f float_i2f -1 0x80004999
EXPECT: 11001110111111111111111101101100
MUST BE: 11001110111111111111111101101101 (<- 1 added at the end)
So what bothers me that the mantissa is in both examples different then if I just shift my integer value to the right. The zeros at the end for instance. Where do they come from?
I only see this behaviour on big/small values. Values in the range -2^24, 2^24 work fine.
I wonder if someone can enlighten me what happens here. What are the steps too take on very big/small values.
This is an add on question to : function to convert float to int (huge integers) which is not as general as this one here.
EDIT
Code:
unsigned float_i2f(int x) {
if (x == 0) return 0;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 158;
int t = a;
while (!(t >> 31) & 0x1) {
t <<= 1;
e--;
};
/* calculate mantissa */
int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
m &= 0x7fffff;
int res = sign << 31;
res |= (e << 23);
res |= m;
return res;
}
EDIT 2:
After Adams remarks and the reference to the book Write Great Code, I updated my routine with rounding. Still I get some rounding errors (now fortunately only 1 bit off).
Now if I do a test run, I get mostly good results but a couple of rounding errors like this:
input: 0xfefffff5
result: 11001011100000000000000000000101
GOAL: 11001011100000000000000000000110 (1 too low)
input: 0x7fffff
result: 01001010111111111111111111111111
GOAL: 01001010111111111111111111111110 (1 too high)
unsigned float_i2f(int x) {
if (x == 0) return 0;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 158;
int t = a;
while (!(t >> 31) & 0x1) {
t <<= 1;
e--;
};
/* mask to check which bits get shifted out when rounding */
static unsigned masks[24] = {
0, 1, 3, 7,
0xf, 0x1f,
0x3f, 0x7f,
0xff, 0x1ff,
0x3ff, 0x7ff,
0xfff, 0x1fff,
0x3fff, 0x7fff,
0xffff, 0x1ffff,
0x3ffff, 0x7ffff,
0xfffff, 0x1fffff,
0x3fffff, 0x7fffff
};
/* mask to check wether round up, or down */
static unsigned HOmasks[24] = {
0,
1, 2, 4, 0x8, 0x10, 0x20, 0x40, 0x80,
0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000, 0x10000, 0x20000, 0x40000, 0x80000, 0x100000, 0x200000, 0x400000
};
int S = a & masks[8];
int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
m &= 0x7fffff;
if (S > HOmasks[8]) {
/* round up */
m += 1;
} else if (S == HOmasks[8]) {
/* round down */
m = m + (m & 1);
}
/* special case where last bit of exponent is also set in mantissa
* and mantissa itself is 0 */
if (m & (0x1 << 23)) {
e += 1;
m = 0;
}
int res = sign << 31;
res |= (e << 23);
res |= m;
return res;
}
Does someone have any idea where the problem lies?
A 32-bit float uses some of the bits for the exponent and therefore cannot represent all 32-bit integer values exactly.
A 64-bitdouble can store any 32-bit integer value exactly.
Wikipedia has an abbreviated entry on IEEE 754 floating point, and lots of details of the internals of floating point numbers at IEEE 754-1985 — the current standard is IEEE 754:2008. It notes that a 32-bit float uses one bit for the sign, 8 bits for the exponent, leaving 23 explicit and 1 implicit bit for the mantissa, which is why absolute values up to 224 can be represented exactly.
I thought that it was clear that a 32 bit integer can't be exactly stored into a 32bit float. My question is: What happens IF I store an integer bigger 2^24 or smaller -2^24? And how can I replicate it?
Once the absolute values are larger than 224, the integer values cannot be represented exactly in the 24 effective digits of the mantissa of a 32-bit float, so only the leading 24 digits are reliably available. Floating point rounding also kicks in.
You can demonstrate with code similar to this:
#include
#include
typedef union Ufloat
{
uint32_t i;
float f;
} Ufloat;
static void dump_value(uint32_t i, uint32_t v)
{
Ufloat u = { .i = v };
printf("0x%.8" PRIX32 ": 0x%.8" PRIX32 " = %15.7e = %15.6A\n", i, v, u.f, u.f);
}
int main(void)
{
uint32_t lo = 1 << 23;
uint32_t hi = 1 << 28;
Ufloat u;
for (uint32_t v = lo; v < hi; v <<= 1)
{
u.f = v;
dump_value(v, u.i);
}
lo = (1 << 24) - 16;
hi = lo + 64;
for (uint32_t v = lo; v < hi; v++)
{
u.f = v;
dump_value(v, u.i);
}
return 0;
}
Sample output:
0x00800000: 0x4B000000 = 8.3886080e+06 = 0X1.000000P+23
0x01000000: 0x4B800000 = 1.6777216e+07 = 0X1.000000P+24
0x02000000: 0x4C000000 = 3.3554432e+07 = 0X1.000000P+25
0x04000000: 0x4C800000 = 6.7108864e+07 = 0X1.000000P+26
0x08000000: 0x4D000000 = 1.3421773e+08 = 0X1.000000P+27
0x00FFFFF0: 0x4B7FFFF0 = 1.6777200e+07 = 0X1.FFFFE0P+23
0x00FFFFF1: 0x4B7FFFF1 = 1.6777201e+07 = 0X1.FFFFE2P+23
0x00FFFFF2: 0x4B7FFFF2 = 1.6777202e+07 = 0X1.FFFFE4P+23
0x00FFFFF3: 0x4B7FFFF3 = 1.6777203e+07 = 0X1.FFFFE6P+23
0x00FFFFF4: 0x4B7FFFF4 = 1.6777204e+07 = 0X1.FFFFE8P+23
0x00FFFFF5: 0x4B7FFFF5 = 1.6777205e+07 = 0X1.FFFFEAP+23
0x00FFFFF6: 0x4B7FFFF6 = 1.6777206e+07 = 0X1.FFFFECP+23
0x00FFFFF7: 0x4B7FFFF7 = 1.6777207e+07 = 0X1.FFFFEEP+23
0x00FFFFF8: 0x4B7FFFF8 = 1.6777208e+07 = 0X1.FFFFF0P+23
0x00FFFFF9: 0x4B7FFFF9 = 1.6777209e+07 = 0X1.FFFFF2P+23
0x00FFFFFA: 0x4B7FFFFA = 1.6777210e+07 = 0X1.FFFFF4P+23
0x00FFFFFB: 0x4B7FFFFB = 1.6777211e+07 = 0X1.FFFFF6P+23
0x00FFFFFC: 0x4B7FFFFC = 1.6777212e+07 = 0X1.FFFFF8P+23
0x00FFFFFD: 0x4B7FFFFD = 1.6777213e+07 = 0X1.FFFFFAP+23
0x00FFFFFE: 0x4B7FFFFE = 1.6777214e+07 = 0X1.FFFFFCP+23
0x00FFFFFF: 0x4B7FFFFF = 1.6777215e+07 = 0X1.FFFFFEP+23
0x01000000: 0x4B800000 = 1.6777216e+07 = 0X1.000000P+24
0x01000001: 0x4B800000 = 1.6777216e+07 = 0X1.000000P+24
0x01000002: 0x4B800001 = 1.6777218e+07 = 0X1.000002P+24
0x01000003: 0x4B800002 = 1.6777220e+07 = 0X1.000004P+24
0x01000004: 0x4B800002 = 1.6777220e+07 = 0X1.000004P+24
0x01000005: 0x4B800002 = 1.6777220e+07 = 0X1.000004P+24
0x01000006: 0x4B800003 = 1.6777222e+07 = 0X1.000006P+24
0x01000007: 0x4B800004 = 1.6777224e+07 = 0X1.000008P+24
0x01000008: 0x4B800004 = 1.6777224e+07 = 0X1.000008P+24
0x01000009: 0x4B800004 = 1.6777224e+07 = 0X1.000008P+24
0x0100000A: 0x4B800005 = 1.6777226e+07 = 0X1.00000AP+24
0x0100000B: 0x4B800006 = 1.6777228e+07 = 0X1.00000CP+24
0x0100000C: 0x4B800006 = 1.6777228e+07 = 0X1.00000CP+24
0x0100000D: 0x4B800006 = 1.6777228e+07 = 0X1.00000CP+24
0x0100000E: 0x4B800007 = 1.6777230e+07 = 0X1.00000EP+24
0x0100000F: 0x4B800008 = 1.6777232e+07 = 0X1.000010P+24
0x01000010: 0x4B800008 = 1.6777232e+07 = 0X1.000010P+24
0x01000011: 0x4B800008 = 1.6777232e+07 = 0X1.000010P+24
0x01000012: 0x4B800009 = 1.6777234e+07 = 0X1.000012P+24
0x01000013: 0x4B80000A = 1.6777236e+07 = 0X1.000014P+24
0x01000014: 0x4B80000A = 1.6777236e+07 = 0X1.000014P+24
0x01000015: 0x4B80000A = 1.6777236e+07 = 0X1.000014P+24
0x01000016: 0x4B80000B = 1.6777238e+07 = 0X1.000016P+24
0x01000017: 0x4B80000C = 1.6777240e+07 = 0X1.000018P+24
0x01000018: 0x4B80000C = 1.6777240e+07 = 0X1.000018P+24
0x01000019: 0x4B80000C = 1.6777240e+07 = 0X1.000018P+24
0x0100001A: 0x4B80000D = 1.6777242e+07 = 0X1.00001AP+24
0x0100001B: 0x4B80000E = 1.6777244e+07 = 0X1.00001CP+24
0x0100001C: 0x4B80000E = 1.6777244e+07 = 0X1.00001CP+24
0x0100001D: 0x4B80000E = 1.6777244e+07 = 0X1.00001CP+24
0x0100001E: 0x4B80000F = 1.6777246e+07 = 0X1.00001EP+24
0x0100001F: 0x4B800010 = 1.6777248e+07 = 0X1.000020P+24
0x01000020: 0x4B800010 = 1.6777248e+07 = 0X1.000020P+24
0x01000021: 0x4B800010 = 1.6777248e+07 = 0X1.000020P+24
0x01000022: 0x4B800011 = 1.6777250e+07 = 0X1.000022P+24
0x01000023: 0x4B800012 = 1.6777252e+07 = 0X1.000024P+24
0x01000024: 0x4B800012 = 1.6777252e+07 = 0X1.000024P+24
0x01000025: 0x4B800012 = 1.6777252e+07 = 0X1.000024P+24
0x01000026: 0x4B800013 = 1.6777254e+07 = 0X1.000026P+24
0x01000027: 0x4B800014 = 1.6777256e+07 = 0X1.000028P+24
0x01000028: 0x4B800014 = 1.6777256e+07 = 0X1.000028P+24
0x01000029: 0x4B800014 = 1.6777256e+07 = 0X1.000028P+24
0x0100002A: 0x4B800015 = 1.6777258e+07 = 0X1.00002AP+24
0x0100002B: 0x4B800016 = 1.6777260e+07 = 0X1.00002CP+24
0x0100002C: 0x4B800016 = 1.6777260e+07 = 0X1.00002CP+24
0x0100002D: 0x4B800016 = 1.6777260e+07 = 0X1.00002CP+24
0x0100002E: 0x4B800017 = 1.6777262e+07 = 0X1.00002EP+24
0x0100002F: 0x4B800018 = 1.6777264e+07 = 0X1.000030P+24
The first part of the output demonstrates that some integer values can still be stored exactly; specifically, powers of 2 can be stored exactly. In fact, more precisely (but less concisely), any integer where binary representation of the absolute value has no more than 24 significant digits (any trailing digits are zeros) can be represented exactly. The values can't necessarily be printed exactly, but that's a separate issue from storing them exactly.
The second (larger) part of the output demonstrates that up to 224-1, the integer values can be represented exactly. The value of 224 itself is also exactly representable, but 224+1 is not, so it appears the same as 224. By contrast, 224+2 can be represented with just 24 binary digits followed by 1 zero and hence can be represented exactly. Repeat ad nauseam for increments larger than 2. It looks as though 'round even' mode is in effect; that's why the results show 1 value then 3 values.
(I note in passing that there isn't a way to stipulate that the double passed to printf() — converted from float by the rules of default argument promotions (ISO/IEC 9899:2011 §6.5.2.2 Function calls, ¶6) be printed as a float() — the h modifier would logically be used, but is not defined.)
C/C++ floats tend to be compatible with the IEEE 754 floating point standard (e.g. in gcc). The zeros come from the rounding rules.
Shifting a number to the right makes some bits from the right-hand side go away. Let's call them guard bits. Now let's call HO the highest bit and LO the lowest bit of our number. Now suppose that the guard bits are still a part of our number. If, for example, we have 3 guard bits it means that the value of our LO bit is 8 (if it is set). Now if:
value of guard bits > 0.5 * value of LO
rounds the number to the smalling possible greater value, ignoring the sign
value of 'guard bits' == 0.5 * value of LO
use current number value if LO == 0
number += 1 otherwise
value of guard bits < 0.5 * value of LO
use current number value
why do 3 guard bits mean the LO value is 8 ?
Suppose we have a binary 8 bit number:
weights: 128 64 32 16 8 4 2 1
binary num: 0 0 0 0 1 1 1 1
Let's shift it right by 3 bits:
weights: x x x 128 64 32 16 8 | 4 2 1
binary num: 0 0 0 0 0 0 0 1 | 1 1 1
As you see, with 3 guard bits the LO bit ends up at the 4th position and has a weight of 8. It is true only for the purpose of rounding. The weights have to be 'normalized' afterwards, so that the weight of LO bit becomes 1 again.
And how can I check with bit operations if guard bits > 0.5 * value ??
The fastest way is to employ lookup tables. Suppose we're working on an 8 bit number:
unsigned number; //our number
unsigned bitsToShift; //number of bits to shift
assert(bitsToShift < 8); //8 bits
unsigned guardMasks[8] = {0, 1, 3, 7, 0xf, 0x1f, 0x3f}
unsigned LOvalues[8] = {0, 1, 2, 4, 0x8, 0x10, 0x20, 0x40} //divided by 2 for faster comparison
unsigned guardBits = number & guardMasks[bitsToShift]; //value of the guard bits
number = number >> bitsToShift;
if(guardBits > LOvalues[bitsToShift]) {
...
} else if (guardBits == LOvalues[bitsToShift]) {
...
} else { //guardBits < LOvalues[bitsToShift]
...
}
Reference: Write Great Code, Volume 1 by Randall Hyde