Two's complement stm32 c - c

I have a number that is "significant byte", it may be 0 or 255.
Which means 0 or -1.
How to convert 255 to -1 in one time.
I have a function that doesn't works for me:
acc->x = ((raw_data[1]) << 8) | raw_data[0];

Assuming that every 8th bit set to 1 means negative (254 == -2) then a widening conversion from signed types should do:
int n = (signed char)somebyte;
so
unsigned char rawdate[2] = ...;
int msbyte = (signed char)rawdata[1];
acc->x = (msbyte << 8) | (raw_data[0] & 0xFF);

I am not sure what is required but here are the rules for arithmetic conversions of integers.
If an integer is assigned to another lower bit integer, the data will be truncated.
Example:
struct A {
int c1 : 8;
unsigned c2 : 8;
} a;
int main()
{
short int i = 255; // right 8 bits containing all bits set
a.c1 = i; // or a.c1 = 255. casting not required.
a.c2 = i; // same as above.
// prints -1, 255
printf("c1: %d c2: %d\n", a.c1, a.c2);
i = 511; // 9 number of 1 bits
a.c1 = i; // left 9th bit will be truncated. casting not required.
a.c2 = i; // same as above
// prints -1, 255
printf("c1: %d c2: %d\n", a.c1, a.c2);
return 0;
}
If a signed 8 bit integer (or char) is assigned to higher bit integer (say int), it's sign bit will be shifted.
ex:
char c = 255; // which is -1
int i = c; // i is now -1. sign bit will be shifted to 32nd bit.

Related

Why does left-shifting an integer by 24-bit yield the wrong result?

I tried left-shifting a 32-bit integer by 24:
char *int_to_bin(int num) {
int i = 0;
static char bin[64];
while (num != 0) {
bin[i] = num % 2 + 48;
num /= 2;
i++;
}
bin[i] = '\0';
return (bin);
}
int main() {
int number = 255;
printf("number: %s\n", int_to_bin(number));
printf("shifted number: %s\n", int_to_bin(number << 24));
return 0;
}
OUTPUT:
number: 11111111
shifted number: 000000000000000000000000/
and i left-shift with 23-bit it yields this result:
0000000000000000000000011111111
Well Why is it like that and what's the matter with '/' at the end of the wrong result?
Two things:
If number has the value 255 then number << 24 has the numerical value 4278190080, which overflows a 32-bit signed integer whose largest possible value is 2147483647. Signed integer overflow is undefined behavior in C, so the result could be anything at all.
What probably happens in this case is that the result of the shift is negative. When num is negative then num % 2 may take the value -1, so you store character 47 in the string, which is /.
Bit shifting math is usually better to do with unsigned types, where overflow is well-defined (it wraps around and bits just shift off the left and vanish) and num % 2 can only be 0 or 1. (Or write num & 1 instead.)
Your int_to_bin routine puts the least-significant bits at the beginning of the string (on the left), so the result is backwards from the way people usually write numbers (with the least-significant bits on the right). You may want to rewrite it.
Shift works fine, you simply print it from the wrong direction.
char *int_to_bin(char *buff, int num)
{
unsigned mask = 1U << (CHAR_BIT * sizeof(num) - 1);
char *wrk = buff;
for(; mask; mask >>= 1)
{
*wrk++ = '0' + !!((unsigned)num & mask);
}
*wrk = 0;
return buff;
}
int main()
{
char buff[CHAR_BIT * sizeof(int) + 1];
int number = 255;
printf("number: %s\n", int_to_bin(buff, number));
printf("shifted number: %s\n", int_to_bin(buff, number << 24));
return 0;
}
Shifting signed integers left is OK, but the right shift is implementation-defined. Many systems use arithmetic shift right and the result is not the same as using the bitwise one:
https://godbolt.org/z/e7f3shxd4
you are storing numbers backwards
you are using signed int32 while shifting by 23 results needs more than 32 bits to handle that operation ...you should use long long int
signed integer can lead to wrong answers as 1<<31 is -1 which results in bad characters in string
finally using unsigned long long int with storing numbers in correct order will produce correct string
you should try re write code on your own before seeing this improved version of your code
#include<stdio.h>
#include<stdlib.h>
char *int_to_bin( unsigned long long int num) {
int i = 0;
static char bin[65];
while (i != 64) {
bin[63-i] = num % 2 + 48;
num /= 2;
i++;
}
bin[64] = '\0';
return (bin);
}
int main() {
unsigned long long int number = 255;
printf("number 1: %s\n", int_to_bin(number));
printf("number 2: %s\n", int_to_bin(number << 24));
return 0;
}

Converting negative numbers to positive numbers but keeping positive numbers unchanged

I want to apply a bitmask to a number that will mimic the absolute value function for 2's complement encoded signed 32 bit integers. So far, I have
int absoluteValue(int x) {
int sign = x >> 31; //get most significant byte...all 1's if x is < 0, all 0's if x >= 0
int negated = (~x + 1) & sign; //negates the number if negative, sets to 0 if positive
//what should go here???
}
Am I going in the right direction? I'm not really sure where to go from here (mostly just how to apply a mask to keep the original positive value). I also don't want to use any conditional statements
Bizarre question. What about
return (negated << 1) + x;
So put together this makes:
int absoluteValue(int x) {
int sign = x >> 31; //get most significant byte...all 1's if x is < 0, all 0's if x >= 0
int negated = (~x + 1) & sign; //negates the number if negative, sets to 0 if positive
return (negated << 1) + x;
}
The last part
negated = (~x + 1) & sign;
is wrong, you are going to get either 1 or 0, you have to create a mask with all
first 31 bits to 0 and only the last one to either 0 or 1.
Assuming that for you target you are dealing with 32 bit integers with 2
complement, you can do this:
#include <stdio.h>
// assuming 32bit, 2 complement
int sign_inverse(int n)
{
int mask = ~n & 0x80000000U;
if(n == 0)
mask = 0;
return (~n + 1) | mask;
}
int main(void)
{
int a = 5;
int b = -4;
int c = 54;
int d = 0;
printf("sign_inverse(%d) = %d\n", a, sign_inverse(a));
printf("sign_inverse(%d) = %d\n", b, sign_inverse(b));
printf("sign_inverse(%d) = %d\n", c, sign_inverse(c));
printf("sign_inverse(%d) = %d\n", d, sign_inverse(d));
return 0;
}
but you need at least 1 if for the case of 0, because the mask for 0 is 0x80000000.
The output of this is:
$ ./b
sign_inverse(5) = -5
sign_inverse(-4) = 4
sign_inverse(54) = -54
sign_inverse(0) = 0
Please note that two's complement representation is not guaranteed, and also the behaviour of operator >> on signed values, where the result get's "filled" with 1-bits is implementation defined (cf., for example, cppreference.com/arithmetic operations):
For negative LHS, the value of LHS >> RHS is implementation-defined
where in most implementations, this performs arithmetic right shift
(so that the result remains negative). Thus in most implementations,
right shifting a signed LHS fills the new higher-order bits with the
original sign bit (i.e. with 0 if it was non-negative and 1 if it was
negative).
But if you take this for given, and if you just want to use bit wise operations and operator +, you are already going into the right direction.
The only thing is that you should take into account the mask you create ( i.e. your sign) in that you toggle the bits of x only in the case where x is negative. You can achieve this by the XOR-operator as follows:
int x = -3000;
unsigned int mask = x >> 31;
int sign = mask & 0x01;
int positive = (x^mask) + sign;
printf("x:%d mask:%0X sign:%d positive:%d\n",x,mask,sign,positive);

Program to count the number of bits set in c

I have tried to count the number of bits set in an integer value in c.
But for some values it is showing the correct bit set count and for some values it is not.
PFB program code
int main()
{
int a=512,i=0,j=1,count=0,k=0;
for(i=0;i<31;i++)
{
if(k=a&j)
{
count++;
j=j<<1;
}
}
printf("the total bit set countis %d",count);
}
The output of set bit value count of 512 is showing as zero and if the value used is 511 count is showing as 9.
Please help me to correct the program.
Stanford University has a page of different ways to implement common bit-twiddling operations. They list 5 different algorithms to count the bits set, all with C examples.
https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
Their simplest implementation:
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
for (c = 0; v; v >>= 1)
{
c += v & 1;
}
If you're using gcc/clang compiler, you can use the builtin function __builtin_popcount
unsigned int user_input = 100
int count = __builtin_popcount(n); // count == 3
When I'm not looking for cross-platform I'll use this function since its highly optimised.
Generally you would count bits in an unsigned integer. The reason being that you're usually checking for bits set in a register or a mask, for example. Signed integers are represented using twos-compliment and I can't think why you'd want to count set bits in a signed integer (would be interested why if you definitely do want this).
Note in C that right or left shifting a signed integer is implementation defined behaviour if the number is negative. From C standard sectn 6.5.7:
... The result of E1 << E2 is E1 left-shifted E2 bit positions; ... If E1
has a signed type and nonnegative value, and E1 << E2 is representable
in the result type, then that is the resulting value; otherwise, the
behavior is undefined.
The result of E1 >> E2 is E1 right-shifted E2
bit positions. ... If E1 has a signed type and a negative value, the
resulting value is implementation-defined ...
If you want to count 1's in an arbitrary sized unsigned integer you could use this example:
#include <stdio.h>
int main(void) {
unsigned int value = 1234;
unsigned int ones = 0;
while(value > 0) {
ones += value & 0x1;
value >>= 1;
}
printf("#Ones = %u", ones);
}
Using this example value could be unsigned char, unsigned long, whatever unsigned integer type...
Note: Do not shift signed values or floats/doubles.
You can use the division / and the modulo % operator to check the bits that are set in an integer.
int main()
{
int a = 512, count = 0;
while(a != 0)
{
if(a % 2 == 1)
{
count++;
}
a /= 2;
}
printf("The total bit set is %d", count);
}
You have a couple of mistakes:
for(i=0;i<32;i++) // <<< this should be 32, not 31
{
if(k=a&j)
{
count++;
}
j=j<<1; // <<< this needs to be outside the if block
}
Note that instead of using a hard-coded value of 32 for the no of bits in an int, it would be better to do it like this:
for(i=0;i<sizeof(int)*CHAR_BIT;i++)
This way the code will still work if the size of an int is e.g. 16 bits or 64 bits.
Although this is not C strictly speaking, you can use inline assembly to call the POPCNT x86 operation:
// GCC syntax
unsigned a = 1234;
unsigned int count;
__asm__(
" POPCNT %0, %1\n"
:"=r" (count)
:"r" (a)
);
return count;
According to this benchmark, calling __builtin_popcount as in idok's answer is just as fast as the above code and they both are much faster than any other C implementation. You can also check the linked repo for other solutions as well.
You are checking the value of a&j , and if a&j is 0, then you do nothing else but try again.
Your j-bitshift needs to be outside the if-then.
#include<stdio.h>
#include<conio.h>
int rem, binary = 0;
unsigned int
countSetBits (unsigned int n){
unsigned int count = 0;
while (n){
count += n & 1;
n >>= 1;
}
printf ("\n\t Number of 1's in the binary number is : %d",count);
}
int dec_bin (int n){
int i=1;
while (n != 0){
rem = n % 2;
n = n / 2;
binary = binary + (rem * i);
i = i * 10;
}
printf("\n\t The converted Binary Equivalent is : %d",binary);
}
int main(){
int i = 0;
printf ("\n\t Enter the Decimal Nummber: ");
scanf ("%d", &i);
int n= i;
dec_bin(n);
countSetBits (i);
return 0;
}

Set first 10 bit of int

I have a 32-bit int and I want to set the first 10 bit to a specific number.
IE
The 32-bit int is:
11101010101010110101100100010010
I want the first 10 bit to be the number 123, which is
0001111011
So the result would be
00011110111010110101100100010010
Does anyone know the easiest way I would be able to do this? I know that we have to do bit-shifting but I'm not good at it so I'm not sure
Thank you!
uint32_t result = (input & 0x3fffff) | (newval << 22);
0x3fffff masks out the highest 10 bits (it has the lowest 22 bits set). You have to shift your new value for the highest 10 bits by 22 places.
Convert inputs to unsigned 32-bit integers
uint32_t num = strtoul("11101010101010110101100100010010", 0, 2);
uint32_t firstbits = 123;
Mask off the lower 32-10 bits. Create mask by shifting a unsigned long 1 22 places left making 100_0000_0000_0000_0000_0000 then decrementing to 11_1111_1111_1111_1111_1111
uint32_t mask = (1UL << (32-10)) - 1;
num &= mask;
Or in firstbits shifted left by 32-10
num |= firstbits << (32-10);
Or in 1 line:
(num & (1UL << (32-10)) - 1) | (firstbits*1UL << (32-10))
Detail about firstbits*1UL. The type of firstbits is not defined by OP and may only be a 16-bit int. To insure code can shift and form an answer that exceeds 16 bits (the minimum width of int), multiple by 1UL to insure the value is unsigned and has at least 32 bit width.
You can "erase" bits (set them to 0) by using a bit wise and ('&'); bits that are 0 in either value will be 0 in the result.
You can set bits to 1 by using a bit wise or ('|'); bits that are 1 in either value will be 1 in the result.
So: and your number with a value where the first 10 bits are 0 and the rest are 1; then 'or' it with the first 10 bits you want put in, and 0 for the other bits. If you need to calculate that value, then a left-shift would be the way to go.
You can also take a mask and replace approach where you zero the lower bits required to hold 123 and then simply | (OR) the value with 123 to gain the final result. You can accomplish the exact same thing with shifts as shown by several other answers, or you can accomplish it with masks:
#include <stdio.h>
#ifndef BITS_PER_LONG
#define BITS_PER_LONG 64
#endif
#ifndef CHAR_BIT
#define CHAR_BIT 8
#endif
char *binpad2 (unsigned long n, size_t sz);
int main (void) {
unsigned x = 0b11101010101010110101100100010010;
unsigned mask = 0xffffff00; /* mask to zero lower 8 bits */
unsigned y = 123; /* value to replace zero bits */
unsigned masked = x & mask; /* zero the lower bits */
/* show intermediate results */
printf ("\n x : %s\n", binpad2 (x, sizeof x * CHAR_BIT));
printf ("\n & mask : %s\n", binpad2 (mask, sizeof mask * CHAR_BIT));
printf ("\n masked : %s\n", binpad2 (masked, sizeof masked * CHAR_BIT));
printf ("\n | 123 : %s\n", binpad2 (y, sizeof y * CHAR_BIT));
masked |= y; /* apply the final or with 123 */
printf ("\n final : %s\n", binpad2 (masked, sizeof masked * CHAR_BIT));
return 0;
}
/** returns pointer to binary representation of 'n' zero padded to 'sz'.
* returns pointer to string contianing binary representation of
* unsigned 64-bit (or less ) value zero padded to 'sz' digits.
*/
char *binpad2 (unsigned long n, size_t sz)
{
static char s[BITS_PER_LONG + 1] = {0};
char *p = s + BITS_PER_LONG;
register size_t i;
for (i = 0; i < sz; i++)
*--p = (n>>i & 1) ? '1' : '0';
return p;
}
Output
$ ./bin/bitsset
x : 11101010101010110101100100010010
& mask : 11111111111111111111111100000000
masked : 11101010101010110101100100000000
| 123 : 00000000000000000000000001111011
final : 11101010101010110101100101111011
How about using bit fields in C combined with a union? The following structure lets you set the whole 32-bit value, the top 10 bits or the bottom 22 bits. It isn't as versatile as a generic function but you can't easily make a mistake when using it. Be aware this and most solutions may not work on all integer sizes and look out for endianness as well.
union uu {
struct {
uint32_t bottom22 : 22;
uint32_t top10 : 10;
} bits;
uint32_t value;
};
Here is an example usage:
int main(void) {
union uu myuu;
myuu.value = 999999999;
printf("value = 0x%08x\n", myuu.value);
myuu.bits.top10 = 0;
printf("value = 0x%08x\n", myuu.value);
myuu.bits.top10 = 0xfff;
printf("value = 0x%08x\n", myuu.value);
return 0;
}
The output is:
value = 0x3b9ac9ff
value = 0x001ac9ff
value = 0xffdac9ff

what happens if you cast a big int to float

this is a general question about what precisely happens when I cast a very big/small SIGNED integer to a floating point using gcc 4.4.
I see some weird behaviour when doing the casting. Here are some examples:
MUSTBE is obtained with this method:
float f = (float)x;
unsigned int r;
memcpy(&r, &f, sizeof(unsigned int));
./btest -f float_i2f -1 0x80800001
input: 10000000100000000000000000000001
absolute value: 01111111011111111111111111111111
exponent: 10011101
mantissa: 00000000011111101111111111111111 (right shifted absolute value)
EXPECT: 11001110111111101111111111111111 (sign|exponent|mantissa)
MUST BE: 11001110111111110000000000000000 (sign ok, exponent ok,
mantissa???)
./btest -f float_i2f -1 0x3f7fffe0
EXPECT: 01001110011111011111111111111111
MUST BE: 01001110011111100000000000000000
./btest -f float_i2f -1 0x80004999
EXPECT: 11001110111111111111111101101100
MUST BE: 11001110111111111111111101101101 (<- 1 added at the end)
So what bothers me that the mantissa is in both examples different then if I just shift my integer value to the right. The zeros at the end for instance. Where do they come from?
I only see this behaviour on big/small values. Values in the range -2^24, 2^24 work fine.
I wonder if someone can enlighten me what happens here. What are the steps too take on very big/small values.
This is an add on question to : function to convert float to int (huge integers) which is not as general as this one here.
EDIT
Code:
unsigned float_i2f(int x) {
if (x == 0) return 0;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 158;
int t = a;
while (!(t >> 31) & 0x1) {
t <<= 1;
e--;
};
/* calculate mantissa */
int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
m &= 0x7fffff;
int res = sign << 31;
res |= (e << 23);
res |= m;
return res;
}
EDIT 2:
After Adams remarks and the reference to the book Write Great Code, I updated my routine with rounding. Still I get some rounding errors (now fortunately only 1 bit off).
Now if I do a test run, I get mostly good results but a couple of rounding errors like this:
input: 0xfefffff5
result: 11001011100000000000000000000101
GOAL: 11001011100000000000000000000110 (1 too low)
input: 0x7fffff
result: 01001010111111111111111111111111
GOAL: 01001010111111111111111111111110 (1 too high)
unsigned float_i2f(int x) {
if (x == 0) return 0;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 158;
int t = a;
while (!(t >> 31) & 0x1) {
t <<= 1;
e--;
};
/* mask to check which bits get shifted out when rounding */
static unsigned masks[24] = {
0, 1, 3, 7,
0xf, 0x1f,
0x3f, 0x7f,
0xff, 0x1ff,
0x3ff, 0x7ff,
0xfff, 0x1fff,
0x3fff, 0x7fff,
0xffff, 0x1ffff,
0x3ffff, 0x7ffff,
0xfffff, 0x1fffff,
0x3fffff, 0x7fffff
};
/* mask to check wether round up, or down */
static unsigned HOmasks[24] = {
0,
1, 2, 4, 0x8, 0x10, 0x20, 0x40, 0x80,
0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000, 0x10000, 0x20000, 0x40000, 0x80000, 0x100000, 0x200000, 0x400000
};
int S = a & masks[8];
int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
m &= 0x7fffff;
if (S > HOmasks[8]) {
/* round up */
m += 1;
} else if (S == HOmasks[8]) {
/* round down */
m = m + (m & 1);
}
/* special case where last bit of exponent is also set in mantissa
* and mantissa itself is 0 */
if (m & (0x1 << 23)) {
e += 1;
m = 0;
}
int res = sign << 31;
res |= (e << 23);
res |= m;
return res;
}
Does someone have any idea where the problem lies?
A 32-bit float uses some of the bits for the exponent and therefore cannot represent all 32-bit integer values exactly.
A 64-bitdouble can store any 32-bit integer value exactly.
Wikipedia has an abbreviated entry on IEEE 754 floating point, and lots of details of the internals of floating point numbers at IEEE 754-1985 — the current standard is IEEE 754:2008. It notes that a 32-bit float uses one bit for the sign, 8 bits for the exponent, leaving 23 explicit and 1 implicit bit for the mantissa, which is why absolute values up to 224 can be represented exactly.
I thought that it was clear that a 32 bit integer can't be exactly stored into a 32bit float. My question is: What happens IF I store an integer bigger 2^24 or smaller -2^24? And how can I replicate it?
Once the absolute values are larger than 224, the integer values cannot be represented exactly in the 24 effective digits of the mantissa of a 32-bit float, so only the leading 24 digits are reliably available. Floating point rounding also kicks in.
You can demonstrate with code similar to this:
#include
#include
typedef union Ufloat
{
uint32_t i;
float f;
} Ufloat;
static void dump_value(uint32_t i, uint32_t v)
{
Ufloat u = { .i = v };
printf("0x%.8" PRIX32 ": 0x%.8" PRIX32 " = %15.7e = %15.6A\n", i, v, u.f, u.f);
}
int main(void)
{
uint32_t lo = 1 << 23;
uint32_t hi = 1 << 28;
Ufloat u;
for (uint32_t v = lo; v < hi; v <<= 1)
{
u.f = v;
dump_value(v, u.i);
}
lo = (1 << 24) - 16;
hi = lo + 64;
for (uint32_t v = lo; v < hi; v++)
{
u.f = v;
dump_value(v, u.i);
}
return 0;
}
Sample output:
0x00800000: 0x4B000000 = 8.3886080e+06 = 0X1.000000P+23
0x01000000: 0x4B800000 = 1.6777216e+07 = 0X1.000000P+24
0x02000000: 0x4C000000 = 3.3554432e+07 = 0X1.000000P+25
0x04000000: 0x4C800000 = 6.7108864e+07 = 0X1.000000P+26
0x08000000: 0x4D000000 = 1.3421773e+08 = 0X1.000000P+27
0x00FFFFF0: 0x4B7FFFF0 = 1.6777200e+07 = 0X1.FFFFE0P+23
0x00FFFFF1: 0x4B7FFFF1 = 1.6777201e+07 = 0X1.FFFFE2P+23
0x00FFFFF2: 0x4B7FFFF2 = 1.6777202e+07 = 0X1.FFFFE4P+23
0x00FFFFF3: 0x4B7FFFF3 = 1.6777203e+07 = 0X1.FFFFE6P+23
0x00FFFFF4: 0x4B7FFFF4 = 1.6777204e+07 = 0X1.FFFFE8P+23
0x00FFFFF5: 0x4B7FFFF5 = 1.6777205e+07 = 0X1.FFFFEAP+23
0x00FFFFF6: 0x4B7FFFF6 = 1.6777206e+07 = 0X1.FFFFECP+23
0x00FFFFF7: 0x4B7FFFF7 = 1.6777207e+07 = 0X1.FFFFEEP+23
0x00FFFFF8: 0x4B7FFFF8 = 1.6777208e+07 = 0X1.FFFFF0P+23
0x00FFFFF9: 0x4B7FFFF9 = 1.6777209e+07 = 0X1.FFFFF2P+23
0x00FFFFFA: 0x4B7FFFFA = 1.6777210e+07 = 0X1.FFFFF4P+23
0x00FFFFFB: 0x4B7FFFFB = 1.6777211e+07 = 0X1.FFFFF6P+23
0x00FFFFFC: 0x4B7FFFFC = 1.6777212e+07 = 0X1.FFFFF8P+23
0x00FFFFFD: 0x4B7FFFFD = 1.6777213e+07 = 0X1.FFFFFAP+23
0x00FFFFFE: 0x4B7FFFFE = 1.6777214e+07 = 0X1.FFFFFCP+23
0x00FFFFFF: 0x4B7FFFFF = 1.6777215e+07 = 0X1.FFFFFEP+23
0x01000000: 0x4B800000 = 1.6777216e+07 = 0X1.000000P+24
0x01000001: 0x4B800000 = 1.6777216e+07 = 0X1.000000P+24
0x01000002: 0x4B800001 = 1.6777218e+07 = 0X1.000002P+24
0x01000003: 0x4B800002 = 1.6777220e+07 = 0X1.000004P+24
0x01000004: 0x4B800002 = 1.6777220e+07 = 0X1.000004P+24
0x01000005: 0x4B800002 = 1.6777220e+07 = 0X1.000004P+24
0x01000006: 0x4B800003 = 1.6777222e+07 = 0X1.000006P+24
0x01000007: 0x4B800004 = 1.6777224e+07 = 0X1.000008P+24
0x01000008: 0x4B800004 = 1.6777224e+07 = 0X1.000008P+24
0x01000009: 0x4B800004 = 1.6777224e+07 = 0X1.000008P+24
0x0100000A: 0x4B800005 = 1.6777226e+07 = 0X1.00000AP+24
0x0100000B: 0x4B800006 = 1.6777228e+07 = 0X1.00000CP+24
0x0100000C: 0x4B800006 = 1.6777228e+07 = 0X1.00000CP+24
0x0100000D: 0x4B800006 = 1.6777228e+07 = 0X1.00000CP+24
0x0100000E: 0x4B800007 = 1.6777230e+07 = 0X1.00000EP+24
0x0100000F: 0x4B800008 = 1.6777232e+07 = 0X1.000010P+24
0x01000010: 0x4B800008 = 1.6777232e+07 = 0X1.000010P+24
0x01000011: 0x4B800008 = 1.6777232e+07 = 0X1.000010P+24
0x01000012: 0x4B800009 = 1.6777234e+07 = 0X1.000012P+24
0x01000013: 0x4B80000A = 1.6777236e+07 = 0X1.000014P+24
0x01000014: 0x4B80000A = 1.6777236e+07 = 0X1.000014P+24
0x01000015: 0x4B80000A = 1.6777236e+07 = 0X1.000014P+24
0x01000016: 0x4B80000B = 1.6777238e+07 = 0X1.000016P+24
0x01000017: 0x4B80000C = 1.6777240e+07 = 0X1.000018P+24
0x01000018: 0x4B80000C = 1.6777240e+07 = 0X1.000018P+24
0x01000019: 0x4B80000C = 1.6777240e+07 = 0X1.000018P+24
0x0100001A: 0x4B80000D = 1.6777242e+07 = 0X1.00001AP+24
0x0100001B: 0x4B80000E = 1.6777244e+07 = 0X1.00001CP+24
0x0100001C: 0x4B80000E = 1.6777244e+07 = 0X1.00001CP+24
0x0100001D: 0x4B80000E = 1.6777244e+07 = 0X1.00001CP+24
0x0100001E: 0x4B80000F = 1.6777246e+07 = 0X1.00001EP+24
0x0100001F: 0x4B800010 = 1.6777248e+07 = 0X1.000020P+24
0x01000020: 0x4B800010 = 1.6777248e+07 = 0X1.000020P+24
0x01000021: 0x4B800010 = 1.6777248e+07 = 0X1.000020P+24
0x01000022: 0x4B800011 = 1.6777250e+07 = 0X1.000022P+24
0x01000023: 0x4B800012 = 1.6777252e+07 = 0X1.000024P+24
0x01000024: 0x4B800012 = 1.6777252e+07 = 0X1.000024P+24
0x01000025: 0x4B800012 = 1.6777252e+07 = 0X1.000024P+24
0x01000026: 0x4B800013 = 1.6777254e+07 = 0X1.000026P+24
0x01000027: 0x4B800014 = 1.6777256e+07 = 0X1.000028P+24
0x01000028: 0x4B800014 = 1.6777256e+07 = 0X1.000028P+24
0x01000029: 0x4B800014 = 1.6777256e+07 = 0X1.000028P+24
0x0100002A: 0x4B800015 = 1.6777258e+07 = 0X1.00002AP+24
0x0100002B: 0x4B800016 = 1.6777260e+07 = 0X1.00002CP+24
0x0100002C: 0x4B800016 = 1.6777260e+07 = 0X1.00002CP+24
0x0100002D: 0x4B800016 = 1.6777260e+07 = 0X1.00002CP+24
0x0100002E: 0x4B800017 = 1.6777262e+07 = 0X1.00002EP+24
0x0100002F: 0x4B800018 = 1.6777264e+07 = 0X1.000030P+24
The first part of the output demonstrates that some integer values can still be stored exactly; specifically, powers of 2 can be stored exactly. In fact, more precisely (but less concisely), any integer where binary representation of the absolute value has no more than 24 significant digits (any trailing digits are zeros) can be represented exactly. The values can't necessarily be printed exactly, but that's a separate issue from storing them exactly.
The second (larger) part of the output demonstrates that up to 224-1, the integer values can be represented exactly. The value of 224 itself is also exactly representable, but 224+1 is not, so it appears the same as 224. By contrast, 224+2 can be represented with just 24 binary digits followed by 1 zero and hence can be represented exactly. Repeat ad nauseam for increments larger than 2. It looks as though 'round even' mode is in effect; that's why the results show 1 value then 3 values.
(I note in passing that there isn't a way to stipulate that the double passed to printf() — converted from float by the rules of default argument promotions (ISO/IEC 9899:2011 §6.5.2.2 Function calls, ¶6) be printed as a float() — the h modifier would logically be used, but is not defined.)
C/C++ floats tend to be compatible with the IEEE 754 floating point standard (e.g. in gcc). The zeros come from the rounding rules.
Shifting a number to the right makes some bits from the right-hand side go away. Let's call them guard bits. Now let's call HO the highest bit and LO the lowest bit of our number. Now suppose that the guard bits are still a part of our number. If, for example, we have 3 guard bits it means that the value of our LO bit is 8 (if it is set). Now if:
value of guard bits > 0.5 * value of LO
rounds the number to the smalling possible greater value, ignoring the sign
value of 'guard bits' == 0.5 * value of LO
use current number value if LO == 0
number += 1 otherwise
value of guard bits < 0.5 * value of LO
use current number value
why do 3 guard bits mean the LO value is 8 ?
Suppose we have a binary 8 bit number:
weights: 128 64 32 16 8 4 2 1
binary num: 0 0 0 0 1 1 1 1
Let's shift it right by 3 bits:
weights: x x x 128 64 32 16 8 | 4 2 1
binary num: 0 0 0 0 0 0 0 1 | 1 1 1
As you see, with 3 guard bits the LO bit ends up at the 4th position and has a weight of 8. It is true only for the purpose of rounding. The weights have to be 'normalized' afterwards, so that the weight of LO bit becomes 1 again.
And how can I check with bit operations if guard bits > 0.5 * value ??
The fastest way is to employ lookup tables. Suppose we're working on an 8 bit number:
unsigned number; //our number
unsigned bitsToShift; //number of bits to shift
assert(bitsToShift < 8); //8 bits
unsigned guardMasks[8] = {0, 1, 3, 7, 0xf, 0x1f, 0x3f}
unsigned LOvalues[8] = {0, 1, 2, 4, 0x8, 0x10, 0x20, 0x40} //divided by 2 for faster comparison
unsigned guardBits = number & guardMasks[bitsToShift]; //value of the guard bits
number = number >> bitsToShift;
if(guardBits > LOvalues[bitsToShift]) {
...
} else if (guardBits == LOvalues[bitsToShift]) {
...
} else { //guardBits < LOvalues[bitsToShift]
...
}
Reference: Write Great Code, Volume 1 by Randall Hyde

Resources