C Bit-Level Int to Float Conversion Unexpected Output - c

Background:
I am playing around with bit-level coding (this is not homework - just curious). I found a lot of good material online and in a book called Hacker's Delight, but I am having trouble with one of the online problems.
It asks to convert an integer to a float. I used the following links as reference to work through the problem:
How to manually (bitwise) perform (float)x?
How to convert an unsigned int to a float?
http://locklessinc.com/articles/i2f/
Problem and Question:
I thought I understood the process well enough (I tried to document the process in the comments), but when I test it, I don't understand the output.
Test Cases:
float_i2f(2) returns 1073741824
float_i2f(3) returns 1077936128
I expected to see something like 2.0000 and 3.0000.
Did I mess up the conversion somewhere? I thought maybe this was a memory address, so I was thinking maybe I missed something in the conversion step needed to access the actual number? Or maybe I am printing it incorrectly? I am printing my output like this:
printf("Float_i2f ( %d ): ", 3);
printf("%u", float_i2f(3));
printf("\n");
But I thought that printing method was fine for unsigned values in C (I'm used to programming in Java).
Thanks for any advice.
Code:
/*
* float_i2f - Return bit-level equivalent of expression (float) x
* Result is returned as unsigned int, but
* it is to be interpreted as the bit-level representation of a
* single-precision floating point values.
* Legal ops: Any integer/unsigned operations incl. ||, &&. also if, while
* Max ops: 30
* Rating: 4
*/
unsigned float_i2f(int x) {
if (x == 0){
return 0;
}
//save the sign bit for later and get the asolute value of x
//the absolute value is needed to shift bits to put them
//into the appropriate position for the float
unsigned int signBit = 0;
unsigned int absVal = (unsigned int)x;
if (x < 0){
signBit = 0x80000000;
absVal = (unsigned int)-x;
}
//Calculate the exponent
// Shift the input left until the high order bit is set to form the mantissa.
// Form the floating exponent by subtracting the number of shifts from 158.
unsigned int exponent = 158; //158 possibly because of place in byte range
while ((absVal & 0x80000000) == 0){//this checks for 0 or 1. when it reaches 1, the loop breaks
exponent--;
absVal <<= 1;
}
//find the mantissa (bit shift to the right)
unsigned int mantissa = absVal >> 8;
//place the exponent bits in the right place
exponent = exponent << 23;
//get the mantissa
mantissa = mantissa & 0x7fffff;
//return the reconstructed float
return signBit | exponent | mantissa;
}

Continuing from the comment. Your code is correct, and you are simply looking at the equivalent unsigned integer made up by the bits in your IEEE-754 single-precision floating point number. The IEEE-754 single-precision number format (made up of the sign, extended exponent, and mantissa), can be interpreted as a float, or those same bits can be interpreted as an unsigned integer (just the number that is made up by the 32-bits). You are outputting the unsigned equivalent for the floating point number.
You can confirm with a simple union. For example:
#include <stdio.h>
#include <stdint.h>
typedef union {
uint32_t u;
float f;
} u2f;
int main (void) {
u2f tmp = { .f = 2.0 };
printf ("\n u : %u\n f : %f\n", tmp.u, tmp.f);
return 0;
}
Example Usage/Output
$ ./bin/unionuf
u : 1073741824
f : 2.000000
Let me know if you have any further questions. It's good to see that your study resulted in the correct floating point conversion. (also note the second comment regarding truncation/rounding)

I'll just chime in here, because nothing specifically about endianness has been addressed. So let's talk about it.
The construction of the value in the original question was endianness-agnostic, using shifts and other bitwise operations. This means that regardless of whether your system is big- or little-endian, the actual value will be the same. The difference will be its byte order in memory.
The generally accepted convention for IEEE-754 is that the byte order is big-endian (although I believe there is no formal specification of this, and therefore no requirement on implementations to follow it). This means if you want to directly interpret your integer value as a float, it needs to be laid out in big-endian byte order.
So, you can use this approach combined with a union if and only if you know that the endianness of floats and integers on your system is the same.
On the common Intel-based architectures this is not okay. On those architectures, integers are little-endian and floats are big-endian. You need to convert your value to big-endian. A simple approach to this is to repack its bytes even if they are already big-endian:
uint32_t n = float_i2f( input_val );
uint8_t char bytes[4] = {
(uint8_t)((n >> 24) & 0xff),
(uint8_t)((n >> 16) & 0xff),
(uint8_t)((n >> 8) & 0xff),
(uint8_t)(n & 0xff)
};
float fval;
memcpy( &fval, bytes, sizeof(float) );
I'll stress that you only need to worry about this if you are trying to reinterpret your integer representation as a float or the other way round.
If you're only trying to output what the representation is in bits, then you don't need to worry. You can just display your integer in a useful form such as hex:
printf( "0x%08x\n", n );

Related

Extract k bits from any side of hex notation

int X = 0x1234ABCD;
int Y = 0xcdba4321;
// a) print the lower 10 bits of X in hex notation
int output1 = X & 0xFF;
printf("%X\n", output1);
// b) print the upper 12 bits of Y in hex notation
int output2 = Y >> 20;
printf("%X\n", output2);
I want to print the lower 10 bits of X in hex notation; since each character in hex is 4 bits, FF = 8 bits, would it be right to & with 0x2FF to get the lower 10 bits in hex notation.
Also, would shifting right by 20 drop all 20 bits at the end, and keep the upper 12 bits only?
I want to print the lower 10 bits of X in hex notation; since each character in hex is 4 bits, FF = 8 bits, would it be right to & with 0x2FF to get the lower 10 bits in hex notation.
No, that would be incorrect. You'd want to use 0x3FF to get the low 10 bits. (0x2FF in binary is: 1011111111). If you're a little uncertain with hex values, an easier way to do that these days is via binary constants instead, e.g.
// mask lowest ten bits in hex
int output1 = X & 0x3FF;
// mask lowest ten bits in binary
int output1 = X & 0b1111111111;
Also, would shifting right by 20 drop all 20 bits at the end, and keep the upper 12 bits only?
In the case of LEFT shift, zeros will be shifted in from the right, and the higher bits will be dropped.
In the case of RIGHT shift, it depends on the sign of the data type you are shifting.
// unsigned right shift
unsigned U = 0x80000000;
U = U >> 20;
printf("%x\n", U); // prints: 800
// signed right shift
int S = 0x80000000;
S = S >> 20;
printf("%x\n", S); // prints: fffff800
Signed right-shift typically shifts the highest bit in from the left. Unsigned right-shift always shifts in zero.
As an aside: IIRC the C standard is a little vague wrt to signed integer shifts. I believe it is theoretically possible to have a hardware platform that shifts in zeros for signed right shift (i.e. micro-controllers). Most of your typical platforms (Intel/Arm) will shift in the highest bit though.
Assuming 32 bit int, then you have the following problems:
0xcdba4321 is too large to fit inside an int. The hex constant itself will actually be unsigned int in this specific case, because of an oddball type rule in C. From there you force an implicit conversion to int, likely ending up with a negative number.
Y >> 20 right shifts a negative number, which is non-portable behavior. It can either shift in ones (arithmetic shift) or zeroes (logical shift), depending on compiler. Whereas right shifting unsigned types is well-defined and always results in logical shift.
& 0xFF masks out 8 bits, not 10.
%X expects an unsigned int, not an int.
The root of all your problems is "sloppy typing" - that is, writing int all over the place when you actually need a more suitable type. You should start using the portable types from stdint.h instead, in this case uint32_t. Also make a habit of always ending you hex constants with a u or U suffix.
A fixed program:
#include <stdio.h>
#include <stdint.h>
int main (void)
{
uint32_t X = 0x1234ABCDu;
uint32_t Y = 0xcdba4321u;
printf("%X\n", X & 0x3FFu);
printf("%X\n", Y >> (32-12));
}
The 0x3FFu mask can also be written as ( (1u<<10) - 1).
(Strictly speaking you need to printf the stdint.h types using specifiers from inttypes.h but lets not confuse the answer by introducing those at the same time.)
Lots of high-value answers to this question.
Here's more info that might spark curiosity...
int main() {
uint32_t X;
X = 0x1234ABCDu; // your first hex number
printf( "%X\n", X );
X &= ((1u<<12)-1)<<20; // mask 12 bits, shifting mask left
printf( "%X\n", X );
X = 0x1234ABCDu; // your first hex number
X &= ~0u^(~0u>>12);
printf( "%X\n", X );
X = 0x0234ABCDu; // Note leading 0 printed in two styles
printf( "%X %08X\n", X, X );
return 0;
}
1234ABCD
12300000
12300000
234ABCD 0234ABCD
print the upper 12 bits of Y in hex notation
To handle this when the width of int is not known, first determine the width with code like sizeof(unsigned)*CHAR_BIT. (C specifies it must be at least 16-bit.)
Best to use unsigned or mask the shifted result with an unsigned.
#include <limits.h>
int output2 = Y;
printf("%X\n", (unsigned) output2 >> (sizeof(unsigned)*CHAR_BIT - 12));
// or
printf("%X\n", (output2 >> (sizeof output2 * CHAR_BIT - 12)) & 0x3FFu);
Rare non-2's complement encoded int needs additional code - not shown.
Very rare padded int needs other bit width detection - not shown.

Cast Integer to Float using Bit Manipulation breaks on some integers in C

Working on a class assignment, I'm trying to cast an integer to a float only using bit manipulations (limited to any integer/unsigned operations incl. ||, &&. also if, while). My code is working for most values, but some values are not generating the results I'm looking for.
For example, if x is 0x807fffff, I get 0xceff0001, but the correct result should be 0xceff0000. I think I'm missing something with my mantissa and rounding, but can't quite pin it down. I've looked at some other threads on SO as well converting-int-to-float and how-to-manually
unsigned dl22(int x) {
int tmin = 0x1 << 31;
int tmax = ~tmin;
unsigned signBit = 0;
unsigned exponent;
unsigned mantissa;
int bias = 127;
if (x == 0) {
return 0;
}
if (x == tmin) {
return 0xcf << 24;
}
if (x < 0) {
signBit = x & tmin;
x = (~x + 1);
}
exponent = bias + 31;
while ( ( x & tmin) == 0 ) {
exponent--;
x <<= 1;
}
exponent <<= 23;
int mantissaMask = ~(tmin >> 8);
mantissa = (x >> 8) & mantissaMask;
return (signBit | exponent | mantissa);
}
EDIT/UPDATE
Found a viable solution - see below
Your code produces the expected output for me on the example you presented. As discussed in comments, however, from C's perspective it does exhibit undefined behavior -- not just in the computation of tmin, but also, for the same reason, in the loop wherein you compute the exponent. To whatever extent this code produces results that vary from environment to environment, that will follow either from the undefined behavior or from your assumption about the size of [unsigned] int being incorrect for the C implementation in use.
Nevertheless, if we assume (unsafely)
that shifts of ints operate as if the left operand were reinterpreted as an unsigned int with the same bit pattern, operated upon, and the resulting bit pattern reinterpreted as an int, and
that int and unsigned int are at least 32 bits wide,
then your code seems correct, modulo rounding.
In the event that the absolute value of the input int has more than 24 significant binary digits (i.e. it is at least 224), however, some precision will be lost in the conversion. In that case the correct result will depend on the FP rounding mode you intend to implement. An incorrectly rounded result will be off by 1 unit in the last place; how many results that affects depends on the rounding mode.
Simply truncating / shifting off the extra bits as you do yields round toward zero mode. That's one of the standard rounding modes, but not the default. The default rounding mode is to round to the nearest representable number, with ties being resolved in favor of the result having least-significant bit 0 (round to even); there are also three other standard modes. To implement any mode other than round-toward-zero, you'll need to capture the 8 least-significant bits of the significand after scaling and before shifting them off. These, together with other details depending on the chosen rounding mode, will determine how to apply the correct rounding.
About half of the 32-bit two's complement numbers will be rounded differently when converted in round-to-zero mode than when converted in any one of the other modes; which numbers exhibit a discrepancy depends on which rounding mode you consider.
I didn't originally mention that I am trying to imitate a U2F union statement:
float u2f(unsigned u) {
union {
unsigned u;
float f;
} a;
a.u = u;
return a.f;
}
Thanks to guidance provided in the postieee-754-bit-manipulation-rounding-error I was able to manage the rounding issues by putting the following after my while statement. This clarified the rounding that was occurring.
lsb = (x >> 8) & 1;
roundBit = (x >> 7) & 1;
stickyBitFlag = !!(x & 0x7F);
exponent <<= 23;
int mantissaMask = ~(tmin >> 8);
mantissa = (x >> 8);
mantissa &= mantissaMask;
roundBit = (roundBit & stickyBitFlag) | (roundBit & lsb);
return (signBit | exponent | mantissa) + roundBit;

Convert Raw 14 bit Two's Complement to Signed 16 bit Integer

I am doing some work in embedded C with an accelerometer that returns data as a 14 bit 2's complement number. I am storing this result directly into a uint16_t. Later in my code I am trying to convert this "raw" form of the data into a signed integer to represent / work with in the rest of my code.
I am having trouble getting the compiler to understand what I am trying to do. In the following code I'm checking if the 14th bit is set (meaning the number is negative) and then I want to invert the bits and add 1 to get the magnitude of the number.
int16_t fxls8471qr1_convert_raw_accel_to_mag(uint16_t raw, enum fxls8471qr1_fs_range range) {
int16_t raw_signed;
if(raw & _14BIT_SIGN_MASK) {
// Convert 14 bit 2's complement to 16 bit 2's complement
raw |= (1 << 15) | (1 << 14); // 2's complement extension
raw_signed = -(~raw + 1);
}
else {
raw_signed = raw;
}
uint16_t divisor;
if(range == FXLS8471QR1_FS_RANGE_2G) {
divisor = FS_DIV_2G;
}
else if(range == FXLS8471QR1_FS_RANGE_4G) {
divisor = FS_DIV_4G;
}
else {
divisor = FS_DIV_8G;
}
return ((int32_t)raw_signed * RAW_SCALE_FACTOR) / divisor;
}
This code unfortunately doesn't work. The disassembly shows me that for some reason the compiler is optimizing out my statement raw_signed = -(~raw + 1); How do I acheive the result I desire?
The math works out on paper, but I feel like for some reason the compiler is fighting with me :(.
Converting the 14 bit 2's complement value to 16 bit signed, while maintaining the value is simply a metter of:
int16_t accel = (int16_t)(raw << 2) / 4 ;
The left-shift pushes the sign bit into the 16 bit sign bit position, the divide by four restores the magnitude but maintains its sign. The divide avoids the implementation defined behaviour of an right-shift, but will normally result in a single arithmetic-shift-right on instruction sets that allow. The cast is necessary because raw << 2 is an int expression, and unless int is 16 bit, the divide will simply restore the original value.
It would be simpler however to just shift the accelerometer data left by two bits and treat it as if the sensor was 16 bit in the first place. Normalising everything to 16 bit has the benefit that the code needs no change if you use a sensor with any number of bits up-to 16. The magnitude will simply be four times greater, and the least significant two bits will be zero - no information is gained or lost, and the scaling is arbitrary in any case.
int16_t accel = raw << 2 ;
In both cases, if you want the unsigned magnitude then that is simply:
int32_t mag = (int32_t)labs( (int)accel ) ;
I would do simple arithmetic instead. The result is 14-bit signed, which is represented as a number from 0 to 2^14 - 1. Test if the number is 2^13 or above (signifying a negative) and then subtract 2^14.
int16_t fxls8471qr1_convert_raw_accel_to_mag(uint16_t raw, enum fxls8471qr1_fs_range range)
{
int16_t raw_signed = raw;
if(raw_signed >= 1 << 13) {
raw_signed -= 1 << 14;
}
uint16_t divisor;
if(range == FXLS8471QR1_FS_RANGE_2G) {
divisor = FS_DIV_2G;
}
else if(range == FXLS8471QR1_FS_RANGE_4G) {
divisor = FS_DIV_4G;
}
else {
divisor = FS_DIV_8G;
}
return ((int32_t)raw_signed * RAW_SCALE_FACTOR) / divisor;
}
Please check my arithmetic. (Do I have 13 and 14 correct?)
Supposing that int in your particular C implementation is 16 bits wide, the expression (1 << 15), which you use in mangling raw, produces undefined behavior. In that case, the compiler is free to generate code to do pretty much anything -- or nothing -- if the branch of the conditional is taken wherein that expression is evaluated.
Also if int is 16 bits wide, then the expression -(~raw + 1) and all intermediate values will have type unsigned int == uint16_t. This is a result of "the usual arithmetic conversions", given that (16-bit) int cannot represent all values of type uint16_t. The result will have the high bit set and therefore be outside the range representable by type int, so assigning it to an lvalue of type int produces implementation-defined behavior. You'd have to consult your documentation to determine whether the behavior it defines is what you expected and wanted.
If you instead perform a 14-bit sign conversion, forcing the higher-order bits off ((~raw + 1) & 0x3fff) then the result -- the inverse of the desired negative value -- is representable by a 16-bit signed int, so an explicit conversion to int16_t is well-defined and preserves the (positive) value. The result you want is the inverse of that, which you can obtain simply by negating it. Overall:
raw_signed = -(int16_t)((~raw + 1) & 0x3fff);
Of course, if int were wider than 16 bits in your environment then I see no reason why your original code would not work as expected. That would not invalidate the expression above, however, which produces consistently-defined behavior regardless of the size of default int.
Assuming when code reaches return ((int32_t)raw_signed ..., it has a value in the [-8192 ... +8191] range:
If RAW_SCALE_FACTOR is a multiple of 4 then a little savings can be had.
So rather than
int16_t raw_signed = raw << 2;
raw_signed >>= 2;
instead
int16_t fxls8471qr1_convert_raw_accel_to_mag(uint16_t raw,enum fxls8471qr1_fs_range range){
int16_t raw_signed = raw << 2;
uint16_t divisor;
...
// return ((int32_t)raw_signed * RAW_SCALE_FACTOR) / divisor;
return ((int32_t)raw_signed * (RAW_SCALE_FACTOR/4)) / divisor;
}
To convert the 14-bit two's-complement into a signed value, you can flip the sign bit and subtract the offset:
int16_t raw_signed = (raw ^ 1 << 13) - (1 << 13);

How to generate an IEEE 754 Single-precision float using only integer arithmetic?

Assuming a low end microprocessor with no floating point arithmetic, I need to generate an IEE754 single precision floating point format number to push out to a file.
I need to write a function that takes three integers being the sign, whole and the fraction and returns a byte array with 4 bytes being the IEEE 754 single precision representation.
Something like:
// Convert 75.65 to 4 byte IEEE 754 single precision representation
char* float = convert(0, 75, 65);
Does anybody have any pointers or example C code please? I'm particularly struggling to understand how to convert the mantissa.
You will need to generate the sign (1 bit), the exponent (8 bits, a biased power of 2), and the fraction/mantissa (23 bits).
Bear in mind that the fraction has an implicit leading '1' bit, which means that the most significant leading '1' bit (2^22) is not stored in the IEEE format. For example, given a fraction of 0x755555 (24 bits), the actual bits stored would be 0x355555 (23 bits).
Also bear in mind that the fraction is shifted so that the binary point is immediately to the right of the implicit leading '1' bit. So an IEEE 23-bit fraction of 11 0101 0101... represents the 24-bit binary fraction 1.11 0101 0101...
This means that the exponent has to be adjusted accordingly.
Does the value have to be written big endian or little endian? Reversed bit ordering?
If you are free, you should think about writing the value as string literal. That way you can easily convert the integer: just write the int part and write "e0" as exponent (or omit the exponent and write ".0").
For the binary representation, you should have a look at Wikipedia. Best is to first assemble the bitfields to an uint32_t - the structure is given in the linked article. Note that you might have to round if the integer has more than 23 bits value. Remember to normalize the generated value.
Second step will be to serialize the uint32_t to an uint8_t-array. Mind the endianess of the result!
Also note to use uint8_t for the result if you really want 8 bit values; you should use an unsigned type. For the intermediate representation, using uint32_t is recommended as that will guarantee you operate on 32 bit values.
You haven't had a go yet so no give aways.
Remember you can regard two 32-bit integers a & b to be interpreted as a decimal a.b as being a single 64-bit integer with an exponent of 2^-32 (where ^ is exponent).
So without doing anything you've got it in the form:
s * m * 2^e
The only problem is your mantissa is too long and your number isn't normalized.
A bit of shifting and adding/subtracting with a possible rounding step and you're done.
You can use a software floating point compiler/library.
See https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html
The basic premise is to:
Given binary32 float.
Form a binary fixed-point representation of the combined whole and factional parts hundredths. This code uses a structure encoding both whole and hundredths fields separately. Important that the whole field is at least 32 bits.
Shift left/right (*2 and /2) until MSbit is in the implied bit position whilst counting the shifts. A robust solution would also note non-zero bits shifted out.
Form a biased exponent.
Round mantissa and drop implied bit.
Form sign (not done here).
Combine the above 3 steps to form the answer.
As Sub-normals, infinites & Not-A-Number will not result with whole, hundredths input, generating those float special cases are not addressed here.
.
#include <assert.h>
#include <stdint.h>
#define IMPLIED_BIT 0x00800000L
typedef struct {
int_least32_t whole;
int hundreth;
} x_xx;
int_least32_t covert(int whole, int hundreth) {
assert(whole >= 0 && hundreth >= 0 && hundreth < 100);
if (whole == 0 && hundreth == 0) return 0;
x_xx x = { whole, hundreth };
int_least32_t expo = 0;
int sticky_bit = 0; // Note any 1 bits shifted out
while (x.whole >= IMPLIED_BIT * 2) {
expo++;
sticky_bit |= x.hundreth % 2;
x.hundreth /= 2;
x.hundreth += (x.whole % 2)*(100/2);
x.whole /= 2;
}
while (x.whole < IMPLIED_BIT) {
expo--;
x.hundreth *= 2;
x.whole *= 2;
x.whole += x.hundreth / 100;
x.hundreth %= 100;
}
int32_t mantissa = x.whole;
// Round to nearest - ties to even
if (x.hundreth >= 100/2 && (x.hundreth > 100/2 || x.whole%2 || sticky_bit)) {
mantissa++;
}
if (mantissa >= (IMPLIED_BIT * 2)) {
mantissa /= 2;
expo++;
}
mantissa &= ~IMPLIED_BIT; // Toss MSbit as it is implied in final
expo += 24 + 126; // Bias: 24 bits + binary32 bias
expo <<= 23; // Offset
return expo | mantissa;
}
void test_covert(int whole, int hundreths) {
union {
uint32_t u32;
float f;
} u;
u.u32 = covert(whole, hundreths);
volatile float best = whole + hundreths / 100.0;
printf("%10d.%02d --> %15.6e %15.6e Same:%d\n", whole, hundreths, u.f, best,
best == u.f);
}
#include <limits.h>
int main(void) {
test_covert(75, 65);
test_covert(0, 1);
test_covert(INT_MAX, 99);
return 0;
}
Output
75.65 --> 7.565000e+01 7.565000e+01 Same:1
0.01 --> 1.000000e-02 1.000000e-02 Same:1
2147483647.99 --> 2.147484e+09 2.147484e+09 Same:1
Known issues: sign not applied.

Convert from binary to floating point

I'm doing some exercises for Computer Science university and one of them is about converting an int array of 64 bit into it's double-precision floating point value.
Understanding the first bit, the sign +/-, is quite easy. Same for the exponent, as well as we know that the bias is 1023.
We are having problems with the significand. How can I calculate it?
In the end, I would like to obtain the real numbers that the bits meant.
computing the significand of the given 64 bit is quite easy.
according to the wiki article using the IEEE 754, the significand is made up the first 53 bits (from bit 0 to bit 52).
Now if you want to convert number having like 67 bits to your 64 bits value, it would be rounded by setting the trailing 64th bits of your value to 1, even if it was one before... because of the other 3 bits:
11110000 11110010 11111 becomes 11110000 11110011 after the rounding of the last byte;
therefore the there is no need to store the 53th bits because it has always a value a one.
that's why you only store in 52 bits in the significand instead of 53.
now to compute it, you just need to target the bit range of the significand [bit(1) - bit(52)] -bit(0) is always 1- and use it .
int index_signf = 1; // starting at 1, not 0
int significand_length = 52;
int byteArray[53]; // array containing the bits of the significand
double significand_endValue = 0;
for( ; index_signf <= significand_length ; index_signf ++)
{
significand_endValue += byteArray[index_signf] * (pow(2,-(index_signf)));
}
significand_endValue += 1;
Now you just have to fill byteArray accordlingly before computing it, using function like that:
int* getSignificandBits(int* array64bits){
//returned array
int significandBitsArray[53];
// indexes++
int i_array64bits = 0;
int i_significandBitsArray=1;
//set the first bit = 1
significandBitsArray[0] = 1;
// fill it
for(i_significandBitsArray=1, i_array64bits = (63 - 1); i_array64bits >= (64 - 52); i_array64bits--, i_significandBitsArray ++)
significandBitsArray[i_significandBitsArray] = array64bits[i_array64bits];
return significandBitsArray;
}
You could just load the bits into an unsigned integer of the same size as a double, take the address of that and cast it to a void* which you then cast to a double* and dereference.
Of course, this might be "cheating" if you really are supposed to parse the floating point standard, but this is how I would have solved the problem given the parameters you've stated so far.
If you have a byte representation of an object you can copy the bytes into the storage of a variable of the right type to convert it.
double convert_to_double(uint64_t x) {
double result;
mempcy(&result, &x, sizeof(x));
return result;
}
You will often see code like *(double *)&x to do the conversion, but whereas in practice this will always work it's undefined behavior in C.

Resources