Convert 2's complement to integer and calculate rms value - c

A similar question has been asked at Need fastest way to convert 2's complement to decimal in C, but I couldn't use it to get my answer, so posting this...
I have 32-bit data coming from an audio sensor in the following format:-
The Data Format is I2S, 24-bit, 2’s compliment, MSB first. The data precision is 18 bits; unused bits are zeros.
Without any audio input, I am able to read the following data from the sensor:-
0xFA578000
0xFA8AC000
0xFA85C000
0xFA828000
0xFA800000
0xFA7E4000
0xFA7D0000
0xFA7BC000
and so on...
I need to use these data samples to calculate their RMS value, then further use this RMS value to calculate the decibels (20 * log(rms)).
Here is my code with comments:-
//I have 32-bits, with data in the most-significant 24 bits.
inputVal &= 0xFFFFFF00; //Mask the least significant 8 bits.
inputVal = inputVal >> 8; //Data is shifted to least 24 bits. 24th bit is the sign bit.
inputVal &= 0x00FFFFC0; //Mask the least 6 bits, since data precision is 18 bits.
//So, I have got 24-bit data with masked 6 lsb bits. 24th bit is sign bit.
//Converting from 2's complement.
const int negative = (inputVal & (1 << 23)) != 0;
int nativeInt;
if (negative)
nativeInt = inputVal | ~((1 << 24) - 1);
else
nativeInt = inputVal;
return (nativeInt * nativeInt); //Returning the squared value to calculate RMS
After this, I take the average of sum of squared values and calculate its root to get the RMS value.
My questions are,
Am I doing the data bit-manipulations correctly?
Is it necessary to convert the data samples from 2's complement to integer to calculate their RMS values?
***********************************************Part-2*****************************************************
Continuing further with #Johnny Johansson's answer:-
It looks like all your sample values are close to -6800, so I assume that is an offset that you need to account for.
To normalize the sample set, I have calculated the mean value of the sample set and subtracted it from each value in the sample set.
Then, I found the maximum and minimum values form the sample set and calculated the peak-to-peak value.
// I have the sample set, get the mean
float meanval = 0;
for (int i=0; i <actualNumberOfSamples ; i++)
{
meanval += samples[i];
}
meanval /= actualNumberOfSamples;
printf("Average is: %f\n", meanval);
// subtract it from all samples to get a 'normalized' output
for (int i = 0; i < actualNumberOfSamples; i++)
{
samples[i] -= meanval;
}
// find the 'peak to peak' max
float minsample = 100000;
float maxsample = -100000;
float peakToPeakMax = 0.0;
for (int i = 0; i < actualNumberOfSamples; i++)
{
minsample = fmin(minsample, samples[i]);
maxsample = fmax(maxsample, samples[i]);
}
peakToPeakMax = (maxsample - minsample);
printf("The peak-to-peak maximum value is: %f\n", peakToPeakMax);
(This does not include the RMS part, which comes after you have correct signed integer values)
Now, I calculate the rms value by dividing the peak-to-peak value by square-root of 2.
Then, 20 * log10(rms) gives me the corresponding decibel value.
rmsValue = peak2peakValue / sqrt2;
DB_Val = 20 * log10(rmsValue);
Does the above code take care of the " offset " that you mentioned?
I am yet to find a test plan to verify the calculated decibels, but have I mathematically calculated the decibel value correctly?

The 2'complement part seems like it should work, but it is unnecessarily complicated, since regular integers are represented using 2'complement (unless you are on some very exotic hardware). You could simply do this instead:
signed int signedInputVal = (signed int)inputVal;
signedInputVal >>= 14;
This will give you a value in the range -(2^17) to (2^17-1).
It looks like all your sample values are close to -6800, so I assume that is an offset that you need to account for.
(This does not include the RMS part, which comes after you have correct signed integer values)

Related

How is float to int type conversion done in C? [duplicate]

I was wondering if you could help explain the process on converting an integer to float, or a float to an integer. For my class, we are to do this using only bitwise operators, but I think a firm understanding on the casting from type to type will help me more in this stage.
From what I know so far, for int to float, you will have to convert the integer into binary, normalize the value of the integer by finding the significand, exponent, and fraction, and then output the value in float from there?
As for float to int, you will have to separate the value into the significand, exponent, and fraction, and then reverse the instructions above to get an int value?
I tried to follow the instructions from this question: Casting float to int (bitwise) in C.
But I was not really able to understand it.
Also, could someone explain why rounding will be necessary for values greater than 23 bits when converting int to float?
First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
{
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
{
significand <<= 1;
shifts++;
}
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
}
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.
Have you checked the IEEE 754 floating-point representation?
In 32-bit normalized form, it has (mantissa's) sign bit, 8-bit exponent (excess-127, I think) and 23-bit mantissa in "decimal" except that the "0." is dropped (always in that form) and the radix is 2, not 10. That is: the MSB value is 1/2, the next bit 1/4 and so on.
Joe Z's answer is elegant but range of input values is highly limited. 32 bit float can store all integer values from the following range:
[-224...+224] = [-16777216...+16777216]
and some other values outside this range.
The whole range would be covered by this:
float int2float(int value)
{
// handles all values from [-2^24...2^24]
// outside this range only some integers may be represented exactly
// this method will use truncation 'rounding mode' during conversion
// we can safely reinterpret it as 0.0
if (value == 0) return 0.0;
if (value == (1U<<31)) // ie -2^31
{
// -(-2^31) = -2^31 so we'll not be able to handle it below - use const
// value = 0xCF000000;
return (float)INT_MIN; // *((float*)&value); is undefined behaviour
}
int sign = 0;
// handle negative values
if (value < 0)
{
sign = 1U << 31;
value = -value;
}
// although right shift of signed is undefined - all compilers (that I know) do
// arithmetic shift (copies sign into MSB) is what I prefer here
// hence using unsigned abs_value_copy for shift
unsigned int abs_value_copy = value;
// find leading one
int bit_num = 31;
int shift_count = 0;
for(; bit_num > 0; bit_num--)
{
if (abs_value_copy & (1U<<bit_num))
{
if (bit_num >= 23)
{
// need to shift right
shift_count = bit_num - 23;
abs_value_copy >>= shift_count;
}
else
{
// need to shift left
shift_count = 23 - bit_num;
abs_value_copy <<= shift_count;
}
break;
}
}
// exponent is biased by 127
int exp = bit_num + 127;
// clear leading 1 (bit #23) (it will implicitly be there but not stored)
int coeff = abs_value_copy & ~(1<<23);
// move exp to the right place
exp <<= 23;
union
{
int rint;
float rfloat;
}ret = { sign | exp | coeff };
return ret.rfloat;
}
Of course there are other means to find abs value of int (branchless). Similarly couting leading zeros can also be done without a branch so treat this example as example ;-).

Convert Raw 14 bit Two's Complement to Signed 16 bit Integer

I am doing some work in embedded C with an accelerometer that returns data as a 14 bit 2's complement number. I am storing this result directly into a uint16_t. Later in my code I am trying to convert this "raw" form of the data into a signed integer to represent / work with in the rest of my code.
I am having trouble getting the compiler to understand what I am trying to do. In the following code I'm checking if the 14th bit is set (meaning the number is negative) and then I want to invert the bits and add 1 to get the magnitude of the number.
int16_t fxls8471qr1_convert_raw_accel_to_mag(uint16_t raw, enum fxls8471qr1_fs_range range) {
int16_t raw_signed;
if(raw & _14BIT_SIGN_MASK) {
// Convert 14 bit 2's complement to 16 bit 2's complement
raw |= (1 << 15) | (1 << 14); // 2's complement extension
raw_signed = -(~raw + 1);
}
else {
raw_signed = raw;
}
uint16_t divisor;
if(range == FXLS8471QR1_FS_RANGE_2G) {
divisor = FS_DIV_2G;
}
else if(range == FXLS8471QR1_FS_RANGE_4G) {
divisor = FS_DIV_4G;
}
else {
divisor = FS_DIV_8G;
}
return ((int32_t)raw_signed * RAW_SCALE_FACTOR) / divisor;
}
This code unfortunately doesn't work. The disassembly shows me that for some reason the compiler is optimizing out my statement raw_signed = -(~raw + 1); How do I acheive the result I desire?
The math works out on paper, but I feel like for some reason the compiler is fighting with me :(.
Converting the 14 bit 2's complement value to 16 bit signed, while maintaining the value is simply a metter of:
int16_t accel = (int16_t)(raw << 2) / 4 ;
The left-shift pushes the sign bit into the 16 bit sign bit position, the divide by four restores the magnitude but maintains its sign. The divide avoids the implementation defined behaviour of an right-shift, but will normally result in a single arithmetic-shift-right on instruction sets that allow. The cast is necessary because raw << 2 is an int expression, and unless int is 16 bit, the divide will simply restore the original value.
It would be simpler however to just shift the accelerometer data left by two bits and treat it as if the sensor was 16 bit in the first place. Normalising everything to 16 bit has the benefit that the code needs no change if you use a sensor with any number of bits up-to 16. The magnitude will simply be four times greater, and the least significant two bits will be zero - no information is gained or lost, and the scaling is arbitrary in any case.
int16_t accel = raw << 2 ;
In both cases, if you want the unsigned magnitude then that is simply:
int32_t mag = (int32_t)labs( (int)accel ) ;
I would do simple arithmetic instead. The result is 14-bit signed, which is represented as a number from 0 to 2^14 - 1. Test if the number is 2^13 or above (signifying a negative) and then subtract 2^14.
int16_t fxls8471qr1_convert_raw_accel_to_mag(uint16_t raw, enum fxls8471qr1_fs_range range)
{
int16_t raw_signed = raw;
if(raw_signed >= 1 << 13) {
raw_signed -= 1 << 14;
}
uint16_t divisor;
if(range == FXLS8471QR1_FS_RANGE_2G) {
divisor = FS_DIV_2G;
}
else if(range == FXLS8471QR1_FS_RANGE_4G) {
divisor = FS_DIV_4G;
}
else {
divisor = FS_DIV_8G;
}
return ((int32_t)raw_signed * RAW_SCALE_FACTOR) / divisor;
}
Please check my arithmetic. (Do I have 13 and 14 correct?)
Supposing that int in your particular C implementation is 16 bits wide, the expression (1 << 15), which you use in mangling raw, produces undefined behavior. In that case, the compiler is free to generate code to do pretty much anything -- or nothing -- if the branch of the conditional is taken wherein that expression is evaluated.
Also if int is 16 bits wide, then the expression -(~raw + 1) and all intermediate values will have type unsigned int == uint16_t. This is a result of "the usual arithmetic conversions", given that (16-bit) int cannot represent all values of type uint16_t. The result will have the high bit set and therefore be outside the range representable by type int, so assigning it to an lvalue of type int produces implementation-defined behavior. You'd have to consult your documentation to determine whether the behavior it defines is what you expected and wanted.
If you instead perform a 14-bit sign conversion, forcing the higher-order bits off ((~raw + 1) & 0x3fff) then the result -- the inverse of the desired negative value -- is representable by a 16-bit signed int, so an explicit conversion to int16_t is well-defined and preserves the (positive) value. The result you want is the inverse of that, which you can obtain simply by negating it. Overall:
raw_signed = -(int16_t)((~raw + 1) & 0x3fff);
Of course, if int were wider than 16 bits in your environment then I see no reason why your original code would not work as expected. That would not invalidate the expression above, however, which produces consistently-defined behavior regardless of the size of default int.
Assuming when code reaches return ((int32_t)raw_signed ..., it has a value in the [-8192 ... +8191] range:
If RAW_SCALE_FACTOR is a multiple of 4 then a little savings can be had.
So rather than
int16_t raw_signed = raw << 2;
raw_signed >>= 2;
instead
int16_t fxls8471qr1_convert_raw_accel_to_mag(uint16_t raw,enum fxls8471qr1_fs_range range){
int16_t raw_signed = raw << 2;
uint16_t divisor;
...
// return ((int32_t)raw_signed * RAW_SCALE_FACTOR) / divisor;
return ((int32_t)raw_signed * (RAW_SCALE_FACTOR/4)) / divisor;
}
To convert the 14-bit two's-complement into a signed value, you can flip the sign bit and subtract the offset:
int16_t raw_signed = (raw ^ 1 << 13) - (1 << 13);

Data conversion from accelerometer

Hi all I am working on an accelerometer bma220 , and its datasheet says that data is in 2's complement form.So what i had to do was getting that 8 bit data in any 8 bit signed char and done.
the bma220 have an 8 bit register of which first 6 bits are data and last two are zero.
void properdata(int16_t *msgData)
{
printf("\nin proper data\n");
int16_t temp, i;
for(i=0; i<3; i++)
{
temp = *(msgData + i);
printf("temp = %d sense = %d\n", temp, sense);
temp = temp >> 2; // only 6 bits data
temp = temp / sense; //decimal value * .0625 = value in g
printf("temp = %d\n", temp);
}
}
in this program i am taking data in a unsigned variable msgdata and doing all the calculations on a signed variable. I just need to know if this is the correct way to convert data?
After some suggestions i changed my code to this
void properdata(uint16_t *msgData)
{
int arr[3];
arr[0] = msgData[0];
arr[1] = msgData[1];
arr[2] = msgData[2];
arr[0] = arr[0]/4;
arr[1] = arr[1]/4;
arr[2] = arr[2]/4;
printf("x = %d y = %d z = %d\n", arr[0], arr[1], arr[2]);
}
now in a stand still condition i am getting data as 61, 60 and 17.If I think data should be in a range 31 to -32 but here it is coming out of range?
in this program i am taking data in a unsigned variable msgdata
No you aren't. msgdata is a signed variable.
I just need to know if this is the correct way to convert data?
Using bit-wise operators on signed variables is almost always a bug. You perform a right shift on a signed variable, this is implementation-defined behavior and what will happen to the sign bit depends on the compiler.
I see two problems in your code:
1) As Lundin stated out, shifting negative values to the right is dangerous since behavior is compiler specific.
2) According to the data sheet the range of the accelerator is 1.94 ...- 2.00 g. You try to store the value as plain integer. At least fix point arithmetic is needed here (or float). Or your result will just be 1, 0 , -1 or -2.
The following code should take these points into account (not tested):
int16_t raw; // the 8 bit raw value from the chip
int32_t accel; // acceleration in mg
raw = (int16_t) read_value_from_chip(); // get 8 bits raw value from chip
accel = (int32_t)(raw / 4) * 625; // to avoid to shift to right, use division here
if ( accel >= 0 )
accel = ( accel + 5 ) / 10;
else
accel = ( accel - 5 ) / 10;
printf("%ld\n", accel);
Explanation:
According to the data sheet the resolution is 62.5 mg and the most significant 6 bits hold the signed raw value.
To avoid to deal with the sign explicitly when bringing the bits into position the division is used here instead of the right-shift. Dividing by 4 is used instead of >> 2. This keeps the sign as required.
An optimizing compiler will replace this division by a bit-shift if the compiler/MCU sets bits on the left side 1 when negative values are shifted to the right. If the compiler/MCU does not support this, the division will be used.
*625 is done to get the acceleration in the required resolution of 1/10 mg (1 digit is 0.1 mg). 625 is the short form of 0.0625 * 10000. (Updated)
To get it in mg the acceleration is divided by 10 (I do this here just since mg is handier than 0.1 mg). To round correctly the half of the dividend must be added/subtracted according to the sign before dividing, here this is 10/2 = 5.
The result is now in mg.
If you want to avoid the division, you must handle negative/positive values explicitly when bringing the significant bits into place.
Typically the spec sheet will have an example conversion or two. It might show the value 0000 0000 (binary) is zero, and 0100 0000 is 47.25 g. Run the example values through your code to validate.

16bit Float Multiplication in C

I'm working on a small project, where I need float multiplication with 16bit floats (half precision). Unhappily, I'm facing some problems with the algorithm:
Example Output
1 * 5 = 5
2 * 5 = 10
3 * 5 = 14.5
4 * 5 = 20
5 * 5 = 24.5
100 * 4 = 100
100 * 5 = 482
The Source Code
const int bits = 16;
const int exponent_length = 5;
const int fraction_length = 10;
const int bias = pow(2, exponent_length - 1) - 1;
const int exponent_mask = ((1 << 5) - 1) << fraction_length;
const int fraction_mask = (1 << fraction_length) - 1;
const int hidden_bit = (1 << 10); // Was 1 << 11 before update 1
int float_mul(int f1, int f2) {
int res_exp = 0;
int res_frac = 0;
int result = 0;
int exp1 = (f1 & exponent_mask) >> fraction_length;
int exp2 = (f2 & exponent_mask) >> fraction_length;
int frac1 = (f1 & fraction_mask) | hidden_bit;
int frac2 = (f2 & fraction_mask) | hidden_bit;
// Add exponents
res_exp = exp1 + exp2 - bias; // Remove double bias
// Multiply significants
res_frac = frac1 * frac2; // 11 bit * 11 bit → 22 bit!
// Shift 22bit int right to fit into 10 bit
if (highest_bit_pos(res_mant) == 21) {
res_mant >>= 11;
res_exp += 1;
} else {
res_mant >>= 10;
}
res_frac &= ~hidden_bit; // Remove hidden bit
// Construct float
return (res_exp << bits - exponent_length - 1) | res_frac;
}
By the way: I'm storing the floats in ints, because I'll try to port this code to some kind of Assembler w/o float point operations later.
The Question
Why does the code work for some values only? Did I forget some normalization or similar? Or does it work only by accident?
Disclaimer: I'm not a CompSci student, it's a leisure project ;)
Update #1
Thanks to the comment by Eric Postpischil I noticed one problem with the code: the hidden_bit flag was off by one (should be 1 << 10). With that change, I don't get decimal places any more, but still some calculations are off (e.g. 3•3=20). I assume, it's the res_frac shift as descibred in the answers.
Update #2
The second problem with the code was indeed the res_frac shifting. After update #1 I got wrong results when having 22 bit results of frac1 * frac2. I've updated the code above with a the corrected shift statement. Thanks to all for every comment and answer! :)
From a cursory look:
No attempt is made to determine the location of the high bit in the product. Two 11-bit numbers, each their high bit set, may produce a 21- or 22-bit number. (Example with two-bit numbers: 102•102 is 1002, three bits, but 112•112 is 10012, four bits.)
The result is truncated instead of rounded.
Signs are ignored.
Subnormal numbers are not handled, on input or output.
11 is hardcoded as a shift amount in one place. This is likely incorrect; the correct amount will depend on how the significand is handled for normalization and rounding.
In decoding, the exponent field is shifted right by fraction_length. In encoding, it is shifted left by bits - exponent_length - 1. To avoid bugs, the same expression should be used in both places.
From a more detailed look by chux:
res_frac = frac1 * frac2 fails if int is less than 23 bits (22 for the product and one for the sign).
This is more a suggestion for how to make it easier to get your code right, rather than analysis of what is wrong with the existing code.
There are a number of steps that are common to some or all of the floating point arithmetic operations. I suggest extracting each into a function that can be written with focus on one issue, and tested separately. Then when you come to write e.g. multiplication, you only have to deal with the specifics of that operation.
All the operations will be easier working with a structure that has the actual signed exponent, and the full significand in a wider unsigned integer field. If you were dealing with signed numbers, it would also have a boolean for the sign bit.
Here are some sample operations that could be separate functions, at least until you get it working:
unpack: Take a 16 bit float and extract the exponent and significand into a struct.
pack: Undo unpack - deal with dropping the hidden bit, applying the bias the expoent, and combining them into a float.
normalize: Shift the significand and adjust the exponent to bring the most significant 1-bit to a specified bit position.
round: Apply your rounding rules to drop low significance bits. If you want to do IEEE 754 style round-to-nearest, you need a guard digit that is the most significant bit that will be dropped, and an additional bit indicating if there are any one bits of lower significance than the guard bit.
One problem is that you are truncating instead of rounding:
res_frac >>= 11; // Shift 22bit int right to fit into 10 bit
You should compute res_frac & 0x7ff first, the part of the 22-bit result that your algorithm is about to discard, and compare it to 0x400. If it is below, truncate. If it is above, round away from zero. If it is equal to 0x400, round to the even alternative.

Convert from binary to floating point

I'm doing some exercises for Computer Science university and one of them is about converting an int array of 64 bit into it's double-precision floating point value.
Understanding the first bit, the sign +/-, is quite easy. Same for the exponent, as well as we know that the bias is 1023.
We are having problems with the significand. How can I calculate it?
In the end, I would like to obtain the real numbers that the bits meant.
computing the significand of the given 64 bit is quite easy.
according to the wiki article using the IEEE 754, the significand is made up the first 53 bits (from bit 0 to bit 52).
Now if you want to convert number having like 67 bits to your 64 bits value, it would be rounded by setting the trailing 64th bits of your value to 1, even if it was one before... because of the other 3 bits:
11110000 11110010 11111 becomes 11110000 11110011 after the rounding of the last byte;
therefore the there is no need to store the 53th bits because it has always a value a one.
that's why you only store in 52 bits in the significand instead of 53.
now to compute it, you just need to target the bit range of the significand [bit(1) - bit(52)] -bit(0) is always 1- and use it .
int index_signf = 1; // starting at 1, not 0
int significand_length = 52;
int byteArray[53]; // array containing the bits of the significand
double significand_endValue = 0;
for( ; index_signf <= significand_length ; index_signf ++)
{
significand_endValue += byteArray[index_signf] * (pow(2,-(index_signf)));
}
significand_endValue += 1;
Now you just have to fill byteArray accordlingly before computing it, using function like that:
int* getSignificandBits(int* array64bits){
//returned array
int significandBitsArray[53];
// indexes++
int i_array64bits = 0;
int i_significandBitsArray=1;
//set the first bit = 1
significandBitsArray[0] = 1;
// fill it
for(i_significandBitsArray=1, i_array64bits = (63 - 1); i_array64bits >= (64 - 52); i_array64bits--, i_significandBitsArray ++)
significandBitsArray[i_significandBitsArray] = array64bits[i_array64bits];
return significandBitsArray;
}
You could just load the bits into an unsigned integer of the same size as a double, take the address of that and cast it to a void* which you then cast to a double* and dereference.
Of course, this might be "cheating" if you really are supposed to parse the floating point standard, but this is how I would have solved the problem given the parameters you've stated so far.
If you have a byte representation of an object you can copy the bytes into the storage of a variable of the right type to convert it.
double convert_to_double(uint64_t x) {
double result;
mempcy(&result, &x, sizeof(x));
return result;
}
You will often see code like *(double *)&x to do the conversion, but whereas in practice this will always work it's undefined behavior in C.

Resources