As shown in the code below, I am trying to copy the bits from a long longnum to two doubles, d1 and d2, using different methods: pointer-casting + dereferencing and 'bitwise-and'ing respectively.
# include <stdio.h>
int main(void) {
long longnum = 0xDDDDDDDDDDDDDDDD;
double d1 = *((double*)(&longnum));
double d2 = longnum & 0xFFFFFFFFFFFFFFFF;
printf("%ld\n\n",longnum);
printf("%lf\n\n",d1);
printf("%lf\n",d2);
return 0;
}
The issue is that both the doubles are not printed the same way, as shown in the output below.
-2459565876494606883
-1456815990147462891125136942359339382185244158826619267593931664968442323048246672764155341958241671875972237215762610409185128240974392406835200.000000
15987178197214945280.000000
Given the size of DBL_MAX, the max size of a double, it seems to me that it's the giant number that's actually the sensible output of the two doubles printed.
double d2 = longnum & 0xFFFFFFFFFFFFFFFF;
The & mask doesn't do anything. A number ANDed with all 1's is the same number. The line above is no different from:
double d2 = longnum;
That line doesn't do any bit reinterpretation. Instead it sets d2 to the double that most closely represents the value in longnum. The value will be similar; the bit pattern will be quite different.
The best way to do what you're trying to do is with a union. Unions are the best way to perform type punning.
union {
long l;
double d;
} u;
u.l = longnum;
printf("%f\n\n", u.d);
Using pointers as you did with d1 technically invokes undefined behavior. It is a common idiom and in practice will probably work fine, but type punning with pointers ought to be avoided.
your problem is
d1 = *((double*)(&longnum));
by doing the above you are incorrectly taking the contents from memory and then assuming it is in double precision format.
if you do it by the following:
d1 = (double) longnum;
then the contents of the memory space longnum, which is an 8 byte integer, is converted properly and then stored properly in memory at the location pointed to by d1.
And then the contents of d1 will be correct.
suggest you read the wikipedia page about "Double-precision floating-point format"
the value of a floating point value (or double precision value) is stored differently in memory than how integers are.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format
Related
I have a very basic question. In my program, i am doing multiplication of two fixed point numbers, which is given below. My inputs are of Q1.31 format and output also should be of same format. In order to do this, i am storing the result of multiplication in a temporary 64 bit variable and then doing some operations to get the result in required format.
int conversion1(float input, int Q_FORMAT)
{
return ((int)(input * ((1 << Q_FORMAT)-1)));
}
int mul(int input1, int input2, int format)
{
__int64 result;
result = (__int64)input1 * (__int64)input2;//Q2.62 format
result = result << 1;//Q1.63 format
result = result >> (format + 1);//33.31 format
return (int)result;//Q1.31 format
}
int main()
{
int Q_FORMAT = 31;
float input1 = 0.5, input2 = 0.5;
int q_input1, q_input2;
int temp_mul;
float q_muls;
q_input1 = conversion1(input1, Q_FORMAT);
q_input2 = conversion1(input2, Q_FORMAT);
q_muls = ((float)temp_mul / ((1 << (Q_FORMAT)) - 1));
printf("result of multiplication using q format = %f\n", q_muls);
return 0;
}
My question is while converting float input to integer input (and also while converting int output
to float output), i am using (1<<Q_FORMAT)-1 format. But i have seen people using (1<<Q_FORMAT)
directly in their codes. The Problem i am facing when using (1<<Q_FORMAT) is i am getting the
negative of the desired result.
For example, in my program,
If i use (1<<Q_FORMAT), i am getting -0.25 as the result
But, if i use (1<<Q_FORMAT)-1, i am getting 0.25 as the result which is correct.
Where am i going wrong? Do i need to understand any other concepts?
On common platforms, int is a two’s complement 32-bit integer providing 31 digits (plus a 'sign' bit). It's a bit too narrow to represent a Q1.31 number which requires 32 digits (plus a 'sign' bit).
In your example, this is manifesting as effective arithmetic overflow in the expression, 1 << Q_FORMAT.
To avoid this, you need to either use a type providing more digits (e.g. long long) or a fixed-point format requiring fewer digits (e.g. Q1.30). You can use unsigned to fix your example but the result will be a 'sign' bit short of Q2.30.
Consider this program:
#include <stdio.h>
union myUnion
{
int x;
long double y;
};
int main()
{
union myUnion a;
a.x = 5;
a.y = 3.2;
printf("%d\n%.2Lf", a.x, a.y);
return 0;
}
Output:
-858993459
3.20
This is fine, as the int member gets interpreted using some of the bits of the long double member. However, the reverse doesn't really apply:
#include <stdio.h>
union myUnion
{
int x;
long double y;
};
int main()
{
union myUnion a;
a.y = 3.2;
a.x = 5;
printf("%d\n%.2Lf", a.x, a.y);
return 0;
}
Output:
5
3.20
The question is why the long double doesn't get reinterpreted as some garbage value (since 4 of its bytes should represent the integer)? It is not a coincidence, the program outputs 3.20 for all values of a.x, not just 5.
However, the reverse doesn't really apply
On a little endian system (least significant byte of a multi-byte value is at the lowest address), the int will correspond to the least significant bits of the mantissa of the long double. You have to print that long double with a great deal of precision to see the effect of that int on those insignificant digits.
On a big endian system, like a Power PC box, things would be different: the int part would line up with the most significant part of the long double, overlapping with the sign bit, exponent and most significant mantissa bits. Thus changes in x would have drastic effects on the observed floating-point value, even if only a few significant digits are printed. However, for small values of x, the value appears to be zero.
On a PPC64 system, the following version of the program:
int main(void)
{
union myUnion a;
a.y = 3.2;
int i;
for (i = 0; i < 1000; i++) {
a.x = i;
printf("%d -- %.2Lf\n", a.x, a.y);
}
return 0;
}
prints nothing but
1 -- 0.0
2 -- 0.0
[...]
999 - 0.0
This is because we're creating an exponent field with all zeros, giving rise to values close to zero. However, the initial value 3.2 is completely clobbered; it doesn't just have its least significant bits ruffled.
The size of long double is very large. To see the effect of modifying the x field on implementations where x lines up with the LSBs of the mantissa of y and other bits of union are not effected when modifying via x, you need to print the value with much higher precision.
This is only affecting the last half of the mantissa. It won't make any noticeable difference with the amount of digits you're printing. However, the difference can be seen when you print 64 digits.
This program will show the difference:
#include <stdio.h>
#include <string.h>
#include <ctype.h>
union myUnion
{
int x;
long double y;
};
int main()
{
union myUnion a;
a.y = 3.2;
a.x = 5;
printf("%d\n%.64Lf\n", a.x, a.y);
a.y = 3.2;
printf("%.64Lf\n", a.y);
return 0;
}
My output:
5
3.1999999992549419413918193599855044340074528008699417114257812500
3.2000000000000001776356839400250464677810668945312500000000000000
Based on my knowledge of the 80-bit long double format, this overwrites half of the mantissa, which doesn't skew the result much, so this prints somewhat accurate results.
If you had done this in my program:
a.x = 0;
the result would've been:
0
3.1999999992549419403076171875000000000000000000000000000000000000
3.2000000000000001776356839400250464677810668945312500000000000000
which is only slightly different.
Answers posted by Mohit Jain, Kaz and JL2210 provide good insight to explain your observations and investigate further, but be aware that the C Standard does not guarantee this behavior:
6.2.6 Representations of types
6.2.6.1 General
6 When a value is stored in an object of structure or union type, including in a member object, the bytes of the object representation that correspond to any padding bytes take unspecified values. The value of a structure or union object is never a trap representation, even though the value of a member of the structure or union object may be a trap representation.
7 When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values.
As a consequence, the behavior described in the answers is not guaranteed as all the bytes of the long double y member could be modified by setting the int x member, including the bytes that are not part of the int. These bytes can take any value and the contents of y could even be a trap value, causing undefined behavior.
As commented by Kaz, gcc is more precise than the C Standard: the documentation notes it as a common practice: The practice of reading from a different union member than the one most recently written to (called type-punning) is common. Even with -fstrict-aliasing, type-punning is allowed, provided the memory is accessed through the union type. This practice is actually condoned in the C Standard since C11, as documented in this answer: https://stackoverflow.com/a/11996970/4593267 . Yet in my reading of this footnote there is still no guarantee about the bytes of y not part of x.
Here's the code:
#include <stdio.h>
union
{
unsigned u;
double d;
} a,b;
int main(void)
{
printf("Enter a, b:");
scanf("%lf %lf",&a.d,&b.d);
if(a.d>b.d)
{
a.u^=b.u^=a.u^=b.u;
}
printf("a=%g, b=%g\n",a.d,b.d);
return 0;
}
The a.u^=b.u^=a.u^=b.u; statement should have swapped a and b if a>b, but it seems that whatever I enter, the output will always be exactly my input.
a.u^=b.u^=a.u^=b.u; causes undefined behaviour by writing to a.u twice without a sequence point. See here for discussion of this code.
You could write:
unsigned tmp;
tmp = a.u;
a.u = b.u;
b.u = tmp;
which will swap a.u and b.u. However this may not achieve the goal of swapping the two doubles, if double is a larger type than unsigned on your system (a common scenario).
It's likely that double is 64 bits, while unsigned is only 32 bits. When you swap the unsigned members of the unions, you're only getting half of the doubles.
If you change d to float, or change u to unsigned long long, it will probably work, since they're likely to be the same size.
You're also causing UB by writing to the variables twice without a sequence point. The proper way to write the XOR swap is with multiple statements.
b.u ^= a.u;
a.u ^= b.u;
b.u ^= a.u;
For more about why not to use XOR for swapping, see Why don't people use xor swaps?
In usual environment, memory size of datatype 'unsigned' and 'double' are different.
That is why variables are not look like changed.
And you cannot using XOR swap on floating point variable.
because they are represented totally different in memory.
I saw the following piece of code in an opensource AAC decoder,
static void flt_round(float32_t *pf)
{
int32_t flg;
uint32_t tmp, tmp1, tmp2;
tmp = *(uint32_t*)pf;
flg = tmp & (uint32_t)0x00008000;
tmp &= (uint32_t)0xffff0000;
tmp1 = tmp;
/* round 1/2 lsb toward infinity */
if (flg)
{
tmp &= (uint32_t)0xff800000; /* extract exponent and sign */
tmp |= (uint32_t)0x00010000; /* insert 1 lsb */
tmp2 = tmp; /* add 1 lsb and elided one */
tmp &= (uint32_t)0xff800000; /* extract exponent and sign */
*pf = *(float32_t*)&tmp1 + *(float32_t*)&tmp2 - *(float32_t*)&tmp;
} else {
*pf = *(float32_t*)&tmp;
}
}
In that the line,
*pf = *(float32_t*)&tmp;
is same as,
*pf = (float32_t)tmp;
Isn't it?
Or is there a difference? Maybe in performance?
Thank you.
No, they're completely different. Say the value of tmp is 1. Their code will give *pf the value of whatever floating point number has the same binary representation as the integer 1. Your code would give it the floating point value 1.0!
This code is editing the value of a float knowing it is formatted using the standard IEEE 754 floating representation.
*(float32_t*)&tmp;
means reinterpret the address of temp as being a pointer on a 32 bit float, extract the value pointed.
(float32_t)tmp;
means cast the integer to float 32. Which means 32.1111f may well produce 32.
Very different.
The first causes the bit pattern of tmp to be reinterpreted as a float.
The second causes the numerical value of tmp to be converted to float (within the accuracy that it can be represented including rounding).
Try this:
int main(void) {
int32_t n=1078530011;
float32_t f;
f=*(float32_t*)(&n);
printf("reinterpet the bit pattern of %d as float - f==%f\n",n,f);
f=(float32_t)n;
printf("cast the numerical value of %d as float - f==%f\n",n,f);
return 0;
}
Example output:
reinterpet the bit pattern of 1078530011 as float - f==3.141593
cast the numerical value of 1078530011 as float - f==1078530048.000000
It's like thinking that
const char* str="3568";
int a=*(int*)str;
int b=atoi(str);
Will assign a and b the same values.
First to answer the question, my_float = (float)my_int safely converts the integer to a float according to the rules of the standard (6.3.1.4).
When a value of integer type is converted to a real floating type, if
the value being converted can be represented exactly in the new type,
it is unchanged. If the value being converted is in the range of
values that can be represented but cannot be represented exactly, the
result is either the nearest higher or nearest lower representable
value, chosen in an implementation-defined manner. If the value being
converted is outside the range of values that can be represented, the
behavior is undefined.
my_float = *(float*)&my_int on the other hand, is a dirty trick, telling the program that the binary contents of the integer should be treated as if they were a float variable, with no concerns at all.
However, the person who wrote the dirty trick was probably not aware of it leading to undefined behavior for another reason: it violates the strict aliasing rule.
To fix this bug, you either have to tell your compiler to behave in a non-standard, non-portable manner (for example gcc -fno-strict-aliasing), which I don't recommend.
Or preferably, you rewrite the code so that it doesn't rely on undefined behavior. Best way is to use unions, for which strict aliasing doesn't apply, in the following manner:
typedef union
{
uint32_t as_int;
float32_t as_float;
} converter_t;
uint32_t value1, value2, value3; // do something with these variables
*pf = (converter_t){value1}.as_float +
(converter_t){value2}.as_float -
(converter_t){value3}.as_float;
Also it is good practice to add the following sanity check:
static_assert(sizeof(converter_t) == sizeof(uint32_t),
"Unexpected padding or wrong type sizes!");
Is the defacto method for comparing arrays (in C) to use memcmp from string.h?
I want to compare arrays of ints and doubles in my unit tests
I am unsure whether to use something like:
double a[] = {1.0, 2.0, 3.0};
double b[] = {1.0, 2.0, 3.0};
size_t n = 3;
if (! memcmp(a, b, n * sizeof(double)))
/* arrays equal */
or to write a bespoke is_array_equal(a, b, n) type function?
memcmp would do an exact comparison, which is seldom a good idea for floats, and would not follow the rule that NaN != NaN. For sorting, that's fine, but for other purposes, you might to do an approximate comparison such as:
bool dbl_array_eq(double const *x, double const *y, size_t n, double eps)
{
for (size_t i=0; i<n; i++)
if (fabs(x[i] - y[i]) > eps)
return false;
return true;
}
Using memcmp is not generally a good idea. Let's start with the more complex and work down from there.
Though you mentioned int and double, I first want to concentrate on memcmp as a general solution, such as to compare arrays of type:
struct {
char c;
// 1
int i;
// 2
}
The main problem there is that implementations are free to add padding to structures at locations 1 and 2, making a bytewise comparison potentially false even though the important bits match perfectly.
Now down to doubles. You might think this was better as there's no padding there. However there are other problems.
The first is the treatment of NaN values. IEEE754 goes out of its way to ensure that NaN is not equal to any other value, including itself. For example, the code:
#include <stdio.h>
#include <string.h>
int main (void) {
double d1 = 0.0 / 0.0, d2 = d1;
if (d1 == d2)
puts ("Okay");
else
puts ("Bad");
if (memcmp (&d1, &d2, sizeof(double)) == 0)
puts ("Okay");
else puts
("Bad");
return 0;
}
will output
Bad
Okay
illustrating the difference.
The second is the treatment of plus and minus zero. These should be considered equal for the purposes of comparison but, as the bit patterns are different, memcmp will say they are different.
Changing the declaration/initialisation of d1 and d2 in the above code to:
double d1 = 0.0, d2 = -d1;
will make this clear.
So, if structures and doubles are problematic, surely integers are okay. After all, they're always two's complement, yes?
No, actually they're not. ISO mandates one of three encoding schemes for signed integers and the other two (ones' complements and sign/magnitude) suffer from a similar problem as doubles, that fact that both plus and minus zero exist.
So, while they should possibly be considered equal, again the bit patterns are different.
Even for unsigned integers, you have a problem (it's also a problem for signed values as well). ISO states that these representations can have value bits and padding bits, and that the values of the padding bits are unspecified.
So, even for what may seem the simplest case, memcmp can be a bad idea.
Replace memset with memcmp in your code, and it works.
In your case (as the size both arrays arrays are identical and known during compilation) you can even do:
memcmp(a, b, sizeof(a));
The function you're looking for is memcmp, not memset. See the answers to this question for why it might not be a good idea to memcmp an array of doubles though.
memcmp compares two blocks of memory for number of size given
memset is used to initialise buffer with a value for size given
buffers can be compared without using memcmp in following way. same can be changed for different datatypes.
int8_t array_1[] = { 1, 2, 3, 4 }
int8_t array_2[] = { 1, 2, 3, 4 }
uint8_t i;
uint8_t compare_result = 1;
for (i = 0; i < (sizeof(array_1)/sizeof(int8_t); i++)
{
if (array_1[i] != array_2[i])
{
compare_result = 0;
break;
}
}