In the below program:
union
{
int i;
float f;
} u;
Assuming 32 bit compiler, u is allocated with 4 bytes in memory.
u.f = 3.14159f;
3.14159f is represented using IEEE 754, in those 4 bytes.
printf("As integer: %08x\n", u.i);
What does u.i represent here? Is IEEE 754 binary representation interpreted as 4 byte signed int?
Reading from i is implementation-defined blah blah blah.
Still.
On "normal" platforms where
float is IEEE-754 binary32 format
int is 32 bit 2's complement
the endianness of float and int is the same
type punning through unions is well defined (C99+)
(AKA any "regular" PC with a recent enough compiler)
you will get the integer whose bit pattern matches the one of your original float, which is described e.g. here
Now, there's the sign bit that messes up stuff with the 2's complement representation of int, so you probably want to use an unsigned type to do this kind of experimentation. Also, memcpy is a safer way to perform type-punning (you won't get dirty looks and discussions about the standard), so if you do something like:
float x = 1234.5678;
uint32_t x_u;
memcpy(&x_u, &x, sizeof x_u);
Now you can easily extract the various parts of the FP representation:
int sign = x_u>>31; // 0 = positive; 1 = negative
int exponent = ((x_u>>23) & 0xff; // apply -127 bias to obtain actual exponent
int mantissa = x_u & ~((unsigned(-1)<<23);
(notice that this ignores completely all the "magic" patterns - quiet and signaling NaNs and subnormal numbers come to mind)
According to this answer, reading from any element of the union other than the last one written is either undefined behavior or implementation defined behavior depending on the version of the standard.
If you want to examine the binary representation of 3.14159f, you can do so by casting the address of a float and then dereferencing.
#include <stdint.h>
#include <stdio.h>
int main(){
float f = 3.14159f;
printf("%x\n", *(uint32_t*) &f);
}
The output of this program is 40490fd0, which matches with the result given by this page.
As interjay correctly pointed out, the technique I present above violates the strict aliasing rule. To make the above code work correctly, one must pass the flag -fno-strict-aliasing to gcc or the equivalent flag to disable optimizations based on strict aliasing on other compilers.
Another way of viewing the bytes which does not violate strict aliasing and does not require the flag is using a char * instead.
unsigned char* cp = (unsigned char*) &f;
printf("%02x%02x%02x%02x\n",cp[0],cp[1],cp[2],cp[3]);
Note that on little endian architectures such as x86, this will produce bytes in the opposite order as the first suggestion.
Related
GCC version 5.4.0
Ubuntu 16.04
I have noticed some weird behavior with the right shift in C when I store a value in variable or not.
This code snippet is printing 0xf0000000, the expected behavior
int main() {
int x = 0x80000000
printf("%x", x >> 3);
}
These following two code snippets are printing 0x10000000, which is very weird in my opinion, it is performing logical shifts on a negative number
1.
int main() {
int x = 0x80000000 >> 3
printf("%x", x);
}
2.
int main() {
printf("%x", (0x80000000 >> 3));
}
Any insight would be really appreciated. I do not know if it a specific issue with my personal computer, in which case it can't be replicated, or if it is just a behavior in C.
Quoting from https://en.cppreference.com/w/c/language/integer_constant, for an hexadecimal integer constant without any suffix
The type of the integer constant is the first type in which the value can fit, from the list of types which depends on which numeric base and which integer-suffix was used.
int
unsigned int
long int
unsigned long int
long long int(since C99)
unsigned long long int(since C99)
Also, later
There are no negative integer constants. Expressions such as -1 apply the unary minus operator to the value represented by the constant, which may involve implicit type conversions.
So, if an int has 32 bit in your machine, 0x80000000 has the type unsigned int as it can't fit an int and can't be negative.
The statement
int x = 0x80000000;
Converts the unsigned int to an int in an implementation defined way, but the statement
int x = 0x80000000 >> 3;
Performs a right shift to the unsigned int before converting it to an int, so the results you see are different.
EDIT
Also, as M.M noted, the format specifier %x requires an unsigned integer argument and passing an int instead causes undefined behavior.
Right shift of the negative integer has implementation defined behavior. So when shifting right the negative number you cant "expect" anything
So it is just as it is in your implementation. It is not weird.
6.5.7/5 [...] If E1 has a signed type and a negative value, the resulting value is implementation- defined.
It may also invoke the UB
6.5.7/4 [...] If E1 has a signed type and nonnegative value, and E1×2E2 is representable in the result type, then that is the resulting
value; otherwise, the behavior is undefined.
As noted by #P__J__, the right shift is implementation-dependent, so you should not rely on it to be consistent on different platforms.
As for your specific test, which is on a single platform (possibly 32-bit Intel or another platform that uses two's complement 32-bit representation of integers), but still shows a different behavior:
GCC performs operations on literal constants using the highest precision available (usually 64-bit, but may be even more). Now, the statement x = 0x80000000 >> 3 will not be compiled into code that does right-shift at run time, instead the compiler figures out both operands are constant and folds them into x = 0x10000000. For GCC, the literal 0x80000000 is NOT a negative number. It is the positive integer 2^31.
On the other hand, x = 0x80000000 will store the value 2^31 into x, but the 32-bit storage cannot represent that as the positive integer 2^31 that you gave as an integer literal - the value is beyond the range representable by a 32-bit two's complement signed integer. The high-order bit ends up in the sign bit - so this is technically an overflow, though you don't get a warning or error. Then, when you use x >> 3, the operation is now performed at run-time (not by the compiler), with the 32-bit arithmetic - and it sees that as a negative number.
I am curious what this function does and why its useful. I know this does type conversion of float to integer, any explanation in detail would be grateful.
unsigned int func(float t)
{
return *(unsigned int *)&t;
}
Thanks
Assuming a float and a unsigned int are the same size, it gives an unsigned int value that is represented using the same binary representation (underlying bits) as a supplied float.
The caller can then apply bitwise operations to the returned value, and access the individual bits (e.g. the sign bit, the bits that make up the exponent and mantissa) separately.
The mechanics are that (unsigned int *) converts &t into a pointer to unsigned int. The * then obtains the value at that location. That last step formally has undefined behaviour.
For an implementation (compiler) for which float and unsigned int have different sizes, the behaviour could be anything.
It returns the unsigned integer whose binary representation is the same as the binary representation of the given float.
uint_var = func(float_var);
is essentially equivalent to:
memcpy(&uint_var, &float_var, sizeof(uint_var));
Type punning like this results in undefined behavior, so code like this is not portable. However, it's not uncommon in low-level programming, where the implementation-dependent behavior of the compiler is known.
This doesn't exactly convert a float to an int per se. On most (practically all) platforms, a float is a 32-bit entity with the following four bytes:
Sign+7bits of exponent
8thBitOfExponent+first7bitsOfMantissa
Next 8 of mantissa
Last 8 of mantissa
Whereas an unsigned is just 32 bits of number (in endianness dictated by platform).
A straight float->unsigned int conversion would try to shoehorn the actual value of the float into the closest unsigned it can fit inside. This code straight copies the bits that make up the float without trying to interpret what they mean. So 1.0f translates to 0x3f800000 (assuming big endian).
The above makes a fair number of grody assumptions about platform (on some platforms, you'll have a size mismatch and could end up with truncation or even memory corruption :-( ). I'm also not exactly sure why you'd want to do this at all (maybe to do bit ops a bit easier? Serialization?). Anyway, I'd personally prefer doing an explicit memcpy() to make it more obvious what's going on.
Hi I have two questions:
uint64_t vs double, which has a higher range limit for covering positive numbers?
How to convert double into uint64_t if only the whole number part of double is needed.
Direct casting apparently doesn't work due to how double is defined.
Sorry for any confusion, I'm talking about the 64bit double in C on a 32bit machine.
As for an example:
//operation for convertion I used:
double sampleRate = (
(union { double i; uint64_t sampleRate; })
{ .i = r23u.outputSampleRate}
).sampleRate;
//the following are printouts on command line
// double uint64_t
//printed by %.16llx %.16llx
outputSampleRate 0x41886a0000000000 0x41886a0000000000 sampleRate
//printed by %f %llu
outputSampleRate 51200000.000000 4722140757530509312 sampleRate
So the two numbers remain the same bit pattern but when print out as decimals, the uint64_t is totally wrong.
Thank you.
uint64_t vs double, which has a higher range limit for covering positive numbers?
uint64_t, where supported, has 64 value bits, no padding bits, and no sign bit. It can represent all integers between 0 and 264 - 1, inclusive.
Substantially all modern C implementations represent double in IEEE-754 64-bit binary format, but C does not require nor even endorse that format. It is so common, however, that it is fairly safe to assume that format, and maybe to just put in some compile-time checks against the macros defining FP characteristics. I will assume for the balance of this answer that the C implementation indeed does use that representation.
IEEE-754 binary double precision provides 53 bits of mantissa, therefore it can represent all integers between 0 and 253 - 1. It is a floating-point format, however, with an 11-bit binary exponent. The largest number it can represent is (253 - 1) * 21023, or nearly 21077. In this sense, double has a much greater range than uint64_t, but the vast majority of integers between 0 and its maximum value cannot be represented exactly as doubles, including almost all of the numbers that can be represented exactly by uint64_t.
How to convert double into uint64_t if only the whole number part of double is needed
You can simply assign (conversion is implicit), or you can explicitly cast if you want to make it clear that a conversion takes place:
double my_double = 1.2345678e48;
uint64_t my_uint;
uint64_t my_other_uint;
my_uint = my_double;
my_other_uint = (uint64_t) my_double;
Any fractional part of the double's value will be truncated. The integer part will be preserved exactly if it is representable as a uint64_t; otherwise, the behavior is undefined.
The code you presented uses a union to overlay storage of a double and a uint64_t. That's not inherently wrong, but it's not a useful technique for converting between the two types. Casts are C's mechanism for all non-implicit value conversions.
double can hold substantially larger numbers than uint64_t, as the value range for 8 bytes IEEE 754 is 4.94065645841246544e-324d to 1.79769313486231570e+308d (positive or negative) [taken from here][more detailed explanation]. However if you do addition of small values in that range, you will be in for a surprise, because at some point the precision will not be able to represent e.g. an addition of 1 and will round down to the lower value, essentially making a loop steadily incremented by 1 non-terminating.
This code for example:
#include <stdio.h>
2 int main()
3 {
4 for (double i = 100000000000000000000000000000000.0; i < 1000000000000000000000000000000000000000000000000.0; ++i)
5 printf("%lf\n", i);
6 return 0;
7 }
gives me a constant output of 100000000000000005366162204393472.000000. That's also why we have nextafter and nexttoward functions in math.h. You can also find ceil and floor functions there, which, in theory, will allow you to solve your second problem: removing the fraction part.
However, if you really need to hold large numbers you should look at bigint implementations instead, e.g. GMP. Bigints were designed to do operations on very large integers, and operations like an addition of one will truly increment the number even for very large values.
I'm trying to do, what I imagined to be, a fairly basic task. I have two unsigned char variables and I'm trying to combine them into a single signed int. The problem here is that the unsigned chars start as signed chars, so I have to cast them to unsigned first.
I've done this task in three IDE's; MPLAB (as this is an embedded application), MATLAB, and now trying to do it in visual studio. Visual is the only one having problems with the casting.
For an example, two signed chars are -5 and 94. In MPLAB I first cast the two chars into unsigned chars:
unsigned char a = (unsigned char)-5;
unsigned char b = (unsigned char)94;
This gives me 251 and 94 respectively. I then want to do some bitshifting and concat:
int c = (int)((((unsigned int) a) << 8) | (unsigned int) b);
In MPLAB and MATLAB this gives me the right signed value of -1186. However, the exact same code in visual refuses to output results as a signed value, only unsigned (64350). This has been checked by both debugging and stepping through the code and printing the results:
printf("%d\n", c);
What am I doing wrong? This is driving me insane. The application is an electronic device that collects sensor data, then stores it on an SD card for later decoding using a program written in C. I technically could do all the calculations in MPLAB and then store those on the SDCARD, but I refuse to let Microsoft win.
I understand my method of casting is very unoptimised and you could probably do it in one line, but having had this problem for a couple of days now I've tried to break the steps down as much as possible.
Any help is most appreciated!
The problem is that an int on most systems is 32-bits. If you concatenate two 8-bit quantities and store it into a 32-bit quantity, you will get a positive integer because you are not setting the sign bit, which is the most significant bit. More specifically, you are only populating the lower 16 bits of a 32-bit integer, which will naturally be interpreted as a positive number.
You can fix this by explicitly using as 16-bit signed int.
#include <stdio.h>
#include <stdint.h>
int main() {
unsigned char a = (unsigned char)-5;
unsigned char b = (unsigned char)94;
int16_t c = (int16_t)((((unsigned int) a) << 8) | (unsigned int) b);
printf("%d\n", c);
}
Note that I am on a Linux system, so you will probably have to change stdint.h to the Microsoft equivalent, and possibly change int16_t to whatever Microsoft calls their 16-bit signed integer type, if it is different, but this should work with those modifications.
This is the correct behavior of the standard C language. When you convert an unsigned to a signed type, the language does not perform sign extension, i.e. it does not propagate the highest bit of the unsigned into the sign bit of the signed type.
You can fix your problem by casting a to a signed char, like this:
unsigned char a = (unsigned char)-5;
unsigned char b = (unsigned char)94;
int c = (signed char)a << 8 | b;
printf("%d\n", c); // Prints -1186
Now that a is treated as signed, the language propagates its top bit into the sign bit of the 32-bit int, making the result negative.
Demo on ideone.
Converting an out-of-range unsigned value to a signed value causes implementation-defined behaviour, which means that the compiler must document what it does in this situation; and different compilers can do different things.
In C99 there is also a provision that the compiler may raise a signal in this case (terminating the program if you don't have a signal handler). I believe it was undefined behaviour in C89, but C99 tightened this up a bit.
Is there some reason you can't go:
signed char x = -5;
signed char y = 94;
int c = x * 256 + y;
?
BTW if you are OK with implementation-defined behaviour, and your system has a 16-bit type then you can just go, with #include <stdint.h>,
int c = (int16_t)(x * 256 + y);
To explain, C deals in values. In math, 251 * 256 + 94 is a positive number, and C is no exception to that. The bit-shift operators are just *2 and /2 in disguise. If you want your value to be reduced (mod 65536) you have to specifically request that.
If you also think in terms of values rather than representations, you don't have to worry about things like sign bits and sign extension.
I have this structure which I want to write to a file:
typedef struct
{
char* egg;
unsigned long sausage;
long bacon;
double spam;
} order;
This file must be binary and must be readable by any machine that has a
C99 compiler.
I looked at various approaches to this matter such as ASN.1, XDR, XML,
ProtocolBuffers and many others, but none of them fit my requirements:
small
simple
written in C
I decided then to make my own data protocol. I could handle the
following representations of integer types:
unsigned
signed in one's complement
signed in two's complement
signed in sign and magnitude
in a valid, simple and clean way (impressive, no?). However, the
real types are being a pain now.
How should I read float and double from a byte stream? The standard
says that bitwise operators (at least &, |, << and >>) are for
integer types only, which left me without hope. The only way I could
think was:
int sign;
int exponent;
unsigned long mantissa;
order my_order;
sign = read_sign();
exponent = read_exponent();
mantissa = read_mantissa();
my_order.spam = sign * mantissa * pow(10, exponent);
but that doesn't seem really efficient. I also could not find a
description of the representation of double and float. How should
one proceed before this?
If you want to be as portable as possible with floats you can use frexp and ldexp:
void WriteFloat (float number)
{
int exponent;
unsigned long mantissa;
mantissa = (unsigned int) (INT_MAX * frexp(number, &exponent);
WriteInt (exponent);
WriteUnsigned (mantissa);
}
float ReadFloat ()
{
int exponent = ReadInt();
unsigned long mantissa = ReadUnsigned();
float value = (float)mantissa / INT_MAX;
return ldexp (value, exponent);
}
The Idea behind this is, that ldexp, frexp and INT_MAX are standard C. Also the precision of an unsigned long is usually at least as high as the width of the mantissa (no guarantee, but it is a valid assumption and I don't know a single architecture that is different here).
Therefore the conversion works without precision loss. The division/multiplication with INT_MAX may loose a bit of precision during conversion, but that's a compromise one can live with.
If you are using C99 you can output real numbers in portable hex using %a.
If you are using IEEE-754 why not access the float or double as a unsigned short or unsigned long and save the floating point data as a series of bytes, then re-convert the "specialized" unsigned short or unsigned long back to a float or double on the other side of the transmission ... the bit-data would be preserved, so you should end-up with the same floating point number after transmission.
This answer uses Nils Pipenbrinck's method but I have changed a few details that I think help to ensure real C99 portability. This solution lives in an imaginary context where encode_int64 and encode_int32 etc already exist.
#include <stdint.h>
#include <math.h>
#define PORTABLE_INTLEAST64_MAX ((int_least64_t)9223372036854775807) /* 2^63-1*/
/* NOTE: +-inf and nan not handled. quickest solution
* is to encode 0 for !isfinite(val) */
void encode_double(struct encoder *rec, double val) {
int exp = 0;
double norm = frexp(val, &exp);
int_least64_t scale = norm*PORTABLE_INTLEAST64_MAX;
encode_int64(rec, scale);
encode_int32(rec, exp);
}
void decode_double(struct encoder *rec, double *val) {
int_least64_t scale = 0;
int_least32_t exp = 0;
decode_int64(rec, &scale);
decode_int32(rec, &exp);
*val = ldexp((double)scale/PORTABLE_INTLEAST64_MAX, exp);
}
This is still not a real solution, inf and nan can not be encoded. Also notice that both parts of the double carry sign bits.
int_least64_t is guaranteed by the standard (int64_t is not), and we use the least perimissible maximum for this type to scale the double. The encoding routines accept int_least64_t but will have to reject input that is larger than 64 bits for portability, the same for the 32 bit case.
The C standard doesn't define a representation for floating point types. Your best bet would be to convert them to IEEE-754 format and store them that way. Portability of binary serialization of double/float type in C++ may help you there.
Note that the C standard also doesn't specify a format for integers. While most computers you're likely to encounter will use a normal two's-complement representation with only endianness to be concerned about, it's also possible they would use a one's-complement or sign-magnitude representation, and both signed and unsigned ints may contain padding bits that don't contribute to the value.