I have this structure which I want to write to a file:
typedef struct
{
char* egg;
unsigned long sausage;
long bacon;
double spam;
} order;
This file must be binary and must be readable by any machine that has a
C99 compiler.
I looked at various approaches to this matter such as ASN.1, XDR, XML,
ProtocolBuffers and many others, but none of them fit my requirements:
small
simple
written in C
I decided then to make my own data protocol. I could handle the
following representations of integer types:
unsigned
signed in one's complement
signed in two's complement
signed in sign and magnitude
in a valid, simple and clean way (impressive, no?). However, the
real types are being a pain now.
How should I read float and double from a byte stream? The standard
says that bitwise operators (at least &, |, << and >>) are for
integer types only, which left me without hope. The only way I could
think was:
int sign;
int exponent;
unsigned long mantissa;
order my_order;
sign = read_sign();
exponent = read_exponent();
mantissa = read_mantissa();
my_order.spam = sign * mantissa * pow(10, exponent);
but that doesn't seem really efficient. I also could not find a
description of the representation of double and float. How should
one proceed before this?
If you want to be as portable as possible with floats you can use frexp and ldexp:
void WriteFloat (float number)
{
int exponent;
unsigned long mantissa;
mantissa = (unsigned int) (INT_MAX * frexp(number, &exponent);
WriteInt (exponent);
WriteUnsigned (mantissa);
}
float ReadFloat ()
{
int exponent = ReadInt();
unsigned long mantissa = ReadUnsigned();
float value = (float)mantissa / INT_MAX;
return ldexp (value, exponent);
}
The Idea behind this is, that ldexp, frexp and INT_MAX are standard C. Also the precision of an unsigned long is usually at least as high as the width of the mantissa (no guarantee, but it is a valid assumption and I don't know a single architecture that is different here).
Therefore the conversion works without precision loss. The division/multiplication with INT_MAX may loose a bit of precision during conversion, but that's a compromise one can live with.
If you are using C99 you can output real numbers in portable hex using %a.
If you are using IEEE-754 why not access the float or double as a unsigned short or unsigned long and save the floating point data as a series of bytes, then re-convert the "specialized" unsigned short or unsigned long back to a float or double on the other side of the transmission ... the bit-data would be preserved, so you should end-up with the same floating point number after transmission.
This answer uses Nils Pipenbrinck's method but I have changed a few details that I think help to ensure real C99 portability. This solution lives in an imaginary context where encode_int64 and encode_int32 etc already exist.
#include <stdint.h>
#include <math.h>
#define PORTABLE_INTLEAST64_MAX ((int_least64_t)9223372036854775807) /* 2^63-1*/
/* NOTE: +-inf and nan not handled. quickest solution
* is to encode 0 for !isfinite(val) */
void encode_double(struct encoder *rec, double val) {
int exp = 0;
double norm = frexp(val, &exp);
int_least64_t scale = norm*PORTABLE_INTLEAST64_MAX;
encode_int64(rec, scale);
encode_int32(rec, exp);
}
void decode_double(struct encoder *rec, double *val) {
int_least64_t scale = 0;
int_least32_t exp = 0;
decode_int64(rec, &scale);
decode_int32(rec, &exp);
*val = ldexp((double)scale/PORTABLE_INTLEAST64_MAX, exp);
}
This is still not a real solution, inf and nan can not be encoded. Also notice that both parts of the double carry sign bits.
int_least64_t is guaranteed by the standard (int64_t is not), and we use the least perimissible maximum for this type to scale the double. The encoding routines accept int_least64_t but will have to reject input that is larger than 64 bits for portability, the same for the 32 bit case.
The C standard doesn't define a representation for floating point types. Your best bet would be to convert them to IEEE-754 format and store them that way. Portability of binary serialization of double/float type in C++ may help you there.
Note that the C standard also doesn't specify a format for integers. While most computers you're likely to encounter will use a normal two's-complement representation with only endianness to be concerned about, it's also possible they would use a one's-complement or sign-magnitude representation, and both signed and unsigned ints may contain padding bits that don't contribute to the value.
Related
I want to understand float serialization better. Why in this example do they multiple the mantissa by INT_MAX before casting to unsigned int?
void WriteFloat (float number)
{
int exponent;
unsigned long mantissa;
mantissa = (unsigned int) (INT_MAX * frexp(number, &exponent);
WriteInt (exponent);
WriteUnsigned (mantissa);
}
float ReadFloat ()
{
int exponent = ReadInt();
unsigned long mantissa = ReadUnsigned();
float value = (float)mantissa / INT_MAX;
return ldexp (value, exponent);
}
The frexp() function returns a 'normalized' value in the range (±)[0.5 – 1.0). Clearly, this is not a range that can be properly represented in a variable of integral type (a simple cast of that value would always yield zero, as the range does not include ±1.0), so it has to be 'denormalized' (stretched) into a range that is fully representable.
Multiplying by INT_MAX will give (nearly) the greatest precision possible (assuming int and unsigned long have the same bit-width)†, without overflowing the range of the destination type (including the possibility of storing the representation of a negative value in that unsigned integer).
Note: One could get more precision by storing the sign of the normalized fraction, then subtracting 0.5 from its absolute value, re-applying the sign and multiplying by 2.0 * INT_MAX (I think this will be safe) … but the precision gain (1 bit) is likely not worth the extra effort in coding (and decoding) the stored value.
† On many platforms, the int and long types are the same size; however, this is not required so, as mentioned in the comments, using LONG_MAX as the multiplier/divisor would potentially offer greater precision; however, that may be overkill, depending on how many bits of mantissa there are in the source. If it's an IEEE-754 single-precision float, it will have 23 bits, so a 16-bit int type would lose out, but a 64-bit LONG_MAX would be over-cooking.
In the below program:
union
{
int i;
float f;
} u;
Assuming 32 bit compiler, u is allocated with 4 bytes in memory.
u.f = 3.14159f;
3.14159f is represented using IEEE 754, in those 4 bytes.
printf("As integer: %08x\n", u.i);
What does u.i represent here? Is IEEE 754 binary representation interpreted as 4 byte signed int?
Reading from i is implementation-defined blah blah blah.
Still.
On "normal" platforms where
float is IEEE-754 binary32 format
int is 32 bit 2's complement
the endianness of float and int is the same
type punning through unions is well defined (C99+)
(AKA any "regular" PC with a recent enough compiler)
you will get the integer whose bit pattern matches the one of your original float, which is described e.g. here
Now, there's the sign bit that messes up stuff with the 2's complement representation of int, so you probably want to use an unsigned type to do this kind of experimentation. Also, memcpy is a safer way to perform type-punning (you won't get dirty looks and discussions about the standard), so if you do something like:
float x = 1234.5678;
uint32_t x_u;
memcpy(&x_u, &x, sizeof x_u);
Now you can easily extract the various parts of the FP representation:
int sign = x_u>>31; // 0 = positive; 1 = negative
int exponent = ((x_u>>23) & 0xff; // apply -127 bias to obtain actual exponent
int mantissa = x_u & ~((unsigned(-1)<<23);
(notice that this ignores completely all the "magic" patterns - quiet and signaling NaNs and subnormal numbers come to mind)
According to this answer, reading from any element of the union other than the last one written is either undefined behavior or implementation defined behavior depending on the version of the standard.
If you want to examine the binary representation of 3.14159f, you can do so by casting the address of a float and then dereferencing.
#include <stdint.h>
#include <stdio.h>
int main(){
float f = 3.14159f;
printf("%x\n", *(uint32_t*) &f);
}
The output of this program is 40490fd0, which matches with the result given by this page.
As interjay correctly pointed out, the technique I present above violates the strict aliasing rule. To make the above code work correctly, one must pass the flag -fno-strict-aliasing to gcc or the equivalent flag to disable optimizations based on strict aliasing on other compilers.
Another way of viewing the bytes which does not violate strict aliasing and does not require the flag is using a char * instead.
unsigned char* cp = (unsigned char*) &f;
printf("%02x%02x%02x%02x\n",cp[0],cp[1],cp[2],cp[3]);
Note that on little endian architectures such as x86, this will produce bytes in the opposite order as the first suggestion.
I am curious what this function does and why its useful. I know this does type conversion of float to integer, any explanation in detail would be grateful.
unsigned int func(float t)
{
return *(unsigned int *)&t;
}
Thanks
Assuming a float and a unsigned int are the same size, it gives an unsigned int value that is represented using the same binary representation (underlying bits) as a supplied float.
The caller can then apply bitwise operations to the returned value, and access the individual bits (e.g. the sign bit, the bits that make up the exponent and mantissa) separately.
The mechanics are that (unsigned int *) converts &t into a pointer to unsigned int. The * then obtains the value at that location. That last step formally has undefined behaviour.
For an implementation (compiler) for which float and unsigned int have different sizes, the behaviour could be anything.
It returns the unsigned integer whose binary representation is the same as the binary representation of the given float.
uint_var = func(float_var);
is essentially equivalent to:
memcpy(&uint_var, &float_var, sizeof(uint_var));
Type punning like this results in undefined behavior, so code like this is not portable. However, it's not uncommon in low-level programming, where the implementation-dependent behavior of the compiler is known.
This doesn't exactly convert a float to an int per se. On most (practically all) platforms, a float is a 32-bit entity with the following four bytes:
Sign+7bits of exponent
8thBitOfExponent+first7bitsOfMantissa
Next 8 of mantissa
Last 8 of mantissa
Whereas an unsigned is just 32 bits of number (in endianness dictated by platform).
A straight float->unsigned int conversion would try to shoehorn the actual value of the float into the closest unsigned it can fit inside. This code straight copies the bits that make up the float without trying to interpret what they mean. So 1.0f translates to 0x3f800000 (assuming big endian).
The above makes a fair number of grody assumptions about platform (on some platforms, you'll have a size mismatch and could end up with truncation or even memory corruption :-( ). I'm also not exactly sure why you'd want to do this at all (maybe to do bit ops a bit easier? Serialization?). Anyway, I'd personally prefer doing an explicit memcpy() to make it more obvious what's going on.
Hi I have two questions:
uint64_t vs double, which has a higher range limit for covering positive numbers?
How to convert double into uint64_t if only the whole number part of double is needed.
Direct casting apparently doesn't work due to how double is defined.
Sorry for any confusion, I'm talking about the 64bit double in C on a 32bit machine.
As for an example:
//operation for convertion I used:
double sampleRate = (
(union { double i; uint64_t sampleRate; })
{ .i = r23u.outputSampleRate}
).sampleRate;
//the following are printouts on command line
// double uint64_t
//printed by %.16llx %.16llx
outputSampleRate 0x41886a0000000000 0x41886a0000000000 sampleRate
//printed by %f %llu
outputSampleRate 51200000.000000 4722140757530509312 sampleRate
So the two numbers remain the same bit pattern but when print out as decimals, the uint64_t is totally wrong.
Thank you.
uint64_t vs double, which has a higher range limit for covering positive numbers?
uint64_t, where supported, has 64 value bits, no padding bits, and no sign bit. It can represent all integers between 0 and 264 - 1, inclusive.
Substantially all modern C implementations represent double in IEEE-754 64-bit binary format, but C does not require nor even endorse that format. It is so common, however, that it is fairly safe to assume that format, and maybe to just put in some compile-time checks against the macros defining FP characteristics. I will assume for the balance of this answer that the C implementation indeed does use that representation.
IEEE-754 binary double precision provides 53 bits of mantissa, therefore it can represent all integers between 0 and 253 - 1. It is a floating-point format, however, with an 11-bit binary exponent. The largest number it can represent is (253 - 1) * 21023, or nearly 21077. In this sense, double has a much greater range than uint64_t, but the vast majority of integers between 0 and its maximum value cannot be represented exactly as doubles, including almost all of the numbers that can be represented exactly by uint64_t.
How to convert double into uint64_t if only the whole number part of double is needed
You can simply assign (conversion is implicit), or you can explicitly cast if you want to make it clear that a conversion takes place:
double my_double = 1.2345678e48;
uint64_t my_uint;
uint64_t my_other_uint;
my_uint = my_double;
my_other_uint = (uint64_t) my_double;
Any fractional part of the double's value will be truncated. The integer part will be preserved exactly if it is representable as a uint64_t; otherwise, the behavior is undefined.
The code you presented uses a union to overlay storage of a double and a uint64_t. That's not inherently wrong, but it's not a useful technique for converting between the two types. Casts are C's mechanism for all non-implicit value conversions.
double can hold substantially larger numbers than uint64_t, as the value range for 8 bytes IEEE 754 is 4.94065645841246544e-324d to 1.79769313486231570e+308d (positive or negative) [taken from here][more detailed explanation]. However if you do addition of small values in that range, you will be in for a surprise, because at some point the precision will not be able to represent e.g. an addition of 1 and will round down to the lower value, essentially making a loop steadily incremented by 1 non-terminating.
This code for example:
#include <stdio.h>
2 int main()
3 {
4 for (double i = 100000000000000000000000000000000.0; i < 1000000000000000000000000000000000000000000000000.0; ++i)
5 printf("%lf\n", i);
6 return 0;
7 }
gives me a constant output of 100000000000000005366162204393472.000000. That's also why we have nextafter and nexttoward functions in math.h. You can also find ceil and floor functions there, which, in theory, will allow you to solve your second problem: removing the fraction part.
However, if you really need to hold large numbers you should look at bigint implementations instead, e.g. GMP. Bigints were designed to do operations on very large integers, and operations like an addition of one will truly increment the number even for very large values.
I implemented a fast-power algorithm to perform the operation (x^y) % m.
Since m is very large (4294434817), I used long long to store the result. However, long long still seems not enough during the operation. For example, I got a negative number for (3623752876 * 3623752876) % 4294434817.
Is there anyway to figure it out?
All three of those constants are between 231 and 232.
The type unsigned long long is guaranteed to be able to store values up to at least 264-1, which exceeds the product 3623752876 * 3623752876.
So just use unsigned long long for the calculation. long long is wide enough to hold the individual constants, but not the product.
You could also use uint64_t, defined in <stdint.h>. Unlike unsigned long long, it's guaranteed to be exactly 64 bits wide. Since you don't really need an exact width of 64 bits (128-bit arithmetic would work just as well), uint_least64_t or uint_fast64_t is probably more suitable. But unsigned long long is arguably simpler, and in this case it will work correctly. (uint64_t is not guaranteed to exist, though on any C99 or later implementation it almost certainly will.)
For larger values, including intermediate results, you'll likely need to use something wider than unsigned long long, which is likely to require some kind of multi-precision arithmetic. The GNU GMP library is one possibility. Another is to use a language that has built-in support for arbitrary-width integer arithmetic (such as Python).
This answer is based on the calculation (x * x) % y although the question is not entirely clear.
Use uint64_t because although unsigned int is large enough to hold the operands, and the result, it won't hold the product.
#include <stdio.h>
#include <stdint.h>
int main(void)
{
unsigned x = 3623752876;
unsigned m = 4294434817;
uint64_t r;
r = ((uint64_t)x * x) % m;
printf("%u\n", (unsigned)r);
return 0;
}
Program output:
3896043471
We can use the power of modulus arithmetic to do such calculations. The fundamental property of multiplication in modulus arithmetic for two numbers a and b states:
(a*b)%m = ((a%m)*(b%m))%m
If m fits into the long long int type, then the above calculation will never overflow.
Since you are trying to do modular exponentiation, you can read more about it here.