C - how to store IEEE 754 double and single precision

C - how to store IEEE 754 double and single precision - c

I have to work with IEEE 745 double and single precision numbers.
I have no idea, how to work with them correctly.
I've got a buffer of binary data and I want to get the numbers like
uint8_t bufer[] = {................};
//data I want are at 8th position, (IEEE745_t is my imaginary format)
IEEE745double_t first8bytes = *(IEEE745double_t*)(buffer + 8);
IEEE745single_t next4bytes = *(IEEE745single_t*)(buffer + 16);
What do I put instead of IEE745double_t and IEEE745single_t ? Is it possible to do it with double and float? And if so, how can I guarantee, that they will be 8 and 4 bytes long on every platform?

First of all, you cannot do the pointer cast hack. There is absolutely no guarantee that your byte buffer is correctly aligned. This results in undefined behaviour. Use memcpy instead:
memcpy(&first8bytes, &buffer[8], 8);
memcpy(&next4bytes, &buffer[16], 4);
What do I put instead of IEE745double_t and IEEE745single_t? Is it possible to do it with double and float?
Yes, it's possible, if:
double is 8 bytes, float is 4 bytes and both use IEEE754 presentation.
Data on buffer uses same byte order as your host. If not, you need to copy to temporary unsigned integer type first, where you fix the endianness.
And if so, how can I guarantee, that they will be 8 and 4 bytes long on every platform?
Use static assertion to detect when it's not. For example:
static_assert(sizeof(float) == 4, "invalid float size");

Related

Convert Long To Double, Unexpected Results

I am using very basic code to convert a string into a long and into a double. The CAN library I am using requires a double as an input. I am attempting to send the device ID as a double to another device on the CAN network.
If I use an input string of that is 6 bytes long the long and double values are the same. If I add a 7th byte to the string the values are slightly different.
I do not think I am hitting a max value limit. This code is run with ceedling for an automated test. The same behaviour is seen when sending this data across my CAN communications. In main.c the issue is not observed.
The test is:
void test_can_hal_get_spn_id(void){
struct dbc_id_info ret;
memset(&ret, NULL_TERMINATOR, sizeof(struct dbc_id_info));
char expected_str[8] = "smg123";
char out_str[8];
memset(&out_str, 0, 8);
uint64_t long_val = 0;
double phys = 0.0;
memcpy(&long_val, expected_str, 8);
phys = long_val;
printf("long %ld \n", long_val);
printf("phys %f \n", phys);
uint64_t temp = (uint64_t)phys;
memcpy(&out_str, &temp, 8);
printf("%s\n", expected_str);
printf("%s\n", out_str);
}
With the input = "smg123"
[test_can_hal.c]
- "long 56290670243187 "
- "phys 56290670243187.000000 "
- "smg123"
- "smg123"
With the input "smg1234"
[test_can_hal.c]
- "long 14692989459197299 "
- "phys 14692989459197300.000000 "
- "smg1234"
- "tmg1234"
Is this error just due to how floats are handled and rounded? Is there a way to test for that? Am I doing something fundamentally wrong?
Representing the char array as a double without the intermediate long solved the issue. For clarity I am using DBCPPP. I am using it in C. I should clarify my CAN library comes from NXP, DBCPPP allows my application to read a DBC file and apply the data scales and factors to my raw CAN data. DBCPPP accepts doubles for all data being encoded and returns doubles for all data being decoded.

The CAN library I am using requires a double as an input.
That sounds surprising, but if so, then why are you involving a long as an intermediary between your string and double?
If I use an input string of that is 6 bytes long the long and double values are the same. If I add a 7th byte to the string the values are slightly different.
double is a floating point data type. To be able to represent values with a wide range of magnitudes, some of its bits are used to represent scale, and the rest to represent significant digits. A typical C implementation uses doubles with 53 bits of significand. It cannot exactly represent numbers with more than 53 significant binary digits. That's enough for 6 bytes, but not enough for 7.
I do not think I am hitting a max value limit.
Not a maximum value limit. A precision limit. A 64-bit long has smaller numeric range but more significant digits than an IEEE-754 double.
So again, what role is the long supposed to be playing in your code? If the objective is to get eight bytes of arbitrary data into a double, then why not go directly there? Example:
char expected_str[8] = "smg1234";
char out_str[8] = {0};
double phys = 0.0;
memcpy(&phys, expected_str, 8);
printf("phys %.14e\n", phys);
memcpy(&out_str, &phys, 8);
printf("%s\n", expected_str);
printf("%s\n", out_str);
Do note, however, that there is some risk when (mis)using a double this way. It is possible for the data you put in to constitute a trap representation (a signaling NaN might be such a representation, for example). Handling such a value might cause a trap, or cause the data to be corrupted, or possibly produce other misbehavior. It is also possible to run into numeric issues similar to the one in your original code.
Possibly your library provides some relevant guarantees in that area. I would certainly hope so if doubles are really its sole data type for communication. Otherwise, you could consider using multiple doubles to covey data payloads larger than 53 bits, each of which you could consider loading via your original technique.

If you have a look at the IEEE-754 Wikipedia page, you'll see that the double precision values have a precision of "[a]pproximately 16 decimal digits". And that's roughly where your problem seems to appear.
Specifically, though it's a 64-bit type, it does not have the necessary encoding to provide 264 distinct floating point values. There are many bit patterns that map to the same value.
For example, NaN is encoded as the exponent field of binary 1111 1111 with non-zero fraction (23 bits) regardless of the sign (one bit). That's 2 * (223 - 1) (over 16 million) distinct values representing NaN.
So, yes, your "due to how floats are handled and rounded" comment is correct.
In terms of fixing it, you'll either have to limit your strings to values that can be represented by doubles exactly, or find a way to send the strings across the CAN bus.
For example (if you can't send strings), two 32-bit integers could represent an 8-character string value with zero chance of information loss.

Converting int16 to float in C

How do i convert a 16 bit int to a floating point number?
I have a signed 16 bit variable which i'm told i need to display with an accuracy of 3 decimal places, so i presume this would involve a conversion to float?
I've tried the below which just copy's my 16 bits into a float but this doesn't seem right.
float myFloat = 0;
int16_t myInt = 0x3e00;
memcpy(&myFloat, &myInt, sizeof(int));
I've also read about the Half-precision floating-point format but am unsure how to handle this... if i need to.
I'm using GCC.
update:
The source of the data is a char array [2] which i get from an i2c interface. I then stitch this together into a signed int.
Can anyone help?

I have a signed 16 bit variable which i'm told i need to display with
an accuracy of 3 decimal places
If someone told you the integer value can be displayed this way he/she should start from the C begginers course.
The only possibility is that the integer value has been scaled (multiplied). For example the value of 12.456 can be stored in the integer if multiplied by 1000. If this is the case:
float flv;
int intv = 12456;
flv = (float)intv / 1000.0f;
You can also print this scaled integer without convering to float
printf("%s%d.%03d\n", intv < 0 ? "-": "", abs(intv / 1000), abs(intv % 1000));

Join two integers into one double

I need to transfer a double value (-47.1235648, for example) using sockets. Since I'll have a lot of platforms, I must convert to network byte order to ensure correct endian of all ends....but this convert doesn't accept double, just integer and short, so I'm 'cutting' my double into two integer to transfer, like this:
double lat = -47.848945;
int a;
int b;
a = (int)lat;
b = (int)(lat+1);
Now, I need to restore this on the other end, but using the minimum computation as possible (I saw some examples using POW, but looks like pow use a lot of resources for this, I'm not sure). Is there any way to join this as simples as possible, like bit manipulating?

Your code makes no sense.
The typical approach is to use memcpy():
const double lat = -47.848945;
uint32_t ints[sizeof lat / sizeof (uint32_t)];
memcpy(ints, &lat, sizeof lat);
Now send the elements of ints, which are just 32-bit unsigned integers.
This of course assumes:
That you know how to send uint32_ts in a safe manner, i.e. byte per byte or using endian-conversion functions.
That all hosts share the same binary double format (typically IEEE-754).
That you somehow can manage the byte order requirements when moving to/from a pair of integers from/to a single double value (see #JohnBollinger's answer).
I interpreted your question to mean all of these assumptions were safe, that might be a bit over the top. I can't delete this answer as long as it's accepted.

It's good that you're considering differences in numeric representation, but your idea for how to deal with this problem just doesn't work reliably.
Let us suppose that every machine involved uses 64-bit IEEE-754 format for its double representation. (That's already a potential point of failure, but in practice you probably don't have to worry about failures there.) You seem to postulate that the byte order for machines' doubles will map in a consistent way onto the byte order for their integers, but that is not a safe assumption. Moreover, even where that assumption holds true, you need exactly the right kind of mapping for your scheme to work, and that is not only not safe to assume, but very plausibly will not be what you actually see.
For the sake of argument, suppose machine A, which features big-endian integers, wants to transfer a double value to machine B, which features little-endian integers. Suppose further that on B, the byte order for its double representation is the exact reverse of the order on A (which, again, is not safe to assume). Thus, if on A, the bytes of that double are in the order
S T U V W X Y Z
then we want them to be in order
Z Y X W V U T S
on B. Your approach is to split the original into a pair (STUV, WXYZ), transfer the pair in a value-preserving manner to get (VUTS, ZYXW), and then put the pair back together to get ... uh oh ...
V U T S Z Y X W
. Don't imagine fixing that by first swapping the pair. That doesn't serve your purpose because you must avoid such a swap in the event that the two communicating machines have the same byte order, and you have no way to know from just the 8 bytes whether such a swap is needed. Thus even if we make simplifying assumptions that we know to be unsafe, your strategy is insufficient for the task.
Alternatives include:
transfer your doubles as strings.
transfer your doubles as integer (significand, scale) pairs. The frexp() and ldexp() functions can help with encoding and decoding such representations.
transfer an integer-based fixed-point representation of your doubles (the same as the previous option, but with pre-determined scale that is not transferred)

I need to transfer a double value (-47.1235648, for example) using sockets.
If the platforms have potentially different codings for double, then sending a bit pattern of the double is a problem. If code wants portability, a less than "just copy the bits" approach is needed. An alternative is below.
If platforms always have the same double format, just copy the n bits. Example:#Rishikesh Raje
In detail, OP's problem is only loosely defined. On many platforms, a double is a binary64 yet this is not required by C. That double can represent about 264 different values exactly. Neither -47.1235648 nor -47.848945 are one of those. So it is possible OP does not have a strong precision concern.
"using the minimum computation as possible" implies minimal code, usually to have minimal time. For speed, any solution should be rated on order of complexity and with code profiling.
A portable method is to send via a string. This approach addresses correctness and best possible precision first and performance second. It removes endian issues as data is sent via a string and there is no precision/range loss in sending the data. The receiving side, if the using the same double format will re-formed the double exactly. With different double machines, it has a good string representation to do the best it can.
// some ample sized buffer
#define N (sizeof(double)*CHAR_BIT)
double x = foo();
char buf[N];
#if FLT_RADIX == 10
// Rare based 10 platforms
// If macro DBL_DECIMAL_DIG not available, use (DBL_DIG+3)
sprintf(buf, "%.*e", DBL_DECIMAL_DIG-1, x);
#else
// print mantissa in hexadecimal notation with power-of-2 exponent
sprintf(buf, "%a", x);
#endif
bar_send_string(buf);
To reconstitute the double
char *s = foo_get_string();
double y;
// %f decode strings in decimal(f), exponential(e) or hexadecimal/exponential notation(a)
if (sscanf(s, "%f", &y) != 1) Handle_Error(s);
else use(y);

A much better idea would be to send the double directly as 8 bytes in network byte order.
You can use a union
typedef union
{
double a;
uint8_t bytes[8];
} DoubleUnionType;
DoubleUnionType DoubleUnion;
//Assign the double by
DoubleUnion.a = -47.848945;
Then you can make a network byte order conversion function
void htonfl(uint8_t *out, uint8_t *in)
{
#if LITTLE_ENDIAN // Use macro name as per architecture
out[0] = in[7];
out[1] = in[6];
out[2] = in[5];
out[3] = in[4];
out[4] = in[3];
out[5] = in[2];
out[6] = in[1];
out[7] = in[0];
#else
memcpy (out, in, 8);
#endif
}
And call this function before transmission and after reception.

Convert 2 bytes or 3 bytes to a float in C

I'm working with a low-level protocol that has a high data rate and therefore uses 2 or 3 bytes to represent a float depending on the range of the number, to make the system more efficient.
I'm trying to parse these numbers but the values I get don't make sense to me, they're all zero and I don't think the device would output zero for the variables in question.
The first 5 bytes in my buffer are: FF-FF-FF-FF-FF
The first two bytes make up a float t. It should be noted that the documentation says that the bytes are little endian.
To parse t I do:
float t = 0;
memcpy(buffer, &t, 2);
The next 3 bytes make up float ax, to parse that I do:
float ax = 0;
memcpy(buffer+2, &ax, 3);
Is this the correct way to handle this? I set both t and ax to zero first in case there are random bytes hanging around.
Update
The documentation is not great. Firstly they define a Float as a 32-bit IEEE 754 floating-point number.
Then there is this quote:
To increase efficiency many of the data packets are sent as 24-bit signed integer words
because 16-bits do not provide the range/precision required for many of the quantities,
whereas 32-bit precision makes the packet much longer than required.
Then there is a table which defines t as the first 2 bytes of the buffer. It states that the range is 0-59.999. It doesn't explicitly say that it's a Float, I'm just making that assumption.

Possibly you have the arguments the wrong way around. Change as follows:
memcpy (buffer, &t, 2);
to
memcpy (&t, buffer, 2);
HTH
[edit: To clarify to all those people voting my answer down, the question does indeed have the arguments the wrong way around. Tries to read the buffer but specifies it as the destination for memcpy]

I don't know how to convert 16 byte hexadecimal to floating point

Probably from the time I am trying to convert and wandering internet solely for the answer of this question but I could not find. I just got I can convert hexadecimal to decimal either by some serious programming or manually through math.
I am looking to convert. If there is any way to do that then please share. Well I have searched and found IEEE754 which seems not to be working or I am not comprehending it. Can I do it manually through any equation, I think I heard about it? Or a neat C program which may do it.
Please help! Any help would be highly appreciated.

You need to study the IEEE floating point spec.
This would be quite straightforward in Java. You have handy methods like Float.floatToRawIntBits(float x) and Float.intBitsToFloat(int x)
You might be able to do it with a union.
In C its a bit more hacky. You can abuse a union. Unions in C reuse the same memory for two different variables. A union like
union DoubleLong {
long l;
double d;
} u;
would allow you to treat the same bit of memory as either a long u.i or a double u.f. There are both 8 byte so they take the same space. So doing u.d = M_PI; printf("%lx\n", u.l); prints the binary representation of pi 0x400921fb54442d18.
For 16 byte we need the union to have an array or two 8 byte longs.
#include <stdio.h>
union Data {
long i[2];
long double f;
} u;
int main(int argc, char const *argv[]) {
// Using random IP6 address 2602:306:cecd:7130:5421:a679:6d71:a660
// Store in two separate 8-byte longs
u.i[0] = 0x2602306cecd7130;
u.i[1] = 0x5421a6796d71a660;
// Print out in hexidecimal
printf("%.15La %lx %lx\n", u.f,u.i[0],u.i[1]);
// print out in decimal
printf("%.15Le %ld %ld\n", u.f,u.i[0],u.i[1]);
return 0;
}
One problem is 16 byte hexadecimal floating point numbers might not be defined on you system. float is typically 32 bit - 4 byte, double is 64 bit - 8 byte. There is an long double type but on my mac its only 80-bit - 10 byte. It might be simpler to convert to two double precision numbers. So on my system only the last 4 hexadecimal digits of the second number are significant.
Not all hexadecimal numbers correspond to valid floating point numbers, a lot of values will correspond to NaN's. If the higher bits are 7FFF or FFFF (or 7FF, FFF for double) that will either give infinity of NaN.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight