Join two integers into one double

Join two integers into one double - c

I need to transfer a double value (-47.1235648, for example) using sockets. Since I'll have a lot of platforms, I must convert to network byte order to ensure correct endian of all ends....but this convert doesn't accept double, just integer and short, so I'm 'cutting' my double into two integer to transfer, like this:
double lat = -47.848945;
int a;
int b;
a = (int)lat;
b = (int)(lat+1);
Now, I need to restore this on the other end, but using the minimum computation as possible (I saw some examples using POW, but looks like pow use a lot of resources for this, I'm not sure). Is there any way to join this as simples as possible, like bit manipulating?

Your code makes no sense.
The typical approach is to use memcpy():
const double lat = -47.848945;
uint32_t ints[sizeof lat / sizeof (uint32_t)];
memcpy(ints, &lat, sizeof lat);
Now send the elements of ints, which are just 32-bit unsigned integers.
This of course assumes:
That you know how to send uint32_ts in a safe manner, i.e. byte per byte or using endian-conversion functions.
That all hosts share the same binary double format (typically IEEE-754).
That you somehow can manage the byte order requirements when moving to/from a pair of integers from/to a single double value (see #JohnBollinger's answer).
I interpreted your question to mean all of these assumptions were safe, that might be a bit over the top. I can't delete this answer as long as it's accepted.

It's good that you're considering differences in numeric representation, but your idea for how to deal with this problem just doesn't work reliably.
Let us suppose that every machine involved uses 64-bit IEEE-754 format for its double representation. (That's already a potential point of failure, but in practice you probably don't have to worry about failures there.) You seem to postulate that the byte order for machines' doubles will map in a consistent way onto the byte order for their integers, but that is not a safe assumption. Moreover, even where that assumption holds true, you need exactly the right kind of mapping for your scheme to work, and that is not only not safe to assume, but very plausibly will not be what you actually see.
For the sake of argument, suppose machine A, which features big-endian integers, wants to transfer a double value to machine B, which features little-endian integers. Suppose further that on B, the byte order for its double representation is the exact reverse of the order on A (which, again, is not safe to assume). Thus, if on A, the bytes of that double are in the order
S T U V W X Y Z
then we want them to be in order
Z Y X W V U T S
on B. Your approach is to split the original into a pair (STUV, WXYZ), transfer the pair in a value-preserving manner to get (VUTS, ZYXW), and then put the pair back together to get ... uh oh ...
V U T S Z Y X W
. Don't imagine fixing that by first swapping the pair. That doesn't serve your purpose because you must avoid such a swap in the event that the two communicating machines have the same byte order, and you have no way to know from just the 8 bytes whether such a swap is needed. Thus even if we make simplifying assumptions that we know to be unsafe, your strategy is insufficient for the task.
Alternatives include:
transfer your doubles as strings.
transfer your doubles as integer (significand, scale) pairs. The frexp() and ldexp() functions can help with encoding and decoding such representations.
transfer an integer-based fixed-point representation of your doubles (the same as the previous option, but with pre-determined scale that is not transferred)

I need to transfer a double value (-47.1235648, for example) using sockets.
If the platforms have potentially different codings for double, then sending a bit pattern of the double is a problem. If code wants portability, a less than "just copy the bits" approach is needed. An alternative is below.
If platforms always have the same double format, just copy the n bits. Example:#Rishikesh Raje
In detail, OP's problem is only loosely defined. On many platforms, a double is a binary64 yet this is not required by C. That double can represent about 264 different values exactly. Neither -47.1235648 nor -47.848945 are one of those. So it is possible OP does not have a strong precision concern.
"using the minimum computation as possible" implies minimal code, usually to have minimal time. For speed, any solution should be rated on order of complexity and with code profiling.
A portable method is to send via a string. This approach addresses correctness and best possible precision first and performance second. It removes endian issues as data is sent via a string and there is no precision/range loss in sending the data. The receiving side, if the using the same double format will re-formed the double exactly. With different double machines, it has a good string representation to do the best it can.
// some ample sized buffer
#define N (sizeof(double)*CHAR_BIT)
double x = foo();
char buf[N];
#if FLT_RADIX == 10
// Rare based 10 platforms
// If macro DBL_DECIMAL_DIG not available, use (DBL_DIG+3)
sprintf(buf, "%.*e", DBL_DECIMAL_DIG-1, x);
#else
// print mantissa in hexadecimal notation with power-of-2 exponent
sprintf(buf, "%a", x);
#endif
bar_send_string(buf);
To reconstitute the double
char *s = foo_get_string();
double y;
// %f decode strings in decimal(f), exponential(e) or hexadecimal/exponential notation(a)
if (sscanf(s, "%f", &y) != 1) Handle_Error(s);
else use(y);

A much better idea would be to send the double directly as 8 bytes in network byte order.
You can use a union
typedef union
{
double a;
uint8_t bytes[8];
} DoubleUnionType;
DoubleUnionType DoubleUnion;
//Assign the double by
DoubleUnion.a = -47.848945;
Then you can make a network byte order conversion function
void htonfl(uint8_t *out, uint8_t *in)
{
#if LITTLE_ENDIAN // Use macro name as per architecture
out[0] = in[7];
out[1] = in[6];
out[2] = in[5];
out[3] = in[4];
out[4] = in[3];
out[5] = in[2];
out[6] = in[1];
out[7] = in[0];
#else
memcpy (out, in, 8);
#endif
}
And call this function before transmission and after reception.

Related

Convert Long To Double, Unexpected Results

I am using very basic code to convert a string into a long and into a double. The CAN library I am using requires a double as an input. I am attempting to send the device ID as a double to another device on the CAN network.
If I use an input string of that is 6 bytes long the long and double values are the same. If I add a 7th byte to the string the values are slightly different.
I do not think I am hitting a max value limit. This code is run with ceedling for an automated test. The same behaviour is seen when sending this data across my CAN communications. In main.c the issue is not observed.
The test is:
void test_can_hal_get_spn_id(void){
struct dbc_id_info ret;
memset(&ret, NULL_TERMINATOR, sizeof(struct dbc_id_info));
char expected_str[8] = "smg123";
char out_str[8];
memset(&out_str, 0, 8);
uint64_t long_val = 0;
double phys = 0.0;
memcpy(&long_val, expected_str, 8);
phys = long_val;
printf("long %ld \n", long_val);
printf("phys %f \n", phys);
uint64_t temp = (uint64_t)phys;
memcpy(&out_str, &temp, 8);
printf("%s\n", expected_str);
printf("%s\n", out_str);
}
With the input = "smg123"
[test_can_hal.c]
- "long 56290670243187 "
- "phys 56290670243187.000000 "
- "smg123"
- "smg123"
With the input "smg1234"
[test_can_hal.c]
- "long 14692989459197299 "
- "phys 14692989459197300.000000 "
- "smg1234"
- "tmg1234"
Is this error just due to how floats are handled and rounded? Is there a way to test for that? Am I doing something fundamentally wrong?
Representing the char array as a double without the intermediate long solved the issue. For clarity I am using DBCPPP. I am using it in C. I should clarify my CAN library comes from NXP, DBCPPP allows my application to read a DBC file and apply the data scales and factors to my raw CAN data. DBCPPP accepts doubles for all data being encoded and returns doubles for all data being decoded.

The CAN library I am using requires a double as an input.
That sounds surprising, but if so, then why are you involving a long as an intermediary between your string and double?
If I use an input string of that is 6 bytes long the long and double values are the same. If I add a 7th byte to the string the values are slightly different.
double is a floating point data type. To be able to represent values with a wide range of magnitudes, some of its bits are used to represent scale, and the rest to represent significant digits. A typical C implementation uses doubles with 53 bits of significand. It cannot exactly represent numbers with more than 53 significant binary digits. That's enough for 6 bytes, but not enough for 7.
I do not think I am hitting a max value limit.
Not a maximum value limit. A precision limit. A 64-bit long has smaller numeric range but more significant digits than an IEEE-754 double.
So again, what role is the long supposed to be playing in your code? If the objective is to get eight bytes of arbitrary data into a double, then why not go directly there? Example:
char expected_str[8] = "smg1234";
char out_str[8] = {0};
double phys = 0.0;
memcpy(&phys, expected_str, 8);
printf("phys %.14e\n", phys);
memcpy(&out_str, &phys, 8);
printf("%s\n", expected_str);
printf("%s\n", out_str);
Do note, however, that there is some risk when (mis)using a double this way. It is possible for the data you put in to constitute a trap representation (a signaling NaN might be such a representation, for example). Handling such a value might cause a trap, or cause the data to be corrupted, or possibly produce other misbehavior. It is also possible to run into numeric issues similar to the one in your original code.
Possibly your library provides some relevant guarantees in that area. I would certainly hope so if doubles are really its sole data type for communication. Otherwise, you could consider using multiple doubles to covey data payloads larger than 53 bits, each of which you could consider loading via your original technique.

If you have a look at the IEEE-754 Wikipedia page, you'll see that the double precision values have a precision of "[a]pproximately 16 decimal digits". And that's roughly where your problem seems to appear.
Specifically, though it's a 64-bit type, it does not have the necessary encoding to provide 264 distinct floating point values. There are many bit patterns that map to the same value.
For example, NaN is encoded as the exponent field of binary 1111 1111 with non-zero fraction (23 bits) regardless of the sign (one bit). That's 2 * (223 - 1) (over 16 million) distinct values representing NaN.
So, yes, your "due to how floats are handled and rounded" comment is correct.
In terms of fixing it, you'll either have to limit your strings to values that can be represented by doubles exactly, or find a way to send the strings across the CAN bus.
For example (if you can't send strings), two 32-bit integers could represent an 8-character string value with zero chance of information loss.

How to force compiler to promote variable to float when doing maths

I got question about math in C, quick example below:
uint32_t desired_val;
uint16_t target_us=1500
uint32_t period_us=20000;
uint32_t tempmod=37500;
desired_val = (((target_us)/period_us) * tempmod);
At the moment (target_us/period_us) results in 0 which gives desired_value also 0. I don't want to make these variables float unless i really have to. I dont need anything after comma as it will be saved into 32bit register.
Is it possible to get correct results from this equation without declaring target_us or period_us as float? I want to make fixed point calculations when it's possible and floating point when it's needed.
Working on cortex-M4 if that helps.

Do the multiplication first.
You should split it into two statements with a temporary variable, to ensure the desired order of operations (parentheses ensure proper grouping, but not order).
uint64_t tempprod = (uint64_t)target_us * tempmod;
desired_val = tempprod / period_us;
I've also used uint64_t for the temporary, in case the product overflows. There's still a problem if the desired value doesn't fit into 32 bits; hopefully the data precludes that.

You'll probably have to do some casting in any case, but there's two different methods. First, stick with integers and do the multiplication first:
desired_val = ((uint64_t)target_us * tempmod) / period_us;
or do the calculations in floating point:
desired_val = (uint32_t)(((double)target_us / period_us) * tempmod);

You can do the computation with double quite easily:
desired_val = (double)target_us * tempmod / period_us;
float would be a mistake, since it has far too little precision to be reliable.
You might want to round that off to the nearest integer rather than letting it be truncated:
#include <math.h>
desired_val = round((double)target_us * tempmod / period_us);
See man round
You could, of course, do the computation using a wider integer type (for example, replacing the double with int64_t or long long). That will make rounding slightly trickier.

Should a custom int representation of a float be run through htons before sending?

I've recently enjoyed reading Beej's Guide to Network Programming. In section 7.4 he talks about problems related to sending floats. He offers a simple (and naive) solution where he "packs" floats by converting them to uint32_t's:
uint32_t htonf(float f)
{
uint32_t p;
uint32_t sign;
if (f < 0) { sign = 1; f = -f; }
else { sign = 0; }
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
return p;
}
float ntohf(uint32_t p)
{
float f = ((p>>16)&0x7fff); // whole part
f += (p&0xffff) / 65536.0f; // fraction
if (((p>>31)&0x1) == 0x1) { f = -f; } // sign bit set
return f;
}
Am I supposed to run the packed floats (that is, the results of htonf) through the standard htons before sending? If no, why not?
Beej doesn't mention this as far as I can tell. The reason I'm asking is that I cannot understand how the receiving machine can reliably reconstruct the uint32_ts that are to be passed to ntohf (the "unpacker") if the data isn't converted to network byte order before being sent.

Yes, you would also have to marshall the data in a defined order; the easiest way would be to use htonl.
But, aside from educational purposes, I'd really suggest staying away from this code. It has a very limited range, and silently corrupts most numbers. Also, it's really unnecessarily complicated for what it does. You might just as well multiply the float by 65536 and cast it to an int to send; cast to a float and divide by 65536.0 to receive. (As noted in a comment, it is even questionable whether the guide's code is educational: I'd say it is educational in the sense that critiquing it and/or comparing it with good code will teach you something: if nothing else, that not everything that glitters on the web is gold.)
Almost all CPUs actually out there these days use IEEE-754 format floats, but I wouldn't use Beej's second solution either because it's unnecessarily slow; the standard library functions frexp and ldexp will reliably convert between a double and the corresponding mantissa and integer binary exponent. Or you can use ilogb* and scalb*, if you prefer that interface. You can find the appropriate bit length for the mantissa on the host machine through the macros FLT_MANT_DIG, DBL_MANT_DIG and LDBL_MANT_DIG (in float.h). [See note 1]
Coding floating point data transfer properly is a good way to start to understand floating point representations, which is definitely worthwhile. But if you just want to transmit floating point numbers over the wire and you don't have some idiosyncratic processor to support, I'd suggest just sending the raw bits of the float or double as a 4-byte or 8-byte integer (in whatever byte order you've selected as standard), and restricting yourself to IEEE-754 32- and 64-bit representations.
Notes:
Implementation hint: frexp returns a mantissa between 0.5 and 1.0, but what you really want is an integer, so you should scale the mantissa by the correct power of 2 and subtract that from the binary exponent returned by frexp. The result is not really precision-dependent as long as you can transmit arbitrary precision integers, so you don't need to distinguish between float, double, or some other binary representation.

Run them through htonl (and vice versa), not htons.
These two functions, htonf and ntohf, are OK as far as they go (i.e., not very far at all), but their names are misleading. They produce a fixed-point 32-bit representation with 31 bits of that split up as: 15 bits of integer, 16 bits of fraction. The remaining bit holds the sign. This value is in the host's internal representation. (You could do the htonl etc right in the functions themselves, to fix this.)
Note that any float whose absolute value reaches or exceeds 32768, or is less than 2-16 (.0000152587890625), will be wrecked in the process of "network-izing", since those do not fit in a 15.16 format.
(Edit to add: It's better to use a packaged network-izer. Even something as old as the Sun RPC XDR routines will encode floating-point properly.)

Precisly convert float 32 to unsigned short or unsigned char

First of all sorry if this is a duplicate, I couldn't find any subject answering my question.
I'm coding a little program that will be used to convert 32-bit floating point values to short int (16 bits) and unsigned char (8 bits) values. This is for HDR images purpose.
From here I could get the following function (without clamping):
static inline uint8_t u8fromfloat(float x)
{
return (int)(x * 255.0f);
}
I suppose that in the same way we could get short int by multiplying by (pow( 2,16 ) -1)
But then I ended up thinking about ordered dithering and especially to Bayer dithering. To convert to uint8_t I suppose I could use a 4x4 matrix and a 8x8 matrix for unsigned short.
I also thought of a Look-up table to speed-up the process, this way:
uint16_t LUT[0x10000] // 2¹⁶ values contained
and store 2^16 unsigned short values corresponding to a float.
This same table could be then used for uint8_t as well because of the implicit cast between unsigned short ↔ unsigned int
But wouldn't a look-up table like this be huge in memory? Also how would one fill a table like this?!
Now I'm confused, what would be best according to you?
EDIT after uwind answer: Let's say now that I also want to do basic color space conversion at the same time, that is before converting to U8/U16 , do a color space conversion (in float), and then shrink it to U8/U16. Wouldn't in that case use a LUT be more efficient? And yeah I would still have the problem to index the LUT.

The way I see it, the look-up table won't help since in order to index into it, you need to convert the float into some integer type. Catch 22.
The table would require 0x10000 * sizeof (uint16_t) bytes, which is 128 KB. Not a lot by modern standards, but on the other hand cache is precious. But, as I said, the table doesn't add much to the solution since you need to convert float to integer in order to index.
You could do a table indexed by the raw bits of the float re-interpreted as integer, but that would have to be 32 bits which becomes very large (8 GB or so).
Go for the straight-forward runtime conversion you outlined.

Just stay with the multiplication - it'll work fine.
Practically all modern CPU have vector instructions (SSE, AVX, ...) adapted to this stuff, so you might look at programming for that. Or use a compiler that automatically vectorizes your code, if possible (Intel C, also GCC). Even in cases where table-lookup is a possible solution, this can often be faster because you don't suffer from memory latency.

First, it should be noted that float has 24 bits of precision, which can no way fit into a 16-bit int or even 8 bits. Second, float have much larger range, which can't be stored in any int or long long int
So your question title is actually incorrect, no way to precisely convert any float to short or char. You want to map a float value between 0 and 1 to an 8-bit or 16-bit int range.
For the code you use above, it'll work fine. However the value 255 is extremely unlikely to be returned because it needs exactly 1.0 as input, otherwise values such as 254.99999 will ended up being truncated as 254. You should round the value instead
return (int)(x * 255.0f + .5f);
or better, use the code provided in your link for more balanced distribution
static inline uint8_t u8fromfloat_trick(float x)
{
union { float f; uint32_t i; } u;
u.f = 32768.0f + x * (255.0f / 256.0f);
return (uint8_t)u.i;
}
Using LUT wouldn't be any faster because a table for 16-bit values is too large for fitting in cache, and in fact may reduce your performance greatly. The snippet above needs only 2 floating-point instructions, or only 1 with FMA. And SIMD will improve performance 4-32x (or more) further, so LUT method would be easily outperformed as it's much harder to parallelize table look ups

Convert a 32 bits to float value

I am working on a DSP processor to implement a BFSK frequency hopping mechanism using C on a Linux system. On the receiver part of the program, I am getting an input of a set of samples which I de-modulate using Goertzel algorithm to determine whether the received bit was a 0 or 1.
Right now, I am able to detect the bits individually. But I have to return the data for processing in the form of a float array. So, I need to pack every set of 32 bits received to form a float value. Right I am doing something like :
uint32_t i,j,curBit,curBlk;
unint32_t *outData; //this is intiallized to address of some pre-defined location in DSP memory
float *output;
for(i=0; i<num_os_bits; i++) //Loop for number of data bits
{
//Demodulate the data and set curBit=0x0000 or 0x8000
curBlk=curBlk>>1;
curBlk=curBlk|curBit;
bitsCounter+=1;
if(i!=0 && (bitsCounter%32)==0) //32-bits processed, save the data in array
{
*(outData+j)=curBlk;
j+=1;
curBlk=0;
}
}
output=(float *)outData;
Now, the values of the output array are just the values of outData array with 0s after the decimal point.
example: if output[i]=12345 the `outData[i]=12345.0000'.
But while testing the program I am generating the sample test data of bits using an array of float
float data[] ={123.12456,45.789,297.0956};
So after the demodulation I am expecting the float array output to have a same values as data array.
Is there some other method to convert 32-bits of data to a float. Should I store the received bits to a char array and then convert it to float.

Not sure if I get your point - you sequentialy obtain bits and if you got 32 bit you want to make a float from them?
what about:
union conv32
{
uint32_t u32; // here_write_bits
float f32; // here_read_float
};
Which can be used like this in-line:
float f = ((union conv32){.u32 = my_uint}).f32;
Or with a helper variable:
union conv32 x = {.u32 = my_uint};
float f = x.f32;

You need to copy the bit pattern verbatim without invoking any implicit conversions. The simplest way to do that is to take the address of the data and reinterpret it as a pointer to the intermediate "carrier" type before dereferencing it.
Consider this:
float source_float = 1234.5678f ;
uint32_t transport_bits = *((uint32_t*)&source_float);
float destination_float = *((float*)&transport_bits);
The intermediate result in transport_bits is target dependent, in my test on x86, it is 0x449a522b, this being the bit representation of single precision float on that platform (floating point representation and endianness dependent).
Either way the result in destination_float is identical to source_float having been transported via the uint32_t transport_bits.
However this does require that the floating point representation of the originator is identical to that of the receiver which may not be a given. It is all a bit non-portable, The fact that your code does not work suggests perhaps that the representation does indeed differ between sender and receiver. Not only must the FP representation be identical but also the endianness. If they differ your code may have to do one or both of reorder the bits and calculate the floating point equivalent by extraction of exponent and mantissa. All so you nee to be sure that the transmission order of the bits is is the same order in which you are reassembling them. There are a number of opportunities to get this wrong.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight