Arithmetic Bit Shift of Double Variable Data Type in C - c

I am trying to arithmetic bit shift a double data type in C. I was wondering if this is the correct way to do it:
NOTE: firdelay[ ][ ] is declared in main as
double firdelay[8][12]
void function1(double firdelay[][12]) {
int * shiftptr;
// Cast address of element of 2D matrix (type double) to integer pointer
*shiftptr = (int *) (&firdelay[0][5]);
// Dereference integer pointer and shift right by 12 bits
*shiftptr >>= 12;
}

Bitwise shifting a floating point data type will not give you the result you're looking for.
In Simulink, the Shift Arithmetic block only does bit shifting for integer data types. If you feed it a floating point type it divides the input signal by 2^N where N is the number of bits to shift specified in the mask dialog box.
EDIT:
Since you don't have the capability to perform any floating point math your options are:
understand the layout of a floating point single precision number, then figure out how to manipulate it bitwise to achieve division.
convert whatever algorithm you're porting to use fixed point data types instead of floating point
I'd recommend option 2, it's a whole lot easier than 1

Bit-shifting a floating-point data type (reinterpreted as an int) will give you gibberish (take a look at the diagrams of the binary representation here to see why).
If you want to multiply/divide by a power of 2, then you should do that explicitly.

According to the poorly worded and very unclear documentation, it seems that "bit shifting" in Simulink takes two arguments for floating point values and has the effect of multiplying a floating point value by two raised to the difference of the arguments.
You can use ldexp(double_number, bits_to_pseudo_shift) to obtain this behavior. The function ldexp is found in <math.h>.

There is no correct way to do this. Both operands of << must be of some integer type.
What you're doing is interpreting ("type-punning") a double object as if it were an int object, and then shifting the resulting int value. Even if double and int happen to be the same size, this is very unlikely to do anything useful. (And even if it is useful, it makes more sense to shift unsigned values rather than signed values).

One potential use case for this is for capturing mantissa bits, exponent bits, and sign bit if interested. For this you can use a union:
union doubleBits {
double d;
long l;
};
You can set take your double and set it in the union:
union doubleBits myUnion;
myUnion.d = myDouble;
And bit shift the long portion of the union after extracting the bits like so:
myUnion.l >>= 1;
Since the number of bits for each portion of the double is defined, this is one way to extract the underlying bit representation. This is one use case where it could be feasible to want to get the raw underlying bits. I'm not familiar with Simulink, but if that is possibly why the double was being shifted in the first place, this could be a way to achieve that behavior in C. The fact that it was always 12 bits makes me think otherwise, but just in case figured it was worth pointing out for others who stumble upon this question.

There's a way to achieve this : just add your n to the exponent part of the bitwise representation of the double.
Convert your double to long using "reinterpret" or bitwise conversion (using a union for instance).. extract bits from 52 to 63 (11 bits).. then add your shift and put back the result in the exponent.
You should take into account the special values of double (+infinity, nan or zero)
double operator_shift_left(double a,int n)
{
union
{
long long l;
double d;
} r;
r.d=a;
switch(r.l)
{
case 0x0000000000000000: // 0
case 0x8000000000000000: // -0
case 0x7FF0000000000000: // pos infnity
case 0xFFF0000000000000: // neg infnity
case 0x7FF0000000000001: // Nan
case 0x7FF8000000000001: // Nan
case 0x7FFFFFFFFFFFFFFF: // Nan
return a;
}
int nexp=(((r.l>>52)&0x7FF)+n); // new exponent
if (nexp<0) // underflow
{
r.l=r.l & 0x8000000000000000;
// returns +/- 0
return r.d;
}
if (nexp>2047) // overflow
{
r.l=(r.l & 0x8000000000000000)| 0x7FF0000000000000;
// returns +/- infinity
return r.d;
}
// returns the number with the new exponant
r.l=(r.l & 0x800FFFFFFFFFFFFF)|(((long long)nexp)<<52);
return r.d;
}
(there's probably some x64 processor instruction that does it ?)

Related

Is it defined what will happen if you shift a float?

I am following This video tutorial to implement a raycaster. It contains this code:
if(ra > PI) { ry = (((int)py>>6)<<6)-0.0001; rx=(py-ry)*aTan+px; yo=-64; xo=-yo*aTan; }//looking up
I hope I have transcribed this correctly. In particular, my question is about casting py (it's declared as float) to integer, shifting it back and forth, subtracting something, and then assigning it to a ry (also a float) This line of code is entered at time 7:24, where he also explains that he wants to
round the y position to the nearest 64th value
(I'm unsure if that means the nearest multiple of 64 or the nearest (1/64), but I know that the 6 in the source is derived from the number 64, being 2⁶)
For one thing, I think that it would be valid for the compiler to load (say) a 32-bit float into a machine register, and then shift that value down by six spaces, and then shift it back up by six spaces (these two operations could interfere with the mantissa, or the exponent, or maybe something else, or these two operations could be deleted by a peephole optimisation step.)
Also I think it would be valid for the compiler to make demons fly out of your nose when this statement is executed.
So my question is, is (((int)py>>6)<<6) defined in C when py is float?
is (((int)py>>6)<<6) defined in C when py is float?
It is certainly undefined behavior (UB) for many float. The cast to an int is UB for float with a whole number value outside the [INT_MIN ... INT_MAX] range.
So code is UB for about 38% of all typical float - the large valued ones, NaNs and infinities.
For typical float, a cast to int128_t is defined for nearly all float.
To get to OP's goal, code could use the below, which I believe to be well defined for all float.
If anything, use the below to assess the correctness of one's crafted code.
// round the y position to the nearest 64th value
float round_to_64th(float x) {
if (isfinite(x)) {
float ipart;
// The modf functions break the argument value into integral and fractional parts
float frac = modff(x, &ipart);
x = ipart + roundf(frac*64)/64;
}
return x;
}
"I'm unsure if that means the nearest multiple of 64 or the nearest (1/64)"
On review, OP's code is attempting to truncate to the nearest multiple of 64 or 2⁶.
It is still UB for many float.
That code doesn't shift a float because the bitshift operators aren't defined for floating-point types. If you try it you will get a compiler error.
Notice that the code is (int)py >> 6, the float is cast to an int before the shift operation. The integer value is what is being shifted.
If your question is "what will happen if you shift a float?", the answer is it won't compile. Example on Compiler Explorer.
The best possible recreation of the shift ops for floating points, in short, without using additional functions are the following:
Left shift:
ShiftFloat(py,6,1);
Right shift:
ShiftFloat(py,6,0);
float ShiftFloat(float x, int count, int ismultiplication)
{
float value = x;
for (int i = 0; i < count; ++i)
{
value *= (powf(0.5,(float)(ismultiplication^1)) / powf(2.0,(float)(ismultiplication)));
}
return count != 0 ? value : x;
}

getting exponent of a floating number in c

Sorry if this is already been asked, and I've seen other way of extracting the exponent of a floating point number, however this is what is given to me:
unsigned f2i(float f)
{
union {
unsigned i;
float f;
} x;
x.i = 0;
x.f = f;
return x.i;
}
I'm having trouble understanding this union datatype, because shouldn't the return x.i at the end always make f2i return a 0?
Also, what application could this data type even be useful for? For example, say I have a function:
int getexponent(float f){
}
This function is supposed to get the exponent value of the floating point number with bias of 127. I've found many ways to make this possible, however how could I manipulate the f2i function to serve this purpose?
I appreciate any pointers!
Update!!
Wow, years later and this just seem trivial.
For those who may be interested, here is the function!
int getexponent(float f) {
unsigned f2u(float f);
unsigned int ui = (f2u(f)>>23) & 0xff ;//shift over by 23 and compare to 0xff to get the exponent with the bias
int bias = 127;//initialized bias
if(ui == 0) return 1-bias; // special case 0
else if(ui == 255) return 11111111; //special case infinity
return ui - bias;
}
I'm having trouble understanding this union datatype
The union data type is a way for a programmer to indicate that some variable can be one of a number of different types. The wording of the C11 standard is something like "a union contains at most one of its members". It is used for things like parameters that may be logically one thing or another. For example, an IP address might be an IPv4 address or an IPv6 address so you might define an address type as follows:
struct IpAddress
{
bool isIPv6;
union
{
uint8_t v4[4];
uint8_t v6[16];
} bytes;
}
And you would use it like this:
struct IpAddress address = // Something
if (address.isIPv6)
{
doSomeV6ThingWith(address.bytes.v6);
}
else
{
doSomeV4ThingWith(address.bytes.v4);
}
Historically, unions have also been used to get the bits of one type into an object of another type. This is because, in a union, the members all start at the same memory address. If I just do this:
float f = 3.0;
int i = f;
The compiler will insert code to convert a float to an integer, so the exponent will be lost. However, in
union
{
unsigned int i;
float f;
} x;
x.f = 3.0;
int i = x.i;
i now contains the exact bits that represent 3.0 in a float. Or at least you hope it does. There's nothing in the C standard that says float and unsigned int have to be the same size. There's also nothing in the C standard that mandates a particular representation for float (well, annex F says floats conform to IEC 60559 , but I don't know if that counts as part of the standard). So the above code is, at best, non portable.
To get the exponent of a float the portable way is the frexpf() function defined in math.h
how could I manipulate the f2i function to serve this purpose?
Let's make the assumption that a float is stored in IEC 60559 format in 32 bits which Wkipedia thinks is the same as IEEE 754. Let's also assume that integers are stored in little endian format.
union
{
uint32_t i;
float f;
} x;
x.f = someFloat;
uint32_t bits = x.i;
bits now contains the bit pattern of the floating point number. A single precision floating point number looks like this
SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM
^ ^ ^
bit 31 bit 22 bit 0
Where S is the sign bit, E is an exponent bit, M is a mantissa bit.
So having got your int32_t you just need to do some shifting and masking:
uint32_t exponentWithBias = (bits >> 23) & 0xff;
Because it's a union it means that x.i and x.f have the same address, what this allows you to do is reinterpret one data type to another. In this scenario the union is first zeroed out by x.i = 0; and then filled with f. Then x.i is returned which is the integer representation of the float f. If you would then shift that value you would get the exponent of the original f because of the way a float is laid out in memory.
I'm having trouble understanding this union datatype, because shouldn't the return x.i at the end always make f2i return a 0?
The line x.i = 0; is a bit paranoid and shouldn't be necessary. Given that unsigned int and float are both 32 bits, the union creates a single chunk of 32 bits in memory, which you can access either as a float or as the pure binary representation of that float, which is what the unsigned is for. (It would have been better to use uint32_t.)
This means that the lines x.i = 0; and x.f = f; write to the very same memory area twice.
What you end up with after the function is the pure binary notation of the float. Parsing out the exponent or any other part from there is very much implementation-defined, since it depends on floating point format and endianess. How to represent FLOAT number in memory in C might be helpful.
That union type is strongly discouraged, as it is strongly architecture dependant and compiler implementation dependant.... both things make it almost impossible to determine a correct way to achieve the information you request.
There are portable ways of doing that, and all of them have to deal with the calculation of logarithm to the base ten. If you get the integer part of the log10(x) you'll get the number you want,
int power10 = (int)log10(x);
double log10(double x)
{
return log(x)/log(10.0);
}
will give you the exponent of 10 to raise to get the number to multiply the mantissa to get the number.... if you divide the original number by the last result, you'll get the mantissa.
Be careful, as the floating point numbers are normally internally stored in a power of two's basis, which means the exponent you get stored is not a power of ten, but a power of two.

Any precision loss when converting float64 to uint64 in C? assuming only the whole number part of the data is meaningful

I have a counter field from one TCP protocol which keeps track of samples sent out. It is defined as float64. I need to translate this protocol to a different one, in which the counter is defined as uint64.
Can I safely store the float64 value in the uint64 without losing any precision on the whole number part of the data? Assuming the fractional part can be ignored.
EDIT: Sorry if I didn't describe it clearly. There is no code. I'm just looking at two different protocol documentations to see if one can be translated to the other.
Please treat float64 as double, the documentation isn't well written and is pretty old.
Many thanks.
I am assuming you are asking about 64 bit floating point values such as IEEE Binary64 as documented in https://en.wikipedia.org/wiki/Double-precision_floating-point_format .
Converting a double represented as a 64 bit IEEE floating point value to a uint64_t will not lose any precision on the integral part of the value, as long as the value itself is non negative and less than 264. But if the value is larger than 253, the representation as a double does not allow complete precision, so whatever computation led to the value probably was inaccurate anyway.
Note that the reverse is false, IEEE floats have less precision than 64 bit uint64_t, so close but different large values will convert to the same double values.
Note that a counter implemented as a 64 bit double is intrinsically limited by the precision of the floating point type. Incrementing a value larger than 253 by one is likely to have no effect at all. Using a floating point type to implement a packet counter seems a very bad idea. Using a uint64_t counter directly seem a safer bet. You only have to worry about wrap around at 264, a condition that you can check for in the unlikely case where you would actually expect to count that far.
If you cannot change the format, verify that the floating point value is within range for the conversion and store an appropriate value if it is not:
#include <stdint.h>
...
double v = get_64bit_value();
uint64_t result;
if (v < 0) {
result = 0;
} else
if (v >= (double)UINT64_MAX) {
result = UINT64_MAX;
} else {
result = (uint64_t)v;
}
Yes, precision is lost. Negative numbers cannot be converted properly to uint64 (as this type is unsigned), as well as numbers greater than 2^64-1. In all other cases, the conversion is exact (providing you look at the float64 value as exact and rounding is done correctly).

get the float number's exponents [duplicate]

This question already has answers here:
How to get the sign, mantissa and exponent of a floating point number
(7 answers)
Closed 6 years ago.
I just started learning floating point and get to know the SME stuff. I'm still very confused about the mantissa... Can somebody explain to me how can I get the exp part of the float. I am sorry if that's a super stupid and basic question but I am having a hard time understanding it...
Also how do I implement the following function... clearly my implementation is wrong. But how do I do it?
// Extract the 8-bit exponent field of single precision
// floating point number f and return it as an unsigned byte
unsigned char get_exponent_field(float f)
{
// TODO: Your code here.
int bias = 127;
int expp = (int)f;
unsigned char E = expp-bias;
return E;
}
If you want to extract the IEEE-754 single precision exponent from a float value (in excess 127 notation), you can use the float functions, or you can use a simple union with a shift and mask to do the same:
unsigned float_getexp (float f)
{
union {
unsigned u;
float f;
} uf;
uf.f = f;
return (uf.u >> 23) & 0xff;
}
If you want the actual exponent bias (i.e. the number of places the mantissa decimal is shifted during normalization prior to hidden bit removal), just subtract 127 from the value returned, or if you want that value returned, subtract it before the return.
Give it a try and let me know if you have questions. (note: the type should be unsigned for your exponent, instead of the int you have).
First, get your floating-point number and calculate its binary form by converting both the integral and fractional parts separately. Once you've got that, say you've got 11010.101(base-2). Normalize the binary string: 1.1010101 x 2^4. Next, add your excess value, say excess 15, to the exponent of the sci. not. value, which would give you 19(base-ten). Convert this to base-two; this will be your exponent.
This is just the structure of the operation, plug in your own bias, etc.

Does floor() return something that's exactly representable?

In C89, floor() returns a double. Is the following guaranteed to work?
double d = floor(3.0 + 0.5);
int x = (int) d;
assert(x == 3);
My concern is that the result of floor might not be exactly representable in IEEE 754. So d gets something like 2.99999, and x ends up being 2.
For the answer to this question to be yes, all integers within the range of an int have to be exactly representable as doubles, and floor must always return that exactly represented value.
All integers can have exact floating point representation if your floating point type supports the required mantissa bits. Since double uses 53 bits for mantissa, it can store all 32-bit ints exactly. After all, you could just set the value as mantissa with zero exponent.
If the result of floor() isn't exactly representable, what do you expect the value of d to be? Surely if you've got the representation of a floating point number in a variable, then by definition it's exactly representable isn't it? You've got the representation in d...
(In addition, Mehrdad's answer is correct for 32 bit ints. In a compiler with a 64 bit double and a 64 bit int, you've got more problems of course...)
EDIT: Perhaps you meant "the theoretical result of floor(), i.e. the largest integer value less than or equal to the argument, may not be representable as an int". That's certainly true. Simple way of showing this for a system where int is 32 bits:
int max = 0x7fffffff;
double number = max;
number += 10.0;
double f = floor(number);
int oops = (int) f;
I can't remember offhand what C does when conversions from floating point to integer overflow... but it's going to happen here.
EDIT: There are other interesting situations to consider too. Here's some C# code and results - I'd imagine at least similar things would happen in C. In C#, double is defined to be 64 bits and so is long.
using System;
class Test
{
static void Main()
{
FloorSameInteger(long.MaxValue/2);
FloorSameInteger(long.MaxValue-2);
}
static void FloorSameInteger(long original)
{
double convertedToDouble = original;
double flooredToDouble = Math.Floor(convertedToDouble);
long flooredToLong = (long) flooredToDouble;
Console.WriteLine("Original value: {0}", original);
Console.WriteLine("Converted to double: {0}", convertedToDouble);
Console.WriteLine("Floored (as double): {0}", flooredToDouble);
Console.WriteLine("Converted back to long: {0}", flooredToLong);
Console.WriteLine();
}
}
Results:
Original value: 4611686018427387903
Converted to double:
4.61168601842739E+18Floored (as double): 4.61168601842739E+18
Converted back to long:
4611686018427387904
Original value: 9223372036854775805
Converted to double:
9.22337203685478E+18Floored (as double): 9.22337203685478E+18
Converted back to long:
-9223372036854775808
In other words:
(long) floor((double) original)
isn't always the same as original. This shouldn't come as any surprise - there are more long values than doubles (given the NaN values) and plenty of doubles aren't integers, so we can't expect every long to be exactly representable. However, all 32 bit integers are representable as doubles.
I think you're a bit confused about what you want to ask. floor(3 + 0.5) is not a very good example, because 3, 0.5, and their sum are all exactly representable in any real-world floating point format. floor(0.1 + 0.9) would be a better example, and the real question here is not whether the result of floor is exactly representable, but whether inexactness of the numbers prior to calling floor will result in a return value different from what you would expect, had all numbers been exact. In this case, I believe the answer is yes, but it depends a lot on your particular numbers.
I invite others to criticize this approach if it's bad, but one possible workaround might be to multiply your number by (1.0+0x1p-52) or something similar prior to calling floor (perhaps using nextafter would be better). This could compensate for cases where an error in the last binary place of the number causes it to fall just below rather than exactly on an integer value, but it will not account for errors which have accumulated over a number of operations. If you need that level of numeric stability/exactness, you need to either do some deep analysis or use an arbitrary-precision or exact-math library which can handle your numbers correctly.

Resources