This question already has answers here:
How to get the sign, mantissa and exponent of a floating point number
(7 answers)
Closed 6 years ago.
I just started learning floating point and get to know the SME stuff. I'm still very confused about the mantissa... Can somebody explain to me how can I get the exp part of the float. I am sorry if that's a super stupid and basic question but I am having a hard time understanding it...
Also how do I implement the following function... clearly my implementation is wrong. But how do I do it?
// Extract the 8-bit exponent field of single precision
// floating point number f and return it as an unsigned byte
unsigned char get_exponent_field(float f)
{
// TODO: Your code here.
int bias = 127;
int expp = (int)f;
unsigned char E = expp-bias;
return E;
}
If you want to extract the IEEE-754 single precision exponent from a float value (in excess 127 notation), you can use the float functions, or you can use a simple union with a shift and mask to do the same:
unsigned float_getexp (float f)
{
union {
unsigned u;
float f;
} uf;
uf.f = f;
return (uf.u >> 23) & 0xff;
}
If you want the actual exponent bias (i.e. the number of places the mantissa decimal is shifted during normalization prior to hidden bit removal), just subtract 127 from the value returned, or if you want that value returned, subtract it before the return.
Give it a try and let me know if you have questions. (note: the type should be unsigned for your exponent, instead of the int you have).
First, get your floating-point number and calculate its binary form by converting both the integral and fractional parts separately. Once you've got that, say you've got 11010.101(base-2). Normalize the binary string: 1.1010101 x 2^4. Next, add your excess value, say excess 15, to the exponent of the sci. not. value, which would give you 19(base-ten). Convert this to base-two; this will be your exponent.
This is just the structure of the operation, plug in your own bias, etc.
Related
I am following This video tutorial to implement a raycaster. It contains this code:
if(ra > PI) { ry = (((int)py>>6)<<6)-0.0001; rx=(py-ry)*aTan+px; yo=-64; xo=-yo*aTan; }//looking up
I hope I have transcribed this correctly. In particular, my question is about casting py (it's declared as float) to integer, shifting it back and forth, subtracting something, and then assigning it to a ry (also a float) This line of code is entered at time 7:24, where he also explains that he wants to
round the y position to the nearest 64th value
(I'm unsure if that means the nearest multiple of 64 or the nearest (1/64), but I know that the 6 in the source is derived from the number 64, being 2⁶)
For one thing, I think that it would be valid for the compiler to load (say) a 32-bit float into a machine register, and then shift that value down by six spaces, and then shift it back up by six spaces (these two operations could interfere with the mantissa, or the exponent, or maybe something else, or these two operations could be deleted by a peephole optimisation step.)
Also I think it would be valid for the compiler to make demons fly out of your nose when this statement is executed.
So my question is, is (((int)py>>6)<<6) defined in C when py is float?
is (((int)py>>6)<<6) defined in C when py is float?
It is certainly undefined behavior (UB) for many float. The cast to an int is UB for float with a whole number value outside the [INT_MIN ... INT_MAX] range.
So code is UB for about 38% of all typical float - the large valued ones, NaNs and infinities.
For typical float, a cast to int128_t is defined for nearly all float.
To get to OP's goal, code could use the below, which I believe to be well defined for all float.
If anything, use the below to assess the correctness of one's crafted code.
// round the y position to the nearest 64th value
float round_to_64th(float x) {
if (isfinite(x)) {
float ipart;
// The modf functions break the argument value into integral and fractional parts
float frac = modff(x, &ipart);
x = ipart + roundf(frac*64)/64;
}
return x;
}
"I'm unsure if that means the nearest multiple of 64 or the nearest (1/64)"
On review, OP's code is attempting to truncate to the nearest multiple of 64 or 2⁶.
It is still UB for many float.
That code doesn't shift a float because the bitshift operators aren't defined for floating-point types. If you try it you will get a compiler error.
Notice that the code is (int)py >> 6, the float is cast to an int before the shift operation. The integer value is what is being shifted.
If your question is "what will happen if you shift a float?", the answer is it won't compile. Example on Compiler Explorer.
The best possible recreation of the shift ops for floating points, in short, without using additional functions are the following:
Left shift:
ShiftFloat(py,6,1);
Right shift:
ShiftFloat(py,6,0);
float ShiftFloat(float x, int count, int ismultiplication)
{
float value = x;
for (int i = 0; i < count; ++i)
{
value *= (powf(0.5,(float)(ismultiplication^1)) / powf(2.0,(float)(ismultiplication)));
}
return count != 0 ? value : x;
}
Sorry if this is already been asked, and I've seen other way of extracting the exponent of a floating point number, however this is what is given to me:
unsigned f2i(float f)
{
union {
unsigned i;
float f;
} x;
x.i = 0;
x.f = f;
return x.i;
}
I'm having trouble understanding this union datatype, because shouldn't the return x.i at the end always make f2i return a 0?
Also, what application could this data type even be useful for? For example, say I have a function:
int getexponent(float f){
}
This function is supposed to get the exponent value of the floating point number with bias of 127. I've found many ways to make this possible, however how could I manipulate the f2i function to serve this purpose?
I appreciate any pointers!
Update!!
Wow, years later and this just seem trivial.
For those who may be interested, here is the function!
int getexponent(float f) {
unsigned f2u(float f);
unsigned int ui = (f2u(f)>>23) & 0xff ;//shift over by 23 and compare to 0xff to get the exponent with the bias
int bias = 127;//initialized bias
if(ui == 0) return 1-bias; // special case 0
else if(ui == 255) return 11111111; //special case infinity
return ui - bias;
}
I'm having trouble understanding this union datatype
The union data type is a way for a programmer to indicate that some variable can be one of a number of different types. The wording of the C11 standard is something like "a union contains at most one of its members". It is used for things like parameters that may be logically one thing or another. For example, an IP address might be an IPv4 address or an IPv6 address so you might define an address type as follows:
struct IpAddress
{
bool isIPv6;
union
{
uint8_t v4[4];
uint8_t v6[16];
} bytes;
}
And you would use it like this:
struct IpAddress address = // Something
if (address.isIPv6)
{
doSomeV6ThingWith(address.bytes.v6);
}
else
{
doSomeV4ThingWith(address.bytes.v4);
}
Historically, unions have also been used to get the bits of one type into an object of another type. This is because, in a union, the members all start at the same memory address. If I just do this:
float f = 3.0;
int i = f;
The compiler will insert code to convert a float to an integer, so the exponent will be lost. However, in
union
{
unsigned int i;
float f;
} x;
x.f = 3.0;
int i = x.i;
i now contains the exact bits that represent 3.0 in a float. Or at least you hope it does. There's nothing in the C standard that says float and unsigned int have to be the same size. There's also nothing in the C standard that mandates a particular representation for float (well, annex F says floats conform to IEC 60559 , but I don't know if that counts as part of the standard). So the above code is, at best, non portable.
To get the exponent of a float the portable way is the frexpf() function defined in math.h
how could I manipulate the f2i function to serve this purpose?
Let's make the assumption that a float is stored in IEC 60559 format in 32 bits which Wkipedia thinks is the same as IEEE 754. Let's also assume that integers are stored in little endian format.
union
{
uint32_t i;
float f;
} x;
x.f = someFloat;
uint32_t bits = x.i;
bits now contains the bit pattern of the floating point number. A single precision floating point number looks like this
SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM
^ ^ ^
bit 31 bit 22 bit 0
Where S is the sign bit, E is an exponent bit, M is a mantissa bit.
So having got your int32_t you just need to do some shifting and masking:
uint32_t exponentWithBias = (bits >> 23) & 0xff;
Because it's a union it means that x.i and x.f have the same address, what this allows you to do is reinterpret one data type to another. In this scenario the union is first zeroed out by x.i = 0; and then filled with f. Then x.i is returned which is the integer representation of the float f. If you would then shift that value you would get the exponent of the original f because of the way a float is laid out in memory.
I'm having trouble understanding this union datatype, because shouldn't the return x.i at the end always make f2i return a 0?
The line x.i = 0; is a bit paranoid and shouldn't be necessary. Given that unsigned int and float are both 32 bits, the union creates a single chunk of 32 bits in memory, which you can access either as a float or as the pure binary representation of that float, which is what the unsigned is for. (It would have been better to use uint32_t.)
This means that the lines x.i = 0; and x.f = f; write to the very same memory area twice.
What you end up with after the function is the pure binary notation of the float. Parsing out the exponent or any other part from there is very much implementation-defined, since it depends on floating point format and endianess. How to represent FLOAT number in memory in C might be helpful.
That union type is strongly discouraged, as it is strongly architecture dependant and compiler implementation dependant.... both things make it almost impossible to determine a correct way to achieve the information you request.
There are portable ways of doing that, and all of them have to deal with the calculation of logarithm to the base ten. If you get the integer part of the log10(x) you'll get the number you want,
int power10 = (int)log10(x);
double log10(double x)
{
return log(x)/log(10.0);
}
will give you the exponent of 10 to raise to get the number to multiply the mantissa to get the number.... if you divide the original number by the last result, you'll get the mantissa.
Be careful, as the floating point numbers are normally internally stored in a power of two's basis, which means the exponent you get stored is not a power of ten, but a power of two.
Working on my way to solve exercise 2.1 from "The C programming language" where one should calculate on the local machine the range of different types like char, short, int etc. but also float and double. By everything except float and double i watch for the overflow to happen and so can calculate the max/min values. However, by floats this is still not working.
So, the question is why this code prints the same value twice? I thought the second line should print inf
float f = 1.0;
printf("%f\n",FLT_MAX);
printf("%f\n",FLT_MAX + f);
Try multiplying with 10, and if will overflow. The reason it doesn't overflow is the same reason why adding a small float to an already very large float doesn't actually change the value at all - it's a floating point format, meaning the number of digits of precision is limited.
Or, adding at least that last significant digit would likely work:
float f = 3.402823e38f; // FLT_MAX
f = f + 0.000001e38f; // this should result in overflow
The reason why it prints the same value twice is that 1.0 is too small to be added to FLOAT_MAX. A float has usually 24 bits for the mantissa, and 8 bits for the exponent. If you have a very large value with an exponent of 127, you would need a mantissa with at least 127 bits to be able to add 1.0.
As an example, the same problem exists with decimal (and any other) exponential values:
If you have a number with 3 significant digits like 1.00*106, you can't add 1 to it because this would be 1'000'001, and this requires 6 significant digits.
You could overflow a float by doubling the value repeatedly.
I am trying to arithmetic bit shift a double data type in C. I was wondering if this is the correct way to do it:
NOTE: firdelay[ ][ ] is declared in main as
double firdelay[8][12]
void function1(double firdelay[][12]) {
int * shiftptr;
// Cast address of element of 2D matrix (type double) to integer pointer
*shiftptr = (int *) (&firdelay[0][5]);
// Dereference integer pointer and shift right by 12 bits
*shiftptr >>= 12;
}
Bitwise shifting a floating point data type will not give you the result you're looking for.
In Simulink, the Shift Arithmetic block only does bit shifting for integer data types. If you feed it a floating point type it divides the input signal by 2^N where N is the number of bits to shift specified in the mask dialog box.
EDIT:
Since you don't have the capability to perform any floating point math your options are:
understand the layout of a floating point single precision number, then figure out how to manipulate it bitwise to achieve division.
convert whatever algorithm you're porting to use fixed point data types instead of floating point
I'd recommend option 2, it's a whole lot easier than 1
Bit-shifting a floating-point data type (reinterpreted as an int) will give you gibberish (take a look at the diagrams of the binary representation here to see why).
If you want to multiply/divide by a power of 2, then you should do that explicitly.
According to the poorly worded and very unclear documentation, it seems that "bit shifting" in Simulink takes two arguments for floating point values and has the effect of multiplying a floating point value by two raised to the difference of the arguments.
You can use ldexp(double_number, bits_to_pseudo_shift) to obtain this behavior. The function ldexp is found in <math.h>.
There is no correct way to do this. Both operands of << must be of some integer type.
What you're doing is interpreting ("type-punning") a double object as if it were an int object, and then shifting the resulting int value. Even if double and int happen to be the same size, this is very unlikely to do anything useful. (And even if it is useful, it makes more sense to shift unsigned values rather than signed values).
One potential use case for this is for capturing mantissa bits, exponent bits, and sign bit if interested. For this you can use a union:
union doubleBits {
double d;
long l;
};
You can set take your double and set it in the union:
union doubleBits myUnion;
myUnion.d = myDouble;
And bit shift the long portion of the union after extracting the bits like so:
myUnion.l >>= 1;
Since the number of bits for each portion of the double is defined, this is one way to extract the underlying bit representation. This is one use case where it could be feasible to want to get the raw underlying bits. I'm not familiar with Simulink, but if that is possibly why the double was being shifted in the first place, this could be a way to achieve that behavior in C. The fact that it was always 12 bits makes me think otherwise, but just in case figured it was worth pointing out for others who stumble upon this question.
There's a way to achieve this : just add your n to the exponent part of the bitwise representation of the double.
Convert your double to long using "reinterpret" or bitwise conversion (using a union for instance).. extract bits from 52 to 63 (11 bits).. then add your shift and put back the result in the exponent.
You should take into account the special values of double (+infinity, nan or zero)
double operator_shift_left(double a,int n)
{
union
{
long long l;
double d;
} r;
r.d=a;
switch(r.l)
{
case 0x0000000000000000: // 0
case 0x8000000000000000: // -0
case 0x7FF0000000000000: // pos infnity
case 0xFFF0000000000000: // neg infnity
case 0x7FF0000000000001: // Nan
case 0x7FF8000000000001: // Nan
case 0x7FFFFFFFFFFFFFFF: // Nan
return a;
}
int nexp=(((r.l>>52)&0x7FF)+n); // new exponent
if (nexp<0) // underflow
{
r.l=r.l & 0x8000000000000000;
// returns +/- 0
return r.d;
}
if (nexp>2047) // overflow
{
r.l=(r.l & 0x8000000000000000)| 0x7FF0000000000000;
// returns +/- infinity
return r.d;
}
// returns the number with the new exponant
r.l=(r.l & 0x800FFFFFFFFFFFFF)|(((long long)nexp)<<52);
return r.d;
}
(there's probably some x64 processor instruction that does it ?)
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Which is the first integer that an IEEE 754 float is incapable of representing exactly?
Firstly, this IS a homework question, just to clear this up immediately. I'm not looking for a spoon fed solution of course, just maybe a little pointer to the right direction.
So, my task is to find the smallest positive integer that can not be represented as an IEEE-754 float (32 bit). I know that testing for equality on something like "5 == 5.00000000001" will fail, so I thought I'd simply loop over all the numbers and test for that in this fashion:
int main(int argc, char **argv)
{
unsigned int i; /* Loop counter. No need to inizialize here. */
/* Header output */
printf("IEEE floating point rounding failure detection\n\n");
/* Main program processing */
/* Loop over every integer number */
for (i = 0;; ++i)
{
float result = (float)i;
/* TODO: Break condition for integer wrapping */
/* Test integer representation against the IEEE-754 representation */
if (result != i)
break; /* Break the loop here */
}
/* Result output */
printf("The smallest integer that can not be precisely represented as IEEE-754"
" is:\n\t%d", i);
return 0;
}
This failed. Then I tried to subtract the integer "i" from the floating point "result" that is "i" hoping to achieve something of a "0.000000002" that I could try and detect, which failed, too.
Can someone point me out a property of floating points that I can rely on to get the desired break condition?
-------------------- Update below ---------------
Thanks for help on this one! I learned multiple things here:
My original thought was indeed correct and determined the result on the machine it was intended to be run on (Solaris 10, 32 bit), yet failed to work on my Linux systems (64 bit and 32 bit).
The changes that Hans Passant added made the program also work with my systems, there seem to be some platform differences going on here that I didn't expect,
Thanks to everyone!
The problem is that your equality test is a float point test. The i variable will be converted to float first and that of course produces the same float. Convert the float back to int to get an integer equality test:
float result = (float)i;
int truncated = (int)result;
if (truncated != i) break;
If it starts with the digits 16 then you found the right one. Convert it to hex and explain why that was the one that failed for a grade bonus.
I think you should reason on the representation of the floating numbers as (base, sign,significand,exponent)
Here it is an excerpt from Wikipedia that can give you a clue:
A given format comprises:
* Finite numbers, which may be either base 2 (binary) or base 10
(decimal). Each finite number is most
simply described by three integers: s=
a sign (zero or one), c= a significand
(or 'coefficient'), q= an exponent.
The numerical value of a finite number
is
(−1)s × c × bq
where b is the base (2 or 10). For example, if the sign is 1
(indicating negative), the significand
is 12345, the exponent is −3, and the
base is 10, then the value of the
number is −12.345.
That would be FLT_MAX+1. See float.h.
Edit: or actually not. Check the modf() function in math.h