How are floating point numbers stored in memory?

How are floating point numbers stored in memory? - c

I've read that they're stored in the form of mantissa and exponent
I've read this document but I could not understand anything.

To understand how they are stored, you must first understand what they are and what kind of values they are intended to handle.
Unlike integers, a floating-point value is intended to represent extremely small values as well as extremely large. For normal 32-bit floating-point values, this corresponds to values in the range from 1.175494351 * 10^-38 to 3.40282347 * 10^+38.
Clearly, using only 32 bits, it's not possible to store every digit in such numbers.
When it comes to the representation, you can see all normal floating-point numbers as a value in the range 1.0 to (almost) 2.0, scaled with a power of two. So:
1.0 is simply 1.0 * 2^0,
2.0 is 1.0 * 2^1, and
-5.0 is -1.25 * 2^2.
So, what is needed to encode this, as efficiently as possible? What do we really need?
The sign of the expression.
The exponent
The value in the range 1.0 to (almost) 2.0. This is known as the "mantissa" or the significand.
This is encoded as follows, according to the IEEE-754 floating-point standard.
The sign is a single bit.
The exponent is stored as an unsigned integer, for 32-bits floating-point values, this field is 8 bits. 1 represents the smallest exponent and "all ones - 1" the largest. (0 and "all ones" are used to encode special values, see below.) A value in the middle (127, in the 32-bit case) represents zero, this is also known as the bias.
When looking at the mantissa (the value between 1.0 and (almost) 2.0), one sees that all possible values start with a "1" (both in the decimal and binary representation). This means that it's no point in storing it. The rest of the binary digits are stored in an integer field, in the 32-bit case this field is 23 bits.
In addition to the normal floating-point values, there are a number of special values:
Zero is encoded with both exponent and mantissa as zero. The sign bit is used to represent "plus zero" and "minus zero". A minus zero is useful when the result of an operation is extremely small, but it's still important to know from which direction the operation came from.
plus and minus infinity -- represented using an "all ones" exponent and a zero mantissa field.
Not a Number (NaN) -- represented using an "all ones" exponent and a non-zero mantissa.
Denormalized numbers -- numbers smaller than the smallest normal number. Represented using a zero exponent field and a non-zero mantissa. The special thing with these numbers is that the precision (i.e. the number of digits a value can contain) will drop the smaller the value becomes, simply because there is not room for them in the mantissa.
Finally, the following is a handful of concrete examples (all values are in hex):
1.0 : 3f800000
-1234.0 : c49a4000
100000000000000000000000.0: 65a96816

In layman's terms, it's essentially scientific notation in binary. The formal standard (with details) is IEEE 754.

typedef struct {
unsigned int mantissa_low:32;
unsigned int mantissa_high:20;
unsigned int exponent:11;
unsigned int sign:1;
} tDoubleStruct;
double a = 1.2;
tDoubleStruct* b = reinterpret_cast<tDoubleStruct*>(&a);
Is an example how memory is set up if compiler uses IEEE 754 double precision which is the default for a C double on little endian systems (e.g. Intel x86).
Here it is in C based binary form and better read
wikipedia about double precision to understand it.

There are a number of different floating-point formats. Most of them share a few common characteristics: a sign bit, some bits dedicated to storing an exponent, and some bits dedicated to storing the significand (also called the mantissa).
The IEEE floating-point standard attempts to define a single format (or rather set of formats of a few sizes) that can be implemented on a variety of systems. It also defines the available operations and their semantics. It's caught on quite well, and most systems you're likely to encounter probably use IEEE floating-point. But other formats are still in use, as well as not-quite-complete IEEE implementations. The C standard provides optional support for IEEE, but doesn't mandate it.

The mantissa represents the most significant bits of the number.
The exponent represents how many shifts are to be performed on the mantissa in order to get the actual value of the number.
Encoding specifies how are represented sign of mantissa and sign of exponent (basically whether shifting to the left or to the right).
The document you refer to specifies IEEE encoding, the most widely used.

I have found the article you referenced quite illegible (and I DO know a little how IEEE floats work). I suggest you try with the Wiki version of the explanation. It's quite clear and has various examples:
http://en.wikipedia.org/wiki/Single_precision and http://en.wikipedia.org/wiki/Double_precision

It is implementation defined, although IEEE-754 is the most common by far.
To be sure that IEEE-754 is used:
in C, use #ifdef __STDC_IEC_559__
in C++, use the std::numeric_limits<float>::is_iec559 constants
I've written some guides on IEEE-754 at:
In Java, what does NaN mean?
What is a subnormal floating point number?

Related

Is there a C equivalent of C#'s decimal type?

In relation to: Convert Decimal to Double
Now, I got to many questions relating to C#'s floating-point type called decimal and I've seen its differences with both float and double, which got me thinking if there is an equivalent to this type in C.
In the question specified, I got an answer I want to convert to C:
double trans = trackBar1.Value / 5000.0;
double trans = trackBar1.Value / 5000d;
Of course, the only change is the second line gone, but with the thing about the decimal type, I want to know it's C equivalent.
Question: What is the C equivalent of C#'s decimal?

C2X will standardize decimal floating point as _DecimalN: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2573.pdf
In addition, GCC implements decimal floating point as an extension; it currently supports 32-bit, 64-bit, and 128-bit decimal floats.

Edit: much of what I said below is just plain wrong, as pointed out by phuclv in the comments. Even then, I think there's valuable information to be gained in reading that answer, so I'll leave it unedited below.
So in short: yes, there is support for Decimal floating-point values and arithmetic in the standard C language. Just check out phuclv's comment and S.S. Anne's answer.
In the C programming language, as others have commented, there's no such thing as a Decimal type, nor are there types implemented like it. The simplest type that is close to it would be double, which is implemented, most commonly, as an IEEE-754 compliant 64-bit floating-point type. It contains a 1-bit sign, an 11-bit exponent and a 52-bit mantissa/fraction. The following image represents it quite well(from wikipedia):
So you have the following format:
A more detailed explanation can be read here, but you can see that the exponent part is a power of two, which means that there will be imprecisions when dealing with division and multiplication by ten. A simple explanation is because division by anything that isn't a power of two is sure to repeat digits indefinitely in base 2. Example: 1/10 = 0.1(in base 10) = 0.00011001100110011...(in base 2). And, because computers can't store an unlimited amount of zeroes, your operations will have to be truncated/approximated.
In the case of C#'s Decimal, from the documentation:
The binary representation of a Decimal number consists of a 1-bit sign, a 96-bit integer number, and a scaling factor used to divide the integer number and specify what portion of it is a decimal fraction.
This last part is important, because instead of being a multiplication by a power of two, it is a division by a power of ten. So you have the following format:
Which, as you can clearly see, is a completely different implementation from above!
For instance, if you wanted to divide by a power of 10, you could do that exactly, because that just involves increasing the exponent part(N). You have to be aware of the limitation of the numbers that can be represented by Decimal, though, which is at most a measly 7.922816251426434e+28, whereas double can go up to 1.79769e+308.
Given that there are no equivalents (yet) in C to Decimal, you may wonder "what do I do?". Well, it depends. First off, is it really important for you to use a Decimal type? Can't you use a double? To answer that question, it's helpful to know why that type was created in the first place. Again, from Microsoft's documentation:
The Decimal value type is appropriate for financial calculations that require large numbers of significant integral and fractional digits and no round-off errors
And, just at the next phrase:
The Decimal type does not eliminate the need for rounding. Rather, it minimizes errors due to rounding
So you shouldn't think of Decimal as having "infinite precision", just as being a more appropriate type for calculations that generally need to be made in the decimal system(such as financial ones, as stated above).
If you still want a Decimal data type in C, you'd have to work in developing a library to support addition, subtraction, multiplication, etc --- Because C doesn't support operator overloading. Also, it still wouldn't have hardware support(e.g. from the x64 instruction set), so all of your operations would be slower than those of double, for example. Finally, if you still want something that supports a Decimal in other languages(in your final question), you may look into Decimal TR in C++.

As other pointed out, there's nothing in C standard(s) such as .NET's decimal, but, if you're working on Windows and have the Windows SDK, it's defined:
DECIMAL structure (wtypes.h)
Represents a decimal data type that provides a sign and scale for a
number (as in coordinates.)
Decimal variables are stored as 96-bit (12-byte) unsigned integers
scaled by a variable power of 10. The power of 10 scaling factor
specifies the number of digits to the right of the decimal point, and
ranges from 0 to 28.
typedef struct tagDEC {
USHORT wReserved;
union {
struct {
BYTE scale;
BYTE sign;
} DUMMYSTRUCTNAME;
USHORT signscale;
} DUMMYUNIONNAME;
ULONG Hi32;
union {
struct {
ULONG Lo32;
ULONG Mid32;
} DUMMYSTRUCTNAME2;
ULONGLONG Lo64;
} DUMMYUNIONNAME2;
} DECIMAL;
DECIMAL is used to represent an exact numeric value with a fixed precision and fixed scale.
The origin of this type is Windows' COM/OLE automation (introduced for VB/VBA/Macros, etc. so, it predates .NET, which has very good COM automation support), documented here officially: [MS-OAUT]: OLE Automation Protocol, 2.2.26 DECIMAL
It's also one of the VARIANT type (VT_DECIMAL). In x86 architecture, it's size fits right in the VARIANT (16 bytes).

Decimal type in C# is used is used with precision of 28-29 digits and it has size of 16 bytes.There is not even a close equivalent in C to C#.In Java there is a BigDecimal data type that is closest to C# decimal data type.C# decimal gives you numbers like:
+/- someInteger / 10 ^ someExponent
where someInteger is a 96 bit unsigned integer and someExponent is an integer between 0 and 28.
Is Java's BigDecimal the closest data type corresponding to C#'s Decimal?

What is Biased Notation?

I have read:
"Like an unsigned int, but oﬀset by −(2^(n−1) − 1), where n is the number of bits in the numeral. Aside:
Technically we could choose any bias we please, but the choice presented here is extraordinarily common." - http://inst.eecs.berkeley.edu/~cs61c/sp14/disc/00/Disc0.pdf
However, I don't get what the point is. Can someone explain this to me with examples? Also, when should I use it, given other options like one's compliment, sign and mag, and two's compliment?

Biased notation is a way of storing a range of values that doesn't start with zero.
Put simply, you take an existing representation that goes from zero to N, and then add a bias B to each number so it now goes from B to N+B.
Floating-point exponents are stored with a bias to keep the dynamic range of the type "centered" on 1.
Excess-three encoding is a technique for simplifying decimal arithmetic using a bias of three.
Two's complement notation could be considered as biased notation with a bias of INT_MIN and the most-significant bit flipped.

A "representation" is a way of encoding information so that it easy to extract details or inferences from the encoded information.
Most modern CPUs "represent" numbers using "twos complement notation". They do this because it is easy to design digital circuits that can do what amounts to arithmetic on these values quickly (add, subtract, multiply, divide, ...). Twos complement also has the nice property that one can interpret the most significant bit as either a power-of-two (giving "unsigned numbers") or as a sign bit (giving signed numbers) without changing essentially any of the hardware used to implement the arithmetic.
Older machines used other bases, e.g, quite common in the 60s were machines that represented numbers as sets of binary-coded-decimal digits stuck in 4-bit addressable nibbles (the IBM 1620 and 1401 are examples of this). So, you can represent that same concept or value different ways.
A bias just means that whatever representation you chose (for numbers), you have added a constant bias to that value. Presumably that is done to enable something to be done more effectively. I can't speak to "−(2^(n−1) − 1)" being "an extraordinaly common (bias)"; I do lots of assembly and C coding and pretty don't find a need to "bias" values.
However, there is a common example. Modern CPUs largely implement IEEE floating point, which stores floating point numbers with sign, exponent, mantissa. The exponent is is power of two, symmetric around zero, but biased by 2^(N-1) if I recall correctly, for an N-bit exponent.
This bias allows floating point values with the same sign to be compared for equal/less/greater by using the standard machine twos-complement instructions rather than a special floating point instruction, which means that sometimes use of actual floating point compares can be avoided. (See http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm for dark corner details). [Thanks to #PotatoSwatter for noting
the inaccuracy of my initial answer here, and making me go dig this out.]

Why dividing a float by a power of 10 is less accurate than typing the number directly?

When I run
printf("%.8f\n", 971090899.9008999);
printf("%.8f\n", 9710908999008999.0 / 10000000.0);
I get
971090899.90089989
971090899.90089977
I know why neither is exact, but what I don't understand is why doesn't the second match the first?
I thought basic arithmetic operations (+ - * /) were always as accurate as possible...
Isn't the first number a more accurate result of the division than the second?

Judging from the numbers you're using and based on the standard IEEE 754 floating point standard, it seems the left hand side of the division is too large to be completely encompassed in the mantissa (significand) of a 64-bit double.
You've got 52 bits worth of pure integer representation before you start bleeding precision. 9710908999008999 has ~54 bits in its representation, so it does not fit properly -- thus, the truncation and approximation begins and your end numbers get all finagled.
EDIT: As was pointed out, the first number that has no mathematical operations done on it doesn't fit either. But, since you're doing extra math on the second one, you're introducing extra rounding errors not present with the first number. So you'll have to take that into consideration too!

Evaluating the expression 971090899.9008999 involves one operation, a conversion from decimal to the floating-point format.
Evaluating the expression 9710908999008999.0 / 10000000.0 involves three operations:
Converting 9710908999008999.0 from decimal to the floating-point format.
Converting 10000000.0 from decimal to the floating-point format.
Dividing the results of the above operations.
The second of those should be exact in any good C implementation, because the result is exactly representable. However, the other two add rounding errors.
C does not require implementations to convert decimal to floating-point as accurately as possible; it allows some slack. However, a good implementation does convert accurately, using extra precision if necessary. Thus, the single operation on 971090899.9008999 produces a more accurate result than the multiple operations.
Additionally, as we learn from a comment, the C implementation used by the OP converts 9710908999008999.0 to 9710908999008998. This is incorrect by the rules of IEEE-754 for the common round-to-nearest mode. The correct result is 9710908999009000. Both of these candidates are representable in IEEE-754 64-bit binary, and both are equidistant from the source value, 9710908999008999. The usual rounding mode is round-to-nearest, ties-to-even, meaning the candidate with the even low bit should be selected, which is 9710908999009000 (with significand 0x1.1400298aa8174), not 9710908999008998 (with significand 0x1.1400298aa8173). (IEEE 754 defines another round-to-nearest mode: ties-to-away, which selects the candidate with the larger magnitude, which is again 9710908999009000.)
The C standard permits some slack in conversions; either of these two candidates conforms to the C standard, but good implementations also conform to IEEE 754.

Why does the value of this float change from what it was set to?

Why is this C program giving the "wrong" output?
#include<stdio.h>
void main()
{
float f = 12345.054321;
printf("%f", f);
getch();
}
Output:
12345.054688
But the output should be, 12345.054321.
I am using VC++ in VS2008.

It's giving the "wrong" answer simply because not all real values are representable by floats (or doubles, for that matter). What you'll get is an approximation based on the underlying encoding.
In order to represent every real value, even between 1.0x10-100 and 1.1x10-100 (a truly minuscule range), you still require an infinite number of bits.
Single-precision IEEE754 values have only 32 bits available (some of which are tasked to other things such as exponent and NaN/Inf representations) and cannot therefore give you infinite precision. They actually have 23 bits available giving precision of about 224 (there's an extra implicit bit) or just over 7 decimal digits (log10(224) is roughly 7.2).
I enclose the word "wrong" in quotes because it's not actually wrong. What's wrong is your understanding about how computers represent numbers (don't be offended though, you're not alone in this misapprehension).
Head on over to http://www.h-schmidt.net/FloatApplet/IEEE754.html and type your number into the "Decimal representation" box to see this in action.
If you want a more accurate number, use doubles instead of floats - these have double the number of bits available for representing values (assuming your C implementation is using IEEE754 single and double precision data types for float and double respectively).
If you want arbitrary precision, you'll need to use a "bignum" library like GMP although that's somewhat slower than native types so make sure you understand the trade-offs.

The decimal number 12345.054321 cannot be represented accurately as a float on your platform. The result that you are seeing is a decimal approximation to the closest number that can be represented as a float.

floats are about convenience and speed, and use a binary representation - if you care about precision use a decimal type.
To understand the problem, read What Every Computer Scientist Should Know About Floating-Point Arithmetic:
http://docs.sun.com/source/806-3568/ncg_goldberg.html
For a solution, see the Decimal Arithmetic FAQ:
http://speleotrove.com/decimal/decifaq.html

It's all to do with precision. Your number cannot be stored accurately in a float.

Single-precision floating point values can only represent about eight to nine significant (decimal) digits. Beyond that point, you're seeing quantization error.

What is the smallest exact representation of 1/(2^x) that can be represented in the C programming language?

What is the smallest exact representation of 1/(2^x) that can be represented in the C programming language?

On most platforms, C's double is the same as the IEEE 754 double precision format. The closest positive value to zero supported there is 2^-1022 (which is equal to 1/2^1022).
However if you allow user defined types, there is no limit, as you can always express the exponent as a bigint.

Using IEEE-754 double for arithmetic, the smallest exact value 1/2^n is:
2^-1022 if your platform does not have denormal support
2^-1023 if your platform has denormal support, but you insist on computing it using 1.0 / 2^n; this is because 2^1023 is the largest representable exact power of two in double.
2^-1074 if your platform has denormal support and you don't mind directly specifying the value, eg with C99 hex floating-point notation: 0x1.0p-1074 or 0x0.0000000000001p-1022.
If you use another type, say long double on an x86 machine with a compiler that maps that to 80-bit float, the smallest value can be quite a lot smaller (2^-16446, assuming that I did my arithmetic properly =)

If you use the GNU MP library (written in C), then you can represent any value up to the amount of RAM install.

0, that is 1/(2^inf) ;)
More seriously, this is a question of exponent bits in double precision floats. I don't think the C standard itself defines the size, but IEEE 754 does define it to have 11 exponent bits.
Lets ignore denormals for a little while. Since the smallest exponent value is −1022, this should be 1/(2^1022). But then there's the case of denormals, which IIRC should simply not contain any implicit 1 bit. The denormal numbers are thus spread uniformly over the 0..1/(2^1022)-range, giving log2(52) more values IIRC. So, I THINK the final answer should be 1/(2^(1074)).

If you store your variable as a 64-bit negative exponent, 1/2^(2^63 - 1). :)
That's a reeeeally small number.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight