C, FLT_MAX value larger than 32 bits? - c

I'm actually researching how to display a float number (with write) and I'm facing about something which is confusing me.
I found that float are stored in 32 bits, whith 1 bits for sign, 7 bits for exponant and the rest for the Mantissa.
Where my trouble are coming, is when I display FLT_MAX with printf, I will get 340282346638528859811704183484516925440.000000 by simply doing
printf("%f\n", FLT_MAX)
This value is bigger than INT_MAX, bigger than LLONG_MAX, how can this number of digit can be stored in 32 bits ? This is really 32 bits or system dependent ? I'm on Ubuntu x86_64 GNU/Linux.
I can't understand how more than 10 digits (INT_MAX len) can be stored in the same number of bits.
If think the problem is linked, but I also have trouble for double who will give me
printf("%lf", DBL_MAX);
#179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
It's making the mystery bigger !
Thanks for helping, hope I was clear.

Bits are merely physical things with two states. There is no inherent meaning to them. When we use the bits to represent an integer in binary, we interpret each bit as having a value, 1 for one bit, 2 for another, 4 for another, 8 for another, and so on. There is nothing in physics, logic, or law that requires us to give them this interpretation.
When we use the bits to represent a floating-point object, we give each bit a different meaning. One bit represents the sign. Eight bits contain an encoding of the exponent. 23 bits contain an encoding of the significand.
To figure out the meaning of the bits given the floating-point encoding scheme for numbers in the normal range, we interpret the exponent bits as a binary numeral, then subtract 127, then raise two to the resulting power. (For example, “10000011” is the binary numeral for 131, so it represents 24) Then we take the significand bits and append them to “1.”, forming a binary numeral such as “1.01011100000000000000000”. We convert that numeral to a number (it is 159/128), and we multiply it by the power from the exponent (producing 159/8 in this example) and apply the sign.
Since the exponent can be large, the value represented can be very large. The software that converts floating-point numbers to characters for output such as “340282346638528859811704183484516925440.000000” performs these interpretations for you.

Related

why 10000101 is both -5 and 133 [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
Vidoe
00000101 is 5
10000101 is -5
but 10000101 is also 133
I don't understand why 1 binary is able to represent 2 numbers.
Any help is appreciated, thank you.
The word “Gift” has at least two meanings. In English, it means a present you give somebody. In German, it means poison. To know which concept a speaker means, you must know which language they are speaking.
The bit string 10000101 does not mean anything by itself. It is just some bits. Bit strings have values only when we associate them with types, which are (in part) methods of associating values with bit strings. To know what value it represents, you must know which type it is being used with.
If we interpret 10000101 as a pure binary numeral, it means 1•27 + 1•22 + 1•20 = 128 + 4 + 1 = 133.
If we interpret 10000101 as a sign-and-magnitude representation, it means negative with 1•22 + 1•20 = − (4+1) = −5.
If we interpret 10000101 as a two’s complement representation, it means −1•27 + 1•22 + 1•20 = −128 + 4 + 1 =  −123.
In C, every declared object and every constant has a type, and every expression built from these things has a type. The type says how to interpret the bits.
In these representations of a signed number that occupies one byte
00000101 is 5
10000101 is -5
the most significant bit is the sign bit that determines whether the stored number is positive or negative.
That is if the sign bit is set then the corresponding value stored in value bits is negated.
This representation of signed values is named like sign and magnitude.
It is implementation-defined how signed values are stored. Most computer architectures use the so called 2's complement representation. For such an architecture the negative number -5 is represented like
11111011 is -5
If the number is considered as having an unsigned integer type then no bit is allocated as the sign bit. In this case all bits are value bits. So for unsigned integer this representation 10000101 yields the value 133.
I don't understand why 1 binary is able to represent 2 numbers.
Pure binary numbers have no sign.
So as pure binary, 10000101 is unambiguously 133, pure and simple.
But, of course, we want to be able to handle negative numbers. So there are various ways of rigging things up so that some bit patterns represent negative numbers. But as soon as you do that, you're going to end up with bit patterns which can be interpreted two ways: as straight binary (giving a positive number) or as signed binary (giving a negative number). Any bit pattern that represents a negative number can also be interpreted as a positive number, by just not using whatever negative-number rule you were using.
Here's another way of thinking about it. Consider the word "march". It's the thing that bands do in parades. But if I capitalize the first letter, March, it's the third month of the year. So If I write "March", and I pay attention to the capitalization, it's a month, but if I ignore the capitalization, it's a verb.
Similarly, if I write 10000101, and I ignore the possibility of signedness, I get 133.
But if I pay attention to sign, I get a negative number.
Or if I interpret it as a character, I might get something else!
Here are 5 possibilities:
bit pattern
interpreted as
gives
10000101
pure binary
133
10000101
sign/magnitude
-5
10000101
ones' complement
-122
10000101
two's complement
-123
10000101
character
à
(Now, I confess, in the last row I had to cheat, by using the old MS-DOS character set. In Unicode, 10000101 does not represent a character, and in the Windows character set, it's one character that's an ellipsis, or three dots: … .)
Now, one thing you may be worried about is whether we're getting "something for nothing", by having a bit pattern that can do double duty as either as positive or a negative number. Are we cheating and extending the range somehow? The answer is that, no, we're not. Let's stay with 8-bit numbers. If we treat our 8-bit numbers as pure binary, we can cover the range from 0 to 255 (that is, 00000000 to 11111111). We can't represent the number 300, because it takes too many bits, and we can't represent the number -5, because we have no way to represent negative numbers. With 8 bits, in pure binary, we can represent only 0 to 255, and that's it.
If we switch to two's complement, we can represent any number from -128 to +127 — which is exactly 256 different numbers. We still can't represent 300 (positive or negative), because it's still too many bits. But, now, we can't represent the number 200, either, because if we try to, it's 11001000, and that's the two's complement bit pattern for -56. Similarly, we can't represent 133 (your original example), because its bit pattern is 10000101, and that's a negative number, too, -123.

If a C signed integer type is stored in 22 bits, what is the smallest value it can store?

I am learning about data allocation and am a little confused.
If you are looking for the smallest or greatest value that can be stored in a certain number of bits then does it matter what the data type is?
Wouldn't the smallest or biggest number that could be stored in 22 bits would be 22 1's positive or negative? Is the first part of this question a red herring? Wouldn't the smallest value be -4194303?
A 22-bit data element can store any one of 2^22 distinct values. What those values actually mean is a matter of interpretation. That interpretation may be imposed by a compiler or some piece of hardware, or may be under the control of the programmer, and suit some specific application.
A simple interpretation, of course, would be to treat the 22 bits as an unsigned integer, with values from 0 to (2^22)-1. A two's-complement, signed integer is a slightly more sophisticated interpretation of the same bits. Or you (or the compiler, or CPU) could divide the 22 bits up into a mantissa and exponent, and store a range of decimal numbers. The range and precision would depend on how many bits were allocated to the mantissa, and how many to the exponent.
Or you could split the bits up and use some for the numerator and some for the denominator of a fraction. Or, in fact, anything else.
Some of these interpretations of the bits are built into hardware, some are implemented by compilers or libraries, and some are entirely under the programmer's control. Not all programming languages allow the programmer to manipulate individual bits in a natural or efficient way, but some do. Sometimes, using a highly unconventional interpretation of binary data can give significant efficiency gains, but usually at the expense of readability and maintainability.
So, yes, it matters what the data type is.
There is no law (of humans, logic, or nature) that says bits must represent numbers only in the pattern that one of the bits represents 20, another represents 21, another represents 22, and so on (and the number represented is the sum of those values for the bits that are 1). We have choices about how to use bits to represent numbers, including:
The bits do use that pattern, and so 22 bits can represent any number from 0 to the sum of 20 + 21 + 22 + … + 221 = 222 − 1 = 4,194,303. The smallest representable value is 0.
The bits mostly use that pattern, but it is modified so that one bit represents −221 instead of +221. This is called two’s complement, and the smallest value representable is −221 = −2,097,152.
The bits represent numbers as described above except the represent value is divided by 1000. This is called fixed-point. In the first case, the value represent by all bits 1 would be 4194.303, but the smallest representable value would be 0. With a combination of two’s complement and fixed-point scaled by 1/1000, the smallest representable value would be −2097.152.
The bits represent a floating-point number, where one bit represents a sign (+ or −), certain bits represent an exponent and other information, and the remaining bits represent a significand. In common floating-point formats, when all the bits in that exponent-and-other field are 1s and the significand field bits are 0s, the number represents +∞ or −∞, according to the sign bit. In such a format, the smallest representable value is −∞.
As an example, we could designate patterns of bits to represent numbers arbitrarily. We could say that 0000000000000000000000 represents 34, 0000000000000000000001 represents −15, 0000000000000000000010 represents 5, 0000000000000000000011 represents 3+4i, and so on. The smallest representable value would be whichever of those arbitrary values is smallest.
So what the smallest representable value is depends entirely on the type, since the “type” of the data includes the scheme by which the bits represent values.
If the type is a “signed integer type,” there is still some flexibility in the representation. Most modern C implementations (and other programming languages) use the two’s complement scheme described above. But the C standard still allows two other schemes:
One’s complement: If the first bit is 1, the value represented is negative, and its magnitude is given by complementing the remaining bits and interpreting them as binary. Using six bits for an example, 101001 would be negative with the magnitude of 101102 = 22, so −22.
Sign-and-magnitude: If the first bit is 1, the value represented is negative, and its magnitude is given by interpreting the remaining bits as binary. Using the same bits, 101001 would negative with the magnitude of 010012 = 9, so −9.
In both one’s complement and sign-and-magnitude, the smallest representable value with 22 bits is −(221−1) = −2,097,151.
To stretch the question further, C defines standard integer types but allows implementations to extend the language. An implementation could define some “signed integer type” with an arbitrary scheme for representing numbers, as long as that scheme included a sign, to make the name correct.
Without going into technical jargon about doing maths with Two's compliment, I'll try to explain in easy words.
First you need to raise 2 with power of 'number of bits'.
Let's take an example of an 8 bit type,
An un-signed 8-bit integer can store 2 ^ 8 = 256 values.
Since values are indexed starting from 0, so values range from 0 - 255.
Assuming you want to store signed values, so you need to get the half (simply divide it by 2),
256 / 2 = 128.
Remember we start from zero,
You might be rightly thinking you can store -127 to 127 starting from zero on both sides.
Just know that there is only zero (there is nothing like +0 or -0),
so you start with zero to positive half. 0 to 127,
that leaves you with negative half starting from -1 to -128
Hence the range will be -128 to 127.
For a 22 bit signed integer you can do the math,
2 ^ 22 = 4,194,304
4194304 / 2 = 2,097,152
-1 for positive side,
range will be, -2097152 to 2097151.
To answer your question,
-2097152 would be the smallest number you can store.
Thanks everyone for the replies. I figured it out with the help of all of your info but I will explain the answer to show exactly what gaps of knowledge I had that lead to my misunderstanding.
The data type does matter in this question because for signed data types the first bit is used to represent whether or not a binary number is positive or negative. 0111 = 7 and 1111 = -7
sign int and unsigned int use the same number of bits, 32 bits. Since an unsigned int is unsigned: the first bit isn't used to represent positive or negative so it can represent a larger number with that extra bit. 1111 converted to an unsigned int is 15 whereas with the signed int it was -7 since the furthest left bit represents the sign: 1 is negative and 0 is positive.
Now to answer "If a C signed integer type is stored in 22 bits, what is the smallest value it can store?":
If you convert binary to decimal you get 1111111111111111111111 = 4194304
This decimal value -1 is the maximum value an unsigned could hold. Since our data type is signed it has to use one less bit for the number value since the first bit represents the sign. This gives us -2097152.
Thanks again, everyone.

C, getting the maximum float or maximum double not from <float.h>

i was completing book "C. Programming language", but faced up with the question in which i should get the maximum\minimum value of float-pointing number, without using any of standard libraries, such as <float.h>. Thank you
“Without using” exercises are a little bit stupid, so here is one version “without using” any header.
…
double nextafter(double, double);
double max = nextafter(1.0 / 0.0, 0.0);
…
And without using any library function, only assuming that double is mapped to IEEE 754's binary64 format (a very common choice):
…
double max = 0x1.fffffffffffffp1023;
…
Assuming a binary floating-point format, start with 2.0 and multiply it by 2.0 until you get an overflow. This determines the maximum exponent. Then, starting with x as the number you had right before the overflow, take the sum x + x/2 + x/4 + ... until adding x/q does not change the value of the number (or overflows again). This determines the maximum mantissa.
The smallest representable positive number can be found a similar way.
From wikipedia you can read up the IEEE floating point format: http://en.wikipedia.org/wiki/Single-precision_floating-point_format
This contains
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 bits (23 explicitly stored)
The page also contains information on how to interpret the exponent value. Value of 0xFF (255) in exponent signifies ±infinity if the significant is zero and NaN (not a number) otherwise. The +- infinity are largest numbers. The sign bit defines if the number if +infinity or -infinity. If the question is about the largest non-infinite value then just use the largest non-special value.
Largest non-infinite value is 24 bits of 1s in significand and 0xFE (254) as exponent. Since the exponent is offset the actual value is something like: significand * 2^(254-127) which is somewhere close to 3.402823 × 10^38 in decimal according to the wikipedia page. If you want the minimum, just toggle the sign bit on to get the exact same value as negative.
EDIT: Since this is about C, I've assumed the 32 bit IEEE float.
You can figure out the number of bits the number holds by doing a sizeof(type)*8.
Then look at http://en.wikipedia.org/wiki/Double-precision_floating-point_format or http://en.wikipedia.org/wiki/Single-precision_floating-point_format
This way you can look it up in a table based in the number of bits.
This assumes that the structure is using IEEE 754.
You could start from the IEEE definition, and work from there. For example, number of bits of exponent, number of bits of mantissa. When you study the format, you will see that the 23 bits of mantissa actually represent 24 bits. The reason is, the mantissa is "normalised", that is, it is left shifted so that the ms bit is always 1. This gives the maximum number of significant bits retained from a calculation. Where has the 24th bit gone? Because it is always there (except for a 0 value), it is "implied" as the 24th bit.

How to create a custom float with varying bit length

I'm writing a program that basically converts an int into a floating point with a twist, the user specifies how many bits are used for the exponent and mantissa part. I don't understand what the algorithm for taking a number and breaking it into an exponent and mantissa is though. If I go to an IEEE 754 Converter and type in 34, it changes it to exp 2^5 and mantissa 1.0625, but I don't understand how to get this conversion. I know what I'll do to then get the bits in the specified places, but I don't know how to get the correct bits in the first place

Fixed Point Multiplication of Unsigned numbers

I am trying to solve a multiplication problem with fixed point numbers. The numbers are 32 bit. My architecture is 8 bit. So here goes:
I am using 8.8 notation i.e., 8 for integer, 8 for fraction.
I have A78 which is 10.468. I take its two's complement, and the answer is FFFFF588, which I truncate to 16 bits as F588 and store it. Reason being, I only want to multiply two, 2 byte numbers.
Now when I multiply this F588 (negative 10.42 or 0x0A78) with 0xFF4B which is the two's compliment of 0x00B5 (0.707), answer should be 0x0766. Or something like it.
What I get on the other hand is 66D8.
Now here is where it gets interesting: If I store negative of B5 in two's compliment in 32 bits, I get 0xFF5266D8 which I shift right by 8 bits, truncate then to 16 bits, and answer is 0x5266.
On the other hand if I instead store my negative 10.42 in 32 bits, I get 0xF58F66D8, which after shifting 8 bits and truncating becomes 8F66.
But, if I store both numbers in 32 bit formats, only then I get the correct result after shifting and truncation, which is 0x0766.
Why is this happening? I understand that loss of information is intrinsic when we go from 32 to 16 bits, but 0x07 is much different from 0x55. I will be absolutely grateful for a response.
Let’s look at just the integer representations. You have two 16-bit integers, x and y, and you form their 16-bit two’s complements. However, you keep these 16-bit complements in 32-bit objects. In 32 bits, what you have is 65536–x and 65536–y. (For example, you started with 0xa78, complemented it to make 0xfffff588, and discarded bits to get 0xf588. That equals 0x10000-0xa78.)
When you multiply these, the result is 65536•65536 – 65536•x – 65536•y + x•y.
The 65536•65536 is 232, so it vanishes because unsigned 32-bit arithmetic is performed modulo 232. You are left with – 65536•x – 65536•y + x•y.
Now you can see the problem: x•y is the product of two 16-bit values, so it flows into the high 16 bits of the 32 bits. Up there, you still have – 65536•x – 65536•y, which you do not want.
An easy way to do this is multiplication to keep all 32 bits of the complement. E.g., when you took the two’s complement of 0xa78, you got 0xfffff588. Then you discarded the high bits, keeping only 0xf588. If you do not do that, you will multiply 0xfffff588 by 0xffffff4b, and the product will be 0x766d8 which, when shifted for the fraction, will be 0x766, which is the result you want.
If the high bits are lost because you stored the two’s complement into a 16-bit object, then simply restore them when you reload the object, by extending the sign bit. That is, take bit 15 and repeat it in bits 16 to 31. An easy way to do this is to load the 16-bit object into a 16-bit signed integer, then convert the 16-bit signed integer to an unsigned 32-bit integer.

Resources