What do arithmetic underflow and overflow mean in C programming?
Overflow
From http://en.wikipedia.org/wiki/Arithmetic_overflow:
the condition that occurs when a
calculation produces a result that is
greater in magnitude than that which a
given register or storage location can
store or represent.
So, for instance:
uint32_t x = 1UL << 31;
x *= 2; // Overflow!
Note that as #R mentions in a comment below, the C standard suggests:
A computation involving unsigned
operands can never overflow, because a
result that cannot be represented by
the resulting unsigned integer type is
reduced modulo the number that is one
greater than the largest value that
can be represented by the resulting
type.
Of course, this is a fairly idiosyncratic definition of "overflow". Most people would refer to modulo reduction (i.e wrap-around) as "overflow".
Underflow
From http://en.wikipedia.org/wiki/Arithmetic_underflow:
the condition in a computer program that
can occur when the true result of a
floating point operation is smaller in
magnitude (that is, closer to zero)
than the smallest value representable
as a normal floating point number in
the target datatype.
So, for instance:
float x = 1e-30;
x /= 1e20; // Underflow!
Computers use only 0 and 1 to represent data so that the range of values that can be represented is limited. Many computers use 32 bits to store integers, so the largest unsigned integer that can be stored in this case is 2^32 -1 = 4294967295. But the first bit is used to represent the sign, so, in fact, the largest value is 2^31 - 1 = 2147483647.
The situation where an integer outside the allowed range requires more bits than can be stored is called an overflow.
Similarly, with real numbers, an exponent that is too small to be stored causes an underflow.
int, the most common data type in C, is a 32-bit data type. This means that each int is given 32 bits in memory. If I had the variable
int a = 2;
that would actually be represented in memory as a 32-bit binary number:
00000000000000000000000000000010.
If you have two binary numbers such as
10000000000000000000000000000000
and
10000000000000000000000000000000,
their sum would be 100000000000000000000000000000000, which is 33 bits long. However, the computer only takes the 32 least significant bits, which are all 0. In this case the computer recognizes that the sum is greater than what can be stored in 32 bits, and gives an overflow error.
An underflow is basically the same thing happening in the opposite direction. The floating-point standard used for C allows for 23 bits after the decimal place; if the number has precision beyond this point it won't be able to store those bits. This results in an underflow error and/or loss of precision.
underflow depends exclusively upon the given algorithm and the given input data,and hence there is no direct control by the programmer .Overflow on the other hand, depends upon the arbitrary choice of the programmer for the amount of memory space reserved for each stack ,and this choice does influence the number of times overflow may occur
Related
I am learning about data allocation and am a little confused.
If you are looking for the smallest or greatest value that can be stored in a certain number of bits then does it matter what the data type is?
Wouldn't the smallest or biggest number that could be stored in 22 bits would be 22 1's positive or negative? Is the first part of this question a red herring? Wouldn't the smallest value be -4194303?
A 22-bit data element can store any one of 2^22 distinct values. What those values actually mean is a matter of interpretation. That interpretation may be imposed by a compiler or some piece of hardware, or may be under the control of the programmer, and suit some specific application.
A simple interpretation, of course, would be to treat the 22 bits as an unsigned integer, with values from 0 to (2^22)-1. A two's-complement, signed integer is a slightly more sophisticated interpretation of the same bits. Or you (or the compiler, or CPU) could divide the 22 bits up into a mantissa and exponent, and store a range of decimal numbers. The range and precision would depend on how many bits were allocated to the mantissa, and how many to the exponent.
Or you could split the bits up and use some for the numerator and some for the denominator of a fraction. Or, in fact, anything else.
Some of these interpretations of the bits are built into hardware, some are implemented by compilers or libraries, and some are entirely under the programmer's control. Not all programming languages allow the programmer to manipulate individual bits in a natural or efficient way, but some do. Sometimes, using a highly unconventional interpretation of binary data can give significant efficiency gains, but usually at the expense of readability and maintainability.
So, yes, it matters what the data type is.
There is no law (of humans, logic, or nature) that says bits must represent numbers only in the pattern that one of the bits represents 20, another represents 21, another represents 22, and so on (and the number represented is the sum of those values for the bits that are 1). We have choices about how to use bits to represent numbers, including:
The bits do use that pattern, and so 22 bits can represent any number from 0 to the sum of 20 + 21 + 22 + … + 221 = 222 − 1 = 4,194,303. The smallest representable value is 0.
The bits mostly use that pattern, but it is modified so that one bit represents −221 instead of +221. This is called two’s complement, and the smallest value representable is −221 = −2,097,152.
The bits represent numbers as described above except the represent value is divided by 1000. This is called fixed-point. In the first case, the value represent by all bits 1 would be 4194.303, but the smallest representable value would be 0. With a combination of two’s complement and fixed-point scaled by 1/1000, the smallest representable value would be −2097.152.
The bits represent a floating-point number, where one bit represents a sign (+ or −), certain bits represent an exponent and other information, and the remaining bits represent a significand. In common floating-point formats, when all the bits in that exponent-and-other field are 1s and the significand field bits are 0s, the number represents +∞ or −∞, according to the sign bit. In such a format, the smallest representable value is −∞.
As an example, we could designate patterns of bits to represent numbers arbitrarily. We could say that 0000000000000000000000 represents 34, 0000000000000000000001 represents −15, 0000000000000000000010 represents 5, 0000000000000000000011 represents 3+4i, and so on. The smallest representable value would be whichever of those arbitrary values is smallest.
So what the smallest representable value is depends entirely on the type, since the “type” of the data includes the scheme by which the bits represent values.
If the type is a “signed integer type,” there is still some flexibility in the representation. Most modern C implementations (and other programming languages) use the two’s complement scheme described above. But the C standard still allows two other schemes:
One’s complement: If the first bit is 1, the value represented is negative, and its magnitude is given by complementing the remaining bits and interpreting them as binary. Using six bits for an example, 101001 would be negative with the magnitude of 101102 = 22, so −22.
Sign-and-magnitude: If the first bit is 1, the value represented is negative, and its magnitude is given by interpreting the remaining bits as binary. Using the same bits, 101001 would negative with the magnitude of 010012 = 9, so −9.
In both one’s complement and sign-and-magnitude, the smallest representable value with 22 bits is −(221−1) = −2,097,151.
To stretch the question further, C defines standard integer types but allows implementations to extend the language. An implementation could define some “signed integer type” with an arbitrary scheme for representing numbers, as long as that scheme included a sign, to make the name correct.
Without going into technical jargon about doing maths with Two's compliment, I'll try to explain in easy words.
First you need to raise 2 with power of 'number of bits'.
Let's take an example of an 8 bit type,
An un-signed 8-bit integer can store 2 ^ 8 = 256 values.
Since values are indexed starting from 0, so values range from 0 - 255.
Assuming you want to store signed values, so you need to get the half (simply divide it by 2),
256 / 2 = 128.
Remember we start from zero,
You might be rightly thinking you can store -127 to 127 starting from zero on both sides.
Just know that there is only zero (there is nothing like +0 or -0),
so you start with zero to positive half. 0 to 127,
that leaves you with negative half starting from -1 to -128
Hence the range will be -128 to 127.
For a 22 bit signed integer you can do the math,
2 ^ 22 = 4,194,304
4194304 / 2 = 2,097,152
-1 for positive side,
range will be, -2097152 to 2097151.
To answer your question,
-2097152 would be the smallest number you can store.
Thanks everyone for the replies. I figured it out with the help of all of your info but I will explain the answer to show exactly what gaps of knowledge I had that lead to my misunderstanding.
The data type does matter in this question because for signed data types the first bit is used to represent whether or not a binary number is positive or negative. 0111 = 7 and 1111 = -7
sign int and unsigned int use the same number of bits, 32 bits. Since an unsigned int is unsigned: the first bit isn't used to represent positive or negative so it can represent a larger number with that extra bit. 1111 converted to an unsigned int is 15 whereas with the signed int it was -7 since the furthest left bit represents the sign: 1 is negative and 0 is positive.
Now to answer "If a C signed integer type is stored in 22 bits, what is the smallest value it can store?":
If you convert binary to decimal you get 1111111111111111111111 = 4194304
This decimal value -1 is the maximum value an unsigned could hold. Since our data type is signed it has to use one less bit for the number value since the first bit represents the sign. This gives us -2097152.
Thanks again, everyone.
This question already has answers here:
Why Do We have unsigned and signed int type in C?
(4 answers)
Closed 4 years ago.
I am studying c language and there are two different integer types, signed/unsigned.
Signed integers can present both positive and negative numbers. Why do we
need unsigned integers then?
One word answer is "range"!
When you declare a signed integer, it takes 4 bytes/ 32 bits memory (on a 32 bit system).
Out of those 32 bits, 1 bit is for sign and other 31 bits represent number. Means you can represent any number between -2,147,483,648 to 2,147,483,647 i.e. 2^31.
What if you want to use 2,147,483,648? Go for 8 bytes? But if you are not interested in negative numbers, then isn't it wastage of 4 bytes?
If you use unsigned int, all 32bits represent your number as you don't need to spare 1 bit for sign. Hence, with unsigned int, you can go from 0 to 4,294,967,295 i.e. 2^32
Same applies for other data types.
The reason is that Integer always has a fixed size. On most systems, an Integer is 32 bits large.
So no matter of having a signed or unsigned Integer, it takes always the same amout of memory. And that's where signed and unsigned differ: the range
Where an unsigned integer has a range of 0 to 4294967295 (2³²-1), the signed integer has a range of -2147483647 to 2147483648
unsigned types have an extra bit of storage, allowing them a maximum magnitude of 2CHAR_BIT * sizeof(type)-1 for positive values. This is why types like size_t, which are meant to store sizes of files, strings, arrays, etc. are unsigned.
With signed integers, one bit is reserved for the sign, so if an int is 32-bits long, you only get 31 bits to store the magnitude of the number. An unsigned int does not have this restriction; the MSB is used for magnitude as well, but it comes at the expense of no longer being able to be negative.
Signed integer overflow is undefined by the C standard, whereas unsigned integer overflow is guaranteed to wrap-around and reset to zero. For example, the following code invokes undefined behavior in C:
int a = INT_MAX;
a++;
Whereas this is guaranteed to wrap-around back to zero:
unsigned int a = UINT_MAX;
a++;
Unsigned types are generally better for performing bit operations on
There are two/three reasons, given that C must offer the greatest range of possibilities to the programmer.
The first is that an unsigned integer can hold a double (positive) value in respect to its signed counterpart. And we don't want to waste any single bit right?
The second is that a protocol, or some data structure a program must cope with, can use unsigned values, so it is handy to have that data type.
The third is that processors actually have unsigned types, so C language gives them available. May be that there are algorithms which relay on overflow, for example.
There can be still other motivations, probably I don't remember them all.
Personally, I make large use of unsigned integers in embedded applications. For example, using a single unsigned char as an index into a circular buffer of 256 elements, makes it simple and fast to increment the index without checking for overflow, because when the index overflows, it does exactly what I want it to do (reset to zero). Again, there are probably many other situations, I tell just the first that comes to my mind.
It's all about memory. They are used to represent a greater number without making use of a larger amount of memory.
Numbers are stored on the computer in binary form. Signed numeric values use a process called two's complement to transform positive numbers into negative ones where the first bit, the one that could represent the highest value is not taken into account for any calculation.
It means that the numeric signed type of your choice can only store a maximum value of N available bits minus 1 bit and the remaining bit will be used to determine the sign of the value, while an unsigned type of your choice can make use of all its available bits to store its value with the drawback of not being able to represent negative values.
Is a conversion from an int to a float always possible in C without the float becoming one of the special values like +Inf or -Inf?
AFAIK there is is no upper limit on the range of int.
I think a 128 bit int would cause an issue for a platform with an IEEE754 float as that has an upper value of around the 127th power of 2.
Short answer to your question: no, it is not always possible.
But it is worthwhile to go a little bit more into details. The following paragraph shows what the standard says about integer to floating-point conversions (online C11 standard draft):
6.3.1.4 Real floating and integer
2) When a value of integer type is converted to a real floating type,
if the value being converted can be represented exactly in the new
type, it is unchanged. If the value being converted is in the range of
values that can be represented but cannot be represented exactly, the
result is either the nearest higher or nearest lower representable
value, chosen in an implementation-defined manner. If the value being
converted is outside the range of values that can be represented, the
behavior is undefined. ...
So many integer values may be converted exactly. Some integer values may lose precision, yet a conversion is at least possible. For some values, however, the behaviour might be undefined (if, for example, an integer value would not be able to be represented with the maximum exponent of the float value). But actually I cannot assume a case where this will happen.
Is it always possible to convert an int to a float?
Reasonably - yes. An int will always convert to a finite float. The conversion may lose some precision for great int values.
Yet for the pedantic, an odd compiler could have trouble.
C allows for excessively wide int, not just 16, 32 or 64 bit ones and float could have a limit range, as small as 1e37.
It is not the upper range of int or INT_MAX that should be of concern. It is the lower end. INT_MIN which often has +1 greater magnitude than INT_MAX.
A 124 bit int min value could be about -1.06e37, so that does exceed the minimal float range.
With the common binary32 float, an int would need to be more than 128 bits to cause a float infinity.
So what test is needed to detect this rare situation?
Form an exact power-of-2 limit and perform careful math to avoid overflow or imprecision.
#if -INT_MAX == INT_MIN
// rare non 2's complement machine
#define INT_MAX_P1_HALF (INT_MAX/2 + 1)
_Static_assert(FLT_MAX/2 >= INT_MAX_P1_HALF, "non-2's comp.`int` range exceeds `float`");
#else
_Static_assert(-FLT_MAX <= INT_MIN, "2's complement `int` range exceeds `float`");
#endif
The standard only requires floating point representations to include a finite number as large as 1037 (§5.2.4.2.2/12) and does not put any limit on the maximum size of an integer. So if your implementation has 128-bit integers (or even 124-bit integers), it is possible for an integer-to-float conversion to exceed the range of finite representable floating point numbers.
No, it not always possible to convert an int to a float, due to how floats work. 32 bit floats greater than 16777216 (or less than -16777216) need to be even, greater than 33554432 (or less than -33554432) need to be evenly divisibly by 4, greater than 67108864 (or less than -67108864) need to be evenly divisibly by 8, etc. The IEEE-754 float standard defines round to nearest even as the default mode, but other modes exist depending upon implementation.
Also, the largest 128 bit int = 2^128 - 1 is greater than the largest 32 bit float = 2^127 x 1.11111111111111111111111 = 2^127 x (2-2^-23) = 2^127 x (2^1-2^-23) = 2^(127+1) - 2^(127-23) = 2^(127+1)-2^(127-23) = 2^(128) - 2^(104)
given the following function:
int boof(int n) {
return n + ~n + 1;
}
What does this function return? I'm having trouble understanding exactly what is being passed in to it. If I called boof(10), would it convert 10 to base 2, and then do the bitwise operations on the binary number?
This was a question I had on a quiz recently, and I think the answer is supposed to be 0, but I'm not sure how to prove it.
note: I know how each bitwise operator works, I'm more confused on how the input is processed.
Thanks!
When n is an int, n + ~n will always result in an int that has all bits set.
Strictly speaking, the behavior of adding 1 to such an int will depend on the representation of signed numbers on the platform. The C standard support 3 representations for signed int:
for Two's Complement machines (the vast majority of systems in use today), the result will be 0 since an int with all bits set is -1.
on a One's Complement machine (which are pretty rare today, I believe), the result will be 1 since an int with all bits set is 0 or -0 (negative zero) or undefined behavior.
a Signed-magnitude machine (are there really any of these still in use?), an int with all bits set is a negative number with the maximum magnitude (so the actual value will depend on the size of an int). In this case adding 1 to it will result in a negative number (the exact value, again depends on the number of bits that are used to represent an int).
Note that the above ignores that it might be possible for some implementations to trap with various bit configurations that might be possible with n + ~n.
Bitwise operations will not change the underlying representation of the number to base 2 - all math on the CPU is done using binary operations regardless.
What this function does is take n and then add it to the two's complement negative representation of itself. This essentially negates the input. Anything you put in will equal 0.
Let me explain with 8 bit numbers as this is easier to visualize.
10 is represented in binary as 00001010.
Negative numbers are stored in two's complement (NOTing the number and adding 1)
So the (~n + 1) portion for 10 looks like so:
11110101 + 1 = 11110110
So if we take n + ~n+1:
00001010 + 11110110 = 0
Notice if we add these numbers together we get a left carry which will set the overflow flag, resulting in a 0. (Adding a negative and positive number together never means the overflow indicates an exception!)
See this
The CARRY and OVERFLOW flag in Binary Arithmetic
case 1:
int8_t a = -10;
int32_t b;
b = (int32_t)a;
case 2:
uint8_t a = 10;
uint32_t b;
b = (uint32_t)a;
What will b be in these two cases? Is there any guarantees? Will the 3 extra bytes during the type conversion be padded with 0?
Clarification: larger as in more bytes.
Converting between integer types is guaranteed to be "correct". That is, if the value being converted (regardless of its type) is representable in the converted-to type, the result will be that same value.
In the first case, -10 is representable as an int32_t, so b will end up holding the 32-bit representation of -10. On 2's complement machines (virtually all modern computers), that'll have a whole lot of 1's at the top. On x86, the cbw, cwd, cwde, and cdq instructions are used to do this.
In the second case, 10 is representable as a uint32_t, so b will end up holding the 32-bit representation of 10. That'll have a whole lot of zeros at the top.
You can think of this as "sign-extension" -- when widening signed integers, the extra bits are copied from the MSB of the source operand -- but that's just implementation details. The rule is, if it can be represented in the destination type, it's represented correctly in the destination type.
The one extra guarantee that C/C++ gives you, is that when narrowing unsigned types -- converting from a bigger unsigned integer to a smaller unsigned integer -- the result will be the same as chopping off the upper bits, regardless of whether the value is representable in the smaller type. For signed integers, all bets are off (but in practice, the same thing always happens, and sometimes that means a positive value becomes negative).