I'm new to C, still struggling in understanding how overflow occurs. Let's say we have the following buggy code to determine whether one string is longer than another:
int strlonger(char *s, char *t) {
return strlen(s) - strlen(t) > 0; // let's say the first return value of strlen(s) is s1, abd the second is s2
}
and we know it is not going to work as the return type of strlen() is size_t which is unsigned int, so when we have sth like 1u - 2u > 0;the left operand overflows.
I kind of get the idea, it is sth like 1u - 2u is -1, but because both s1 and s2 are unsigned int, the result should also be unsigned int, therefore it overflow.
But considering a different scenario:
int a= 1048577;
size_t b = 4096;
long long unsigned c= a* b;
since 1048577*4096 = 4294971392 which is out of range of int or unsigned b, so isn't that the result should overflow first? why it is like the result is reserved to keep value just because the left operand c is long long unsigned that can hold the value?, isn't that more sensible to make it work only in this way:
long long unsigned a= 1048577;
long long unsigned b = 4096;
long long unsigned c= a* b;
I kind of get the idea, it is sth like 1u - 2u is -1, but because both s1 and s2 are unsigned int, the result should also be unsigned int, therefore it overflow.
Not at all.
The result is whatever type you wish it to be, of course (it can be double for all I care), but that result type is not important - or at least it's not of primary importance, because it doesn't affect whether the operation itself is "OK" or not. The operation itself must be defined before you can even begin thinking about converting the result to any type at all (or leaving it in the "natural" type).
What you should focus on is whether an operation such as subtraction on two values of identical unsigned types is defined. And indeed, it always is defined. The C standard states what the result is - and it is very clear that there is no overflow. In fact, it's even clearer: the result can NEVER overflow:
A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type. (ISO/IEC 9899:1999 (E) ยง6.2.5/9)
Not only that, but conversions between integers and unsigned integers are well defined as well, and -1 (of type integer) converts to the maximum value of whatever unsigned type you convert it to. Basically, -1 converted to unsigned int is a short way of writing UINT_MAX etc.
unsigned char uc = -1;
assert(uc == UCHAR_MAX);
unsigned short us = -1;
assert(us == USHORT_MAX);
unsigned int ui = -1;
assert(ui == UINT_MAX);
unsigned long ul = -1;
assert(ul == ULONG_MAX);
// etc.
long long unsigned c= a* b;
since 1048577*4096 = 4294971392 which is out of range of int or unsigned b, so isn't that the result should overflow first?
The C language is simply not designed to interpret it the way you do. That's all. Most decisions in programming language design are completely arbitrary. You might be surprised of course that the designers made a decision different than you'd have made, but both are equally arbitrary.
What happens here is that the whole computation is performed using the long long unsigned type, and because it is an unsigned type, it never overflows. The C standard says so. And that's all there's to it.
One could argue that doing it the way you propose is worse, because there'd be way more typing to get something that should seem to work. If C worked the way you wanted, you'd need to write your expression as follows:
int a = 1048577;
size_t b = 4096;
long long unsigned c = (long long unsigned)a * (long long unsigned)b;
One could argue that forcing everyone to pollute their code with endless casts that way would be unkind to say the least. C is nicer than you expect it to be.
Of course C is also full of things that are abhorrent, so you were just lucky that you asked about this and not, say, the millionth question about why gets() is bad. The truth is: gets() is like Voldermort. You don't say gets and you don't use gets and everything is fine.
Related
I thought the following code might cause an overflow since a * 65535 is larger than what unsigned short int can hold, but the result seems to be correct.
Is there some built-in mechanism in C that stores intermediary arithmetic results in a larger data type? Or is it working as an accident?
unsigned char a = 30;
unsigned short int b = a * 65535 /100;
printf("%hu", b);
It works because all types narrower than int will go under default promotion. Since unsigned char and unsigned short are both smaller than int on your platform, they'll be promoted to int and the result won't overflow if int contains CHAR_BIT + 17 bits or more (which is the result of a * 65535 plus sign) 22 bits or more (which is the number of bits in 30 * 65535 plus sign). However if int has fewer bits then overflow will occur and undefined behavior happens. It won't work if sizeof(unsigned short) == sizeof(int) either
Default promotion allows operations to be done faster (because most CPUs work best with values in its native size) and also prevents some naive overflow from happening. See
Implicit type promotion rules
Will char and short be promoted to int before being demoted in assignment expressions?
Why must a short be converted to an int before arithmetic operations in C and C++?
If I declare two max integers in C:
int a = INT_MAX;
int b = INT_MAX;
and sum them into the another int:
int c = a+b;
I know there is a buffer overflow but I am not sure how to handle it.
This causes undefined behavior since you are using signed integers (which cause undefined behavior if they overflow).
You will need to find a way to avoid the overflow, or if possible, switch to unsigned integers (which use wrapping overflow).
One possible solution is to switch to long integers such that no overflow occurs.
Another possibility is checking for the overflow first:
if( (INT_MAX - a) > b) {
// Will overflow, do something else
}
Note: I'm assume here you don't actually know the exact value of a and b.
For the calculation to be meaningful, you would have to use a type large enough to hold the result. Apart from that, overflow is only a problem for signed int. If you use unsigned types, then you don't get undefined overflow, but well-defined wrap-around.
In this specific case the solution is trivial:
unsigned int c = (unsigned int)a + (unsigned int)b; // 4.29 bil
Otherwise, if you truly wish to know the signed equivalent of the raw binary value, you can do:
int c = (unsigned int)a + (unsigned int)b;
As long as the calculation is carried out on unsigned types there's no danger (and the value will fit in this case - it won't wrap-around). The result of the addition is implicitly converted through assignment to the signed type of the left operand of =. This conversion is implementation-defined, as in the result depends on signed format used. On 2's complement mainstream computers you will very likely get the value -2.
I'm currently fixing a legacy bug in C code. In the process of fixing this bug, I stored an unsigned int into an unsigned long long. But to my surprise, math stopped working when I compiled this code on a 64 bit version of GCC. I discovered that the problem was that when I assigned a long long an int value, then I got a number that looked like 0x0000000012345678, but on the 64-bit machine, that number became 0xFFFFFFFF12345678.
Can someone explain to me or point me to some sort of spec or documentation on what is supposed to happen when storing a smaller data type in a larger one and perhaps what the appropriate pattern for doing this in C is?
Update - Code Sample
Here's what I'm doing:
// Results in 0xFFFFFFFFC0000000 in 64 bit gcc 4.1.2
// Results in 0x00000000C0000000 in 32 bit gcc 3.4.6
u_long foo = 3 * 1024 * 1024 * 1024;
I think you have to tell the compiler that the number on the right is unsigned. Otherwise it thinks it's a normal signed int, and since the sign bit is set, it thinks it's negative, and then it sign-extends it into the receiver.
So do some unsigned casting on the right.
Expressions are generally evaluated independently; their results are not affected by the context in which they appear.
An integer constant like 1024 is of the smallest of int, long int, long long int into which its value will fit; in the particular case of 1024 that's always int.
I'll assume here that u_long is a typedef for unsigned long (though you also mentioned long long in your question).
So given:
unsigned long foo = 3 * 1024 * 1024 * 1024;
the 4 constants in the initialization expression are all of type int, and all three multiplications are int-by-int. The result happens to be greater (by a factor of 1.5) than 231, which means it won't fit in an int on a system where int is 32 bits. The int result, whatever it is, will be implicitly converted to the target type unsigned long, but by that time it's too late; the overflow has already occurred.
The overflow means that your code has undefined behavior (and since this can be determined at compile time, I'd expect your compiler to warn about it). In practice, signed overflow typically wraps around, so the above will typically set foo to -1073741824. You can't count on that (and it's not what you want anyway).
The ideal solution is to avoid the implicit conversions by ensuring that everything is of the target type in the first place:
unsigned long foo = 3UL * 1024UL * 1024UL * 1024UL;
(Strictly speaking only the first operand needs to be of type unsigned long, but it's simpler to be consistent.)
Let's look at the more general case:
int a, b, c, d; /* assume these are initialized */
unsigned long foo = a * b * c * d;
You can't add a UL suffix to a variable. If possible, you should change the declarations of a, b, c, and d so they're of type unsigned long long, but perhaps there's some other reason they need to be of type int. You can add casts to explicitly convert each one to the correct type. By using casts, you can control exactly when the conversions are performed:
unsigned long foo = (unsigned long)a *
(unsigned long)b *
(unsigned long)d *
(unsigned long)d;
This gets a bit verbose; you might consider applying the cast only to the leftmost operand (after making sure you understand how the expression is parsed).
NOTE: This will not work:
unsigned long foo = (unsigned long)(a * b * c * d);
The cast converts the int result to unsigned long, but only after the overflow has already occurred. It merely specifies explicitly the cast that would have been performed implicitly.
Integral literals with a suffix are int if they can fit, in your case 3 and 1024 can definitely fit. This is covered in the draft C99 standard section 6.4.4.1 Integer constants, a quote of this section can be found in my answer to Are C macros implicitly cast?.
Next we have the multiplication, which performs the usual arithmetic conversions conversions on it's operands but since they are all int the result of which is too large to fit in a signed int which results in overflow. This is undefined behavior as per section 5 which says:
If an exceptional condition occurs during the evaluation of an expression (that is, if the
result is not mathematically defined or not in the range of representable values for its
type), the behavior is undefined.
We can discover this undefined behavior empirically using clang and the -fsanitize=undefined flags (see it live) which says:
runtime error: signed integer overflow: 3145728 * 1024 cannot be represented in type 'int'
Although in two complement this will just end up being a negative number. One way to fix this would be to use the ul suffix:
3ul * 1024ul * 1024ul * 1024ul
So why does a negative number converted to an unsigned value give a very large unsigned value? This is covered in section 6.3.1.3 Signed and unsigned integers which says:
Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type.49)
which basically means unsigned long max + 1 is added to the negative number which results in very large unsigned value.
Can you tell me what exactly does the u after a number, for example:
#define NAME_DEFINE 1u
Integer literals like 1 in C code are always of the type int. int is the same thing as signed int. One adds u or U (equivalent) to the literal to ensure it is unsigned int, to prevent various unexpected bugs and strange behavior.
One example of such a bug:
On a 16-bit machine where int is 16 bits, this expression will result in a negative value:
long x = 30000 + 30000;
Both 30000 literals are int, and since both operands are int, the result will be int. A 16-bit signed int can only contain values up to 32767, so it will overflow. x will get a strange, negative value because of this, rather than 60000 as expected.
The code
long x = 30000u + 30000u;
will however behave as expected.
It is a way to define unsigned literal integer constants.
It is a way of telling the compiler that the constant 1 is meant to be used as an unsigned integer. Some compilers assume that any number without a suffix like 'u' is of int type. To avoid this confusion, it is recommended to use a suffix like 'u' when using a constant as an unsigned integer. Other similar suffixes also exist. For example, for float 'f' is used.
it means "unsigned int", basically it functions like a cast to make sure that numeric constants are converted to the appropriate type at compile-time.
A decimal literal in the code (rules for octal and hexadecimal literals are different, see https://en.cppreference.com/w/c/language/integer_constant) has one of the types int, long or long long. From these, the compiler has to choose the smallest type that is large enough to hold the value. Note that the types char, signed char and short are not considered. For example:
0 // this is a zero of type int
32767 // type int
32768 // could be int or long: On systems with 16 bit integers
// the type will be long, because the value does not fit in an int there.
If you add a u suffix to such a number (a capital U will also do), the compiler will instead have to choose the smallest type from unsigned int, unsigned long and unsigned long long. For example:
0u // a zero of type unsigned int
32768u // type unsigned int: always fits into an unsigned int
100000u // unsigned int or unsigned long
The last example can be used to show the difference to a cast:
100000u // always 100000, but may be unsigned int or unsigned long
(unsigned int)100000 // always unsigned int, but not always 100000
// (e.g. if int has only 16 bit)
On a side note: There are situations, where adding a u suffix is the right thing to ensure correctness of computations, as Lundin's answer demonstrates. However, there are also coding guidelines that strictly forbid mixing of signed and unsigned types, even to the extent that the following statement
unsigned int x = 0;
is classified as non-conforming and has to be written as
unsigned int x = 0u;
This can lead to a situation where developers that deal a lot with unsigned values develop the habit of adding u suffixes to literals everywhere. But, be aware that changing signedness can lead to different behavior in various contexts, for example:
(x > 0)
can (depending on the type of x) mean something different than
(x > 0u)
Luckily, the compiler / code checker will typically warn you about suspicious cases. Nevertheless, adding a u suffix should be done with consideration.
My apologies if the question seems weird. I'm debugging my code and this seems to be the problem, but I'm not sure.
Thanks!
It depends on what you want the behaviour to be. An int cannot hold many of the values that an unsigned int can.
You can cast as usual:
int signedInt = (int) myUnsigned;
but this will cause problems if the unsigned value is past the max int can hold. This means half of the possible unsigned values will result in erroneous behaviour unless you specifically watch out for it.
You should probably reexamine how you store values in the first place if you're having to convert for no good reason.
EDIT: As mentioned by ProdigySim in the comments, the maximum value is platform dependent. But you can access it with INT_MAX and UINT_MAX.
For the usual 4-byte types:
4 bytes = (4*8) bits = 32 bits
If all 32 bits are used, as in unsigned, the maximum value will be 2^32 - 1, or 4,294,967,295.
A signed int effectively sacrifices one bit for the sign, so the maximum value will be 2^31 - 1, or 2,147,483,647. Note that this is half of the other value.
Unsigned int can be converted to signed (or vice-versa) by simple expression as shown below :
unsigned int z;
int y=5;
z= (unsigned int)y;
Though not targeted to the question, you would like to read following links :
signed to unsigned conversion in C - is it always safe?
performance of unsigned vs signed integers
Unsigned and signed values in C
What type-conversions are happening?
IMHO this question is an evergreen. As stated in various answers, the assignment of an unsigned value that is not in the range [0,INT_MAX] is implementation defined and might even raise a signal. If the unsigned value is considered to be a two's complement representation of a signed number, the probably most portable way is IMHO the way shown in the following code snippet:
#include <limits.h>
unsigned int u;
int i;
if (u <= (unsigned int)INT_MAX)
i = (int)u; /*(1)*/
else if (u >= (unsigned int)INT_MIN)
i = -(int)~u - 1; /*(2)*/
else
i = INT_MIN; /*(3)*/
Branch (1) is obvious and cannot invoke overflow or traps, since it
is value-preserving.
Branch (2) goes through some pains to avoid signed integer overflow
by taking the one's complement of the value by bit-wise NOT, casts it
to 'int' (which cannot overflow now), negates the value and subtracts
one, which can also not overflow here.
Branch (3) provides the poison we have to take on one's complement or
sign/magnitude targets, because the signed integer representation
range is smaller than the two's complement representation range.
This is likely to boil down to a simple move on a two's complement target; at least I've observed such with GCC and CLANG. Also branch (3) is unreachable on such a target -- if one wants to limit the execution to two's complement targets, the code could be condensed to
#include <limits.h>
unsigned int u;
int i;
if (u <= (unsigned int)INT_MAX)
i = (int)u; /*(1)*/
else
i = -(int)~u - 1; /*(2)*/
The recipe works with any signed/unsigned type pair, and the code is best put into a macro or inline function so the compiler/optimizer can sort it out. (In which case rewriting the recipe with a ternary operator is helpful. But it's less readable and therefore not a good way to explain the strategy.)
And yes, some of the casts to 'unsigned int' are redundant, but
they might help the casual reader
some compilers issue warnings on signed/unsigned compares, because the implicit cast causes some non-intuitive behavior by language design
If you have a variable unsigned int x;, you can convert it to an int using (int)x.
It's as simple as this:
unsigned int foo;
int bar = 10;
foo = (unsigned int)bar;
Or vice versa...
If an unsigned int and a (signed) int are used in the same expression, the signed int gets implicitly converted to unsigned. This is a rather dangerous feature of the C language, and one you therefore need to be aware of. It may or may not be the cause of your bug. If you want a more detailed answer, you'll have to post some code.
Some explain from C++Primer 5th Page 35
If we assign an out-of-range value to an object of unsigned type, the result is the remainder of the value modulo the number of values the target type can hold.
For example, an 8-bit unsigned char can hold values from 0 through 255, inclusive. If we assign a value outside the range, the compiler assigns the remainder of that value modulo 256.
unsigned char c = -1; // assuming 8-bit chars, c has value 255
If we assign an out-of-range value to an object of signed type, the result is undefined. The program might appear to work, it might crash, or it might produce garbage values.
Page 160:
If any operand is an unsigned type, the type to which the operands are converted depends on the relative sizes of the integral types on the machine.
...
When the signedness differs and the type of the unsigned operand is the same as or larger than that of the signed operand, the signed operand is converted to unsigned.
The remaining case is when the signed operand has a larger type than the unsigned operand. In this case, the result is machine dependent. If all values in the unsigned type fit in the large type, then the unsigned operand is converted to the signed type. If the values don't fit, then the signed operand is converted to the unsigned type.
For example, if the operands are long and unsigned int, and int and long have the same size, the length will be converted to unsigned int. If the long type has more bits, then the unsigned int will be converted to long.
I found reading this book is very helpful.