I'm looking here to understand sign extension:
http://www.shrubbery.net/solaris9ab/SUNWdev/SOL64TRANS/p8.html
struct foo {
unsigned int base:19, rehash:13;
};
main(int argc, char *argv[])
{
struct foo a;
unsigned long addr;
a.base = 0x40000;
addr = a.base << 13; /* Sign extension here! */
printf("addr 0x%lx\n", addr);
addr = (unsigned int)(a.base << 13); /* No sign extension here! */
printf("addr 0x%lx\n", addr);
}
They claim this:
------------------ 64 bit:
% cc -o test64 -xarch=v9 test.c
% ./test64
addr 0xffffffff80000000
addr 0x80000000
%
------------------ 32 bit:
% cc -o test32 test.c
% ./test32
addr 0x80000000
addr 0x80000000
%
I have 3 questions:
What is sign extension ? Yes I read wiki, but didn't understand when type promotion occurs, what's going on with sign extension?
Why ffff.. in 64 bit(referring addr) ?
When I do type cast, why no sign extension?
EDIT:
4. Why not an issue in 32 bit system?
The left operand of the << operator undergoes standard promotions, so in your case it is promoted to int -- so far so good. Next, the int of value 0x4000 is multiplied by 213, which causes overflow and thus undefined behaviour. However, we can see what's happening: the value of the expression is now simply INT_MIN, the smallest representable int. Finally, when you convert that to an unsigned 64-bit integer, the usual modular arithmetic rules entail that the resulting value is 0xffffffff80000000. Similarly, converting to an unsigned 32-bit integer gives the value 0x80000000.
To perform the operation on unsigned values, you need to control the conversions with a cast:
(unsigned int)(a.base) << 13
a.base << 13
The bitwise operator performs integer promotions on both its operands.
So this is equivalent to:
(int) a.base << 13
which is a negative value of type int.
Then:
addr = (int) a.base << 13;
converts this signed negative value ((int) a.base << 13) to the type of addr which is unsigned long through integer conversions.
Integer conversions (C99, 6.3.1.3p2) rules that is the same as doing:
addr = (long) ((int) a.base << 13);
The conversion long performs the sign extension here because ((int) a.base << 13) is a negative signed number.
On the other case, with a cast you have something equivalent to:
addr = (unsigned long) (unsigned int) ((int) a.base << 13);
so no sign extension is performed in your second case because (unsigned int) ((int) a.base << 13) is an unsigned (and positive of course) value.
EDIT: as KerrekSB mentioned in his answer a.base << 13 is actually not representable in an int (I assume 32-bit int) so this expression invokes undefined behavior and the implementation has he right to behave in any other way, for example crashing.
For information, this is definitely not portable but if you are using gcc, gcc does not consider a.base << 13 here as undefined behavior. From gcc documentation:
"GCC does not use the latitude given in C99 only to treat certain aspects of signed '<<' as undefined, but this is subject to change."
in http://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html
This is more of a question about bit-fields. Note that if you change the struct to
struct foo {
unsigned int base, rehash;
};
you get very different results.
As #JensGustedt noted in Type of unsigned bit-fields: int or unsigned int the specification says:
If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int;
Even though you've specified that base is unsigned, the compiler converts it to a signed int when you read it. That's why you don't get sign extension when you cast it to unsigned int.
Sign extension has to do with how negative numbers are represented in binary. The most common scheme is 2s complement. In this scheme, -1 is represented in 32 bits as 0xFFFFFFFF, -2 is 0xFFFFFFFE, etc. So what should be done when we want to convert a 32-bit number to a 64-bit number, for example? If we convert 0xFFFFFFFF to 0x00000000FFFFFFFF, the numbers will have the same unsigned value (about 4 billion), but different signed values (-1 vs. 4 billion). On the other hand, if we convert 0xFFFFFFFF to 0xFFFFFFFFFFFFFFFF, the numbers will have the same signed value (-1) but different unsigned values. The former is called zero-extension (and is appropriate for unsigned numbers) and the latter is called sign-extension (and is appropriate for signed numbers). It's called "sign-extension" because the "sign bit" (the most significant, or left-most bit) is extended, or copied, to make the number wider.
It took me a while and a lot of reading/testing.
Maybe my, beginner way to understand what's going on will get to you (as I got it)
a.base=0x40000 (1(0)x18) -> 19-bit bitfield
addr=a.base<<13.
any value a.base can hold int can hold, too so conversion from 19-bit unsigned int bitfield to 32-bit signed integer. (a.base is now (0)x13,1,(0)x18).
now (converted to signed int a.base)<<13 which results in 1(0)x31). Remember it's signed int now.
addr=(1(0)x31). addr is of unsigned long type(64 bit) so to do the assignment righ value is converted to long int. Conversion from signed int to long int make addr (1)x33,(0)x31.
And that's what being printed after all of thos converstions you weren't even aware of:
0xffffffff80000000.
Why the second line prints 0x80000000 is because of that cast to (unsigned int) before conversion to long int. When converting unsigned int to long int there is no bit sign so value is just filled with trailing 0's to match the size and that's all.
What's different on with 32-bit, is during conversion from 32-bit signed int to 32-bit unsigned long their sizes match and do trailing bit signs are added,so:
1(0)x31 will stay 1(0)x31
even after conversion from int to long int(they have the same size, the value is interpreted different but bits are intact.)
Quotation from your link:
Any code that makes this assumption must be changed to work for both
ILP32 and LP64. While an int and a long are both 32-bits in the ILP32
data model, in the LP64 data model, a long is 64-bits.
Related
When coding in C, I have accidently found that as for non-Ascii characters, after they are converted from char (1 byte) to int (4 bytes), the extra bits (3 bytes) are supplemented by 1 rather than 0. (As for Ascii characters, the extra bits are supplemented by 0.) For example:
char c[] = "ā";
int i = c[0];
printf("%x\n", i);
And the result is ffffffc4, rather than c4 itself. (The UTF-8 code for ā is \xc4\x81.)
Another related issue is that when performing right shift operations >> on a non-Ascii character, the extra bits on the left end are also supplemented by 1 rather than 0, even though the char variable is explicitly converted to unsigned int (for as for signed int, the extra bits are supplemented by 1 in my OS). For example:
char c[] = "ā";
unsigned int u_c;
int i = c[0];
unsigned int u_i = c[0];
c[0] = (unsigned int)c[0] >> 1;
u_c = (unsigned int)c[0] >> 1;
i = i >> 1;
u_i = u_i >> 1;
printf("c=%x\n", (unsigned int)c[0]); // result: ffffffe2. The same with the signed int i.
printf("u_c=%x\n", u_c); // result: 7fffffe2.
printf("i=%x\n", i); // result: ffffffe2.
printf("u_i=%x\n", u_i); // result: 7fffffe2.
Now I am confused with these results... Are they concerned with the data structures of char, int and unsigned int, or related to my operating system (ubuntu 14.04), or related to the ANSI C requirements? I have tried to compile this program with both gcc(4.8.4) and clang(3.4), but there is no difference.
Thank you so much!
It is implementation-defined whether char is signed or unsigned. On x86 computers, char is customarily a signed integer type; and on ARM it is customarily an unsigned integer type.
A signed integer will be sign-extended when converted to a larger signed type;
a signed integer converted to unsigned integer will use the modulo arithmetic to wrap the signed value into the range of the unsigned type as if by repeatedly adding or subtracting the maximum value of the unsigned type + 1.
The solution is to use/cast to unsigned char if you want the value to be portably zero-extended, or for storing small integers in range 0..255.
Likewise, if you want to store signed integers in range -127..127/128, use signed char.
Use char if the signedness doesn't matter - the implementation will probably have chosen the type that is the most efficient for the platform.
Likewise, for the assignment
unsigned int u_c; u_c = (uint8_t)c[0];,
Since -0x3c or -60 is not in the range of uint16_t, then the actual value is the value (mod UINT16_MAX + 1) that falls in the range of uint16_t; iow, we add or subtract UINT16_MAX + 1 (notice that the integer promotions could trick here so you might need casts if in C code) until the value is in the range. UINT16_MAX is naturally always 0xFFFFF; add 1 to it to get 0x10000. 0x10000 - 0x3C is 0xFFC4 that you saw. And then the uint16_t value is zero-extended to the uint32_t value.
Had you run this on a platform where char is unsigned, the result would have been 0xC4!
BTW in i = i >> 1;, i is a signed integer with a negative value; C11 says that the value is implementation-defined, so the actual behaviour can change from compiler to compiler. The GCC manuals state that
Signed >> acts on negative numbers by sign extension.
However a strictly-conforming program should not rely on this.
I thought I'd found something similar in this answer but in that case they weren't assigning the result of the expression to the variable. In my case I am assigning it but the bitshift part of the expression has no effect.
unsigned leftmost1 = ((~0)>>20);
printf("leftmost1 %u\n", leftmost1);
Returns
leftmost1 4294967295
Whereas
unsigned leftmost1 = ~0;
leftmost1 = leftmost1 >> 20;
printf("leftmost1 %u\n", leftmost1);
Gives me
leftmost1 4095
I would expect separating the logic into two lines would have no impact, why are the results different?
In the first case, you are doing a signed right shift, because ~0 results in a signed value. The exact behavior of signed right shifts is implementation-defined, but most platforms, including yours, extend the sign bit, so the shift is a no-op for your input of "all ones".
In the second case, you are doing an unsigned right shift, since leftmost1 is an unsigned value. So you shift in zeros from the left.
If you wanted to do an unsigned shift without the intermediate assignmetn, you can do:
(~0u) >> 20
Where the u suffix indicates an unsigned literal.
~0 is an int. So your first piece of code isn't equivalent to the second, it's equivalent to
int tmp = ~0;
tmp = tmp >> 20;
unsigned leftmost1 = tmp;
You're seeing the results of sign extension when you right-shift a negative number.
0 has type int. ~0 is -1 on a typical two's complement machine. Right-shifting a negative number has implementation-defined results, but a common choice is to shift in 1 bits, which for -1 leaves the number unchanged (i.e. -1 >> anything is -1 again).
You can fix this by writing 0u (which is a literal of type unsigned int). This forces the operations to be done in unsigned int, as in your second example:
unsigned leftmost1 = ~0;
This line is equivalent to unsigned leftmost1 = -1, which implicitly converts -1 (a signed int) to UINT_MAX. The following operation (leftmost1 >> 20) then uses unsigned arithmetic.
Try casting like this. ~0 is promoted to int which is signed so it's carrying the sign bit when you shift
unsigned leftmost1 = ((unsigned)(~0)>>20);
printf("leftmost1 %u\n", leftmost1);
I am trying to convert 65529 from an unsigned int to a signed int. I tried doing a cast like this:
unsigned int x = 65529;
int y = (int) x;
But y is still returning 65529 when it should return -7. Why is that?
It seems like you are expecting int and unsigned int to be a 16-bit integer. That's apparently not the case. Most likely, it's a 32-bit integer - which is large enough to avoid the wrap-around that you're expecting.
Note that there is no fully C-compliant way to do this because casting between signed/unsigned for values out of range is implementation-defined. But this will still work in most cases:
unsigned int x = 65529;
int y = (short) x; // If short is a 16-bit integer.
or alternatively:
unsigned int x = 65529;
int y = (int16_t) x; // This is defined in <stdint.h>
I know it's an old question, but it's a good one, so how about this?
unsigned short int x = 65529U;
short int y = *(short int*)&x;
printf("%d\n", y);
This works because we are casting the address of x to the signed version of it's type, that's permitted by the C standard. Not all type punning like this (most in fact) is legal. The standard says this.
An object shall have its stored value accessed only by an lvalue that has one of the following types:
the declared type of the object,
a qualified version of the declared type of the object,
a type that is the signed or unsigned type corresponding to the declared type of the object,
a type that is the signed or unsigned type corresponding to a qualified version of the declared type of the object,
an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),
a character type.
So, alas, since we are accessing the bits of x as if they were a signed (via the pointer), the actual conversion operation is replaced by reading what appears to be just a negative signed short, and conversion takes place without issue. However, it's possible for this to screw up on a one's complement machine, but those are so, so rare, and so, so obsolete, I wouldn't even bother with looking out for them.
#Mysticial got it. A short is usually 16-bit and will illustrate the answer:
int main()
{
unsigned int x = 65529;
int y = (int) x;
printf("%d\n", y);
unsigned short z = 65529;
short zz = (short)z;
printf("%d\n", zz);
}
65529
-7
Press any key to continue . . .
A little more detail. It's all about how signed numbers are stored in memory. Do a search for twos-complement notation for more detail, but here are the basics.
So let's look at 65529 decimal. It can be represented as FFF9h in hexadecimal. We can also represent that in binary as:
11111111 11111001
When we declare short zz = 65529;, the compiler interprets 65529 as a signed value. In twos-complement notation, the top bit signifies whether a signed value is positive or negative. In this case, you can see the top bit is a 1, so it is treated as a negative number. That's why it prints out -7.
For an unsigned short, we don't care about sign since it's unsigned. So when we print it out using %d, we use all 16 bits, so it's interpreted as 65529.
To understand why, you need to know that the CPU represents signed numbers using the two's complement (maybe not all, but many).
byte n = 1; //0000 0001 = 1
n = ~n + 1; //1111 1110 + 0000 0001 = 1111 1111 = -1
And also, that the type int and unsigned int can be of different sized depending on your CPU. When doing specific stuff like this:
#include <stdint.h>
int8_t ibyte;
uint8_t ubyte;
int16_t iword;
//......
The representation of the values 65529u and -7 are identical for 16-bit ints. Only the interpretation of the bits is different.
For larger ints and these values, you need to sign extend; one way is with logical operations
int y = (int )(x | 0xffff0000u); // assumes 16 to 32 extension, x is > 32767
If speed is not an issue, or divide is fast on your processor,
int y = ((int ) (x * 65536u)) / 65536;
The multiply shifts left 16 bits (again, assuming 16 to 32 extension), and the divide shifts right maintaining the sign.
You are expecting that your int type is 16 bit wide, in which case you'd indeed get a negative value. But most likely it's 32 bits wide, so a signed int can represent 65529 just fine. You can check this by printing sizeof(int).
To answer the question posted in the comment above - try something like this:
unsigned short int x = 65529U;
short int y = (short int)x;
printf("%d\n", y);
or
unsigned short int x = 65529U;
short int y = 0;
memcpy(&y, &x, sizeof(short int);
printf("%d\n", y);
Since converting unsigned values use to represent positive numbers converting it can be done by setting the most significant bit to 0. Therefore a program will not interpret that as a Two`s complement value. One caveat is that this will lose information for numbers that near max of the unsigned type.
template <typename TUnsigned, typename TSinged>
TSinged UnsignedToSigned(TUnsigned val)
{
return val & ~(1 << ((sizeof(TUnsigned) * 8) - 1));
}
I know this is an old question, but I think the responders may have misinterpreted it. I think what was intended was to convert a 16-digit bit sequence received as an unsigned integer (technically, an unsigned short) into a signed integer. This might happen (it recently did to me) when you need to convert something received from a network from network byte order to host byte order. In that case, use a union:
unsigned short value_from_network;
unsigned short host_val = ntohs(value_from_network);
// Now suppose host_val is 65529.
union SignedUnsigned {
short s_int;
unsigned short us_int;
};
SignedUnsigned su;
su.us_int = host_val;
short minus_seven = su.s_int;
And now minus_seven has the value -7.
I am trying to convert 65529 from an unsigned int to a signed int. I tried doing a cast like this:
unsigned int x = 65529;
int y = (int) x;
But y is still returning 65529 when it should return -7. Why is that?
It seems like you are expecting int and unsigned int to be a 16-bit integer. That's apparently not the case. Most likely, it's a 32-bit integer - which is large enough to avoid the wrap-around that you're expecting.
Note that there is no fully C-compliant way to do this because casting between signed/unsigned for values out of range is implementation-defined. But this will still work in most cases:
unsigned int x = 65529;
int y = (short) x; // If short is a 16-bit integer.
or alternatively:
unsigned int x = 65529;
int y = (int16_t) x; // This is defined in <stdint.h>
I know it's an old question, but it's a good one, so how about this?
unsigned short int x = 65529U;
short int y = *(short int*)&x;
printf("%d\n", y);
This works because we are casting the address of x to the signed version of it's type, that's permitted by the C standard. Not all type punning like this (most in fact) is legal. The standard says this.
An object shall have its stored value accessed only by an lvalue that has one of the following types:
the declared type of the object,
a qualified version of the declared type of the object,
a type that is the signed or unsigned type corresponding to the declared type of the object,
a type that is the signed or unsigned type corresponding to a qualified version of the declared type of the object,
an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),
a character type.
So, alas, since we are accessing the bits of x as if they were a signed (via the pointer), the actual conversion operation is replaced by reading what appears to be just a negative signed short, and conversion takes place without issue. However, it's possible for this to screw up on a one's complement machine, but those are so, so rare, and so, so obsolete, I wouldn't even bother with looking out for them.
#Mysticial got it. A short is usually 16-bit and will illustrate the answer:
int main()
{
unsigned int x = 65529;
int y = (int) x;
printf("%d\n", y);
unsigned short z = 65529;
short zz = (short)z;
printf("%d\n", zz);
}
65529
-7
Press any key to continue . . .
A little more detail. It's all about how signed numbers are stored in memory. Do a search for twos-complement notation for more detail, but here are the basics.
So let's look at 65529 decimal. It can be represented as FFF9h in hexadecimal. We can also represent that in binary as:
11111111 11111001
When we declare short zz = 65529;, the compiler interprets 65529 as a signed value. In twos-complement notation, the top bit signifies whether a signed value is positive or negative. In this case, you can see the top bit is a 1, so it is treated as a negative number. That's why it prints out -7.
For an unsigned short, we don't care about sign since it's unsigned. So when we print it out using %d, we use all 16 bits, so it's interpreted as 65529.
To understand why, you need to know that the CPU represents signed numbers using the two's complement (maybe not all, but many).
byte n = 1; //0000 0001 = 1
n = ~n + 1; //1111 1110 + 0000 0001 = 1111 1111 = -1
And also, that the type int and unsigned int can be of different sized depending on your CPU. When doing specific stuff like this:
#include <stdint.h>
int8_t ibyte;
uint8_t ubyte;
int16_t iword;
//......
The representation of the values 65529u and -7 are identical for 16-bit ints. Only the interpretation of the bits is different.
For larger ints and these values, you need to sign extend; one way is with logical operations
int y = (int )(x | 0xffff0000u); // assumes 16 to 32 extension, x is > 32767
If speed is not an issue, or divide is fast on your processor,
int y = ((int ) (x * 65536u)) / 65536;
The multiply shifts left 16 bits (again, assuming 16 to 32 extension), and the divide shifts right maintaining the sign.
You are expecting that your int type is 16 bit wide, in which case you'd indeed get a negative value. But most likely it's 32 bits wide, so a signed int can represent 65529 just fine. You can check this by printing sizeof(int).
To answer the question posted in the comment above - try something like this:
unsigned short int x = 65529U;
short int y = (short int)x;
printf("%d\n", y);
or
unsigned short int x = 65529U;
short int y = 0;
memcpy(&y, &x, sizeof(short int);
printf("%d\n", y);
Since converting unsigned values use to represent positive numbers converting it can be done by setting the most significant bit to 0. Therefore a program will not interpret that as a Two`s complement value. One caveat is that this will lose information for numbers that near max of the unsigned type.
template <typename TUnsigned, typename TSinged>
TSinged UnsignedToSigned(TUnsigned val)
{
return val & ~(1 << ((sizeof(TUnsigned) * 8) - 1));
}
I know this is an old question, but I think the responders may have misinterpreted it. I think what was intended was to convert a 16-digit bit sequence received as an unsigned integer (technically, an unsigned short) into a signed integer. This might happen (it recently did to me) when you need to convert something received from a network from network byte order to host byte order. In that case, use a union:
unsigned short value_from_network;
unsigned short host_val = ntohs(value_from_network);
// Now suppose host_val is 65529.
union SignedUnsigned {
short s_int;
unsigned short us_int;
};
SignedUnsigned su;
su.us_int = host_val;
short minus_seven = su.s_int;
And now minus_seven has the value -7.
#include "stdio.h"
int main()
{
int x = -13701;
unsigned int y = 3;
signed short z = x / y;
printf("z = %d\n", z);
return 0;
}
I would expect the answer to be -4567. I am getting "z = 17278".
Why does a promotion of these numbers result in 17278?
I executed this in Code Pad.
The hidden type conversions are:
signed short z = (signed short) (((unsigned int) x) / y);
When you mix signed and unsigned types the unsigned ones win. x is converted to unsigned int, divided by 3, and then that result is down-converted to (signed) short. With 32-bit integers:
(unsigned) -13701 == (unsigned) 0xFFFFCA7B // Bit pattern
(unsigned) 0xFFFFCA7B == (unsigned) 4294953595 // Re-interpret as unsigned
(unsigned) 4294953595 / 3 == (unsigned) 1431651198 // Divide by 3
(unsigned) 1431651198 == (unsigned) 0x5555437E // Bit pattern of that result
(short) 0x5555437E == (short) 0x437E // Strip high 16 bits
(short) 0x437E == (short) 17278 // Re-interpret as short
By the way, the signed keyword is unnecessary. signed short is a longer way of saying short. The only type that needs an explicit signed is char. char can be signed or unsigned depending on the platform; all other types are always signed by default.
Short answer: the division first promotes x to unsigned. Only then the result is cast back to a signed short.
Long answer: read this SO thread.
The problems comes from the unsigned int y. Indeed, x/y becomes unsigned. It works with :
#include "stdio.h"
int main()
{
int x = -13701;
signed int y = 3;
signed short z = x / y;
printf("z = %d\n", z);
return 0;
}
Every time you mix "large" signed and unsigned values in additive and multiplicative arithmetic operations, unsigned type "wins" and the evaluation is performed in the domain of the unsigned type ("large" means int and larger). If your original signed value was negative, it first will be converted to positive unsigned value in accordance with the rules of signed-to-unsigned conversions. In your case -13701 will turn into UINT_MAX + 1 - 13701 and the result will be used as the dividend.
Note that the result of signed-to-unsigned conversion on a typical 32-bit int platform will result in unsigned value 4294953595. After division by 3 you'll get 1431651198. This value is too large to be forced into a short object on a platform with 16-bit short type. An attempt to do that results in implementation-defined behavior. So, if the properties of your platform are the same as in my assumptions, then your code produces implementation-defined behavior. Formally speaking, the "meaningless" 17278 value you are getting is nothing more than a specific manifestation of that implementation-defined behavior. It is possible, that if you compiled your code with overflow checking enabled (if your compiler supports them), it would trap on the assignment.