On architectures where int is represented using multiple bytes in memory, what constraints does the C Standard impose regarding possible representations? Most current systems use either little-endian or big-endian representations, but it is possible to have a conforming system with a different representation? How different can it be?
what constraints does the C Standard impose regarding possible representations?
3 Encodings allowed: 2's complement, 1s' complement, sign-magnitude. Non-2's complement could have either a -0 or a trap representation.
int must be 16-bit or wider (a range of at least [-32767...32767]). Could be 36 or 64 for real historic examples.
but it is possible to have a conforming system with a different representation?
Sample: PDP-endian
0x01020304 stored as 2, 1, 4, 3. See also #chqrlie.
How different can it be?
int may have padding, char cannot. I do not know of any int with padding.
int could be 1 "byte" when a "byte" is more than 16 bits.
IIRC, some graphics processors used 64-bit "byte", char, int, long, long long.
I once did used a 64-bit long, unsigned long where the unsigned long had 1 padding bit such that ULONG_MAX == LONG_MAX. Compliant but unusual. In theory, UINT_MAX == INT_MAX is possible - never heard of such an implementation.
In 2020, I suspect the follow are universal.
Endian: either big or little.
2's complement. (Next C might require this.)
"byte size" of 8 (maybe 16, 32), int is 16 or 32 bit.
No padding.
From the following citations from the standard, we see:
int has at least 16 bits.
Any ordering of bytes is permissible.
Any ordering of bits is permissible (but must match unsigned int).
The value bits are binary.
Negative values use one of the three specified methods.
C 2018 6.2.6.1 says:
1 The representations of all types are unspecified except as stated in this subclause.
2 Except for bit-fields, objects are composed of contiguous sequences of one or more bytes, the number, order, and encoding of which are either explicitly specified or implementation-defined.
4 Values stored in non-bit-field objects of any other object type [other than unsigned bit-fields and unsigned char, addressed in paragraph 3] consist of n × CHAR_BIT bits, where n is the size of an object of that type, in bytes…
6.2.6.2 says:
1 For unsigned integer types other than unsigned char,… If there are N value bits, each bit shall represent a different power of 2 between 1 and 2N-1, so that objects of that type shall be capable of representing values from 0 to 2N − 1 using a pure binary representation;…
2 For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; signed char shall not have any padding bits. There shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are M value bits in the signed type and N in the unsigned type, then M ≤ N ). If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be modified in one of the following ways:
— the corresponding value with sign bit 0 is negated (sign and magnitude);
— the sign bit has the value −(2M ) (two’s complement);
— the sign bit has the value −(2M − 1) (ones’ complement).
Which of these applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for ones’ complement), is a trap representation or a normal value. In the case of sign and magnitude and ones’ complement, if this representation is a normal value it is called a negative zero.
5 The values of any padding bits are unspecified… For any integer type, the object representation where all the bits are zero shall be a representation of the value zero in that type.
And 5.2.4.2.1 tells us int must be able to represent at least −32767 to +32767, from which we deduce it has at least 15 value bits.
Related
Do unsigned int and signed int have any relevance as far as the storage is concerned. I know it has its relevance in a print statement i.e.; -1 will be treated as 4294967295 (%d and %u) . If we consider just storage of the value , does the unsigned or signed would make a difference?
In C, you cannot have a value without a type. (Various operations are defined in terms of mathematical values, but each operation is specified to produce a result in a particular type, so, at each point in a C expression where there is a value, it has a type.) So any value is stored by storing the bytes of the object that represents it.
The C 2018 standard specifies the representations of types in 6.2.6 and of integer types specifically in 6.2.6.2. Objects are composed of one or more bits. Unsigned integers are represented with pure binary plus optional padding bits. The order of the bits is not specified. For a signed integer type, one of the bits is a sign bit, and each value bit has the same values as the same bit of the corresponding unsigned type. Some of the value bits in the unsigned type may be padding bits (unused for value) in the signed type. (But the total number of bits is the same, per 6.2.5 6.) The sign bit either indicates the value is negated or it represents the value −(2M) or −(2M−1), where M is the number of value bits. (Which of those three is implementation-defined.)
Therefore, whether an integer type is signed or unsigned makes no difference regarding the interpretation of the common value bits. It affects only the interpretation of bits that are value bits in the unsigned type but a sign bit or padding bits in the signed type. (The latter are rare.)
If the value in a signed integer type is the same as the value in its corresponding unsigned integer type, they have the same value in each of their common value bits and zeros in all of their unshared sign or value bits. (The padding bits are not specified by the C standard.)
char type is used in the example, but question relates to any integer type.
Can I be sure that
signed char foo = 127;
will always be binary identical to
unsigned char foo = 127;
so that it's possible to use signed variant for raw byte representation if MSB not needed?
The bits representing the values are not necessarily identical if padding bits are present but are identical if there are no padding bits.
C 2018 6.2.6.2 2 says of signed integer types:
Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type…
So the value bits in a signed integer type are the same as the value bits in the corresponding unsigned type. This leaves three more sets of bits to consider:
The sign bit.
Value bits that are in the unsigned type but not the signed type.
Padding bits.
The sign bit must be zero, because this paragraph also says:
… If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be modified in one of the following ways:…
Those “following ways” (sign-and-magnitude, one’s complement, two’s complement) all result in negative values if the value is not zero. Since we are told the represented value is positive, it is not negative, and so the sign bit must be zero. (We should note the question asserts the value is positive, thus excluding zero. With sign-and-magnitude or one’s complement, zero can be represented with a sign bit of one, and thus it could have different bits from an unsigned integer zero, which has all zero bits.)
Any value bits in that exist in the unsigned type but not the signed type must be zero, since the value is the same in both types.
That leaves the padding bits, and that is where the correspondence fails. The values of padding bits are not specified by the C standard and therefore may differ between signed and unsigned type or even between two instances of the same value in the same type. A specific C implementation may of course define its padding bits so that signed and unsigned types with the same value always have the same padding bits, and that the padding bits correspond between the signed and unsigned types. (We could imagine that the sign bit in the signed type corresponds to a padding bit in the unsigned type instead of to a value bit.)
C 1999 has the same wording.
This question might be very basic but i post here only after days of googling and for my proper basic understanding of signed integers in C.
Actually some say signed int has range
-32767 to 32767 and others say it has range
-32768 to 32767
Let us have int a=5 (signed / let us consider just 1 byte)
*the 1st representation of a=5 is represented as 00000101 as a positive number and a=-5 is represented as 10000101 (so range -32767 to 32767 justified)
(here the msb/sign bit is 1/0 the number will be positive/negative and rest(magnitude bits) are unchanged )
*the 2nd representation of a=5 is represented as 00000101 as a positive number and a=-5 is represented as 11111011
(the msb is considered as -128 and the rest of bits are manipulated to obtain -5) (so range -32768 to 32767 justified)
So I confuse between these two things. My doubt is what is the actual range of signed int in c ,1) or 2)
It depends on your environment and typically int can store -2147483648 to 2147483647 if it is 32-bit long and two's complement is used, but C specification says that int can store at least -32767 to 32767.
Quote from N1256 5.2.4.2.1 Sizes of integer types <limits.h>
Their implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
— minimum value for an object of type int
INT_MIN -32767 // −(2 15 − 1)
— maximum value for an object of type int
INT_MAX +32767 // 2 15 − 1`
Today, signed ints are usually done in two's complement notation.
The highest bit is the "sign bit", it is set for all negative numbers.
This means you have seven bits to represent different values.
With the highest bit unset, you can (with 16 bits total) represent the values 0..32767.
With the highest bit set, and because you already have a representation for zero, you can represent the values -1..-32768.
This is, however, implementation-defined, other representations do exist as well. The actual range limits for signed integers on your platform / for your compiler are the ones found in your environment's <limits.h>. That is the only definite authority.
On today's desktop systems, an int is usually 32 or 64 bits wide, for a correspondingly much larger range than the 16-bit 32767 / 32768 you are talking of. So either those people are talking about really old platforms, really old knowledge, embedded systems, or the minimum guaranteed range -- the standard states that INT_MIN must be at least -32767, and INT_MAX be at least +32767, the lowest common denominator.
My doubt is what is the actual range of signed int in c ,1) [-32767 to 32767] or 2) [-32768 to 32767]?
The whole point of C and its advantage of high portability to old and new platforms is that code should not care.
C defines the range of int with 2 macros: INT_MIN and INT_MAX. The C spec specifies:
INT_MIN is -32,767 or less.
INT_MAX is +32,767 or more.
If code needs a 16-bit 2's complement type, use int16_t. If code needs a 32-bit or wider type, use long or int32least_t, etc. Do not code assuming int is something that it is not defined to be.
The value 32767 is the maximum positive value you can represent on a signed 16-bit integer. The C corresponding type is short.
The int type is represented on at least the same number of bytes as short and at most the same number of bytes as long. The size of int on 16-bit processors is 2 bytes (the same as short). On 32-bit and higher architecture, the size of int is 4 bytes (the same as long).
No matter the architecture, the minumum value of int is INT_MIN and the maximum value of int is INT_MAX.
Similar, there are constants to get the minimum and maximum values for short (SHRT_MIN and SHRT_MAX), long, char etc. You don't need to use hardcoded constants or to guess what is the minimum value for int on your system.
The representation #1 is named "sign and magnitude representation". It is a theoretical model that uses the most significant byte to store the sign and the rest of the bytes to store the absolute value of the number. It was used by some early computers, probably because it seemed a natural map of the numbers representation in mathematics. However, it is not natural for binary computers.
The representation #2 is named two's complement. The two's-complement system has the advantage that the fundamental arithmetic operations of addition, subtraction, and multiplication are identical to those for unsigned binary numbers (as long as the inputs are represented in the same number of bits and any overflow beyond those bits is discarded from the result). This is why it is the preferred encoding nowadays.
The C standard specifies the lowest limits for integer values. As it is written in the Standard (5.2.4.2.1 Sizes of integer types )
...Their implementation-defined values shall be equal or greater in
magnitude (absolute value) to those shown, with the same sign.
For objects of type int these lowest limits are
— minimum value for an object of type int
INT_MIN -32767 // −(215 − 1)
— maximum value for an object of type int
INT_MAX +32767 // 215 − 1
For the two's complement representation of integers the number of positive values is one less than the number of negative values. So if only tow bytes are used for representations of objects of type int then INT_MIN will be equal to -32768.
Take into account that 32768 in magnitude is greater than the value used in the Standard. So it satisfies the Standard requirement.
On the other habd for the representation "sign and magnitude" the limits (when 2 bytes are used) will be the same as shown in the Standard that is -32767:32767
So the actual limits used in the implementation depend on the width of integers and their representation.
As per the C standard the value representation of a integer type is implementation defined. So 5 might not be represented as 00000000000000000000000000000101 or -1 as 11111111111111111111111111111111 as we usually assume in a 32-bit 2's complement. So even though the operators ~, << and >> are well defined, the bit patterns they will work on is implementation defined. The only defined bit pattern I could find was "§5.2.1/3 A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.".
So my questions is - Is there a implementation independent way of converting integer types to a bit pattern?
We can always start with a null character and do enough bit operations on it to get it to a desired value, but I find it too cumbersome. I also realise that practically all implementations will use a 2's complement representation, but I want to know how to do it in a pure C standard way. Personally I find this topic quite intriguing due to the matter of device-driver programming where all code written till date assumes a particular implementation.
In general, it's not that hard to accommodate unusual platforms for the most cases (if you don't want to simply assume 8-bit char, 2's complement, no padding, no trap, and truncating unsigned-to-signed conversion), the standard mostly gives enough guarantees (a few macros to inspect certain implementation details would be helpful, though).
As far as a strictly conforming program can observe (outside bit-fields), 5 is always encoded as 00...0101. This is not necessarily the physical representation (whatever this should mean), but what is observable by portable code. A machine using Gray code internally, for example, would have to emulate a "pure binary notation" for bitwise operators and shifts.
For negative values of signed types, different encodings are allowed, which leads to different (but well-defined for every case) results when re-interpreting as the corresponding unsigned type. For example, strictly conforming code must distinguish between (unsigned)n and *(unsigned *)&n for a signed integer n: They are equal for two's complement without padding bits, but different for the other encodings if n is negative.
Further, padding bits may exist, and signed integer types may have more padding bits than their corresponding unsigned counterparts (but not the other way round, type-punning from signed to unsigned is always valid). sizeof cannot be used to get the number of non-padding bits, so e.g. to get an unsigned value where only the sign-bit (of the corresponding signed type) is set, something like this must be used:
#define TYPE_PUN(to, from, x) ( *(to *)&(from){(x)} )
unsigned sign_bit = TYPE_PUN(unsigned, int, INT_MIN) &
TYPE_PUN(unsigned, int, -1) & ~1u;
(there are probably nicer ways) instead of
unsigned sign_bit = 1u << sizeof sign_bit * CHAR_BIT - 1;
as this may shift by more than the width. (I don't know of a constant expression giving the width, but sign_bit from above can be right-shifted until it's 0 to determine it, Gcc can constant-fold that.) Padding bits can be inspected by memcpying into an unsigned char array, though they may appear to "wobble": Reading the same padding bit twice may give different results.
If you want the bit pattern (without padding bits) of a signed integer (little endian):
int print_bits_u(unsigned n) {
for(; n; n>>=1) {
putchar(n&1 ? '1' : '0'); // n&1 never traps
}
return 0;
}
int print_bits(int n) {
return print_bits_u(*(unsigned *)&n & INT_MAX);
/* This masks padding bits if int has more of them than unsigned int.
* Note that INT_MAX is promoted to unsigned int here. */
}
int print_bits_2scomp(int n) {
return print_bits_u(n);
}
print_bits gives different results for negative numbers depending on the representation used (it gives the raw bit pattern), print_bits_2scomp gives the two's complement representation (possibly with a greater width than a signed int has, if unsigned int has less padding bits).
Care must be taken not to generate trap representations when using bitwise operators and when type-punning from unsigned to signed, see below how these can potentially be generated (as an example, *(int *)&sign_bit can trap with two's complement, and -1 | 1 can trap with ones' complement).
Unsigned-to-signed integer conversion (if the converted value isn't representable in the target type) is always implementation-defined, I would expect non-2's complement machines to differ from the common definition more likely, though technically, it could also become an issue on 2's complement implementations.
From C11 (n1570) 6.2.6.2:
(1) For unsigned integer types other than unsigned char, the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter). If there are N value bits, each bit shall represent a different power of 2 between 1 and 2N-1, so that objects of that type shall be capable of representing values from 0 to 2N-1 using a pure binary representation; this shall be known as the value representation. The values of any padding bits are unspecified.
(2) For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; signed char shall not have any padding bits. There shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are M value bits in the signed
type and N in the unsigned type, then M≤N ). If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be modified in one of the following ways:
the corresponding value with sign bit 0 is negated (sign and magnitude);
the sign bit has the value -(2M) (two's complement);
the sign bit has the value -(2M-1) (ones' complement).
Which of these applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value. In the case of sign and magnitude and ones' complement, if this representation is a normal value it is called a negative zero.
To add to mafso's excellent answer, there's a part of the ANSI C rationale which talks about this:
The Committee has explicitly restricted the C language to binary architectures, on the grounds that this stricture was implicit in any case:
Bit-fields are specified by a number of bits, with no mention of “invalid integer” representation. The only reasonable encoding for such bit-fields is binary.
The integer formats for printf suggest no provision for “invalid integer” values, implying that any result of bitwise manipulation produces an integer result which can be printed by printf.
All methods of specifying integer constants — decimal, hex, and octal — specify an integer value. No method independent of integers is defined for specifying “bit-string constants.” Only a binary encoding provides a complete one-to-one mapping between bit strings and integer values.
The restriction to binary numeration systems rules out such curiosities as Gray code and makes
possible arithmetic definitions of the bitwise operators on unsigned types.
The relevant part of the standard might be this quote:
3.1.2.5 Types
[...]
The type char, the signed and unsigned integer types, and the
enumerated types are collectively called integral types. The
representations of integral types shall define values by use of a pure
binary numeration system.
If you want to get the bit-pattern of a given int, then bit-wise operators are your friends. If you want to convert an int to its 2-complement representation, then arithmetic operators are your friends. The two representations can be different, as it is implementation defined:
Std Draft 2011. 6.5/4. Some operators (the unary operator ~, and the
binary operators <<, >>, &, ^, and |, collectively described as
bitwise operators) are required to have operands that have integer
type. These operators yield values that depend on the internal
representations of integers, and have implementation-defined and
undefined aspects for signed types.
So it means that i<<1 will effectively shift the bit-pattern by one position to the left, but that the value produced can be different than i*2 (even for smal values of i).
In a 16 Bit C compiler we have 2 bytes to store an integer, and 1 byte for a character. For unsigned integers the range is 0 to 65535. For signed integers the range is -32768 to 32767. For unsigned character, 0 to 255. According to the integer type, shouldn't the signed character range be like -128 to 127. But why -127 to 127? What about the remaining one bit?
I think you're mixing two things:
What ranges the standard requires for signed char, int etc.
What ranges are implemented in most hardware these days.
These don't necessarily have to be the same as long as the range implemented is a superset of the range required by the standard.
According to the C standard, the implementation-defined values of SCHAR_MIN and SCHAR_MAX shall be equal or greater in magnitude (absolute value) to, and of the same sign as:
SCHAR_MIN -127
SCHAR_MAX +127
i.e. only 255 values, not 256.
However, the limits defined by a compliant implementation can be 'greater' in magnitude than these. i.e. [-128,+127] is allowed by the standard too. And since most machines represent numbers in the 2's complement form, [-128,+127] is the range you will get to see most often.
Actually, even the minimum range of int defined by the C standard is symmetric about zero. It is:
INT_MIN -32767
INT_MAX +32767
i.e. only 65535 values, not 65536.
But again, most machines use 2's complement representation, and this means that they offer the range [-32768,+32767].
While in 2's complement form it is possible to represent 256 signed values in 8 bits (i.e. [-128,+127]), there are other signed number representations where this is not possible.
In the sign-magnitude representation, one bit is reserved for the sign, so:
00000000
10000000
both mean the same thing, i.e. 0 (or rather, +0 and -0).
This means, one value is wasted. And thus sign-magnitude representation can only hold values from -127 (11111111) to +127 (01111111) in 8 bits.
In the one's complement representation (negate by doing bitwise NOT):
00000000
11111111
both mean the same thing, i.e. 0.
Again, only values from -127 (10000000) to +127 (01111111) can be represented in 8 bits.
If the C standard required the range to be [-128,+127], then this would essentially exclude machines using such representations from being able to efficiently run C programs. They would require an additional bit to represent this range, thus needing 9 bits to store signed characters instead of 8. The logical conclusion based on the above is: This is why the C standard requires [-127,+127] for signed characters. i.e. to allow implementations the freedom to choose a form of integer representation that suits their needs and at the same time be able to adhere to the standard in an efficient way. The same logic applies to int as well.