In the C programming language, if an int is 4 bytes and letters are represented in ASCII as a number (also an int), then why is a char 1 byte?
A char is one byte because the standard says so. But that's not really what you are asking. In terms of the decimal values of a char it can hold from -128 to 127, have a look at a table for ASCII character codes, you'll notice that the decimal values of those codes are between 0 and 127, hence, they fit in positive values of a char. There are extended character sets that use unsigned char, values from 0 to 255.
6.2.5 Types
...
3 An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative. If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of values that can be represented in that type.
...
5 An object declared as type signed char occupies the same amount of storage as a ‘‘plain’’ char object. A ‘‘plain’’ int object has the natural size suggested by the architecture of the execution environment (large enough to contain any value in the range INT_MIN to INT_MAX as defined in the header <limits.h>).
C 2012 Online Draft
Type sizes are not defined in terms of bits, but in terms of the range of values that must be represented.
The basic execution character set consists of 96 or so characters (26 uppercase Latin characters, 26 lowercase latin characters, 10 decimal digits, 29 graphical characters, space, vertical tab, horizontal tab, line feed, form feed); 8 bits is more than sufficient to represent those.
int, OTOH, must be able to represent a much wider range of values; the minimum range as specified in the standard is [-32767..32767]1, although on most modern implementations it’s much wider.
The standard doesn’t assume two’s complement representation of signed integers, which is why INT_MIN is -32767 and not -32768.
In the C language, a char usually has a size of 8 bits.
In all the compilers that I have seen (which are, admittedly, not very many), the char is taken to be large enough to hold the ASCII character set (or the so called “extended ASCII”) and the size of the char data type is 8 bits (this includes compilers in major Desktop platforms, and a some embedded systems).
1 byte was sufficient to represent the whole character set.
Related
int a = 0x11223344;
char b = (char)a;
I am new to programming and learning C. Why do I get value of b here as D?
If I want to store an integer into a char type variable, which byte of the integer will be stored?
This is not fully defined by the C standard.
In the particular situation you tried it, what likely happened is that the low eight bits of 0x11223344 were stored in b, producing 4416 (6810) in b, and printing that prints “D” because your system using ASCII character codes, and 68 is the ASCII code for “D”.
However, you should be wary of something like this working, because it is contingent on several things, and variations are possible.
First, the C standard allows char to be signed or unsigned. It also allows char to be any width that is eight bits or greater. In most C implementations today, it is eight bits.
Second, the conversion from int to char depends on whether char is signed or unsigned and may not be fully defined by the C standard.
If char is unsigned, then the conversion is defined to wrap modulo M+1, where M is the largest value representable in char. Effectively, this is the same as taking the low byte of the value. If the unsigned char has eight bits, its M is 255, so M+1 is 256.
If char is signed and the value is out of range of the char type, the conversion is implementation-defined: It may either trap or produce an implementation-defined value. Your C implementation may wrap conversions to signed integer types similarly to how it wraps conversions to unsigned types, but another reasonable behavior is to “clamp” out-of-range values to the limits of the type, CHAR_MIN and CHAR_MAX. For example, converting −8000 to char could yield the minimum, −128, while converting 0x11223344 to char could yield the maximum, +127.
Third, the C standard does not require implementations to use ASCII. It is very common to use ASCII. (Usually, the character encoding is not just ASCII, because ASCII covers only values from 0 to 127. C implementations often use some extension beyond ASCII for values from 128 to 255.)
In the textbook "The C Programming Language," page 9 has the line below.
"char character--a single byte"
Is this meaning the type "char" variable can keep just one letter, number or symbol?
I also want to understand the term's precise definition.
My understanding is here. Is this correct?
Character: Any letter, number or symbol.
Character string: Several characters.
If it is wrong, I want the correct definition.
Thank you, all members of the community for everyday's support.
The formal C standard definition of character sets (5.2.1):
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.
The basic character set is specified to contain:
the 26 uppercase letters of the Latin alphabet /--/
the 26 lowercase letters of the Latin alphabet /--/
the 10 decimal digits /--/
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab, vertical tab, and form feed.
In the basic execution character set, there shall be
control characters representing alert, backspace, carriage return, and new line.
The representation of each member of the source and execution basic
character sets shall fit in a byte.
Then 6.2.5 says:
An object declared as type char is large enough to store any member of the basic execution character set.
The formal definition of a byte is very similar (3.6):
byte
addressable unit of data storage large enough to hold any member of the basic character set of the execution environment
Furthermore, it is specified that a char is always 1 byte large (6.5.3.4):
The sizeof operator yields the size (in bytes) of its operand /--/
When sizeof is applied to an operand that has type char, unsigned char, or
signed char, (or a qualified version thereof) the result is 1.
The C standard does not however specify the number of bits in a byte, only that it has to be 8 bits or more.
The standard (draft n1570 for C11) says:
An object declared as type char is large enough to store any member of the basic
execution character set. If a member of the basic execution character set is stored in a
char object, its value is guaranteed to be nonnegative.
As the standard character set contains all upper and lower case alphabets, decimal digits and some other characters, it needs at least 7 bits to be represented. Anyway the standard mandates the size of a char to be at least 8 bits:
[The] implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
— number of bits for smallest object that is not a bit-field (byte)
CHAR_BIT 8
A char must be individually addressable. For that reason a char is said to be a byte, and by definition sizeof(char) is 1 whatever the exact number of bits - some old mainframes used 12 or 16 bits characters.
unsigned char and signed char are integer types that use same storage size as char. They are distinct types, yet the conversions between the 3 types are perfectly defined and never change the representation. Even if a distinct type, the standard requires:
The implementation shall define char to have the same range,
representation, and behavior as either signed char or unsigned char.
On common architectures, a char uses 8 bits. All values in range 0-127 represent the ASCII character set (NB: this is not mandated by the standard, and other representation like EBCDIC were used). Values in the other range (-128 to -1 or 128-255) are called the extended chars and can represent either a ISO-8859-x (or Latin) charset, or bytes in a multi-byte character set like UTF-8 or UCS2 (subset of UTF16 for unicode characters in the 0-FFFF range). ISO-8859-1 or Latin1 is a single byte charset representing Unicode characters in the range à-255. It used to be a de facto standard and Windows still uses CP1252 (a close variation) for west european language system
TL/DR: to directly answer your question:
a char represents some symbols, at least the basic execution character set
a character string is by convention a null terminated char array. The represented symbols depend on the used charset, and for multi-byte charsets (like UTF8) there is no 1 to 1 relation between a char and a symbol
The 'char' take one byte of storage, and can represent a value between -128 to +127. This is commonly used to hold single ASCII character. In the ASCII encoding, all printable characters are assigned values between 32 (space) to 126 (tilde, '~'), with non printable characters assigned to the rest of the codes.
Note that unlike the Java char (which can bold any unicode character), the "c" char will not be able to represent the latin characters.
Typically char is a one-byte sized variable type, and as byte is made of 8 bits, the value range for char is 0-255 or -128-127 if signed (One bit is used for sign indication).
Those 255 options, are used to represent a value, in the case of a char, a symbol, letter or digit (or some special characters). from the ASCII codec.
If you want for example to store a Japanese letter or an emoji, which requires 2 bytes (As there are much more characters than 255 as you know). You will have to use a type that support such size - for unicode such as wchar_t
My computer science teacher taught us that which data type to declare depends on the size of the value for a variable you need. And then he demonstrated having a char add and subtract a number to output a different char. I remember he said this is something to do with ASCII Code. Can anyone explain this more specifically and clearly ? So, is char considerd as a number(since we can do math with it ) or a character or both? Can we print out the number behind a char?how?
So, is char considerd as a number or a character or both?
Both. It is an integer, but that integer value represents a character, as described by the character encoding of your system. The character encoding of the system that your computer science teacher uses happens to be ASCII.
Can we print out the number behind a char?how?
C++ (as the question used to be tagged):
The behaviour of the character output stream (such as std::cout) is to print the represented character when you insert an integer of type char. But the behaviour for all other integer types is to print the integer value. So, you can print the integer value of a char by converting it to another integer type:
std::cout << (unsigned)'c';
C:
There are no templated output streams, so you don't need to do explicit conversion to another integer (except for the signedness). What you need is the correct format specifier for printf:
printf("%hhu", (unsigned char)'c');
hh is for integer of size char, u is to for unsigned as you probably are interested in the unsigned representation.
A char can hold a number, it's the smallest integer type available on your machine and must have at least 8 bits. It is synonymous to a byte.
It's typical use is to store the codes of characters. Computers can only deal with numbers, so, to represent characters, numbers are used. Of course you must agree on which number means which character.
C doesn't require a specific character encoding, but most systems nowadays use a superset of ASCII (this is a very old encoding using only 7 bits) like e.g. UTF-8.
So, if you have a char that holds a character and you add or subtract some value, the result will be another number that happens to be the code for a different character.
In ASCII, the characters 0-9, a-z and A-Z have adjacent code points, therefore by adding e.g. 2 to A, the result will be C.
Can we print out the number behind a char?
Of course. It just depends whether you interpret the value in the char as just a number or as the code of a character. E.g. with printf:
printf("%c\n", 'A'); // prints the character
printf("%hhu\n", (unsigned char)'A'); // prints the number of the code
The cast to (unsigned char) is only needed because char is allowed to be either signed or unsigned, we want to treat it as unsigned here.
A char takes up a single byte. On systems with an 8 bit byte this gives it a range (assuming char is signed) of -128 to 127. You can print this value as follows:
char a = 65;
printf("a=%d\n", a);
Output:
65
The %d format specifier prints its argument as a decimal integer. If on the other hand you used the %c format specifier, this prints the character associated with the value. On systems that use ASCII, that means it prints the ASCII character associated with that number:
char a = 65;
printf("a=%c\n", a);
Output:
A
Here, the character A is printed because 65 is the ASCII code for A.
You can perform arithmetic on these numbers and print the character for the resulting code:
char a = 65;
printf("a=%c\n", a);
a = a + 1;
printf("a=%c\n", a);
Output:
A
B
In this example we first print A which is the ASCII character with code 65. We then add 1 giving us 66. Then we print the ASCII character for 66 which is B.
Every variable is stored in binary (i.e as a number,) chars, are just numbers of a specific size.
They represent a character when encoded using some character encoding, the ASCII standard (www.asciitable.com) is here.
As in the #Igor comment, if you run the following code; you see the ASCII character, Decimal and Hexadecimal representation of your char.
char c = 'A';
printf("%c %d 0x%x", c, c, c);
Output:
A 65 0x41
As an exercise to understand it better, you could make a program to generate the ASCII Table yourself.
My computer science teacher taught us that which data type to declare depends on the size of the value for a variable you need.
This is correct. Different types can represent different ranges of values. For reference, here are the various integral types and the minimum ranges they must be able to represent:
Type Minimum Range
---- -------------
signed char -127...127
unsigned char 0...255
char same as signed or unsigned char, depending on implementation
short -32767...32767
unsigned short 0...65535
int -32767...32767
unsigned int 0...65535
long -2147483647...2147483647
unsigned long 0...4294967295
long long -9223372036854775807...9223372036854775807
unsigned long long 0...18446744073709551615
An implementation may represent a larger range in a given type; for example, on most modern implementations, the range of an int is the same as the range of a long.
C doesn't mandate a fixed size (bit width) for the basic integral types (although unsigned types are the same size as their signed equivalent); at the time C was first developed, byte and word sizes could vary between architectures, so it was easier to specify a minimum range of values that the type had to represent and leave it to the implementor to figure out how to map that onto the hardware.
C99 introduced the stdint.h header, which defines fixed-width types like int8_t (8-bit), int32_t (32-bit), etc., so you can define objects with specific sizes if necessary.
So, is char considerd as a number(since we can do math with it ) or a character or both?
char is an integral data type that can represent values in at least the range [0...127]1, which is the range of encodings for the basic execution character set (upper- and lowercase Latin alphabet, decimal digits 0 through 9, and common punctuation characters). It can be used for storing and doing regular arithmetic on small integer values, but that's not the typical use case.
You can print char objects out as a characters or numeric values:
#include <limits.h> // for CHAR_MAX
...
printf( "%5s%5s\n", "dec", "char" );
printf( "%5s%5s\n", "---", "----" );
for ( char i = 0; i < CHAR_MAX; i++ )
{
printf("%5hhd%5c\n", i, isprint(i) ? i : '.' );
}
That code will print out the integral value and the associated character, like so (this is ASCII, which is what my system uses):
...
65 A
66 B
67 C
68 D
69 E
70 F
71 G
72 H
73 I
...
Control characters like SOH and EOT don't have an associated printing character, so for those value the code above just prints out a '.'.
By definition, a char object takes up a single storage unit (byte); the number of bits in a single storage unit must be at least 8, but could be more.
Plain char may be either signed or unsigned depending on the implementation so it can represent additional values outside that range, but it must be able to represent *at least* those values.
I know that sizeof(char) will always be 1, and that this is in units of bytes, and that a byte can be any number of bits (I believe any number of bits greater than or equal to 8, but not positive on that).
I also commonly see references that mention how C data type sizes can be specified in terms of the relationship between their sizes, such as "sizeof(int) <= sizeof(long)".
My question is basically: What would "sizeof(int)" evaluate to on a system where a byte is 8 bits and an int is 39 bits (or some other value which is not evenly divisible by CHAR_BIT).
My guess is that sizeof() returns the minimum number of bytes required to store the type, so it would therefore round up to the next byte. So in my example with a 39 bit int, the result of sizeof(int) would be 5.
Is this correct?
Also, is there an easy way to determine the number of bits a particular type can hold that is 100% portable and does not require the inclusion of any headers? This is more for a learning experience than an actual application. I would just use stdint types in practice. I was thinking maybe something along the lines of declaring the variable and initializing it to ~0, then loop and left shift it until it's zero.
Thanks!
What people often fail to understand is that, in C, there's a clear distinction between sizeof and "width".
"width" is more about binary representation, range, overflow/wrap-around behavior. You say a unsigned integer type is 16-bit wide then you mean it wraps around at 65535.
However sizeof only cares about storage. Hence sizeof(T[n])==sizeof(T)*n is maintained by allowing sizeof to include paddings.
For this reason it makes little sense trying to find connections between sizeof a type and the arithmetic behavior of a type: a type can have a certain range but can take whatever storage space it wants.
To answer your question ("what if a 39-bit int on a 8-bit-char machine?") I'd like to use TI C6400+ as an example, because it has a 40-bit long and 8-bit char, very close.
TI C6400+ is a byte addressable machine so it must define 8-bit byte as char.
It also has a 40-bit-integer type because the ALU can operate on 40-bit integers, and they defined it as long.
You would think sizeof(long) should be 5, right?
Well, it could, but this CPU also does not supported unaligned-load very well, so for performance reasons this long type is by default aligned to 8-byte boundaries instead of 5-byte, then each long has 3 bytes of paddings (in both memory and register level because it takes a pair of GPRs in the CPU, too), then naturally sizeof(long) becomes 8.
Interestingly the C6400+ C implementation also provides long long and sizeof(long long) is also 8. But that's a truly 64-bit wide type and has full 64-bit range instead of 40-bit.
UPDATE
So back to the "39-bit" case.
Since 6.2.8.1 require the alignment of all complete types be an integer multiple of "bytes", then a 39-bit integer must be padded to at least 40 bits or larger if CHAR_BIT is 8, so sizeof such a type must be an integer greater or equal to 5.
Chapter and verse:
6.2.6 Representations of types
6.2.6.1 General
The representations of all types are unspecified except as stated in this subclause.
Except for bit-fields, objects are composed of contiguous sequences of one or more bytes,
the number, order, and encoding of which are either explicitly specified or
implementation-defined.
Values stored in unsigned bit-fields and objects of type unsigned char shall be
represented using a pure binary notation.49)
Values stored in non-bit-field objects of any other object type consist of n × CHAR_BIT
bits, where n is the size of an object of that type, in bytes. The value may be copied into
an object of type unsigned char [n] (e.g., by memcpy); the resulting set of bytes is
called the object representation of the value. Values stored in bit-fields consist of m bits,
where m is the size specified for the bit-field. The object representation is the set of m
bits the bit-field comprises in the addressable storage unit holding it. Two values (other
than NaNs) with the same object representation compare equal, but values that compare
equal may have different object representations.
49) A positional representation for integers that uses the binary digits 0 and 1, in which the values
represented by successive bits are additive, begin with 1, and are multiplied by successive integral
powers of 2, except perhaps the bit with the highest position. (Adapted from the American National
Dictionary for Information Processing Systems.) A byte contains CHAR_BIT bits, and the values of
type unsigned char range from 0 to 2^CHAR_BIT − 1.
My question is basically: What would "sizeof(int)" evaluate to on a system where a byte is 8 bits and an int is 39 bits (or some other value which is not evenly divisible by CHAR_BIT).
The implementation would have to map CHAR_BIT-sized storage units onto odd-sized words such that the above requirements hold, probably with a significant performance penalty. A 39-bit word can hold up to four 8- or 9-bit storage units, so sizeof (int) would probably evaluate to 4.
Alternately, the implementor can simply decide it's not worth the hassle and set CHAR_BIT to 39; everything, including individual characters, takes up one or more full words, leaving up to 31 bits unused depending on the type.
There have been real-world examples of this sort of thing in the past. One of the old DEC PDPs (I want to say the PDP-8, maybe PDP-11?) used 36-bit words and 7-bit ASCII for character values; 5 characters could be stored in a single word, with one bit unused. All other types took up a full word. If the implementation set CHAR_BIT to 9, you could cleanly map CHAR_BIT-sized storage units onto 36-bit words, but again, that may incur a significant performance penalty if the hardware expects 5 characters per word.
Is the size of every C data type guaranteed to be a multiple of bytes?
Yes. Even if number of bits required to represent the data type is smaller than CHAR_BIT for example struct foo { unsigned int bar : 1}; or int16_t on a system with CHAR_BIT = 32.
sizeof(char) = 1 (guaranteed)
CHAR_BIT >= 8 ( = 8 on posix compliant system)
sizeof(anything else) = k * sizeof(char) = k where k is a whole number
From 6.5.3.4 The sizeof and _Alignof operators
The sizeof operator yields the size (in bytes) of its operand, which
may be an expression or the parenthesized name of a type. The size is
determined from the type of the operand. The result is an integer.
[...]
To add to your confusion, on a machine with CHAR_BIT = 32, sizeof(int16_t) will be 1, and so will be the sizeof(int32_t) and you may actually end up allocating more bytes of memory than required (I am not sure though) if you use malloc(sizeof(int16_t) * n) instead of malloc(sizeof(int16_t[n]).
Say I have a unicode character in wchar_t x;
Of course, the obvious way to convert x to ASCII is use the wctob function
But I'm wondering, since the first 255 characters of Unicode correspond with ASCII, will a cast to char consistently work across platforms?
char c = (char) x ; // cast to char, this works on Windows
The question is, will a cast to char guarantee to keep the LOW ORDER bits, or will it possibly keep the HIGH ORDER bits? (I'm concerned about a little-endian/big endian situation here, although I realize if it worked on my little endian system, it definitely should work on big endian systems).
For the sake of brevity, I use some terms loosely. To avoid much confusion, one is strongly advised to carefully study definitions of at least the following terms: ASCII, Unicode, UCS, UCS-2, UCS-4, UTF, UTF-8, UTF-16, UTF-32, character, character set, coded character set, repertoire, code unit.
The code of the character 'Q' is 81 in both ASCII and Unicode.
81 is just an integer, like any other integer. A char variable may store the number 81. A wchar_t variable may store the same number 81. We interpret 81 as 'Q' in both cases.
It does not make much sense to ask how the number 81 preserves when cast from e.g. long to short. If it fits then you are all set. There's no endianness or higher bits or lower bits or any of this stuff involved.
When you convert files that store characters, or streams of bytes over a network, endianness and bits and stuff begin to matter, just like with files that store (binary representations of) any old numbers.
If x does not fit in a char, then the behavior is officially "implementation-defined" and is allowed to raise a signal. If x does fit in a char, then the value is preserved (regardless of endianness).
6.3.1.3 Signed and unsigned integers
(1) When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged.
(2) [does not apply here]
(3) Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
For maximum portability, perform a range check first and cast only if the value is in the range SCHAR_MIN to SCHAR_MAX.
(Others have noted and I wish to repeat that ASCII extends only to character 127.)
I was under the impression that the endianess of the system does not matter in this situation.
I found a really good explanation here.
I think this should help rest your fears about casting.