Subtlety in conversion of characters to integers - c

Can someone explain clearly what these lines from K&R actually mean:
"When a char is converted to an int, can it ever produce a negative
integer? The answer varies from machine to machine. The definition of
C guarantees that any character in the machine's standard printing
character set will never be negative, but arbitrary bit patterns
stored in character variables may appear to be negative on some
machines,yet positive on others".

There are two more-or-less relevant parts to the standard — ISO/IEC 9899:2011.
6.2.5 Types
¶3 An object declared as type char is large enough to store any member of the basic
execution character set. If a member of the basic execution character set is stored in a
char object, its value is guaranteed to be nonnegative. If any other character is stored in
a char object, the resulting value is implementation-defined but shall be within the range
of values that can be represented in that type.
¶15 The three types char, signed char, and unsigned char are collectively called
the character types. The implementation shall define char to have the same range,
representation, and behavior as either signed char or unsigned char.45)
45) CHAR_MIN, defined in <limits.h>, will have one of the values 0 or SCHAR_MIN, and this can be
used to distinguish the two options. Irrespective of the choice made, char is a separate type from the
other two and is not compatible with either.
That defines what your quote from K&R states. The other relevant part defines what the basic execution character set is.
5.2.1 Character sets
¶1 Two sets of characters and their associated collating sequences shall be defined: the set in
which source files are written (the source character set), and the set interpreted in the
execution environment (the execution character set). Each set is further divided into a
basic character set, whose contents are given by this subclause, and a set of zero or more
locale-specific members (which are not members of the basic character set) called
extended characters. The combined set is also called the extended character set. The
values of the members of the execution character set are implementation-defined.
¶2 In a character constant or string literal, members of the execution character set shall be
represented by corresponding members of the source character set or by escape
sequences consisting of the backslash \ followed by one or more characters. A byte with
all bits set to 0, called the null character, shall exist in the basic execution character set; it
is used to terminate a character string.
¶3 Both the basic source and basic execution character sets shall have the following
members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab, vertical tab, and
form feed. The representation of each member of the source and execution basic
character sets shall fit in a byte. In both the source and execution basic character sets, the
value of each character after 0 in the above list of decimal digits shall be one greater than
the value of the previous. In source files, there shall be some way of indicating the end of
each line of text; this International Standard treats such an end-of-line indicator as if it
were a single new-line character. In the basic execution character set, there shall be
control characters representing alert, backspace, carriage return, and new line. If any
other characters are encountered in a source file (except in an identifier, a character
constant, a string literal, a header name, a comment, or a preprocessing token that is never
converted to a token), the behavior is undefined.
¶4 A letter is an uppercase letter or a lowercase letter as defined above; in this International
Standard the term does not include other characters that are letters in other alphabets.
¶5 The universal character name construct provides a way to name other characters.
One consequence of these rules is that if a machine uses 8-bit character and EBCDIC encoding, then plain char must be an unsigned type since the digits have code 240..249 in EBCDIC.

You need to understand several things first.
If I take an 8-bit value and extend it to a 16-bit value, normally you would imagine just adding a bunch of 0's on the left. For example, if I have the 8-bit value 23, in binary that's 00010111, so as a 16-bit number it's 0000000000010111, which is also 23.
It turns out that negative numbers always have a 1 in the high-order bit. (There might be weird machines for which this is not true, but it's true for any machine you're likely to use.) For example, the 8-bit value -40 is represented in binary as 11011000.
So when you convert a signed 8-bit value to a 16-bit value, if the high-order bit is 1 (that is, if the number is negative), you do not add a bunch of 0-s on the left, you add a bunch of 1's instead. For example, going back to -40, we would convert 11011000 to 1111111111011000, which is the 16-bit representation of -40.
There are also unsigned numbers, that are never negative. For example, the 8-bit unsigned number 216 is represented as 11011000. (You will notice that this is the same bit pattern as the signed number -40 had.) When you convert an unsigned 8-bit number to 16 bits, you add a bunch of 0's no matter what. For example, you would convert 11011000 to 0000000011011000, which is the 16-bit representation of 216.
So, putting this all together, if you're converting an 8-bit number to 16 (or more) bits, you have to look at two things. First, is the number signed or unsigned? If it's unsigned, just add a bunch of 0's on the left. But if it's signed, you have to look at the high-order bit of the 8-0bit number. If it's 0 (if the number is positive), add a bunch of 0's on the left. But if it's 1 (if the number is negative), add a bunch of 1's on the right. (This whole process is known as sign extension.)
The ordinary ASCII characters (like 'A' and '1' and '$') all have values less than 128, which means that their high-order bit is always 0. But "special" characters from the "Latin-1" or UTF-8 character sets have values greater than 128. For this reason they're sometimes also called "high bit" or "eighth bit" characters. For example, the Latin-1 character Ø (O with a slash through it it) has the value 216.
Finally, although type char in C is typically an 8-bit type, the C Standard does not specify whether it is signed or unsigned.
Putting this all together, what Kernighan and Ritchie are saying is that when we convert a char to a 16- or 32-bit integer, we don't quite know how to apply step 5. If I'm on a machine where type char is unsigned, and I take the character Ø and convert it to an int, I'll probably get the value 216. But if I'm on a machine where type char is signed, I'll probably get the number -40.

Related

How do I compare single multibyte character constants cross-platform in C?

In my previous post I found a solution to do this using C++ strings, but I wonder if there would be a solution using char's in C as well.
My current solution uses str.compare() and size() of a character string as seen in my previous post.
Now, since I only use one (multibyte) character in the std::string, would it be possible to achieve the same using a char?
For example, if( str[i] == '¶' )? How do I achieve that using char's?
(edit: made a type on SO for comparison operator as pointed out in the comments)
How do I compare single multibyte character constants cross-platform in C?
You seem to mean an integer character constant expressed using a single multibyte character. The first thing to recognize, then, is that in C, integer character constants (examples: 'c', '¶') have type int, not char. The primary relevant section of C17 is paragraph 6.4.4.4/10:
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g.,’ab’ ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined. If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
(Emphasis added.)
Note well that "implementation defined" implies limited portability from the get-go. Even if we rule out implementations defining perverse behavior, we still have alternatives such as
the implementation rejects integer character constants containing multibyte source characters; or
the implementation rejects integer character constants that do not map to a single-byte execution character; or
the implementation maps source multibyte characters via a bytewise identity mapping, regardless of the byte sequence's significance in the execution character set.
That is not an exhaustive list.
You can certainly compare integer character constants with each other, but if they map to multibyte execution characters then you cannot usefully compare them to individual chars.
Inasmuch as your intended application appears to be to locate individual mutlibyte characters in a C string, the most natural thing to do appears to be to implement a C analog of your C++ approach, using the standard strstr() function. Example:
char str[] = "Some string ¶ some text ¶ to see";
char char_to_compare[] = "¶";
int char_size = sizeof(char_to_compare) - 1; // don't count the string terminator
for (char *location = strstr(str, char_to_compare);
location;
location = strstr(location + char_size, char_to_compare)) {
puts("Found!");
}
That will do the right thing in many cases, but it still might be wrong for some characters in some execution character encodings, such as those encodings featuring multiple shift states.
If you want robust handling for characters outside the basic execution character set, then you would be well advised to take control of the in-memory encoding, and to perform appropriate convertions to, operations on, and conversions from that encoding. This is largely what ICU does, for example.
I believe you meant something like this:
char a = '¶';
char b = '¶';
if (a == b) /*do something*/;
The above may or may not work, if the value of '¶' is bigger than the char range, then it will overflow, causing a and b to store a different value than that of '¶'. Regardless of which value they hold, they may actually both have the same value.
Remember, the char type is simply a single-byte wide (8-bits) integer, so in order to work with multibyte characters and avoid overflow you just have to use a wider integer type (short, int, long...).
short a = '¶';
short b = '¶';
if (a == b) /*do something*/;
From personal experience, I've also noticed, that sometimes your environment may try to use a different character encoding than what you need. For example, trying to print the 'á' character will actually produce something else.
unsigned char x = 'á';
putchar(x); //actually prints character 'ß' in console.
putchar(160); //will print 'á'.
This happens because the console uses an Extended ASCII encoding, while my coding environment actually uses Unicode, parsing a value of 225 for 'á' instead of the value of 160 that I want.

Are multi-character character constants valid in C? Maybe in MS VC?

While reviewing some WINAPI code intended to compile in MS Visual C++, I found the following (simplified):
char buf[4];
// buf gets filled ...
switch ((buf[0] << 8) + buf[1]) {
case 'CT':
/* ... */
case 'SY':
/* ... */
default:
break;
}
}
Assuming 16 bit chars, I can understand why the shift of buf[0] and addition of buf[1]. What I don't gather is how the comparisons in the case clauses are intended to work.
I don't have access to Visual C++ and, of course, those yield multi-character character constant [-Wmultichar] warnings on gcc/MingW.
This is a non-portable way of storing more than one chars in one int. Finally, the comparison happens as the int values, as usual.
Note: consider concatenated representation of the ASCII values for each individual char as the final int value.
Following the wiki article, (emphasis mine)
[...] Multi-character constants (e.g. 'xy') are valid, although rarely useful — they let one store several characters in an integer (e.g. 4 ASCII characters can fit in a 32-bit integer, 8 in a 64-bit one). Since the order in which the characters are packed into an int is not specified, portable use of multi-character constants is difficult.
Related, C11, chapter §6.4.4.4/p10
An integer character constant has type int. The value of an integer character constant
containing a single character that maps to a single-byte execution character is the
numerical value of the representation of the mapped character interpreted as an integer.
The value of an integer character constant containing more than one character (e.g.,
'ab'), or containing a character or escape sequence that does not map to a single-byte
execution character, is implementation-defined. [....]
Yes, they are valid and its type is int and its value is implementation dependent.
From C11 draft, 6.4.4.4p10:
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer. The
value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape sequence
that does not map to a single-byte execution character, is
implementation-defined.
(emphasis added)
GCC is being cautious, and warns to let you know in case you have used it unintentionally.

What Is the "character" in string's definition?

C11 defines a "string" as:
A string is a contiguous sequence of characters terminated by and
including the first null character. §7.1.1 1
It earlier defines a "character" as:
3.7 character
〈abstract〉 member of a set of elements used for the organization, control, or representation of data
3.7.1
character
single-byte character
〈C〉 bit representation that fits in a byte
3.7.2
multibyte character
sequence of one or more bytes representing a member of the extended character set ...
3.7.3
wide character
value representable by an object of type wchar_t, capable of representing any character
in the current locale
Question: What definition of "character" is being used in the definition of "string":
"character" in 3.7,
"character" in 3.7.1,
or something else?
A string is a contiguous sequence of data of type char.
The word "character" is used in two senses, abstract and practical.
From the abstract point of view, we first would have to define the concept "set of characters", in order to, later, go to 3.7 and say "a member of a set of elements for...".
This definition of "character" fits another standard: ISO/IEC 2382-1.
See ISO/IEC 2382-1(character)
There, you can analyze a big list of terms related to "Information Representation".
MY SHORT ANSWER: "character" in the definition of "string" corresponds to c11.3.7.1.
The explanation is as follows:
CHARACTER IN THE ABSTRACT
A symbol is an intellectual convention of human beings.
So, the abstract symbol for "A" is a convention which we use to recognize different "graphs" like A, A, A, as being all "the same" thing (a piece of information, say).
The information is represented, then, by ordered and finite sequences of a set of (abstract) characters.
Next, you need to codify this abstract symbols to make possible their representation in information systems (computers).
This is done, in general, by defining a one-to-one correspondence between integer numbers (called code-points) and their correspondant characters in a given set.
An encoding schema is a way in that a set of characters is associated to certain numbers (code-points).
This encoding can change from one system to another ("A" has not the same encoding in EBCDIC as in ASCII).
Finally, we associate a "graph" to each character+code-point, that is, a written representation, which can be eventually printed or shown on screen.
The shape of the graph can change according to a font design, so it is not a good starting point to define the term "character".
CHARACTER IN C
In 3.7.1. it seems that C11 refers to another meaning of "character", intended to be a brief form to say "single-byte character". It is talking about code-points (that is, integer numbers associated to "abstract characters of a given set") that fit in exactly 1 byte.
In this case, we need the definition of Byte.
In C, a byte is an information storage unit, consisting of an ordered sequence of n bits, where n is an integer number greater than or equal to 8 (in general is 8, of course), whose value you can found by checking the constant CHAR_BIT, in <limits.h>.
There are data types whose size is exactly 1 byte: char, unsigned char, signed char.
The range of values of unsigned char is exactly 0...2^n - 1, where n is CHAR_BIT.
The range of values of char coincides with signed char or unsgined char, but C11 doesn't say which of them corresponds to char.
Moreover, in any case, the type char must be considered different from signed char and unsigned char.
A string is, now, a sequence of objects of type char.
WHY CHAR?
The standard defines the representation of characters in terms of char:
(6.2.5.3)
An object declared as type char is large enough to store any member of the basic
execution character set. If a member of the basic execution character set is stored in a
char object, its value is guaranteed to be nonnegative. If any other character is stored in
a char object, the resulting value is implementation-defined but shall be within the range
of values that can be represented in that type.
STRING
Now, a string in C is a contiguous sequence of (single-byte) characters terminated by the null character, which in C is always 0.
This definition can be understood again in an abstract way, however in 7.1.1.1 the text talks about the address of the string, so it must be understood that a "string" is an object in memory.
A "string" object is, then, a contiguous sequence of "bytes", each one holding the code-point of a character.
This is derived from the fact that a "character" is intended to fit exactly in 1 byte.
It is represented in C by an array of type char, whose last element is 0.
MULTIBYTE CHARACTER
The definition of "multibyte" is complicated.
It is referred to some special encoding schemas that uses a variable number of bytes to represent an (abstract) character.
You need information about the execution character sets in order to properly handle multibyte character sets.
However, even if you have a multibyte character, it is still represented in memory as a sequence of bytes.
That means that you will represent a multibyte string again as an array of char.
The way in that the execution system interprets these bytes is a different issue.
WIDE CHARACTER
A wide character is an element of another set of (abstract) characters, different to those represented in the type char.
It is intended that the set of "wide characters" be larger than the set of "single-byte characters".
But this is not necessarily the case.
The relevant facts of the "wide characters" are the following:
The set of "wide characters", whichever they are, can be represented by the range of values of the type wchar_t.
These characters can be different from those represented in the type char.
A "wide character" can use more than 1 byte storage.
A "wide string" is a null-terminated contiguous sequence of "wide characters".
Thus, a "wide string" is a different object than a "string".
CONCLUSION
A string has nothing to do with "wide" characters, but only "single-byte characters".
A string is a null-terminated contiguous sequence of "bytes", which, in turn, means, objects of some the char types: char, signed char, unsigned char, corresponding to code-points of an abstract character set that fits in 1 byte.

How can a character be represented by a bit pattern containing three octal digits?

From Chapter 2(Sub section 2.3 named Constants) of K&R book on C programming language:
Certain characters can be represented in character and string
constants by escape sequences like \n (newline); these sequences look
like two characters, but represent only one. In addition, an arbitrary
byte-sized bit pattern can be specified by
′\ooo′
where ooo is one to three octal digits (0...7) or by
′\xhh′
where hh is one or more hexadecimal digits (0...9, a...f, A...F). So
we might write
#define VTAB ′\013′ /* ASCII vertical tab */
#define BELL ′\007′ /* ASCII bell character */
or, in hexadecimal,
#define VTAB ′\xb′ /* ASCII vertical tab */
#define BELL ′\x7′ /* ASCII bell character */
The part that confuses me is the following wordings(emphasis mine): where ooo is one to three octal digits (0...7). If there are three octal digits the the number of bits required will be 9(3 for each digit) which exceeds the byte length required for characters. Surely I am missing something here. What is it that I am missing?
\ooo (3 octal digits) does indeed allow a specification of 9-bit values of 0 to 111111111 (binary) or 511. If this is allowed is dependent on the char size.
Assignments such as below generate a warning on many environments because a char is 8 bits in those environments. Typically the highest octal sequence allowed is \377. But a char needs not be 8 bits. OP's "9... exceeds the byte length required for characters" is incorrect.
char *s = "\777"; //warning "Octal sequence out of range"
char c = '\777'; //warning
int i = '\777'; //warning
The 3 octal digit constant '\141' is the same as 'a' in a typically environment where ASCII is used. But in an alternate character set, 'a' could be different. Thus if one wanted a portable bit pattern assignment of 01100001, one could use '\141' instead of 'a'. One could accomplish the same by assigning '\x61'. In some context, an octal pattern may be preferred.
C11 6.4.4.4.9 If no prefix used, "The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the corresponding type: unsigned char"
The range of code numbers of characters is not defined in K&R, as far as I can remember. In the early days, it was usually the ASCII range 0...127. Nowadays it is often an 8-bit range, 0...255, but it could be wider, too. In any case, the implementation-defined limits on the char data type imply restrictions on the escape notations, too.
For example, if the range is 0...127, then \177 is the largest allowed octal escape.
The first octal digit is only allowed to go to 3 (two bits), not 7 (three bits), if we're talking about eight bit bytes. If we're talking about ASCII (7 bit values), the first digit can only be zero or one.
If K&R says otherwise, their description is either incomplete or incorrect.

C standard: L prefix and octal/hexadecimal escape sequences

I didn't find an explanation in the C standard how do aforementioned escape sequences in wide strings are processed.
For example:
wchar_t *txt1 = L"\x03A9";
wchar_t *txt2 = L"\xA9\x03";
Are these somehow processed (like prefixing each byte with \x00 byte) or stored in memory exactly the same way as they are declared here?
Also, how does L prefix operate according to the standard?
EDIT:
Let's consider txt2. How it would be stored in memory? \xA9\x00\x03\x00 or \xA9\x03 as it was written? Same goes to \x03A9. Would this be considered as a wide character or as 2 separate bytes which would be made into two wide characters?
EDIT2:
Standard says:
The hexadecimal digits that follow the backslash and the letter x in a hexadecimal escape
sequence are taken to be part of the construction of a single character for an integer
character constant or of a single wide character for a wide character constant. The
numerical value of the hexadecimal integer so formed specifies the value of the desired
character or wide character.
Now, we have a char literal:
wchar_t txt = L'\xFE\xFF';
It consists of 2 hex escape sequences, therefore it should be treated as two wide characters. If these are two wide characters they can't fit into one wchar_t space (yet it compiles in MSVC) and in my case this sequence is treated as the following:
wchar_t foo = L'\xFFFE';
which is the only hex escape sequence and therefore the only wide char.
EDIT3:
Conclusions: each oct/hex sequence is treated as a separate value ( wchar_t *txt2 = L"\xA9\x03"; consists of 3 elements). wchar_t txt = L'\xFE\xFF'; is not portable - implementation defined feature, one should use wchar_t txt = L'\xFFFE';
There's no processing. L"\x03A9" is simply an array wchar_t const[2] consisting of the two elements 0x3A9 and 0, and similarly L"\xA9\x03" is an array wchar_t const[3].
Note in particular C11 6.4.4.4/7:
Each octal or hexadecimal escape sequence is the longest sequence of characters that can
constitute the escape sequence.
And also C++11 2.14.3/4:
There is no limit to the number of digits in a hexadecimal sequence.
Note also that when you are using a hexadecimal sequence, it is your responsibility to ensure that your data type can hold the value. C11-6.4.4.4/9 actually spells this out as a requirement, whereas in C++ exceeding the type's range is merely "implementation-defined". (And a good compiler should warn you if you exceed the type's range.)
Your code doesn't make sense, though, because the left-hand sides are neither arrays nor pointers. It should be like this:
wchar_t const * p = L"\x03A9"; // pointer to the first element of a string
wchar_t arr1[] = L"\x03A9"; // an actual array
wchar_t arr2[2] = L"\x03A9"; // ditto, but explicitly typed
std::wstring s = L"\x03A9"; // C++ only
On a tangent: This question of mine elaborates a bit on string literals and escape sequences.

Resources