In C, strings are arrays of char (char *) and characters are usually stored in char. I noticed that some functions from the libC are taking as argument integers instead of a char.
For instance, let's take the functions toupper() and tolower() that both use int. The man page says:
If c is not an unsigned char value, or EOF, the behavior of these
functions is undefined.
My guess is that with a int, toupper and tolower are able to deal with unsigned char and EOF. But in fact EOF is in practice (is there any rule about its value?) a value that can be stored with a char, and since those functions won't transform EOF into something else, I'm wondering why toupper does not simply take a char as argument.
In any case why do we need to accept something that is not a character (such as EOF)? Could someone provide me a relevant use case?
This is similar with fputc or putchar, that also take a int that is converted into an unsigned char anyway.
I am looking for the precise motivations for that choice. I want to be convinced, I don't want to answer that I don't know if someone ask me one day.
C11 7.4
The header <ctype.h> declares several functions useful for classifying and mapping
characters. In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the
macro EOF. If the argument has any other value, the behavior is
undefined.
C11 7.21.1
EOF
which expands to an integer constant expression, with type int and a
negative value, ...
The C standard explicitly states that EOF is always an int with negative value. And furthermore, the signedness of the default char type is implementation-defined, so it may be unsigned and not able to store a negative value:
C11 6.2.5
If a member of the basic execution character set is stored in a char
object, its value is guaranteed to be nonnegative. If any other
character is stored in a char object, the resulting value is
implementation-defined but shall be within the range of values that
can be represented in that type.
BITD a coding method included:
/* example */
int GetDecimal() {
int sum = 0;
int ch;
while (isdigit(ch = getchar())) { /* isdigit(EOF) return 0 */
sum *= 10;
sum += ch - '0';
}
ungetc(ch, stdin); /* If c is EOF, operation fails and the input stream is unchanged. */
return sum;
}
ch with the value of EOF then could be used in various functions like isalpha() , tolower().
This style caused problems with putchar(EOF) which I suspect did the same as putchar(255).
The method is discouraged today for various reasons. Various models like the following are preferred.
int GetDecimal() {
int ch;
while (((ch = getchar()) != EOF)) && isdigit(ch)) {
...
}
...
}
If c is not an unsigned char value, or EOF, the behavior of these functions is undefined.
But EOF is a negative int in C and some platforms (hi ARM!) have char the same as unsigned char.
Related
The book The C Programming Language by Kernighan and Ritchie, second edition states on page 43 in the chapter about Type Conversions:
Another example of char to int conversion is the function lower, which maps a single character to lower case for the ASCII character set. If the character is not an upper case letter, lower returns returns it unchanged.
/* lower: convert c to lower case; ASCII only */
int lower(int c)
{
if (c >= 'A' && c <= 'Z')
return c + 'a' - 'A';
else
return c;
}
It isn't mentioned explicitly in the text so I'd like to make sure I understand it correctly: The conversion happens when you call the lower function with a variable of type char, doesn't it? Especially, the expression
c >= 'A'
has nothing to do with a conversion from int to char since a character constant like 'A'
is handled as an int internally from the start, isn't it? Edit: Or is this different (e.g. a character constant being treated as a char) for ANSI C, which the book covers?
Character constants have type int, as you expected, so you are correct that there are no promotions to int in this function.
Any promotion that may occur would happen if a variable of type char is passed to this function, and this is most likely what the text is referring to.
The type of character constants is int in both the current C17 standard (section 6.4.4.4p10):
An integer character constant has type
int
And in the C89 / ANSI C standard (section 3.1.3.4 under Semantics):
An integer character constant has type int
The latter of which is what K&R Second Edition refers to.
K&R C is old. Really old. Many particulars of K&R C are no longer true in up-to-date standard C.
In stadard, up-to-date C11, there is no conversion to/from char in the function you posted:
/* lower: convert c to lower case; ASCII only */
int lower(int c)
{
if (c >= 'A' && c <= 'Z')
return c + 'a' - 'A';
else
return c;
}
The function accepts int arguments as int c, and per 6.4.4.4 Character constants of the C standard, character literals are of type int.
Thus the entire lower function, as posted, under C11 deals entirely with int values.
The conversion, if any, is may be done when the function is called:
char upperA = 'A`;
// this will implicitly promote the upperA char
// value to an int value
char lowerA = lower( upperA );
Note that this is one of the differences between C and C++. In C++, character literals are of type char, not int.
How exactly is this function an example of a char to int conversion?
/* lower: convert c to lower case; ASCII only */
int lower(int c) {
if (c >= 'A' && c <= 'Z')
return c + 'a' - 'A';
else
return c;
}
It is not an example of a char to int conversion - technically incorrect by the author.
The text goes on to discuss tolower(c) as an alternative to lower() as it "works" correctly even if [A -Z] are not consecutively encoded as in EBCDIC.
What is not discussed, is that tolower() functions and others (is...()) are only specified for int values in the unsigned char range and EOF. C11 §7.4 1. Other values invoke undefined behavior (UB).
It is this requirement that makes these Standard C library functions conceptually char to int conversions as only values in the (about) char range are specified and the result is int.
Now look at code where char conversion does occur.
void my_strtolower1(char *s) {
while (*s) {
*s = lower(*s); // conversion `char` to `int` and `int` to `char`.
s++;
}
}
void my_strtolower2(char *s) {
while (*s) {
*s = tolower(*s); // conversion `char` to `int` and `int` to `char`.
s++;
}
}
void my_strtolower3(char *s) {
while (*s) {
// conversion `char` to `unsigned char` to `int` and `int` to `char`.
*s = tolower((unsigned char) *s);
s++;
}
}
my_strtolower1() well defined, yet not functionally correct on rare machines where [A-Z,a-z] are not consecutive.
my_strtolower2() expected functionality except technically undefined behavior when *s < 0 (and not EOF).
my_strtolower3() expected functionality without UB when *s < 0.
I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.
This question already has answers here:
Difference between int and char in getchar/fgetc and putchar/fputc?
(2 answers)
Closed 5 years ago.
There is the following code from part of a function that I have been given:
char ch;
ch = fgetc(fp);
if (ch == EOF)
return -1;
Where fp is a pointer-to-FILE/stream passed as a parameter to the function.
However, having checked the usage of fgetc(),getc() and getchar(), it seems that they all return type int rather than type char because EOF does not fit in the values 0-255 that are used in a char, and so is usually < 0 (e.g. -1). However, this leads me to ask three questions:
If getchar() returns int, why is char c; c = getchar(); a valid usage of the function? Does C automatically type cast to char in this case, and in the case that getchar() is replaced with getc(fp) or fgetc(fp)?
What would happen in the program when fgetc() or the other two functions return EOF? Would it again try and cast to char like before but then fail? What gets stored in ch, if anything?
If EOF is not actually a character, how is ch == EOF a valid comparison, since EOF cannot be represented by a char variable?
If getchar() returns int, why is char c; c = getchar(); a valid usage of the function?
It's not. Just because you can write and compiler (somehow) allows you to compile it, does not make a code valid.
I believe the above answers all the questions.
Just to add, in case EOF is returned, it cannot be stored in a char. Signedness of a char is implementation defined, thus, as per chapter 6.3.1.3, C11
When a value with integer type is converted to another integer type other than _Bool, if
the value can be represented by the new type, it is unchanged.
Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type.60)
Otherwise, the new type is signed and the value cannot be represented in it; either the
result is implementation-defined or an implementation-defined signal is raised.
I am working through methods of input and output in C, and I have been presented with a segment of code that has an element that I cannot understand. The purport of this code is to show how the 'echoing' and 'buffered' input/outputs work, and in the code, they have a type 'int' declared for, as I understand, characters:
#include <stdio.h>
int main(void){
int ch; //This is what I do not get: why is this type 'int'?
while((ch = getchar()) != '\n'){
putchar(ch);
}
return 0;
}
I'm not on firm footing with type casting as it is, and this 'int' / 'char' discrepancy is undermining all notions that I have regarding data types and compatibility.
getchar() returns an int type because it is designed to be able to return a value that cannot be represented by char to indicate EOF. (C.11 §7.21.1 ¶3 and §7.21.7.6 ¶3)
Your looping code should take into account that getchar() might return EOF:
while((ch = getchar()) != EOF){
if (ch != '\n') putchar(ch);
}
The getc, fgetc and getchar functions return int because they are capable of handling binary data, as well as providing an in-band signal of an error or end-of-data condition.
Except on certain embedded platforms which have an unusual byte size, the type int is capable of representing all of the byte values from 0 to UCHAR_MAX as positive values. In addition, it can represent negative values, such as the value of the constant EOF.
The type unsigned char would only be capable of representing the values 0 to UCHAR_MAX, and so the functions would not be able to use the return value as a way of indicating the inability to read another byte of data. The value EOF is convenient because it can be treated as if it were an input symbol; for instance it can be included in a switch statement which handles various characters.
There is a little bit more to this because in the design of C, values of short and char type (signed or unsigned) undergo promotion when they are evaluated in expressions.
In classic C, before prototypes were introduced, when you pass a char to a function, it's actually an int value which is passed. Concretely:
int func(c)
char c;
{
/* ... */
}
This kind of old style definition does not introduce information about the parameter types.
When we call this as func(c), where c has type char, the expression c is subject to the usual promotion, and becomes a value of type int. This is exactly the type which is expected by the above function definition. A parameter of type char actually passes through as a value of type int. If we write an ISO C prototype declaration for the above function, it has to be, guess what:
int func(int); /* not int func(char) */
Another legacy is that character constants like 'A' actually have type int and not char. It is noteworthy that this changes in C++, because C++ has overloaded functions. Given the overloads:
void f(int);
void f(char);
we want f(3) to call the former, and f('A') to call the latter.
So the point is that the designers of C basically regarded char as being oriented toward representing a compact storage location, and the smallest addressable unit of memory. But as far as data manipulation in the processor was concerned, they were thinking of the values as being word-sized int values: that character processing is essentially data manipulation based on int.
This is one of the low-level facets of C. In machine languages on byte-addressable machines, we usually think of bytes as being units of storage, and when we load the into registers to work with them, they occupy a full register, and so become 32 bit values (or what have you). This is mirrored in the concept of promotion in C.
The return type of getchar() is int. It returns the ASCII code of the character it's just read. This is (and I know someone's gonna correct me on this) the same as the char representation, so you can freely compare them and so on.
Why is it this way? The getchar() function is ancient -- from the very earliest days of K&R C. putchar() similarly takes an int argument, when you'd think it might take a char.
Hope that helps!
I have a C code in which I am using standard library function isalpha() in ctype.h, This is on Visual Studio 2010-Windows.
In below code, if char c is '£', the isalpha call returns an assertion as shown in the snapshot below:
char c='£';
if(isalpha(c))
{
printf ("character %c is alphabetic\n",c);
}
else
{
printf ("character %c is NOT alphabetic\n",c);
}
I can see that this might be because 8 bit ASCII does not have this character.
So how do I handle such Non-ASCII characters outside of ASCII table?
What I want to do is if any non-alphabetic character is found(even if it includes such character not in 8-bit ASCII table) i want to be able to neglect it.
You may want to cast the value sent to isalpha (and the other functions declared in <ctype.h>) to unsigned char
isalpha((unsigned char)value)
It's one of the (not so) few occasions where a cast is appropriate in C.
Edited to add an explanation.
According to the standard, emphasis is mine
7.4
1 The header <ctype.h> declares several functions useful for classifying and mapping
characters. In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.
The cast to unsigned char ensures calling isalpha() does not invoke Undefined Behaviour.
You must pass an int to isalpha(), not a char. Note the standard prototype for this function:
int isalpha(int c);
Passing an 8-bit signed character will cause the value to be converted into a negative integer, resulting in an illegal negative offset into the internal arrays typically used by isxxxx().
However you must ensure that your char is treated as unsigned when casting - you can't simply cast it directly to an int, because if it's an 8-bit character the resulting int would still be negative.
The typical way to ensure this works is to cast it to an unsigned char, and then rely on implicit type conversion to convert that into an int.
e.g.
char c = '£';
int a = isalpha((unsigned char) c);
You may be compiling using wchar (UNICODE) as character type, in that case the isalpha method to use is iswalpha
http://msdn.microsoft.com/en-us/library/xt82b8z8.aspx