Why EOF coincides with valid char value? [duplicate] - c

This question already has answers here:
What is EOF in the C programming language?
(10 answers)
Closed 6 years ago.
As said in comments to the answer to this question: Why gcc does not produce type mismatch warning for int and char?
both -1 and 255 are 0xFF as 8 bit HEX number on any current CPU.
But EOF is equal to -1. This is a contradiction, because the value of EOF must not coincide with any valid 8-bit character. This example demonstrates it:
#include <stdio.h>
int main(void)
{
char c = 255;
if (c == EOF) printf("oops\n");
return 0;
}
On my machine it prints oops.
How this contradiction can be explained?

When you compare an int value to a char value, the char value is promoted to an int value. This promotion is automatic and part of the C language specification (see e.g. this "Usual arithmetic conversions" reference, especially point 4). Sure the compiler could give a warning about it, but why should it if it's a valid language construct?
There's also the problem with the signedness of char which is implementation defined. If char is unsigned, then your condition would be false.
Also if you read just about any reference for functions reading characters from files (for example this one for fgetc and getc) you will see that they return an int and not a char, precisely for the reasons mentioned above.

Related

Understanding syntax puzzle in c

In an upcoming exam in c, we have one question that gives you extra credit.
The question is always related to tricky syntax of various printing types.
Overall, I understood all the questions I have gone through, but two questions in particular had me puzzled :
What is the output of the following program?
#include <stdio.h>
void main(){
printf ("%c", '&'&'&');
}
answer: &
What is the output of the following program?
#include <stdio.h>
#include <string.h>
void main(){
printf("%c",strcmp("***","**")*'*');
}
answer: *
As you can see the questions are quite similar.
My question is, why is this the output?
Regarding the first question: I understand that a character is, logic-wise, always TRUE and that AND-ing TRUE with TRUE gives you TRUE (or 1) as well, but why would it convert 1 to '&', why not the char equivalent of 1 from the ASCII-table? (notice the required print of %c and not %d)
Regarding the second question: I understand that strcmp returns an int according to the value that 'appears first in the dictionary' and in this example would result in 1 but why multiplying it with the char '*' (again, logic-wise equals to 1) would result in converting (1*1=1) to char '*'?
For the first question the expression is '&' & '&', where & is a bitwise AND operator (not a logical operator). With bitwise AND the result of x & x is x, so the result in this case is just the character '&'.
For the second question, assuming the the result of the call to strcmp() is 1, you can then simplify the expression to 1 * '*' which is just '*'. (Note that as #rici mentions in the comments above, the result of strcmp is not guaranteed to be 1 in this case, only that it will be an integer > 0, so you should not rely on this behaviour, and the question is therefore a bad question).
'&' is a constant of type int. '&'&'&' has the same value and type as '&' since a & a is a for any int a. So the output is equivalent to printf ("%c", '&');.
The analysis of the second snippet is more difficult. The result of strcmp is a positive number. And that is multiplied by '*' (which must be a positive number for any encoding supported by C). That's an int but the value is implementation defined (subject to the encoding on your platform and your platform's implementation of strcmp), and the behaviour of %c is contingent on the signedness or otherwise of char on your platform. If the result is too big to fit into a char, and char is unsigned, then the value is converted to a char with the normal wrap-around behaviour. If char is signed then the conversion is implementation-defined and an implementation-defined signal might be raised.

C Language: Why int variable can store char?

I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ΓΏ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.

Clarification on the use of fgetc [duplicate]

This question already has answers here:
Difference between int and char in getchar/fgetc and putchar/fputc?
(2 answers)
Closed 5 years ago.
There is the following code from part of a function that I have been given:
char ch;
ch = fgetc(fp);
if (ch == EOF)
return -1;
Where fp is a pointer-to-FILE/stream passed as a parameter to the function.
However, having checked the usage of fgetc(),getc() and getchar(), it seems that they all return type int rather than type char because EOF does not fit in the values 0-255 that are used in a char, and so is usually < 0 (e.g. -1). However, this leads me to ask three questions:
If getchar() returns int, why is char c; c = getchar(); a valid usage of the function? Does C automatically type cast to char in this case, and in the case that getchar() is replaced with getc(fp) or fgetc(fp)?
What would happen in the program when fgetc() or the other two functions return EOF? Would it again try and cast to char like before but then fail? What gets stored in ch, if anything?
If EOF is not actually a character, how is ch == EOF a valid comparison, since EOF cannot be represented by a char variable?
If getchar() returns int, why is char c; c = getchar(); a valid usage of the function?
It's not. Just because you can write and compiler (somehow) allows you to compile it, does not make a code valid.
I believe the above answers all the questions.
Just to add, in case EOF is returned, it cannot be stored in a char. Signedness of a char is implementation defined, thus, as per chapter 6.3.1.3, C11
When a value with integer type is converted to another integer type other than _Bool, if
the value can be represented by the new type, it is unchanged.
Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type.60)
Otherwise, the new type is signed and the value cannot be represented in it; either the
result is implementation-defined or an implementation-defined signal is raised.

Difference between storing the value in int and char returned from getchar function in C [duplicate]

This question already has answers here:
While (( c = getc(file)) != EOF) loop won't stop executing
(2 answers)
Closed 7 years ago.
While going through the book by Dennis Ritchie , I found that it is better to storing the value returned by getchar() function in C in integer type variable rather than character type variable. The reason it stated was that character type variable cannot store the value of EOF . While implementing it practically, there was no such difficult in storing the return in char type variable. And What does getchar() function originally returns , the charcter or the ascii value of the character?
EOF is end of file. You won't see the difference until you implement some file read/write operation/code.
The value of EOF is always defined to be -1.
That works well because all ASCII codes are positive, so it can't possibly clash with any real character's representation.
Unfortunately, C has a very strange feature that can cause trouble. It is not defined what the range of possible values for a char variable must be. On some systems it is -128 to +127, which is fine; but on other systems it is 0 to +255, which is fine for normal ASCII values, but not so hot for EOF's -1.
For one thing, the variable to hold getchar()'s return value must be an int. EOF is an out of band return value from getchar(): it is distinct from all possible char values which getchar() can return. (On modern systems, it does not reflect any actual end-of-file character stored in a file; it is a signal that no more characters are available.) getchar()'s return value must be stored in a variable larger than char so that it can hold all possible char values, and EOF.
Two failure modes are possible if, as in the fragment above, getchar()'s return value is assigned to a char.
If type char is signed, and if EOF is defined (as is usual) as -1, the character with the decimal value 255 ('\377' or '\xff' in C) will be sign-extended and will compare equal to EOF, prematurely terminating the input.t
If type char is unsigned, an actual EOF value will be truncated (by having its higher-order bits discarded, probably resulting in 255 or 0xff) and will not be recognized as EOF, resulting in effectively infinite input.
The bug can go undetected for a long time, however, if chars are signed and if the input is all 7-bit characters. (Whether plain char is signed or unsigned is implementation-defined.)
References:
K&R1 Sec. 1.5 p. 14
K&R2 Sec. 1.5.1 p. 16
ISO Sec.
6.1.2.5, Sec. 7.9.1, Sec. 7.9.7.5
H&S Sec. 5.1.3 p. 116, Sec. 15.1, Sec. 15.6
CT&P Sec. 5.1 p. 70 PCS Sec. 11 p. 157
Generally it is best to store getchar()'s result in an int guaranteeing that EOF is properly handled.

Can an implementation that has sizeof (int) == 1 "fully conform"? [duplicate]

This question already has answers here:
Can sizeof(int) ever be 1 on a hosted implementation?
(8 answers)
Closed 7 years ago.
According to the C standard, any characters returned by fgetc are returned in the form of unsigned char values, "converted to an int" (that quote comes from the C standard, stating that there is indeed a conversion).
When sizeof (int) == 1, many unsigned char values are outside of range. It is thus possible that some of those unsigned char values might end up being converted to an int value (the result of the conversion being "implementation-defined or an implementation-defined signal is raised") of EOF, which would be returned despite the file not actually being in an erroneous or end-of-file state.
I was surprised to find that such an implementation actually exists. The TMS320C55x CCS manual documents UCHAR_MAX having a corresponding value of 65535, INT_MAX having 32767, fputs and fopen supporting binary mode... What's even more surprising is that it seems to describe the environment as a fully conforming, complete implementation (minus signals).
The C55x C/C++ compiler fully conforms to the ISO C standard as defined by the ISO specification ...
The compiler tools come with a complete runtime library. All library
functions conform to the ISO C library standard. ...
Is such an implementation that can return a value indicating errors where there are none, really fully conforming? Could this justify using feof and ferror in the condition section of a loop (as hideous as that seems)? For example, while ((c = fgetc(stdin)) != EOF || !(feof(stdin) || ferror(stdin))) { ... }
The function fgetc() returns an int value in the range of unsigned char only when a proper character is read, otherwise it returns EOF which is a negative value of type int.
My original answer (I changed it) assumed that there was an integer conversion to int, but this is not the case, since actually the function fgetc() is already returning a value of type int.
I think that, to be conforming, the implementation have to make fgetc() to return nonnegative values in the range of int, unless EOF is returned.
In this way, the range of values from 32768 to 65535 will be never associated to character codes in the TMS320C55x implementation.

Resources