This question already has answers here:
Difference between int and char in getchar/fgetc and putchar/fputc?
(2 answers)
Closed 5 years ago.
Up to now, i have found out that, we use getchar() as int because at EOF this function returns -1. I was wondering can't char hold -1 ? i think it can,because it can also range from -128 to 127.
I searched through the top list of google, but the answer i got didn't satisfy me.
First of all, just to clear something up, EOF is not required to be specifically -1. The ISO C standard requires it to have a negative value.
Also note that the type char can be signed or unsigned; that is implementation-defined.
The function fgetc, getc and getchar can be used to process binary streams, not only text streams. Bytes have values from 0 to UCHAR_MAX. The natural type for this range is unsigned char. The return value of getc couldn't be unsigned char, because EOF wouldn't be representable. It couldn't be signed char either, because signed char cannot hold the values from 0 to UCHAR_MAX. So, a wider type is chosen: the type int. On mainstream platforms, int is wider than char: it has a size of at least 2. And so it it is capable of representing some negative value that can be used for EOF, and all the byte values in the range 0 to UCHAR_MAX.
In some C implementations (for systems such as some DSP chips) this is not true: int is one byte wide. That represents challenges. On such a C implementation, a range of valid byte values returned by getc just has to be negative, and one of those values clashes with EOF. Carefully written code can tell that this is the case: a value equal to EOF was returned, yet the feof and ferror functions report false: the stream is not in error, and end-of-file has not occurred. Thus, the value which looks like EOF is actually a valid byte.
getchar() and family return an integer so that the EOF -1 is distinguishable from (char)-1 or (unsigned char)255.
Chars, using Ascii encoding, is stored using a byte, which is unsigned 8-bit integer. Thus chars can take the numeric value 0-255.
getchar returns a signed int to allow for -1, which is the magic number for end-of-file (EOF)
Related
So I was writing a program on my Raspberry Pi Zero to count the frequencies of different word lengths in the input, but the program didn't stop at EOF.
So I tried this to debug:
#include <stdio.h>
#include <stdlib.h>
void main() {
char c;
while ( (c = getchar()) != EOF) {
putchar(c);
}
}
And compiled with this:
gcc test.c && ./a.out <input.txt
It printed out the input text, but then just kept printing question marks until I hit Ctrl+C. When I copied the program over onto my laptop and ran it there, everything worked fine.
I could just finish on the laptop, but I'm curious. Why can't the Pi detect when the file hit EOF?
First couple of facts:
The symbol EOF is a macro that expands to the integer constant -1. This integer constant will have the type int.
It's implementation-defined if char is signed or unsigned. The same compiler on different platforms might have different char implementations.
Now for the long explanation about your problem:
When integer types of different sizes are used in arithmetic expressions (and comparison is considered an arithmetic operator), then both operands of the expression undergoes usual arithmetic conversion to get a common type (usually int).
For smaller integer types, like for example char, that involves integer promotion to convert it to an int. For this promotion the value of the char needs to be kept intact, so e.g. -1 as a char will still be -1 as an int.
Because of how negative numbers are represented on most systems, the char value of -1 is (in hexadecimal) 0xff. For a signed char, when -1 is converted to an int, it keeps the value -1 (which will be represented as 0xffffffff for a 32-bit int type).
The problem comes when char is unsigned, because then when getchar returns EOF (the value -1) the unsigned char value will be equal to 255 (the unsigned decimal representation of 0xff). And when promoted to an int the value will still be 255. And 255 != -1!
That's why the getchar return type is int and not char. And one of the reason why all character-handling functions are using int instead of char.
So to solve your problem, you need to change the type of the variable c to int:
int c;
Then it will work
getchar returns int value not char value. Since you need some way to recognise in one getchar function if you read regular character or if function tells you there is nothing more to read - someone long time ago decided to use int so that some value bigger than char can be returned to indicate end of file. Change char to int.
getchar's return value is supposed to be able to return any ASCII (and extended ASCII) character between 0 and 255.
In order to make the distinction between an ascii and EOF, EOF cannot be a value in this interval, so getchar's return type must have more than 8 bits.
int getchar(void);
So you should write
int c;
while ( (c = getchar()) != EOF) ...
We often use fgetc like this:
int c;
while ((c = fgetc(file)) != EOF)
{
// do stuff
}
Theoretically, if a byte in the file has the value of EOF, this code is buggy - it will break the loop early and fail to process the whole file. Is this situation possible?
As far as I understand, fgetc internally casts a byte read from the file to unsigned char and then to int, and returns it. This will work if the range of int is greater than that of unsigned char.
What happens if it's not (probably then sizeof(int)=1)?
Will fgetc read a legitimate data equal to EOF from a file sometimes?
Will it alter the data it read from the file to avoid the single value EOF?
Will fgetc be an unimplemented function?
Will EOF be of another type, like long?
I could make my code fool-proof by an extra check:
int c;
for (;;)
{
c = fgetc(file);
if (feof(file))
break;
// do stuff
}
It is necessary if I want maximum portability?
Yes, c = fgetc(file); if (feof(file)) does work for maximum portability. It works in general and also when the unsigned char and int have the same number of unique values. This occurs on rare platforms with char, signed char, unsigned char, short, unsigned short, int, unsigned all using the same bit width and width of range.
Note that feof(file)) is insufficient. Code should also check for ferror(file).
int c;
for (;;)
{
c = fgetc(file);
if (c == EOF) {
if (feof(file)) break;
if (ferror(file)) break;
}
// do stuff
}
The C specification says that int must be able to hold values from -32767 to 32767 at a minimum. Any platform with a smaller int is nonstandard.
The C specification also says that EOF is a negative int constant and that fgetc returns "an unsigned char converted to an int" in the event of a successful read. Since unsigned char can't have a negative value, the value of EOF can be distinguished from anything read from the stream.*
*See below for a loophole case in which this fails to hold.
Relevant standard text (from C99):
§5.2.4.2.1 Sizes of integer types <limits.h>:
[The] implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
[...]
minimum value for an object of type int
INT_MIN -32767
maximum value for an object of type int
INT_MAX +32767
§7.19.1 <stdio.h> - Introduction
EOF ... expands to an integer constant expression, with type int and a negative value, that is returned by several functions to indicate end-of-file, that is, no more input from a stream
§7.19.7.1 The fgets function
If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the stream (if defined)
If UCHAR_MAX ≤ INT_MAX, there is no problem: all unsigned char values will be converted to non-negative integers, so they will be distinct from EOF.
Now, there is a funny sort of loophole here: if a system has UCHAR_MAX > INT_MAX, then a system is legally allowed to convert values greater than INT_MAX to negative integers (per §6.3.1.3, the result of converting a value to a signed type that cannot represent that value is implementation defined), making it possible for a character read from a stream to be converted to EOF.
Systems with CHAR_BIT > 8 do exist (e.g. the TI C4x DSP, which apparently uses 32-bit bytes), although I'm not sure if they are broken with respect to EOF and stream functions.
NOTE: chux's answer is the correct one in the most general case. I'm leaving this answer up because I believe both the answer and the discussion in the comments are valuable in understanding the (rare) situations in which chux's approach is necessary.
EOF is guaranteed to have a negative value (C99 7.19.1), and as you mentioned, fgetc reads its input as an unsigned char before converting to int. So those by themselves guarantee that EOF can't be read from a file.
As for your specific questions:
fgetc can't read a legitimate datum equal to EOF. In the file, there's no such thing as signed or unsigned; it's just bit sequences. It's C that interprets 1000 1111 differently depending on whether it's being treated as signed or unsigned. fgetc is required to treat it as unsigned, so negative numbers (other than EOF) cannot be returned.
Addendum: It can't read EOF for the unsigned char part, but when it converts the unsigned char to an int, if the int is not capable of representing all values of the unsigned char, then the behavior is implementation-defined (6.3.1.3).
fgetc is required by the standard for hosted implementations, but freestanding implementations are permitted to omit most of the standard library functions (some are apparently required, but I couldn't find the list.)
EOF won't require a long, since fgetc needs to be able to return it and fgetc returns an int.
As far as altering the data goes, it can't change the value exactly, but since fgetc is specified to read "characters" from the file as opposed to chars, it could potentially read in 8-bits at a time even if the system otherwise defines CHAR_BIT to be 16 (which is the minimum value it could have if sizeof(int) == 1, since INT_MIN <= -32767 and INT_MAX >= 32767 are required by 5.2.4.2). In that case, the input character would be converted to a unsigned char that just always had its high bits 0. Then it could make the conversion to int without losing precision. (In practice, this just won't come up, since machines don't generally have 16-bit bytes)
I'm new to C. From the book there is a sample code:
#include <stdio.h>
main() {
int c;
c = getchar();
while (c != EOF) {
putchar(c);
c = getchar();
}
}
The author writes a sentence like this:
We can't use char since c must be big enough to hold EOF in addition to any possible char. Therefore we use int.
Trying to understand, I modified the code like this:
#include <stdio.h>
main() {
char c=getchar();
while (c != EOF) {
putchar(c);
c = getchar();
}
if (c == EOF) {
putchar('*');
}
}
When I press Ctrl+D, * was printed, which means c holds EOF, which confuses me. Can anybody explain a little bit about this?
Because EOF is a special sentinel value (implementation defined, but usually -1 as an int type), it can't be distinguished from the value 255 if stored in a char variable. You need a type larger than 8 bits in order to represent all possibly byte values returned by getchar(), plus the special sentinel value EOF. There are 257 different possible return values from getchar().
Also, on a related note, character literals in C like 'a' have the type int. In C++, on the other hand, character literals have the type char. So you will see characters usually passed to and returned from C Standard Library functions as int types.
When you press CTRL-D, getchar() returns a value that doesn't fit in char. So char takes as much of it as it can. Let's assume a common number for EOF: 0xFFFFFFFF, in other words, -1. When this value is assigned to char (assuming it's signed), it will get a truncated value out of it, 0xFF, which is also -1.
So your if becomes:
if ((char)-1 == (int)-1)
the (char)-1 gets promoted to int to be able to compare it with (int)-1. Since on promotion of signed values, they get sign extended (to keep the original signed value), you end up comparing -1 and -1 which is true.
That said, this is only lucky. If you actually read a character with value 0xFF, you would mistake it with EOF. Not to mention EOF may not be -1 in the first place. All of this aside, you shouldn't let your program truncate a value when assigning to a variable (unless you know what you are doing).
char is, at least on your system, signed. Therefore, it can hold values from -128 to 127.
EOF is -1 and is, therefore, one of them.
You code works, as the -1 is retained. But as soon as you input the character which is equivalent to 255, you get erroneously -1 as well.
The type char is an unsigned 8-bit value (actually I think it can be 7 bits for the standard ASCII table, but I have never seen it implemented like that).
EOF is implementation defined, but often -1. That is a signed number (0xFFFFFFFF in a 32 bit machine). Most compilers will probably truncate that to 0xFF to compare to a char, but that's also a valid (but rarely used) character, so you can't really be sure if you have hex value 255 or EOF (-1).
In addition, some code may be written to look for a return value of <0 to stop reading. Obviously a char will never be less than zero.
The result of your getchar() is being stored as achar after performing a conversion by-value. The value in this case, is EOF likely (-1).
6.3.1.3-p1 Signed and unsigned integers
When a value with integer type is converted to another integer type
other than _Bool, if the value can be represented by the new type, it
is unchanged.
This is also accounted for during value comparison in your while-condition through value comparison via conversion:
6.5.9-p4 Equality Operators
If both of the operands have arithmetic type, the usual arithmetic conversions are performed. Values of complex types are equal if and
only if both their real parts are equal and also their imaginary parts
are equal. Any two values of arithmetic types from different type
domains are equal if and only if the results of their conversions to
the (complex) result type determined by the usual arithmetic
conversions are equal.
Both char and int are integer types. Both can hold the integer-value (-1) on your platform. Therefore your code "works".
Perhaps I'm overthinking this, as it seems like it should be a lot easier. I want to take a value of type int, such as is returned by fgetc(), and record it in a char buffer if it is not an end-of-file code. E.g.:
char buf;
int c = fgetc(stdin);
if (c < 0) {
/* handle end-of-file */
} else {
buf = (char) c; /* not quite right */
}
However, if the platform has signed default chars then the value returned by fgetc() may be outside the range of char, in which case casting or assigning it to (signed) char produces implementation-defined behavior (right?). Surely, though, there is tons of code out there that does exactly the equivalent of the example. Is it all relying on implementation-defined behavior and/or assuming 7-bit data?
It looks to me like if I want to be certain that the behavior of my code is defined by C to be what I want, then I need to do something like this:
buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);
I think that produces defined, correct behavior whether default chars are signed or unsigned, and regardless even of the size of char. Is that right? And is it really needful to do that to ensure portability?
fgetc() returns unsigned char and EOF. EOF is always < 0. If the system's char is signed or unsigned, it makes no difference.
C11dr 7.21.7.1 2
If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the stream (if defined).
The concern I have about is that is looks to be 2's compliment dependent and implying the range of unsigned char and char are both just as wide. Both of these assumptions are certainly nearly always true today.
buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);
[Edit per OP comment]
Let's assume fgetc() returns no more different characters than stuff-able in the range CHAR_MIN to CHAR_MAX, then (c - (UCHAR_MAX + 1)) would be more portable is replaced with (c - CHAR_MAX + CHAR_MIN). We do not know (c - (UCHAR_MAX + 1)) is in range when c is CHAR_MAX + 1.
A system could exist that has a signed char range of -127 to +127 and an unsigned char range 0 to 255. (5.2.4.2.1), but as fgetc() gets a character, it seems to have all be unsigned char or all ready limited itself to the smaller signed char range, before converting to unsigned char and return that value to the user. OTOH, if fgetc() returned 256 different characters, conversion to a narrow ranged signed char would not be portable regardless of formula.
Practically, it's simple - the obvious cast to char always works.
But you're asking about portability...
I can't see how a real portable solution could work.
This is because the guaranteed range of char is -127 to 127, which is only 255 different values. So how could you translate the 256 possible return values of fgetc (excluding EOF), to a char, without losing information?
The best I can think of is to use unsigned char and avoid char.
With thanks to those who responded, and having now read relevant portions of the C99 standard, I have come to agree with the somewhat surprising conclusion that storing an arbitrary non-EOF value returned by fgetc() as type char without loss of fidelity is not guaranteed to be possible. In large part, that arises from the possibility that char cannot represent as many distinct values as unsigned char.
For their part, the stdio functions guarantee that if data are written to a (binary) stream and subsequently read back, then the read back data will compare equal to the original data. That turns out to have much narrower implications than I at first thought, but it does mean that fputs() must output a distinct value for each distinct char it successfully outputs, and that whatever conversion fgets() applies to store input bytes as type char must accurately reverse the conversion, if any, by which fputs() would produce the input byte as its output. As far as I can tell, however, fputs() and fgets() are permitted to fail on any input they don't like, so it is not certain that fputs() maps every possible char value to an unsigned char.
Moreover, although fputs() and fgets() operate as if by performing sequences of fputc() and fgetc() calls, respectively, it is not specified what conversions they might perform between char values in memory and the underlying unsigned char values on the stream. If a platform's fputs() uses standard integer conversion for that purpose, however, then the correct back-conversion is as I proposed:
int c = fgetc(stream);
char buf;
if (c >= 0) buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);
That arises directly from the integer conversion rules, which specify that integer values are converted to unsigned types by adding or subtracting the integer multiple of <target type>_MAX + 1 needed to bring the result into the range of the target type, supported by the constraints on representation of integer types. Its correctness for that purpose does not depend on the specific representation of char values or on whether char is treated as signed or unsigned.
However, if char cannot represent as many distinct values as unsigned char, or if there are char values that fgets() refuses to output (e.g. negative ones), then there are possible values of c that could not have resulted from a char conversion in the first place. No back-conversion argument is applicable to such bytes, and there may not even be a meaningful sense of char values corresponding to them. In any case, whether the given conversion is the correct reverse-conversion for data written by fputs() seems to be implementation defined. It is certainly implementation-defined whether buf = (char) c will have the same effect, though it does have on very many systems.
Overall, I am struck by just how many details of C I/O behavior are implementation defined. That was an eye-opener for me.
Best way to portably assign the result of fgetc() to a char in C
C2X is on the way
A sub-problem is saving an unsigned char value into a char, which may be signed. With 2's complement, that is not a problem.*1
On non-2's complement machines with signed char that do not support -0 *2, that is a problem. (I know of no such machines.)
In any case, with C2X, support for non-2's complement encoding is planned to be dropped, so as time goes on, we can eventually ignore non-2's complement issues and confidently use
int c = fgetc(stdin);
...
char buf = (c > CHAR_MAX) ? (char)(c - (UCHAR_MAX + 1)) : (char)c;
UCHAR_MAX > INT_MAX??
A 2nd portability issue not discussed is when UCHAR_MAX > INT_MAX. e.g. All integer types are 64-bit. Some graphics processor have used a common size for all integer types.
On such unicorn machines, if (c < 0) is insufficient. Could use:
int c = fgetc(stdin);
#if UCHAR_MAX <= INT_MAX
if (c < 0) {
#else
if (c == EOF && (feof(stdin) || ferror(stdin))) {
#endif
...
Pedantically, ferror(stdin) could be true due to a prior input function and not this one which returned UCHAR_MAX, but let us not go into that rabbit-hole.
*1 In the case of int to signed char with c > CHAR_MAX, "Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." applies. With 2's complement, this is overwhelmingly maps [128 255] to [-128 -1].
*2 With non-2's compliment and -0 support, the common mapping is least 8 bits remain the same. This does make for 2 zeros, yet properly handling of strings in <string.h> uses "For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value)." So -0 is not a null character as that char is accessed as a non-zero unsigned char.
I know the following code is broken --getchar() returns an int not a char--
#include <stdio.h>
int
main(int argc, char* argv[])
{
char single_byte = getchar();
while (single_byte != EOF) {
single_byte = getchar();
printf("getchar() != EOF is %d.\n", single_byte != EOF);
if (single_byte == EOF)
printf("EOF is implemented in terms of 0x%x.\n", single_byte);
}
return 0;
}
though I would expect that a typical output of it (using /dev/urandom as the input-stream for instance) would have been at last EOF is implemented in terms of 0xff, and not the following
$ ./silly < /dev/urandom
getchar() != EOF is 1.
getchar() != EOF is 1.
// ...
getchar() != EOF is 0
EOF is implemented in terms of 0xffffffff.
Furthermore, 0xffffffff cannot be stored into a single byte ...
Thank you in advance
I know the following code is broken --getchar() returns an int not a char--
Good!
char single_byte = getchar();
This is problematic is more than one way.
I'll assume CHAR_BIT == 8 and EOF == -1. (We know EOF is negative and of type int; -1 is a typical value -- and in fact I've never heard of it having any other value.)
Plain char may be either signed or unsigned.
If it's unsigned, the value of single_byte will be either the value of the character that was just read (represented as an unsigned char and trivially converted to plain char), or the result of converting EOF to char. Typically EOF is -1, and the result of the conversion will be CHAR_MAX, or 255. You won't be able to distinguish between EOF and an actual input value of 255 -- and since /dev/urandom returns all byte values with equal probability (and never runs dry), you'll see a 0xff byte sooner or later.
But that won't terminate your input loop. Your comparison (single_byte == EOF) will never be true; since single_byte is of an unsigned type in this scenario, it can never be equal to EOF. You'll have an infinite loop, even when reading from a finite file rather than from an unlimited device like /dev/urandom. (You could have written (single_byte == (char)EOF), but of course that would not solve the underlying problem.)
Since your loop does terminate, we can conclude that plain char is signed on your system.
If plain char is signed, things are a little more complicated. If you read a character in the range 0..127, its value will be stored in single_byte. If you read a character in the range 128..255, the int value is converted to char; since char is signed and the value is out of range, the result of the conversion is implementation-defined. For most implementations, that conversion will map 128 to -128, 129 to -127, ... 255 to -1. If getchar() returns EOF, which is (typically) -1, the conversion is well defined and yields -1. So again, you can't distinguish between EOF and an input character with the value -1.
(Actually, as of C99, the conversion can also raise an implementation-defined signal. Fortunately, as far as I know, no implementations actually do that.)
if (single_byte == EOF)
printf("EOF is implemented in terms of 0x%x.\n", single_byte);
Again, this condition will be true either if getchar() actually returned EOF or if you just read a character with the value 0xff. The %x format requires an argument of type unsigned int. single_byte is of type char, which will almost certainly be promoted to int. Now you can print an int value with an unsigned int format if the value is within the representable range of both types. But since single_byte's value is -1 (it just compared equal to EOF), it's not in that range. printf, with the "%x" format, assumes that the argument is of type unsigned int (this isn't a conversion). And 0xffffffff is the likely result of taking a 32-bit int value of -1 and assuming that it's really an unsigned int.
And I'll just note that storing the result of getchar() in an int object would have been a whole lot easier than analyzing what happens when you store it in a char.
End-of-File is a macro definition of type int that expands into a negative integral constant expression (generally, -1).
EOF is not a real character so in order to allow the result of getchar() return either a valid character or an EOF, it uses a hack whereas the return type is int. You have to cast it to char after you make sure it is not an EOF.
This is a textbook example of poorly designed API.
It appears to be a confusion between (char) -1 and (int) -1.
getchar() returns an int with 1 of 257 different values: 0 to 255 and EOF. EOF is less than 0 (C11 7.21.1).
Typically EOF has the value of -1 and that is so in your case. Let's assume that for the following.
From time to time, when data is read from /dev/urandom, a value of 255 is read. This is not the EOF.
Given that OP performs char single_byte = getchar(), single_byte takes on the same value of (char) -1 if (int) -1 (EOF) was read or if (int) 255 was read.
When next comparing single_byte != EOF, should the result be false, we do not know if original return value of getchar() was -1 or 255.
Recommend a different printf()
printf("single_byte==EOF, so (int) 255 or EOF was read: 0x%hhx\n", single_byte);
Assumptions:
char is 8 bits.
EOF is -1.
EOF values are
EOF => %d => -1
EOF => %c => <prints blank space but not blank space>
EOF => %x => 0xFFFFFFFF
no ascii value for EOF! so basically you cannot compare the getchar() output with EOF. Reason is when you leave blank space and press enter ASCII value of a blank space is 0x20 (32 in decimal), If you press enter then ASCII of carriage return in 0x0D (13 in decimal).
So that piece of code will not work! either you have to define a value to exit the code!