Difference between int and char in getchar/fgetc and putchar/fputc? - c

I am trying to learn C on my own and I'm kind of confused with getchar and putchar:
1
#include <stdio.h>
int main(void)
{
char c;
printf("Enter characters : ");
while((c = getchar()) != EOF){
putchar(c);
}
return 0;
}
2
#include <stdio.h>
int main(void)
{
int c;
printf("Enter characters : ");
while((c = getchar()) != EOF){
putchar(c);
}
return 0;
}
The C library function int putchar(int c) writes a character (an unsigned char) specified by the argument char to stdout.
The C library function int getchar(void) gets a character (an unsigned char) from stdin. This is equivalent to getc with stdin as its argument.
Does it mean putchar() accepts both int and char or either of them and for getchar() should we use an int or char?

TL;DR:
char c; c = getchar(); is wrong, broken and buggy.
int c; c = getchar(); is correct.
This applies to getc and fgetc as well, if not even more so, because one would often read until the end of the file.
Always store the return value of getchar (fgetc, getc...) (and putchar) initially into a variable of type int.
The argument to putchar can be any of int, char, signed char or unsigned char; its type doesn't matter, and all of them work the same, even though one might result in positive and other in negative integers being passed for characters above and including \200 (128).
The reason why you must use int to store the return value of both getchar and putchar is that when the end-of-file condition is reached (or an I/O error occurs), both of them return the value of the macro EOF which is a negative integer constant, (usually -1).
For getchar, if the return value is not EOF, it is the read unsigned char zero-extended to an int. That is, assuming 8-bit characters, the values returned can be 0...255 or the value of the macro EOF; again assuming 8-bit char, there is no way to squeeze these 257 distinct values into 256 so that each of them could be identified uniquely.
Now, if you stored it into char instead, the effect would depend on whether the character type is signed or unsigned by default! This varies from compiler to compiler, architecture to architecture. If char is signed and assuming EOF is defined as -1, then both EOF and character '\377' on input would compare equal to EOF; they'd be sign-extended to (int)-1.
On the other hand, if char is unsigned (as it is by default on ARM processors, including Raspberry PI systems; and seems to be true for AIX too), there is no value that could be stored in c that would compare equal to -1; including EOF; instead of breaking out on EOF, your code would output a single \377 character.
The danger here is that with signed chars the code seems to be working correctly even though it is still horribly broken - one of the legal input values is interpreted as EOF. Furthermore, C89, C99, C11 does not mandate a value for EOF; it only says that EOF is a negative integer constant; thus instead of -1 it could as well be say -224 on a particular implementation, which would cause spaces behave like EOF.
gcc has the switch -funsigned-char which can be used to make the char unsigned on those platforms where it defaults to signed:
% cat test.c
#include <stdio.h>
int main(void)
{
char c;
printf("Enter characters : ");
while ((c = getchar()) != EOF){
putchar(c);
}
return 0;
}
Now we run it with signed char:
% gcc test.c && ./a.out
Enter characters : sfdasadfdsaf
sfdasadfdsaf
^D
%
Seems to be working right. But with unsigned char:
% gcc test.c -funsigned-char && ./a.out
Enter characters : Hello world
Hello world
���������������������������^C
%
That is, I tried to press Ctrl-D there many times but a � was printed for each EOF instead of breaking the loop.
Now, again, for the signed char case, it cannot distinguish between char 255 and EOF on Linux, breaking it for binary data and such:
% gcc test.c && echo -e 'Hello world\0377And some more' | ./a.out
Enter characters : Hello world
%
Only the first part up to the \0377 escape was written to stdout.
Beware that comparisons between character constants and an int containing the unsigned character value might not work as expected (e.g. the character constant 'ä' in ISO 8859-1 would mean the signed value -28. So assuming that you write code that would read input until 'ä' in ISO 8859-1 codepage, you'd do
int c;
while ((c = getchar()) != EOF){
if (c == (unsigned char)'ä') {
/* ... */
}
}
Due to integer promotion, all char values fit into an int, and are automatically promoted on function calls, thus you can give any of int, char, signed char or unsigned char to putchar as an argument (not to store its return value), and it would work as expected.
The actual value passed in the integer might be positive or even negative; for example the character constant \377 would be negative on a 8-bit-char system where char is signed; however putchar (or fputc actually) will convert the value to an unsigned char. C11 7.21.7.3p2:
2 The fputc function writes the character specified by c (converted to an unsigned char) to the output stream pointed to by stream [...]
(emphasis mine)
I.e. the fputc will be guaranteed to convert the given c as if by (unsigned char)c

Always use int to save character from getchar() as EOF constant is of int type. If you use char then the comparison against EOF is not correct.
You can safely pass char to putchar() though as it will be promoted to int automatically.
Note:
Technically using char will work in most cases, but then you can't have 0xFF character as they will be interpreted as EOF due to type conversion. To cover all cases always use int. As #Ilja put it -- int is needed to represent all 256 possible character values and the EOF, which is 257 possible values in total, which cannot be stored in char type.

Related

EOF not detected by C on Raspberry Pi

So I was writing a program on my Raspberry Pi Zero to count the frequencies of different word lengths in the input, but the program didn't stop at EOF.
So I tried this to debug:
#include <stdio.h>
#include <stdlib.h>
void main() {
char c;
while ( (c = getchar()) != EOF) {
putchar(c);
}
}
And compiled with this:
gcc test.c && ./a.out <input.txt
It printed out the input text, but then just kept printing question marks until I hit Ctrl+C. When I copied the program over onto my laptop and ran it there, everything worked fine.
I could just finish on the laptop, but I'm curious. Why can't the Pi detect when the file hit EOF?
First couple of facts:
The symbol EOF is a macro that expands to the integer constant -1. This integer constant will have the type int.
It's implementation-defined if char is signed or unsigned. The same compiler on different platforms might have different char implementations.
Now for the long explanation about your problem:
When integer types of different sizes are used in arithmetic expressions (and comparison is considered an arithmetic operator), then both operands of the expression undergoes usual arithmetic conversion to get a common type (usually int).
For smaller integer types, like for example char, that involves integer promotion to convert it to an int. For this promotion the value of the char needs to be kept intact, so e.g. -1 as a char will still be -1 as an int.
Because of how negative numbers are represented on most systems, the char value of -1 is (in hexadecimal) 0xff. For a signed char, when -1 is converted to an int, it keeps the value -1 (which will be represented as 0xffffffff for a 32-bit int type).
The problem comes when char is unsigned, because then when getchar returns EOF (the value -1) the unsigned char value will be equal to 255 (the unsigned decimal representation of 0xff). And when promoted to an int the value will still be 255. And 255 != -1!
That's why the getchar return type is int and not char. And one of the reason why all character-handling functions are using int instead of char.
So to solve your problem, you need to change the type of the variable c to int:
int c;
Then it will work
getchar returns int value not char value. Since you need some way to recognise in one getchar function if you read regular character or if function tells you there is nothing more to read - someone long time ago decided to use int so that some value bigger than char can be returned to indicate end of file. Change char to int.
getchar's return value is supposed to be able to return any ASCII (and extended ASCII) character between 0 and 255.
In order to make the distinction between an ascii and EOF, EOF cannot be a value in this interval, so getchar's return type must have more than 8 bits.
int getchar(void);
So you should write
int c;
while ( (c = getchar()) != EOF) ...

Is it possible to confuse EOF with a normal byte value when using fgetc?

We often use fgetc like this:
int c;
while ((c = fgetc(file)) != EOF)
{
// do stuff
}
Theoretically, if a byte in the file has the value of EOF, this code is buggy - it will break the loop early and fail to process the whole file. Is this situation possible?
As far as I understand, fgetc internally casts a byte read from the file to unsigned char and then to int, and returns it. This will work if the range of int is greater than that of unsigned char.
What happens if it's not (probably then sizeof(int)=1)?
Will fgetc read a legitimate data equal to EOF from a file sometimes?
Will it alter the data it read from the file to avoid the single value EOF?
Will fgetc be an unimplemented function?
Will EOF be of another type, like long?
I could make my code fool-proof by an extra check:
int c;
for (;;)
{
c = fgetc(file);
if (feof(file))
break;
// do stuff
}
It is necessary if I want maximum portability?
Yes, c = fgetc(file); if (feof(file)) does work for maximum portability. It works in general and also when the unsigned char and int have the same number of unique values. This occurs on rare platforms with char, signed char, unsigned char, short, unsigned short, int, unsigned all using the same bit width and width of range.
Note that feof(file)) is insufficient. Code should also check for ferror(file).
int c;
for (;;)
{
c = fgetc(file);
if (c == EOF) {
if (feof(file)) break;
if (ferror(file)) break;
}
// do stuff
}
The C specification says that int must be able to hold values from -32767 to 32767 at a minimum. Any platform with a smaller int is nonstandard.
The C specification also says that EOF is a negative int constant and that fgetc returns "an unsigned char converted to an int" in the event of a successful read. Since unsigned char can't have a negative value, the value of EOF can be distinguished from anything read from the stream.*
*See below for a loophole case in which this fails to hold.
Relevant standard text (from C99):
§5.2.4.2.1 Sizes of integer types <limits.h>:
[The] implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
[...]
minimum value for an object of type int
INT_MIN -32767
maximum value for an object of type int
INT_MAX +32767
§7.19.1 <stdio.h> - Introduction
EOF ... expands to an integer constant expression, with type int and a negative value, that is returned by several functions to indicate end-of-file, that is, no more input from a stream
§7.19.7.1 The fgets function
If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the stream (if defined)
If UCHAR_MAX ≤ INT_MAX, there is no problem: all unsigned char values will be converted to non-negative integers, so they will be distinct from EOF.
Now, there is a funny sort of loophole here: if a system has UCHAR_MAX > INT_MAX, then a system is legally allowed to convert values greater than INT_MAX to negative integers (per §6.3.1.3, the result of converting a value to a signed type that cannot represent that value is implementation defined), making it possible for a character read from a stream to be converted to EOF.
Systems with CHAR_BIT > 8 do exist (e.g. the TI C4x DSP, which apparently uses 32-bit bytes), although I'm not sure if they are broken with respect to EOF and stream functions.
NOTE: chux's answer is the correct one in the most general case. I'm leaving this answer up because I believe both the answer and the discussion in the comments are valuable in understanding the (rare) situations in which chux's approach is necessary.
EOF is guaranteed to have a negative value (C99 7.19.1), and as you mentioned, fgetc reads its input as an unsigned char before converting to int. So those by themselves guarantee that EOF can't be read from a file.
As for your specific questions:
fgetc can't read a legitimate datum equal to EOF. In the file, there's no such thing as signed or unsigned; it's just bit sequences. It's C that interprets 1000 1111 differently depending on whether it's being treated as signed or unsigned. fgetc is required to treat it as unsigned, so negative numbers (other than EOF) cannot be returned.
Addendum: It can't read EOF for the unsigned char part, but when it converts the unsigned char to an int, if the int is not capable of representing all values of the unsigned char, then the behavior is implementation-defined (6.3.1.3).
fgetc is required by the standard for hosted implementations, but freestanding implementations are permitted to omit most of the standard library functions (some are apparently required, but I couldn't find the list.)
EOF won't require a long, since fgetc needs to be able to return it and fgetc returns an int.
As far as altering the data goes, it can't change the value exactly, but since fgetc is specified to read "characters" from the file as opposed to chars, it could potentially read in 8-bits at a time even if the system otherwise defines CHAR_BIT to be 16 (which is the minimum value it could have if sizeof(int) == 1, since INT_MIN <= -32767 and INT_MAX >= 32767 are required by 5.2.4.2). In that case, the input character would be converted to a unsigned char that just always had its high bits 0. Then it could make the conversion to int without losing precision. (In practice, this just won't come up, since machines don't generally have 16-bit bytes)

type casting in K&R

K&R provide this getchar() example:
int getchar(void)
{
char c;
return (read(0, &c, 1) == 1) ? (unsigned char) c : EOF;
}
c is cast to unsigned char here to avoid sign extension issues, but in the fputs() example...
int fputs(char *s, FILE *iop)
{
int c;
while (c = *s++)
putc(c, iop);
return ferror(iop) ? EOF : 0;
}
*s is assigned to an int without first casting to an unsigned char. Why is the cast unnecessary this time?
It is not about "sign extension issues". This implementation of getchar makes sure that all successfully read characters are returned as non-negative int values. This behavior is required by the specification of getchar, which literally says that the character read is returned as unsigned char values converted to int, even if char is signed on the given platform. What you see there is basically a direct implementation of getchar spec.
Meanwhile fputs does not return any specific character values. fputs does not return c to the user. That c is a purely internal variable. It should preserve the original value of char type on the given platform, since the value of c is then passed to putc. putc does not expect character values converted to non-negative range, it expects original character values, which could easily be negative if char is signed.
BTW, why did you look at fputs, and not fputc? If you look at fputc, which just like getchar returns a character value, you will probably see that it is implemented similarly to getchar in that regard.
I completely misunderstood the question the first time around. The problem here is that getchar() needs to return either a char in the entire range 0-255 or EOF. On most platforms EOF = -1. In order to return both a negative value and a char, int must be used.
This is not the case in fputs. In this example, a char is being assigned to an int in the while loop. "Lower" types are promoted to "higher" types. From page 44 KR, The C Programming Language:
If either operand is a long double, convert the other to a long double.
Otherwise, if either operand is a double, convert the other to a double.
Otherwise, if either operand is a float, convert the other to a float.
Otherwise, convert char and short to int
Then, if either operand is a long, convert the other to long.
According to the man page,
The fputc() function writes the character c (converted to an unsigned char) to the output stream pointed to by stream.
The cast is specifically performed for you inside the function.
Aside from that, assigning from char to negative int and back to char is guaranteed to produce the correct result, and char to negative int to unsigned char is guaranteed to have the same result as a direct cast from char to unsigned char. Other cases may produce signed integer overflow, which produces undefined behavior (i.e., could crash). But most platforms handle that by quiet binary truncation, in such a way that many programmers never worry about it at all.

About storing getchar() returned value inside a char-variable

I know the following code is broken --getchar() returns an int not a char--
#include <stdio.h>
int
main(int argc, char* argv[])
{
char single_byte = getchar();
while (single_byte != EOF) {
single_byte = getchar();
printf("getchar() != EOF is %d.\n", single_byte != EOF);
if (single_byte == EOF)
printf("EOF is implemented in terms of 0x%x.\n", single_byte);
}
return 0;
}
though I would expect that a typical output of it (using /dev/urandom as the input-stream for instance) would have been at last EOF is implemented in terms of 0xff, and not the following
$ ./silly < /dev/urandom
getchar() != EOF is 1.
getchar() != EOF is 1.
// ...
getchar() != EOF is 0
EOF is implemented in terms of 0xffffffff.
Furthermore, 0xffffffff cannot be stored into a single byte ...
Thank you in advance
I know the following code is broken --getchar() returns an int not a char--
Good!
char single_byte = getchar();
This is problematic is more than one way.
I'll assume CHAR_BIT == 8 and EOF == -1. (We know EOF is negative and of type int; -1 is a typical value -- and in fact I've never heard of it having any other value.)
Plain char may be either signed or unsigned.
If it's unsigned, the value of single_byte will be either the value of the character that was just read (represented as an unsigned char and trivially converted to plain char), or the result of converting EOF to char. Typically EOF is -1, and the result of the conversion will be CHAR_MAX, or 255. You won't be able to distinguish between EOF and an actual input value of 255 -- and since /dev/urandom returns all byte values with equal probability (and never runs dry), you'll see a 0xff byte sooner or later.
But that won't terminate your input loop. Your comparison (single_byte == EOF) will never be true; since single_byte is of an unsigned type in this scenario, it can never be equal to EOF. You'll have an infinite loop, even when reading from a finite file rather than from an unlimited device like /dev/urandom. (You could have written (single_byte == (char)EOF), but of course that would not solve the underlying problem.)
Since your loop does terminate, we can conclude that plain char is signed on your system.
If plain char is signed, things are a little more complicated. If you read a character in the range 0..127, its value will be stored in single_byte. If you read a character in the range 128..255, the int value is converted to char; since char is signed and the value is out of range, the result of the conversion is implementation-defined. For most implementations, that conversion will map 128 to -128, 129 to -127, ... 255 to -1. If getchar() returns EOF, which is (typically) -1, the conversion is well defined and yields -1. So again, you can't distinguish between EOF and an input character with the value -1.
(Actually, as of C99, the conversion can also raise an implementation-defined signal. Fortunately, as far as I know, no implementations actually do that.)
if (single_byte == EOF)
printf("EOF is implemented in terms of 0x%x.\n", single_byte);
Again, this condition will be true either if getchar() actually returned EOF or if you just read a character with the value 0xff. The %x format requires an argument of type unsigned int. single_byte is of type char, which will almost certainly be promoted to int. Now you can print an int value with an unsigned int format if the value is within the representable range of both types. But since single_byte's value is -1 (it just compared equal to EOF), it's not in that range. printf, with the "%x" format, assumes that the argument is of type unsigned int (this isn't a conversion). And 0xffffffff is the likely result of taking a 32-bit int value of -1 and assuming that it's really an unsigned int.
And I'll just note that storing the result of getchar() in an int object would have been a whole lot easier than analyzing what happens when you store it in a char.
End-of-File is a macro definition of type int that expands into a negative integral constant expression (generally, -1).
EOF is not a real character so in order to allow the result of getchar() return either a valid character or an EOF, it uses a hack whereas the return type is int. You have to cast it to char after you make sure it is not an EOF.
This is a textbook example of poorly designed API.
It appears to be a confusion between (char) -1 and (int) -1.
getchar() returns an int with 1 of 257 different values: 0 to 255 and EOF. EOF is less than 0 (C11 7.21.1).
Typically EOF has the value of -1 and that is so in your case. Let's assume that for the following.
From time to time, when data is read from /dev/urandom, a value of 255 is read. This is not the EOF.
Given that OP performs char single_byte = getchar(), single_byte takes on the same value of (char) -1 if (int) -1 (EOF) was read or if (int) 255 was read.
When next comparing single_byte != EOF, should the result be false, we do not know if original return value of getchar() was -1 or 255.
Recommend a different printf()
printf("single_byte==EOF, so (int) 255 or EOF was read: 0x%hhx\n", single_byte);
Assumptions:
char is 8 bits.
EOF is -1.
EOF values are
EOF => %d => -1
EOF => %c => <prints blank space but not blank space>
EOF => %x => 0xFFFFFFFF
no ascii value for EOF! so basically you cannot compare the getchar() output with EOF. Reason is when you leave blank space and press enter ASCII value of a blank space is 0x20 (32 in decimal), If you press enter then ASCII of carriage return in 0x0D (13 in decimal).
So that piece of code will not work! either you have to define a value to exit the code!

are int and char represented using the same bits internally by gcc?

I was playing around with unicode characters (without using wchar_t support) just for fun. I'm only using the regular char data type. I noticed that while printing them in hex they were showing up full 4 bytes instead of just one byte.
For ex. consider this c file:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
char *s = (char *) malloc(100);
fgets(s, 100, stdin);
while (s && *s != '\0') {
printf("%x\n", *s);
s++;
}
return 0;
}
After compiling with gcc and giving input as 'cent' symbol (hex: c2 a2) I get the following output
$ ./a.out
¢
ffffffc2: ?
ffffffa2: ?
a:
So instead of just printing c2 and a2 I got the whole 4 bytes as if it's an int type.
Does this mean char is not really 1-byte in length, ascii made it look like 1-byte?
Maybe the reason why the upper three bytes become 0xFFFFFF needs a bit more explanation?
The upper three bytes of the value printed for *s have a value of 0xFF due to sign extension.
The char value passed to printf is extended to an int before the call to printf.
This is due to C's default behaviour.
In the absence of signed or unsigned, the compiler can default to interpret char as signed char or unsigned char. It is consistently one or the other unless explicitly changed with a command line option or pragma's. In this case we can see that it is signed char.
In the absence of more information (prototypes or casts), C passes:
int, so char, short, unsigned char unsigned short are converted to int. It never passes a char, unsigned char, signed char, as a single byte, it always passes an int.
unsigned int is the same size as int so the value is passed without change
The compiler needs to decide how to convert the smaller value to an int.
signed values: the upper bytes of the int are sign extended from the smaller value, which effectively copies the top, sign bit, upwards to fill the int. If the top bit of the smaller signed value is 0, the upper bytes are filled with 0. If the top bit of the smaller signed value is 1, the upper bytes are filled with 1. Hence printf("%x ",*s) prints ffffffc2
unsigned values are not sign extended, the upper bytes of the int are 'zero padded'
Hence the reason C can call a function without a prototype (though the compiler will usually warn about that)
So you can write, and expect this to run (though I would hope your compiler issues warnings):
/* Notice the include is 'removed' so the C compiler does default behaviour */
/* #include <stdio.h> */
int main (int argc, const char * argv[]) {
signed char schar[] = "\x70\x80";
unsigned char uchar[] = "\x70\x80";
printf("schar[0]=%x schar[1]=%x uchar[0]=%x uchar[1]=%x\n",
schar[0], schar[1], uchar[0], uchar[1]);
return 0;
}
That prints:
schar[0]=70 schar[1]=ffffff80 uchar[0]=70 uchar[1]=80
The char value is interpreted by my (Mac's gcc) compiler as signed char, so the compiler generates code to sign extended the char to the int before the printf call.
Where the signed char value has its top (sign) bit set (\x80), the conversion to int sign extends the char value. The sign extension fills in the upper bytes (in this case 3 more bytes to make a 4 byte int) with 1's, which get printed by printf as ffffff80
Where the signed char value has its top (sign) bit clear (\x70), the conversion to int still sign extends the char value. In this case the sign is 0, so the sign extension fills in the upper bytes with 0's, which get printed by printf as 70
My example shows the case where the value is unsigned char. In these two cases the value is not sign extended because the value is unsigned. Instead they are extended to int with 0 padding. It might look like printf is only printing one byte because the adjacent three bytes of the value would be 0. But it is printing the entire int, it happens that the value is 0x00000070 and 0x00000080 because the unsigned char values were converted to
int without sign extension.
You can force printf to only print the low byte of the int, by using suitable formatting (%hhx), so this correctly prints only the value in the original char:
/* Notice the include is 'removed' so the C compiler does default behaviour */
/* #include <stdio.h> */
int main (int argc, const char * argv[]) {
char schar[] = "\x70\x80";
unsigned char uchar[] = "\x70\x80";
printf("schar[0]=%hhx schar[1]=%hhx uchar[0]=%hhx uchar[1]=%hhx\n",
schar[0], schar[1], uchar[0], uchar[1]);
return 0;
}
This prints:
schar[0]=70 schar[1]=80 uchar[0]=70 uchar[1]=80
because printf interprets the %hhx to treat the int as an unsigned char. This does not change the fact that the char was sign extended to an int before printf was called. It is only a way to tell printf how to interpret the contents of the int.
In a way, for signed char *schar, the meaning of %hhx looks slightly misleading, but the '%x' format interprets int as unsigned anyway, and (with my printf) there is no format to print hex for signed values (IMHO it would be a confusing).
Sadly, ISO/ANSI/... don't freely publish our programming language standards, so I can't point to the specification, but searching the web might turn up working drafts. I haven't tried to find them. I would recommend "C: A Reference Manual" by Samuel P. Harbison and Guy L. Steele as a cheaper alternative to the ISO document.
HTH
No. printf is a variable argument function, arguments to a variable argument function will be promoted to an int. And in this case the char was negative, so it gets sign extended.
%x tells printf that the value to print is an unsigned int. So, it promotes the char to an unsigned int, sign extending as necessary and then prints out the resulting value.

Resources