Can sizeof(int) ever be 1 on a hosted implementation? - c

My view is that a C implementation cannot satisfy the specification of certain stdio functions (particularly fputc/fgetc) if sizeof(int)==1, since the int needs to be able to hold any possible value of unsigned char or EOF (-1). Is this reasoning correct?
(Obviously sizeof(int) cannot be 1 if CHAR_BIT is 8, due to the minimum required range for int, so we're implicitly only talking about implementations with CHAR_BIT>=16, for instance DSPs, where typical implementations would be a freestanding implementation rather than a hosted implementation, and thus not required to provide stdio.)
Edit: After reading the answers and some links references, some thoughts on ways it might be valid for a hosted implementation to have sizeof(int)==1:
First, some citations:
7.19.7.1(2-3):
If the end-of-file indicator for the input stream pointed to by stream is not set and a
next character is present, the fgetc function obtains that character as an unsigned
char converted to an int and advances the associated file position indicator for the
stream (if defined).
If the end-of-file indicator for the stream is set, or if the stream is at end-of-file, the endof-file indicator for the stream is set and the fgetc function returns EOF. Otherwise, the
fgetc function returns the next character from the input stream pointed to by stream.
If a read error occurs, the error indicator for the stream is set and the fgetc function
returns EOF.
7.19.8.1(2):
The fread function reads, into the array pointed to by ptr, up to nmemb elements
whose size is specified by size, from the stream pointed to by stream. For each
object, size calls are made to the fgetc function and the results stored, in the order
read, in an array of unsigned char exactly overlaying the object. The file position
indicator for the stream (if defined) is advanced by the number of characters successfully read.
Thoughts:
Reading back unsigned char values outside the range of int could simply have undefined implementation-defined behavior in the implementation. This is particularly unsettling, as it means that using fwrite and fread to store binary structures (which while it results in nonportable files, is supposed to be an operation you can perform portably on any single implementation) could appear to work but silently fail. essentially always results in undefined behavior. I accept that an implementation might not have a usable filesystem, but it's a lot harder to accept that an implementation could have a filesystem that automatically invokes nasal demons as soon as you try to use it, and no way to determine that it's unusable. Now that I realize the behavior is implementation-defined and not undefined, it's not quite so unsettling, and I think this might be a valid (although undesirable) implementation.
An implementation sizeof(int)==1 could simply define the filesystem to be empty and read-only. Then there would be no way an application could read any data written by itself, only from an input device on stdin which could be implemented so as to only give positive char values which fit in int.
Edit (again): From the C99 Rationale, 7.4:
EOF is traditionally -1, but may be any negative integer, and hence distinguishable from any valid character code.
This seems to indicate that sizeof(int) may not be 1, or at least that such was the intention of the committee.

It is possible for an implementation to meet the interface requirements for fgetc and fputc even if sizeof(int) == 1.
The interface for fgetc says that it returns the character read as an unsigned char converted to an int. Nowhere does it say that this value cannot be EOF even though the expectation is clearly that valid reads "usually" return positive values. Of course, fgetc returns EOF on a read failure or end of stream but in these cases the file's error indicator or end-of-file indicator (respectively) is also set.
Similarly, nowhere does it say that you can't pass EOF to fputc so long as that happens to coincide with the value of an unsigned char converted to an int.
Obviously the programmer has to be very careful on such platforms. This is might not do a full copy:
void Copy(FILE *out, FILE *in)
{
int c;
while((c = fgetc(in)) != EOF)
fputc(c, out);
}
Instead, you would have to do something like (not tested!):
void Copy(FILE *out, FILE *in)
{
int c;
while((c = fgetc(in)) != EOF || (!feof(in) && !ferror(in)))
fputc(c, out);
}
Of course, platforms where you will have real problems are those where sizeof(int) == 1 and the conversion from unsigned char to int is not an injection. I believe that this would necessarily the case on platforms using sign and magnitude or ones complement for representation of signed integers.

I remember this exact same question on comp.lang.c some 10 or 15 years ago. Searching for it, I've found a more current discussion here:
http://groups.google.de/group/comp.lang.c/browse_thread/thread/9047fe9cc86e1c6a/cb362cbc90e017ac
I think there are two resulting facts:
(a) There can be implementations where strict conformance is not possible. E.g. sizeof(int)==1 with one-complement's or sign-magnitude negative values or padding bits in the int type, i.e. not all unsigned char values can be converted to a valid int value.
(b) The typical idiom ((c=fgetc(in))!=EOF) is not portable (except for CHAR_BIT==8), as EOF is not required to be a separate value.

I don't believe the C standard directly requires that EOF be distinct from any value that could be read from a stream. At the same time, it does seem to take for granted that it will be. Some parts of the standard have conflicting requirements that I doubt can be met if EOF is a value that could be read from a stream.
For example, consider ungetc. On one hand, the specification says (§7.19.7.11):
The ungetc function pushes the character specified by c (converted to an unsigned
char) back onto the input stream pointed to by stream. Pushed-back characters will be
returned by subsequent reads on that stream in the reverse order of their pushing.
[ ... ]
One character of pushback is guaranteed.
On the other hand, it also says:
If the value of c equals that of the macro EOF, the operation fails and the input stream is unchanged.
So, if EOF is a value that could be read from the stream, and (for example) we do read from the stream, and immediately use ungetc to put EOF back into the stream, we get a conundrum: the call is "guaranteed" to succeed, but also explicitly required to fail.
Unless somebody can see a way to reconcile these requirements, I'm left with considerable doubt as to whether such an implementation can conform.
In case anybody cares, N1548 (the current draft of the new C standard) retains the same requirements.

Would it not be sufficient if a nominal char which shared a bit pattern with EOF was defined as non-sensical? If, for instance, CHAR_BIT was 16 but all the allowed values occupied only the 15 least significant bits (assume a 2s-complement of sign-magnitude int representation). Or must everything representable in a char have meaning as such? I confess I don't know.
Sure, that would be a weird beast, but we're letting our imaginations go here, right?
R.. has convinced me that this won't hold together. Because a hosted implementation must implement stdio.h and if fwrite is to be able to stick integers on the disk, then fgetc could return any bit pattern that would fit in a char, and that must not interfere with returning EOF. QED.

I think you are right. Such an implementation cannot distinguish a legitimate unsigned char value from EOF when using fgetc/fputc on binary streams.
If there are such implementations (this thread seems to suggest there are), they are not strictly conforming. It is possible to have a freestanding implementation with sizeof (int) == 1.
A freestanding implementation (C99 4) only needs to support the features from the standard library as specified in these headers: <float.h>,
<iso646.h>, <limits.h>, <stdarg.h>, <stdbool.h>, <stddef.h>, and
<stdint.h>. (Note no <stdio.h>). Freestanding might make more sense for a DSP or other embedded device anyway.

I'm not so familiar with C99, but I don't see anything that says fgetc must produce the full range of values of char. The obvious way to implement stdio on such a system would be to put 8 bits in each char, regardless of its capacity. The requirement of EOF is
EOF
which expands to an integer
constant expression, with type int and
a negative value, that is returned by
several functions to indicate
end-of-file, that is, no more input
from a stream
The situation is analogous to wchar_t and wint_t. In 7.24.1/2-3 defining wint_t and WEOF, footnote 278 says
wchar_t and wint_t can be the same integer type.
which would seem to guarantee that "soft" range checking is sufficient to guarantee that *EOF is not in the character set.
Edit:
This wouldn't allow binary streams, since in such a case fputc and fgetc are required to perform no transformation. (7.19.2/3) Binary streams are not optional; only their distinctness from text streams is optional. So it would appear that this renders such an implementation noncompliant. It would still be perfectly usable, though, as long as you don't attempt to write binary data outside the 8-bit range.

You are assuming that the EOF cannot be an actual character in the character set.
If you allow this, then sizeof(int) == 1 is OK.

The TI C55x compiler I am using has a 16bit char and 16bit int and does include a standard library. The library merely assumes an eight bit character set, so that when interpreted as a character as char of value > 255 is not defined; and when writing to an 8-bit stream device, the most significant 8 bits are discarded: For example when written to the UART, only the lower 8 bits are transferred to the shift register and output.

Related

Why does snprintf() take a size_t size limit, but returns an int number of chars printed?

The venerable snprintf() function...
int snprintf( char *restrict buffer, size_t bufsz, const char *restrict format, ... );
returns the number of characters it prints, or rather, the number it would have printed had it not been for the buffer size limit.
takes the size of the buffer in characters/bytes.
How does it make sense for the buffer size to be size_t, but for the return type to be only an int?
If snprintf() is supposed to be able to print more than INT_MAX characters into the buffer, surely it must return an ssize_t or a size_t with (size_t) - 1 indicating an error, right?
And if it is not supposed to be able to print more than INT_MAX characters, why is bufsz a size_t rather than, say, an unsigned or an int? Or - is it at least officially constrained to hold values no larger than INT_MAX?
printf predates the existence of size_t and similar "portable" types -- when printf was first standardized, the result of a sizeof was an int.
This is also the reason why the argument in the printf argument list read for a * width or precision in the format is an int rather than a size_t.
snprintf is more recent, so the size it takes as an argument was defined to be a size_t, but the return value was kept as an int to make it the same as printf and sprintf.
Note that you can print more than INT_MAX characters with these functions, but if you do, the return value is unspecified. On most platforms, an int and a size_t will both be returned in the same way (in the primary return value register), it is just that a size_t value may be out of range for an int. So many platforms actually return a size_t (or ssize_t) from all of these routines and things being out of range will generally work out ok, even though the standard does not require it.
The discrepancy between size and return has been discussed in the standards group in the thread https://www.austingroupbugs.net/view.php?id=761. Here is the conclusion posted at the end of that thread:
Further research has shown that the behavior when the return value would overflow int was clarified by WG14 in C99 by adding it into the list of undefined behaviors in Annex J. It was updated in C11 to the following text:
"J.2 Undefined behavior
The behavior is undefined in the following circumstances:
[skip]
— The number of characters or wide characters transmitted by a formatted output function (or written to an array, or that would have been written to an array) is greater than INT_MAX (7.21.6.1, 7.29.2.1)."
Please note that this description does not mention the size argument of snprintf or the size of the buffer.
How does it make sense for the buffer size to be size_t, but for the return type to be only an int?
The official C99 rationale document does not discuss these particular considerations, but presumably it's for consistency and (separate) ideological reasons:
all of the printf-family functions return an int with substantially the same significance. This was defined (for the original printf, fprintf, and sprintf) well before size_t was invented.
type size_t is in some sense the correct type for conveying sizes and lengths, so it was used for the second arguments to snprintf and vsnprintf when those were introduced (along with size_t itself) in C99.
If snprintf() is supposed to be able to print more than INT_MAX characters into the buffer, surely it must return an ssize_t or a size_t with (size_t) - 1 indicating an error, right?
That would be a more internally-consistent design choice, but nope. Consistency across the function family seems to have been chosen instead. Note that none of the functions in this family have documented limits on the number of characters they can output, and their general specification implies that there is no inherent limit. Thus, they all suffer from the same issue with very long outputs.
And if it is not supposed to be able to print more than INT_MAX characters, why is bufsz a size_t rather than, say, an unsigned or an int? Or - is it at least officially constrained to hold values no larger than INT_MAX?
There is no documented constraint on the value of the second argument, other than the implicit one that it must be representable as a size_t. Not even in the latest version of the standard. But note that there is also nothing that says that type int cannot represent all the values that are representable by size_t (though indeed it can't in most implementations).
So yes, implementations will have trouble behaving according to the specifications when very large data are output via these functions, where "very large" is implementation-dependent. As a practical matter, then, one should not rely on using them to emit very large outputs in a single call (unless one intends to ignore the return value).
If snprintf() is supposed to be able to print more than INT_MAX characters into the buffer, surely it must return an ssize_t or a size_t with (size_t) - 1 indicating an error, right?
Not quite.
C also has an Environmental limit for fprintf() and friends.
The number of characters that can be produced by any single conversion shall be at least 4095." C17dr § 7.21.6.1 15
Anything over 4095 per % risks portability and so int, even at 16-bit (INT_MAX = 32767), suffices for most purposes for portable code.
Note: the ssize_t is not part of the C spec.

C Language: Why int variable can store char?

I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.

Can an implementation that has sizeof (int) == 1 "fully conform"? [duplicate]

This question already has answers here:
Can sizeof(int) ever be 1 on a hosted implementation?
(8 answers)
Closed 7 years ago.
According to the C standard, any characters returned by fgetc are returned in the form of unsigned char values, "converted to an int" (that quote comes from the C standard, stating that there is indeed a conversion).
When sizeof (int) == 1, many unsigned char values are outside of range. It is thus possible that some of those unsigned char values might end up being converted to an int value (the result of the conversion being "implementation-defined or an implementation-defined signal is raised") of EOF, which would be returned despite the file not actually being in an erroneous or end-of-file state.
I was surprised to find that such an implementation actually exists. The TMS320C55x CCS manual documents UCHAR_MAX having a corresponding value of 65535, INT_MAX having 32767, fputs and fopen supporting binary mode... What's even more surprising is that it seems to describe the environment as a fully conforming, complete implementation (minus signals).
The C55x C/C++ compiler fully conforms to the ISO C standard as defined by the ISO specification ...
The compiler tools come with a complete runtime library. All library
functions conform to the ISO C library standard. ...
Is such an implementation that can return a value indicating errors where there are none, really fully conforming? Could this justify using feof and ferror in the condition section of a loop (as hideous as that seems)? For example, while ((c = fgetc(stdin)) != EOF || !(feof(stdin) || ferror(stdin))) { ... }
The function fgetc() returns an int value in the range of unsigned char only when a proper character is read, otherwise it returns EOF which is a negative value of type int.
My original answer (I changed it) assumed that there was an integer conversion to int, but this is not the case, since actually the function fgetc() is already returning a value of type int.
I think that, to be conforming, the implementation have to make fgetc() to return nonnegative values in the range of int, unless EOF is returned.
In this way, the range of values from 32768 to 65535 will be never associated to character codes in the TMS320C55x implementation.

How many bits are read by fgetc in a stream?

How many bits are read by the function fgetc in a stream?
The man page of fgetc tells that this function reads a "character", but a character is not a clear definition for me. How many bits does contain a "character" ? Is reading a character with fgetc equivalent as reading a byte?
Does it depend on the architecture of the machine and on the size of "char" or "byte"?
My objective is to read binary data in a stream with portability (byte=8bits or byte=16bits). Is it a better idea to use fread/fwrite with types like uintN_t instead of fgetc in order to control how many bits are read in the stream? Is there a better solution?
How many bits does contain a "character" ?
A character contains precisely CHAR_BIT bits, an implementation-specific value defined in limits.h.
/* Number of bits in a `char'. */
# define CHAR_BIT 8
Is reading a character with fgetc equivalent as reading a byte
Yup, fgetc reads exactly one byte.
This portability problem isn't easily solvable. The best way around it is to not make assumptions on the binary representation.
fgetc read exactly one byte. A character type (signed char, char, unsigned char and qualified versions) contains CHAR_BIT bits (<limits.h>), which is a constant greater than 8.
Your platform has a smallest unit of data, which corresponds to the C data type char. All I/O happens in units of chars. You are guaranteed that a char can hold the values 0–127, and either 0–255 or −127–127. Everything else is platform-specific. (The actual number of bits inside a char is contained in the macro CHAR_BIT.)
That said, as long as you only write and read values within the advertised range into each char, you are guaranteed that your program will work on any conforming platform. The only thing you are not guaranteed is that the resulting data stream will be binarily identical.

what is the output of fgetc under the special case where int width == CHAR_BIT

In section 7.19.7.1 of C99, we have:
If the end-of-file indicator for the input stream pointed to by stream is not set and a
next character is present, the fgetc function obtains that character as an unsigned
char converted to an int and advances the associated file position indicator for the
stream (if defined).
As I understood it, int type can have the same width as an unsigned char. In such a case, can we conclude that fgetc would only function correctly if int width > CHAR_BIT.
(with reference to the comment by blagovest), does C99 specify when the standard library is to be expected, or whether a conforming implementation can implement part but not all of the standard library?
fgetc returns EOF on an end-of-file or error condition.
Otherwise, it returns the character that was read, as an unsigned char, converted to int.
Suppose CHAR_BIT == 16 and sizeof (int) == 1, and suppose the next character read has the value 0xFFFF. Then fgetc() will return 0xFFFF converted to int.
Here it gets a little tricky. Since 0xFFFF can't be represented in type int, the result of the conversion is implementation-defined. But typically, the result will be -1, which is a typical value (in fact, the only value I've ever heard of) for EOF.
So on such a system, fgetc() can return EOF even if it successfully reads a character.
There is no contradiction here. The standard stays that fgetc() returns EOF at end-of-file or on an error. It doesn't say the reverse; returning EOF doesn't necessarily imply that there was an error or end-of-file condition.
You can still determine whether fgetc() read an actual character or not by calling feof() and ferror().
So such a system would break the typical input loop:
while ((c = fgetc()) != EOF) {
...
}
but it wouldn't (necessarily) fail to conform to the standard.
(with reference to the comment by blagovest), does C99 specify when the standard library is to be expected, or whether a conforming
implementation can implement part but not all of the standard
library?
A "hosted implementation" must support the entire standard library, including <stdio.h>.
A "freestanding implementation" needn't support <stdio.h>; only standard headers that don't declare any functions (<limits.h>, <stddef.h>, etc.). But a freestanding implementation may provide <stdio.h> if it chooses.
Typically freestanding implementations are for embedded systems, often with no operating system.
In practice, every current hosted implementation I'm aware of has CHAR_BIT==8. The implication is that in practice you can probably count on an EOF result from fgetc() actually indicating either end-of-file or an error -- but the standard doesn't guarantee it.
Yes on such a platform there would be one unsigned char value that would not be distinguishable from EOF.
unsigned char is not allowed to have padding bytes, so the set of values for unsigned char would be a superset of the possible values for int.
The only hope on such a platform that one could have is that at least char would be signed, so EOF wouldn't clash with the positive char values.
This would probably not be the only problem that such a platform would have.

Resources