I'm currently trying to make my own version of getopt() function.
But I do not know how it returns a character type as an int type.
Is there any way I can have a look into the source code of the getopt() function?
The source code of getopt() in glibc is here: https://github.com/lattera/glibc/blob/master/posix/getopt.c
Of course there are more implementations you might look at, but this is probably the most popular one. Here's another, from FreeBSD: https://github.com/lattera/freebsd/blob/master/lib/libc/stdlib/getopt.c
The return value of getopt(3) function is int to allow for an extra value (apart of all the possible chars it returns) to mark the end of options condition. This extra value is EOF (as in getchar(3) function) which must be different from any char possible value.
To deal with this and the possibility of different C compilers implement char either as signed or unsigned, both functions implement the return value as the character value as an unsigned byte from 0 to 255 (by mapping all the negative values to positive, this is adding to the negative values the constant 256 ---this is an example, as the language doesn't specify exactly how this is done---, so the negatives go in the range 128..255), and reserve EOF as the value -1.
If you are writing a getopt(3) function to be integrated in your system's standard c library, just check what value is used for EOF (most probable is -1) and then implement it so the values returned for your default char type don't conflict/overlap with/ it.
Related
I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.
Good day,
I'm reading through some old code I've been asked to maintain, and I see a number of functions that are like so:
uint32_t myFunc(int* pI);
In the case of error conditions within the body of the function, the function exits early by returning a negative errno value, ie: return (-EINVAL);
Is this value standard (ie: C99, ANSI), C? I've read in the comments that apparently doing this sets the errno to EINVAL, but I can't find any documentation to support this. Wouldn't it be better to just declare the function as a signed int (ie: int32_t myFunc(int* pI)) and treat negative values as error codes, rather than attempt to set errno in this manner?
Returning a negative value does not set errno. This is misinformation you picked up (or more likely, which the original author picked up) out-of-context: the mechanism by which the Linux kernel system calls report errors is returning a negative value in the range -4095 to -1, which the userspace code making the syscall then uses (in most cases) to fill in the value of errno. If you want to set errno yourself, however, just assign a value to it.
Addressing your question: Is this value standard (ie: C99, ANSI), C?
For C99
"errno which expands to a modifiable lvalue that has type int, the
value of which is set to a positive error number by several library
functions." -- N1256 7.5p2
For POSIX
"The <errno.h> header shall define the following macros which shall
expand to integer constant expressions with type int, distinct
positive values" -- errno.h DSCRIPTION
"Values for errno are now required to be distinct positive values
rather than non-zero values. This change is for alignment with the
ISO/IEC 9899:1999 standard." -- errno.h CHANGE HISTORY Issue 6
A good discussion (and the source of these quotes) is found HERE (search "when to return EINVAL")
In section 7.19.7.1 of C99, we have:
If the end-of-file indicator for the input stream pointed to by stream is not set and a
next character is present, the fgetc function obtains that character as an unsigned
char converted to an int and advances the associated file position indicator for the
stream (if defined).
As I understood it, int type can have the same width as an unsigned char. In such a case, can we conclude that fgetc would only function correctly if int width > CHAR_BIT.
(with reference to the comment by blagovest), does C99 specify when the standard library is to be expected, or whether a conforming implementation can implement part but not all of the standard library?
fgetc returns EOF on an end-of-file or error condition.
Otherwise, it returns the character that was read, as an unsigned char, converted to int.
Suppose CHAR_BIT == 16 and sizeof (int) == 1, and suppose the next character read has the value 0xFFFF. Then fgetc() will return 0xFFFF converted to int.
Here it gets a little tricky. Since 0xFFFF can't be represented in type int, the result of the conversion is implementation-defined. But typically, the result will be -1, which is a typical value (in fact, the only value I've ever heard of) for EOF.
So on such a system, fgetc() can return EOF even if it successfully reads a character.
There is no contradiction here. The standard stays that fgetc() returns EOF at end-of-file or on an error. It doesn't say the reverse; returning EOF doesn't necessarily imply that there was an error or end-of-file condition.
You can still determine whether fgetc() read an actual character or not by calling feof() and ferror().
So such a system would break the typical input loop:
while ((c = fgetc()) != EOF) {
...
}
but it wouldn't (necessarily) fail to conform to the standard.
(with reference to the comment by blagovest), does C99 specify when the standard library is to be expected, or whether a conforming
implementation can implement part but not all of the standard
library?
A "hosted implementation" must support the entire standard library, including <stdio.h>.
A "freestanding implementation" needn't support <stdio.h>; only standard headers that don't declare any functions (<limits.h>, <stddef.h>, etc.). But a freestanding implementation may provide <stdio.h> if it chooses.
Typically freestanding implementations are for embedded systems, often with no operating system.
In practice, every current hosted implementation I'm aware of has CHAR_BIT==8. The implication is that in practice you can probably count on an EOF result from fgetc() actually indicating either end-of-file or an error -- but the standard doesn't guarantee it.
Yes on such a platform there would be one unsigned char value that would not be distinguishable from EOF.
unsigned char is not allowed to have padding bytes, so the set of values for unsigned char would be a superset of the possible values for int.
The only hope on such a platform that one could have is that at least char would be signed, so EOF wouldn't clash with the positive char values.
This would probably not be the only problem that such a platform would have.
My view is that a C implementation cannot satisfy the specification of certain stdio functions (particularly fputc/fgetc) if sizeof(int)==1, since the int needs to be able to hold any possible value of unsigned char or EOF (-1). Is this reasoning correct?
(Obviously sizeof(int) cannot be 1 if CHAR_BIT is 8, due to the minimum required range for int, so we're implicitly only talking about implementations with CHAR_BIT>=16, for instance DSPs, where typical implementations would be a freestanding implementation rather than a hosted implementation, and thus not required to provide stdio.)
Edit: After reading the answers and some links references, some thoughts on ways it might be valid for a hosted implementation to have sizeof(int)==1:
First, some citations:
7.19.7.1(2-3):
If the end-of-file indicator for the input stream pointed to by stream is not set and a
next character is present, the fgetc function obtains that character as an unsigned
char converted to an int and advances the associated file position indicator for the
stream (if defined).
If the end-of-file indicator for the stream is set, or if the stream is at end-of-file, the endof-file indicator for the stream is set and the fgetc function returns EOF. Otherwise, the
fgetc function returns the next character from the input stream pointed to by stream.
If a read error occurs, the error indicator for the stream is set and the fgetc function
returns EOF.
7.19.8.1(2):
The fread function reads, into the array pointed to by ptr, up to nmemb elements
whose size is specified by size, from the stream pointed to by stream. For each
object, size calls are made to the fgetc function and the results stored, in the order
read, in an array of unsigned char exactly overlaying the object. The file position
indicator for the stream (if defined) is advanced by the number of characters successfully read.
Thoughts:
Reading back unsigned char values outside the range of int could simply have undefined implementation-defined behavior in the implementation. This is particularly unsettling, as it means that using fwrite and fread to store binary structures (which while it results in nonportable files, is supposed to be an operation you can perform portably on any single implementation) could appear to work but silently fail. essentially always results in undefined behavior. I accept that an implementation might not have a usable filesystem, but it's a lot harder to accept that an implementation could have a filesystem that automatically invokes nasal demons as soon as you try to use it, and no way to determine that it's unusable. Now that I realize the behavior is implementation-defined and not undefined, it's not quite so unsettling, and I think this might be a valid (although undesirable) implementation.
An implementation sizeof(int)==1 could simply define the filesystem to be empty and read-only. Then there would be no way an application could read any data written by itself, only from an input device on stdin which could be implemented so as to only give positive char values which fit in int.
Edit (again): From the C99 Rationale, 7.4:
EOF is traditionally -1, but may be any negative integer, and hence distinguishable from any valid character code.
This seems to indicate that sizeof(int) may not be 1, or at least that such was the intention of the committee.
It is possible for an implementation to meet the interface requirements for fgetc and fputc even if sizeof(int) == 1.
The interface for fgetc says that it returns the character read as an unsigned char converted to an int. Nowhere does it say that this value cannot be EOF even though the expectation is clearly that valid reads "usually" return positive values. Of course, fgetc returns EOF on a read failure or end of stream but in these cases the file's error indicator or end-of-file indicator (respectively) is also set.
Similarly, nowhere does it say that you can't pass EOF to fputc so long as that happens to coincide with the value of an unsigned char converted to an int.
Obviously the programmer has to be very careful on such platforms. This is might not do a full copy:
void Copy(FILE *out, FILE *in)
{
int c;
while((c = fgetc(in)) != EOF)
fputc(c, out);
}
Instead, you would have to do something like (not tested!):
void Copy(FILE *out, FILE *in)
{
int c;
while((c = fgetc(in)) != EOF || (!feof(in) && !ferror(in)))
fputc(c, out);
}
Of course, platforms where you will have real problems are those where sizeof(int) == 1 and the conversion from unsigned char to int is not an injection. I believe that this would necessarily the case on platforms using sign and magnitude or ones complement for representation of signed integers.
I remember this exact same question on comp.lang.c some 10 or 15 years ago. Searching for it, I've found a more current discussion here:
http://groups.google.de/group/comp.lang.c/browse_thread/thread/9047fe9cc86e1c6a/cb362cbc90e017ac
I think there are two resulting facts:
(a) There can be implementations where strict conformance is not possible. E.g. sizeof(int)==1 with one-complement's or sign-magnitude negative values or padding bits in the int type, i.e. not all unsigned char values can be converted to a valid int value.
(b) The typical idiom ((c=fgetc(in))!=EOF) is not portable (except for CHAR_BIT==8), as EOF is not required to be a separate value.
I don't believe the C standard directly requires that EOF be distinct from any value that could be read from a stream. At the same time, it does seem to take for granted that it will be. Some parts of the standard have conflicting requirements that I doubt can be met if EOF is a value that could be read from a stream.
For example, consider ungetc. On one hand, the specification says (§7.19.7.11):
The ungetc function pushes the character specified by c (converted to an unsigned
char) back onto the input stream pointed to by stream. Pushed-back characters will be
returned by subsequent reads on that stream in the reverse order of their pushing.
[ ... ]
One character of pushback is guaranteed.
On the other hand, it also says:
If the value of c equals that of the macro EOF, the operation fails and the input stream is unchanged.
So, if EOF is a value that could be read from the stream, and (for example) we do read from the stream, and immediately use ungetc to put EOF back into the stream, we get a conundrum: the call is "guaranteed" to succeed, but also explicitly required to fail.
Unless somebody can see a way to reconcile these requirements, I'm left with considerable doubt as to whether such an implementation can conform.
In case anybody cares, N1548 (the current draft of the new C standard) retains the same requirements.
Would it not be sufficient if a nominal char which shared a bit pattern with EOF was defined as non-sensical? If, for instance, CHAR_BIT was 16 but all the allowed values occupied only the 15 least significant bits (assume a 2s-complement of sign-magnitude int representation). Or must everything representable in a char have meaning as such? I confess I don't know.
Sure, that would be a weird beast, but we're letting our imaginations go here, right?
R.. has convinced me that this won't hold together. Because a hosted implementation must implement stdio.h and if fwrite is to be able to stick integers on the disk, then fgetc could return any bit pattern that would fit in a char, and that must not interfere with returning EOF. QED.
I think you are right. Such an implementation cannot distinguish a legitimate unsigned char value from EOF when using fgetc/fputc on binary streams.
If there are such implementations (this thread seems to suggest there are), they are not strictly conforming. It is possible to have a freestanding implementation with sizeof (int) == 1.
A freestanding implementation (C99 4) only needs to support the features from the standard library as specified in these headers: <float.h>,
<iso646.h>, <limits.h>, <stdarg.h>, <stdbool.h>, <stddef.h>, and
<stdint.h>. (Note no <stdio.h>). Freestanding might make more sense for a DSP or other embedded device anyway.
I'm not so familiar with C99, but I don't see anything that says fgetc must produce the full range of values of char. The obvious way to implement stdio on such a system would be to put 8 bits in each char, regardless of its capacity. The requirement of EOF is
EOF
which expands to an integer
constant expression, with type int and
a negative value, that is returned by
several functions to indicate
end-of-file, that is, no more input
from a stream
The situation is analogous to wchar_t and wint_t. In 7.24.1/2-3 defining wint_t and WEOF, footnote 278 says
wchar_t and wint_t can be the same integer type.
which would seem to guarantee that "soft" range checking is sufficient to guarantee that *EOF is not in the character set.
Edit:
This wouldn't allow binary streams, since in such a case fputc and fgetc are required to perform no transformation. (7.19.2/3) Binary streams are not optional; only their distinctness from text streams is optional. So it would appear that this renders such an implementation noncompliant. It would still be perfectly usable, though, as long as you don't attempt to write binary data outside the 8-bit range.
You are assuming that the EOF cannot be an actual character in the character set.
If you allow this, then sizeof(int) == 1 is OK.
The TI C55x compiler I am using has a 16bit char and 16bit int and does include a standard library. The library merely assumes an eight bit character set, so that when interpreted as a character as char of value > 255 is not defined; and when writing to an 8-bit stream device, the most significant 8 bits are discarded: For example when written to the UART, only the lower 8 bits are transferred to the shift register and output.
How does C handle converting between integers and characters? Say you've declared an integer variable and ask the user for a number but they input a string instead. What would happen?
The user input is treated as a string that needs to be converted to an int using atoi or another conversion function. Atoi will return 0 if the string cannot be interptreted as a number because it contains letters or other non-numeric characters.
You can read a bit more at the atoi documentation on MSDN - http://msdn.microsoft.com/en-us/library/yd5xkb5c(VS.80).aspx
Uh?
You always input a string. Then you parse convert this string to number, with various ways (asking again, taking a default value, etc.) of handling various errors (overflow, incorrect chars, etc.).
Another thing to note is that in C, characters and integers are "compatible" to some degree. Any character can be assigned to an int. The reverse also works, but you'll lose information if the integer value doesn't fit into a char.
char foo = 'a'; // The ascii value representation for lower-case 'a' is 97
int bar = foo; // bar now contains the value 97
bar = 255; // 255 is 0x000000ff in hexadecimal
foo = bar; // foo now contains -1 (0xff)
unsigned char foo2 = foo; // foo now contains 255 (0xff)
As other people have noted, the data is normally entered as a string -- the only question is which function is used for doing the reading. If you're using a GUI, the function may already deal with conversion to integer and reporting errors and so in an appropriate manner. If you're working with Standard C, it is generally easier to read the value into a string (perhaps with fgets() and then convert. Although atoi() can be used, it is seldom the best choice; the trouble is determining whether the conversion succeeded (and produced zero because the user entered a legitimate representation of zero) or not.
Generally, use strtol() or one of its relatives (strtoul(), strtoll(), strtoull()); for converting floating point numbers, use strtod() or a similar function. The advantage of the integer conversion routines include:
optional base selection (for example, base 10, or base 10 - hex, or base 8 - octal, or any of the above using standard C conventions (007 for octal, 0x07 for hex, 7 for decimal).
optional error detection (by knowing where the conversion stopped).
The place I go for many of these function specifications (when I don't look at my copy of the actual C standard) is the POSIX web site (which includes C99 functions). It is Unix-centric rather than Windows-centric.
The program would crash, you need to call atoi function.