getline in C implementation for Window OS - c

Statement
I know there is a fucntion called getline() on OS made of Linux/Unix.
I want to know what other functions are not available in the Windows operating system, but available in the operating system made of Linux/Unix.
Question
Is there any getline() function made by yourself that can replace the one in Windows ?
What resources are available for reference and reading ?
size_t getline(char **lineptr, size_t *n, FILE *stream)

The relevant standard is IEEE Std 1003.1, also called POSIX.1, specifically its System Interfaces, with the list of affected functions here.
I recommend Linux man pages online. Do not be deterred by its name, because the C functions (described in sections 2, 3) have a section Conforming to, which specifies which standards standardize the feature. If it is C89, C99, etc., then it is included in the C standard; if POSIX.1, then in POSIX; if SuS, then in Single Unix Specification which preceded POSIX; if 4.3BSD, then in old BSD version 4.3; if SVr4, then in Unix System V release 4, and so on.
Windows implements its own extensions to C, and avoids supporting anything POSIX or SuS, but has ported some details from 4.3BSD. Use Microsoft documentation to find out.
In Linux, the C libraries expose these features if certain preprocessor macros are defined before any #include statements are done. These are described at man 7 feature_test_macros. I typically use #define _POSIX_C_SOURCE 200809L for POSIX, and occasionally #define _GNU_SOURCE for GNU C extensions.
getline() is an excellent interface, and not "leaky" except perhaps when used by programmers used to Microsoft/Windows inanities, like not being able to do wide character output to console without Microsoft-only extensions (because they just didn't want to put that implementation inside fwide(), apparently).
The most common use pattern is to initialize an unallocated buffer, and a suitable line length variable:
char *line_buf = NULL;
size_t line_max = 0;
ssize_t line_len;
Then, when you read a line, the C library is free to reallocate the buffer to whatever size is needed to contain the line. For example, your read file line-by-line loop might look like this:
while (1) {
len = getline(&line_buf, &line_max, stdin);
if (len < 0)
break;
// line_buf has len characters of data in it, and line_buf[len] == '\0'.
// If the input contained embedded '\0' bytes in it, then strlen(line_buf) < len.
// Normally, strlen(line_buf) == len.
}
free(line_buf);
line_buf = NULL;
line_max = 0;
if (!feof(stdin) || ferror(stdin)) {
// Not all of input was processed, or there was an error.
} else {
// All input processed without I/O errors.
}
Note that free(NULL) is safe, and does nothing. This means that we can safely use free(line_buf); line_buf = NULL; line_max = 0; after the loop –– in fact, at any point we want! –– to discard the current line buffer. If one is needed, the next getline() or getdelim() call with the same variables will allocate a new one.
The above pattern never leaks memory, and correctly detects all errors during file processing, from I/O errors to not having enough RAM available (or allowed for the current process), albeit it cannot distinguish between them: only that an error occurred. It also won't have false errors, unless you break out of the loop in your own added processing code.
Thus, any claims of getline() being "leaky" are anti-POSIX, pro-Microsoft propaganda. For some reason, Microsoft has steadfastly refused to implement these in their own C library, even though they easily could.
If you want to copy parts of the line, I do recommend using strdup() or strndup(), also POSIX.1-2008 functions. They return a dynamically allocated copy of the string, the latter only copying up to the specified number of characters (if the string does not end before that); in all cases, if the functions return a non-NULL pointer, the dynamically allocated string is terminated with a nul '\0', and should be freed with free() just like the getline() buffer above, when no longer needed.
If you have to run code on Microsoft also, a good option is to implement your own getline() on the architectures and OSes that do not provide one. (You can use the Pre-defined Compiler Macros Wiki to see how you can detect the code being compiled on a specific architecture, OS, or compiler.)
An example getline() implementation can be written on top of fgets(), growing the buffer and reading more (appending to existing buffer), until the buffer ends with a newline. It, however, cannot really handle embedded '\0' bytes in the data; to do that, and properly implement getdelim(), you need to read the data character-by-character, using e.g. fgetc().

Related

What is this ssize_t getline(char **lineptr, size_t *n, FILE *stream); function?

When I was looking at the C++ std::getline function in <string>, I accidently run man getline in my Ubuntu Terminal I found this function:
ssize_t getline(char **lineptr, size_t *n, FILE *stream);
I know it's totally a different thing from std::getline. They just happen to have the same function name.
Neither APUE and The Linux Progamming Interface mention this function. But it belongs to the standard C library (#include <stdio.h>).
I do read the description and it seems that it just a getline function storing bytes into a dynamic buffer/memory. Nothing special except that.
Could someone tell me what this function is mainly for? What's special about it? Tried Google but got nothing.
There’s nothing particularly special about this function. It’s specified by POSIX to read a single, newline-delimited line from stream, into the buffer whose address is given by lineptr and which must be large enough to hold n bytes.
The reason lineptr and n are pointers is that they are used both as input to the function and potentially output from it. If lineptr is NULL on entry, or n indicates that its size is too small to hold the line read from stream, then getline (re)allocates a buffer and updates lineptr and n with the corresponding information.
getline is easier to use than fgets because the latter will stop reading when it reaches the end of the buffer. So if fgets tries to read a line longer than the provided buffer, it will return a partial read and the caller will have to read again and connect the multiple parts. In such circumstances, getline would reallocate the provided buffer, if any.
As explained in the GNU C Library documentation,
Standard C has functions to do this, but they aren’t very safe: null characters and even (for gets) long lines can confuse them. So the GNU C Library provides the nonstandard getline function that makes it easy to read lines reliably.
(getline originated in the GNU C Library and was added to POSIX in the 2008 edition.)

Why strcpy_s is safer than strcpy?

When I am trying to use the strcpy function the visual studio gives me an error
error C4996: 'strcpy': This function or variable may be unsafe. Consider using strcpy_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
After searching online and many answers from StackOverflow, the summary is that strcpy_s is safer than strcpy when copying a large string into a shorter one.
So, I tried the following code for coping into shorter string:
char a[50] = "void";
char b[3];
strcpy_s(b, sizeof(a), a);
printf("String = %s", b);
The code copiles successfuly. However, there is still a runtime error:
So, how is scrcpy_s is safe?
Am I understanding the safty concept wrong?
Why is strcpy_s() "safer"? Well, it's actually quite involved. (Note that this answer ignores any specific code issues in the posted code.)
First, when MSVC tells you standard functions such as strcpy() are "deprecated", at best Microsoft is being incomplete. At worst, Microsoft is downright lying to you. Ascribe whatever motiviation you want to Microsoft here, but strcpy() and a host of other functions that MSVC calls "deprecated" are standard C functions and they are most certainly NOT deprecated by anyone other than Microsoft. So when MSVC warns you that a function required to be implemented in any conforming C compiler (most of which then flow by requirement into C++...), it omits the "by Microsoft" part.
The "safer" functions that Microsoft is "helpfully" suggesting that you use - such as strcpy_s() would be standard, as they are part of the optional Annex K of the C standard, had Microsoft implemented them per the standard.
Per N1967 - Field Experience With Annex K — Bounds Checking Interfaces
Microsoft Visual Studio implements an early version of the APIs. However, the implementation is incomplete and conforms neither to C11 nor to the original TR 24731-1. For example, it doesn't provide the set_constraint_handler_s function but instead defines a _invalid_parameter_handler _set_invalid_parameter_handler(_invalid_parameter_handler) function with similar behavior but a slightly different and incompatible signature. It also doesn't define the abort_handler_s and ignore_handler_s functions, the memset_s function (which isn't part of the TR), or the RSIZE_MAX macro. The Microsoft implementation also doesn't treat overlapping source and destination sequences as runtime-constraint violations and instead has undefined behavior in such cases.
As a result of the numerous deviations from the specification the Microsoft implementation cannot be considered conforming or portable.
Outside of a few specific cases (of which strcpy() is one), whether Microsoft's version of Annex K's "safer" bounds-checking functions are safer is debatable. Per N1967 (bolding mine):
Suggested Technical Corrigendum
Despite more than a decade since the original proposal and nearly ten years since the ratification of ISO/IEC TR 24731-1:2007, and almost five years since the introduction of the Bounds checking interfaces into the C standard, no viable conforming implementations has emerged. The APIs continue to be controversial and requests for implementation continue to be rejected by implementers.
The design of the Bounds checking interfaces, though well-intentioned, suffers from far too many problems to correct. Using the APIs has been seen to lead to worse quality, less secure software than relying on established approaches or modern technologies. More effective and less intrusive approaches have become commonplace and are often preferred by users and security experts alike.
Therefore, we propose that Annex K be either removed from the next revision of the C standard, or deprecated and then removed.
Note, however, in the case of strcpy(), strcpy_s() is actually more akin to strncpy() as strcpy() is just a bog-standard C string function that doesn't do bounds checking, but strncpy() is a perverse function in that it will completely fill its target buffer, starting with data from the source string, and filling the entire target buffer with '\0' char values. Unless the source string fills the entire target buffer, in which case strncpy() will NOT terminate it with a '\0' char value.
I'll repeat that: strncpy() does not guarantee a properly terminated copy.
It's hard not to be "safer" than strncpy(). In this case strcpy_s() does not violate the principle of least astonishment like strncpy() does. I'd call that "safer".
But using strcpy_s() - and all the other "suggested" functions - makes your code de facto non-portable, as Microsoft is the only significant implementation of any form of Annex K's bounds-checking functions.
The header definiton for C is:
errno_t strcpy_s(char *dest,rsize_t dest_size,const char *src)
The invocation for your example should be:
#include <stdlib.h>
char a[50] = "void";
char b[3];
strcpy_s(b, sizeof(b), a);
printf("String = %s", b);
strcpy_s needs the size of the destination, which is smaller than the source in your example.
strcpy_s(b, sizeof(b), a);
would be the way to go.
As for the safety concept, there are many checks now done, and better ways to handle errors.
In your example, had you used strcpy, you would have triggered a buffer overflow. Other functions, like strncpy or strlcpy, would have copied the 3 first characters without any null-byte terminator, which in turn would have triggered a buffer overflow (in reading, this time).

Initialize a `FILE *` variable in C?

I've got several oldish code bases that initialize variables (to be able to redirect input/output at will) like
FILE *usrin = stdin, *usrout = stdout;
However, gcc (8.1.1, glibc 2.27.9000 on Fedora rawhide) balks at this (initializer is not a compile time constant). Rummaging in /usr/include/stdio.h I see:
extern FILE *stdin;
/* C89/C99 say they're macros. Make them happy */
#define stdin stdin
First, it makes no sense to me that you can't initialize variables this (rather natural) way for such use. Sure, you can do it in later code, but it is a nuisance.
Second, why is the macro expansion not a constant?
Third, what is the rationale for having them be macros?
First, it makes no sense to me that you can't initialize variables this (rather natural) way for such use.
Second, why is the macro expansion not a constant?
stdin, stdout, and stderr are pointers which are initialized during C library startup, possibly as the result of a memory allocation. Their values aren't known at compile time -- depending on how your C library works, they might not even be constants. (For instance, if they're pointers to statically allocated structures, their values will be affected by ASLR.)
Third, what is the rationale for having them be macros?
It guarantees that #ifdef stdin will be true. This might have been added for compatibility with some very old programs which needed to handle systems which lacked support for stdio.
Classically, the values for stdin, stdout and stderr were variations on the theme of:
#define stdin (&__iob[0])
#define stdout (&__iob[1])
#define stderr (&__iob[2])
These are address constants and can be used in initializers for variables at file scope:
static FILE *def_out = stdout;
However, the C standard does not guarantee that the values are address constants that can be used like that C11 §7.21 Input/output <stdio.h.>:
stderr, stdin, stdout
which are expressions of type ''pointer to FILE'' that point to the FILE objects associated, respectively, with the standard error, input, and output streams.
Sometime a decade or more ago, the GNU C Library changed their definitions so that you could no longer use stdin, stdout or stderr as initializers for variables at file scope, or static variables with function scope (though you can use them to initialize automatic variables in a function). So, old code that had worked for ages on many systems stopped working on Linux.
The macro expansion of stdin etc is either a simple identity expansion (#define stdin stdin) or equivalent (on macOS, #define stdout __stdoutp). These are variables, not address constants, so you can't copy the value of the variable in the file scope initializer. It is a nuisance, but the standard doesn't say they're address constants, so it is legitimate.
They're required to be macros because they always were macros, so it retains that much backwards compatibility with the dawn of the standard I/O library (circa 1978, long before there was a standard C library per se).

How to get file size in ANSI C without fseek and ftell?

While looking for ways to find the size of a file given a FILE*, I came across this article advising against it. Instead, it seems to encourage using file descriptors and fstat.
However I was under the impression that fstat, open and file descriptors in general are not as portable (After a bit of searching, I've found something to this effect).
Is there a way to get the size of a file in ANSI C while keeping in line with the warnings in the article?
In standard C, the fseek/ftell dance is pretty much the only game in town. Anything else you'd do depends at least in some way on the specific environment your program runs in. Unfortunately said dance also has its problems as described in the articles you've linked.
I guess you could always read everything out of the file until EOF and keep track along the way - with fread() for example.
The article claims fseek(stream, 0, SEEK_END) is undefined behaviour by citing an out-of-context footnote.
The footnote appears in text dealing with wide-oriented streams, which are streams that the first operation that is performed on them is an operation on wide-characters.
This undefined behaviour stems from the combination of two paragraphs. First §7.19.2/5 says that:
— Binary wide-oriented streams have the file-positioning restrictions ascribed to both text and binary streams.
And the restrictions for file-positioning with text streams (§7.19.9.2/4) are:
For a text stream, either offset shall be zero, or offset shall be a value returned by an earlier successful call to the ftell function on a stream associated with the same file and whence shall be SEEK_SET.
This makes fseek(stream, 0, SEEK_END) undefined behaviour for wide-oriented streams. There is no such rule like §7.19.2/5 for byte-oriented streams.
Furthermore, when the standard says:
A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.
It doesn't mean it's undefined behaviour to do so. But if the stream supports it, it's ok.
Apparently this exists to allow binary files can have coarse size granularity, i.e. for the size to be a number of disk sectors rather than a number of bytes, and as such allows for an unspecified number of zeros to magically appear at the end of binary files. SEEK_END cannot be meaningfully supported in this case. Other examples include pipes or infinite files like /dev/zero. However, the C standard provides no way to distinguish between such cases, so you're stuck with system-dependent calls if you want to consider that.
Use fstat - requires the file descriptor - can get that from fileno from the FILE* - Hence the size is in your grasp along with other details.
i.e.
fstat(fileno(filePointer), &buf);
Where filePointer is the FILE *
and
buf is
struct stat {
dev_t st_dev; /* ID of device containing file */
ino_t st_ino; /* inode number */
mode_t st_mode; /* protection */
nlink_t st_nlink; /* number of hard links */
uid_t st_uid; /* user ID of owner */
gid_t st_gid; /* group ID of owner */
dev_t st_rdev; /* device ID (if special file) */
off_t st_size; /* total size, in bytes */
blksize_t st_blksize; /* blocksize for file system I/O */
blkcnt_t st_blocks; /* number of 512B blocks allocated */
time_t st_atime; /* time of last access */
time_t st_mtime; /* time of last modification */
time_t st_ctime; /* time of last status change */
};
The executive summary is that you must use fseek/ftell because there is no alternative (even the implementation specific ones) that is better.
The underlying issue is that the "size" of a file in bytes is not always the same as the length of the data in the file and that, in some circumstances, the length of the data is not available.
A POSIX example is what happens when you write data to a device; the operating system only knows the size of the device. Once the data has been written and the (FILE*) closed there is no record of the length of the data written. If the device is opened for read the fseek/ftell approach will either fail or give you the size of the whole device.
When the ANSI-C committee was sitting at the end of the 1980's a number of operating systems the members remembered simply did not store the length of the data in a file; rather they stored the disk blocks of the file and assumed that something in the data terminated it. The 'text' stream represents this. Opening a 'binary' stream on those files shows not only the magic terminator byte, but also any bytes beyond it that were never written but happen to be in the same disk block.
Consequently the C-90 standard was written so that it is valid to use the fseek trick; the result is a conformant program, but the result may not be what you expect. The behavior of that program is not 'undefined' in the C-90 definition and it is not 'implementation-defined' (because on UN*X it varies with the file). Neither is it 'invalid'. Rather you get a number you can't completely rely on or, maybe, depending on the parameters to fseek, -1 and an errno.
In practice if the trick succeeds you get a number that includes at least all the data, and this is probably what you want, and if the trick fails it is almost certainly someone else's fault.
John Bowler
different OS's provide different apis for this. For example in windows we have:
GetFileAttributes()
In MAC we have:
[[[NSFileManager defaultManager] attributesOfItemAtPath:someFilePath error:nil] fileSize];
But raw method is only by fread and fseek only:
How can I get a file's size in C?
You can't always avoid writing platform-specific code, especially when you have to deal with things that are a function of the platform. File sizes are a function of the file system, so as a rule I'd use the native filesystem API to get that information over the fseek/ftell dance. I'd create my own generic wrapper around it, so as to not pollute application logic with platform-specific details and make the code easier to port.
The article has a little problem of logic.
It (correctly) identifies that a certain usage of C functions has behavior which is not defined by ISO C. But then, to avoid this undefined behavior, the article proposes a solution: replace that usage with platform-specific functions. Unfortunately, the use of platform-specific functions is also undefined according to ISO C. Therefore, the advice does not solve the problem of undefined behavior.
The quote in my copy of the 1999 standard confirms that the alleged behavior is indeed undefined:
A binary stream need no meaningfully support fseek calls with a whence value of SEEK_END. [ISO 9899:1999 7.19.9.2 paragraph 3]
But undefined behavior does not mean "bad behavior"; it is simply behavior for which the ISO C standard gives no definition. Not all undefined behaviors are the same.
Some undefined behaviors are areas in the language where meaningful extensions can be provided. The platform fills the gap by defining a behavior.
Providing a working fseek which can seek from SEEK_END is an example of an extension in place of undefined behavior. It is possible to confirm whether or not a given platform supports fseek from SEEK_END, and if this is provisioned, then it is fine to use it.
Providing a separate function like lseek is also an extension in place of undefined behavior (the undefined behavior of calling a function which is not in ISO C and not defined in the C program). It is fine to use that, if available.
Note that those platforms which have functions like the POSIX lseek will also likely have an ISO C fseek which works from SEEK_END. Also note that on platforms where fseek on a binary file cannot seek from SEEK_END, the likely reason is that this is impossible to do (no API can be provided to do it and that is why the C library function fseek is not able to support it).
So, if fseek does provide the desired behavior on the given platform, then nothing has to be done to the program; it is a waste of effort to change it to use that platform's special function. On the other hand, if fseek does not provide the behavior, then likely nothing does, anyway.
Note that even including a nonstandard header which is not in the program is undefined behavior. (By omission of the definition of behavior.) For instance if the following appears in a C program:
#include <unistd.h>
the behavior is not defined after that. [See References below.] The behavior of the preprocessing directive #include is defined, of course. But this creates two possibilities: either the header <unistd.h> does not exist, in which case a diagnostic is required. Or the header does exist. But in that case, the contents are not known (as far as ISO C is concerned; no such header is documented for the Library). In this case, the include directive brings in an unknown chunk of code, incorporating it into the translation unit. It is impossible to define the behavior of an unknown chunk of code.
#include <platform-specific-header.h> is one of the escape hatches in the language for doing anything whatsoever on a given platform.
In point form:
Undefined behavior is not inherently "bad" and not inherently a security flaw (though of course it can be! E.g. buffer overruns linked to the undefined behaviors in the area of pointer arithmetic and dereferencing.)
Replacing one undefined behavior with another, only for the purpose of avoiding undefined behavior, is pointless.
Undefined behavior is just a special term used in ISO C to denote things that are outside of the scope of ISO C's definition. It does not mean "not defined by anyone in the world" and doesn't imply something is defective.
Relying on some undefined behaviors is necessary for making most real-world, useful programs, because many extensions are provided through undefined behavior, including platform-specific headers and functions.
Undefined behavior can be supplanted by definitions of behavior from outside of ISO C. For instance the POSIX.1 (IEEE 1003.1) series of standards defines the behavior of including <unistd.h>. An undefined ISO C program can be a well defined POSIX C program.
Some problems cannot be solved in C without relying on some kind of undefined behavior. An example of this is a program that wants to seek so many bytes backwards from the end of a file.
References:
Dan Pop in comp.std.c, Dec. 2002: http://groups.google.com/group/comp.std.c/msg/534ab15a7bc4e27e?dmode=source
Chris Torek, comp.std.c, on the subject of nonstandard functions being undefined behavior, Feb. 2002: http://groups.google.com/group/comp.lang.c/msg/2fddb081336543f1?dmode=source
Chris Engebretson, comp.lang.c, April 1997: http://groups.google.com/group/comp.lang.c/msg/3a3812dbcf31de24?dmode=source
Ben Pfaff, comp.lang.c, Dec 1998 [Jestful answer citing undefinedness of the inclusion of nonstandard headers]: http://groups.google.com/group/comp.lang.c/msg/73b26e6892a1ba4f?dmode=source
Lawrence Kirby, comp.lang.c, Sep 1998 [Explains effects of nonstandard headers]: http://groups.google.com/group/comp.lang.c/msg/c85a519fc63bd388?dmode=source
Christian Bau, comp.lang.c, Sep 1997 [Explains how the undefined behavior of #include <pascal.h> can bring in a pascal keyword for linkage.] http://groups.google.com/group/comp.lang.c/msg/e2762cfa9888d5c6?dmode=source

Why does C's "fopen" take a "const char *" as its second argument?

It has always struck me as strange that the C function "fopen" takes a "const char *" as the second argument. I would think it would be easier to both read your code and implement the library's code if there were bit masks defined in stdio.h, like "IO_READ" and such, so you could do things like:
FILE* myFile = fopen("file.txt", IO_READ | IO_WRITE);
Is there a programmatic reason for the way it actually is, or is it just historic? (i.e. "That's just the way it is.")
I believe that one of the advantages of the character string instead of a simple bit-mask is that it allows for platform-specific extensions which are not bit-settings. Purely hypothetically:
FILE *fp = fopen("/dev/something-weird", "r+,bs=4096");
For this gizmo, the open() call needs to be told the block size, and different calls can use radically different sizes, etc. Granted, I/O has been organized pretty well now (such was not the case originally — devices were enormously diverse and the access mechanisms far from unified), so it seldom seems to be necessary. But the string-valued open mode argument allows for that extensibility far better.
On IBM's mainframe MVS o/s, the fopen() function does indeed take extra arguments along the general lines described here — as noted by Andrew Henle (thank you!). The manual page includes the example call (slightly reformatted):
FILE *fp = fopen("myfile2.dat", "rb+, lrecl=80, blksize=240, recfm=fb, type=record");
The underlying open() has to be augmented by the ioctl() (I/O control) call or fcntl() (file control) or functions hiding them to achieve similar effects.
One word : legacy. Unfortunately we have to live with it.
Just speculation : Maybe at the time a "const char *" seemed more flexible solution, because it is not limited in any way. A bit mask could only have 32 different values. Looks like a YAGNI to me now.
More speculation : Dudes were lazy and writing "rb" requires less typing than MASK_THIS | MASK_THAT :)
Dennis Ritchie (in 1993) wrote an article about the history of C, and how it evolved gradually from B. Some of the design decisions were motivated by avoiding source changes to existing code written in B or embryonic versions of C.
In particular, Lesk wrote a 'portable
I/O package' [Lesk 72] that was later
reworked to become the C `standard
I/O' routines
The C preprocessor wasn't introduced until 1972/3, so Lesk's I/O package was written without it! (In very early not-yet-C, pointers fit in integers on the platforms being used, and it was totally normal to assign an implicit-int return value to a pointer.)
Many other changes occurred around 1972-3, but the most important was the introduction of the preprocessor, partly at the urging of Alan Snyder [Snyder 74]
Without #include and #define, an expression like IO_READ | IO_WRITE wasn't an option.
The options in 1972 for what fopen calls could look in typical source without CPP are:
FILE *fp = fopen("file.txt", 1); // magic constant integer literals
FILE *fp = fopen("file.txt", 'r'); // character literals
FILE *fp = fopen("file.txt", "r"); // string literals
Magic integer literals are obviously horrible, so unfortunately the obviously most efficient option (which Unix later adopted for open(2)) was ruled out by lack of a preprocessor.
A character literal is obviously not extensible; presumably that was obvious to API designers even back then. But it would have been sufficient (and more efficient) for early implementations of fopen: They only supported single-character strings, checking for *mode being r, w, or a. (See #Keith Thompson's answer.) Apparently r+ for read+write (without truncating) came later. (See fopen(3) for the modern version.)
C did have a character data type (added to B 1971 as one of the first steps in producing embryonic C, so it was still new in 1972. Original B didn't have char, having been written for machines that pack multiple characters into a word, so char() was a function that indexed a string! See Ritchie's history article.)
Using a single-byte string is effectively passing a char by const-reference, with all the extra overhead of memory accesses because library functions can't inline. (And primitive compilers probably weren't inlining anything, even trival functions (unlike fopen) in the same compilation unit where it would shrink total code size to inline them; Modern style tiny helper functions rely on modern compilers to inline them.)
PS: Steve Jessop's answer with the same quote inspired me to write this.
Possibly related: strcpy() return value. strcpy was probably written pretty early, too.
I must say that I am grateful for it - I know to type "r" instead of IO_OPEN_FLAG_R or was it IOFLAG_R or SYSFLAGS_OPEN_RMODE or whatever
I'd speculate that it's one or more of the following (unfortunately, I was unable to quickly find any kind of supporting references, so this'll probably remain speculation):
Kernighan or Ritchie (or whoever came up with the interface for fopen()) just happened to like the idea of specifying the mode using a string instead of a bitmap
They may have wanted the interface to be similar to yet noticeably different from the Unix open() system call interface, so it would be at once familiar yet not mistakenly compile with constants defined for Unix instead of by the C library
For example, let's say that the mythical C standard fopen() that took a bitmapped mode parameter used the identifier OPENMODE_READONLY to specify that the file what today is specified by the mode string "r". Now, if someone made the following call on a program compiled on a Unix platform (and that the header that defines O_RDONLY has been included):
fopen( "myfile", O_RDONLY);
There would be no compiler error, but unless OPENMODE_READONLY and O_RDONLY were defined to be the same bit you'd get unexpected behavior. Of course it would make sense for the C standard names to be defined the same as the Unix names, but maybe they wanted to preclude requiring this kind of coupling.
Then again, this might not have crossed their minds at all...
The earliest reference to fopen that I've found is in the first edition of Kernighan & Ritchie's "The C Programming Language" (K&R1), published in 1978.
It shows a sample implementation of fopen, which is presumably a simplified version of the code in the C standard library implementation of the time. Here's an abbreviated version of the code from the book:
FILE *fopen(name, mode)
register char *name, *mode;
{
/* ... */
if (*mode != 'r' && *mode != 'w' && *mode != 'a') {
fprintf(stderr, "illegal mode %s opening %s\n",
mode, name);
exit(1);
}
/* ... */
}
Looking at the code, the mode was expected to be a 1-character string (no "rb", no distinction between text and binary). If you passed a longer string, any characters past the first were silently ignored. If you passed an invalid mode, the function would print an error message and terminate your program rather than returning a null pointer (I'm guessing the actual library version didn't do that). The book emphasized simple code over error checking.
It's hard to be certain, especially given that the book doesn't spend a lot of time explaining the mode parameter, but it looks like it was defined as a string just for convenience. A single character would have worked as well, but a string at least makes future expansion possible (something that the book doesn't mention).
Dennis Ritchie has this to say, from http://cm.bell-labs.com/cm/cs/who/dmr/chist.html
In particular, Lesk wrote a 'portable
I/O package' [Lesk 72] that was later
reworked to become the C `standard
I/O' routines
So I say ask Mike Lesk, post the result here as an answer to your own question, and earn stacks of points for it. Although you might want to make the question sound a bit less like criticism ;-)
The reason is simple: to allow the modes be extended by the C implementation as it sees fit. An argument of type int would not do that The C99 Rationale V5-10 7.19.5.3 The fopen function says e.g. that
Other specifications for files, such as record length and block size, are not specified in the Standard due to their widely varying characteristics in different operating environments.
Changes to file access modes and buffer sizes may be specified using the setvbuf function
(see §7.19.5.6).
An implementation may choose to allow additional file specifications as part of the mode string argument. For instance,
file1 = fopen(file1name, "wb,reclen=80");
might be a reasonable extension on a system that provides record-oriented binary files and allows
a programmer to specify record length.
Similar text exists in the C89 Rationale 4.9.5.3
Naturally if |ed enum flags were used then these kinds of extensions would not be possible.
One example of fopen implementation using these parameters would be on z/OS. An example there has the following excerpt:
/* The following call opens:
the file myfile2.dat,
a binary file for reading and writing,
whose record length is 80 bytes,
and maximum length of a physical block is 240 bytes,
fixed-length, blocked record format
for sequential record I/O.
*/
if ( (stream = fopen("myfile2.dat", "rb+, lrecl=80,\
blksize=240, recfm=fb, type=record")) == NULL )
printf("Could not open data file for read update\n");
Now, imagine if you had to squeeze all this information into one argument of type int!!
As Tuomas Pelkonen says, it's legacy.
Personally, I wonder if some misguided saps conceived of it as being better due to fewer characters typed? In the olden days programmers' time was valued more highly than it is today, since it was less accessible and compilers weren't as great and all that.
This is just speculation, but I can see why some people would favor saving a few characters here and there (note the lack of verbosity in any of the standard library function names... I present string.h's "strstr" and "strchr" as probably the best examples of unnecessary brevity).

Resources