Did the meaning of the n parameter of fgets change over time? - c

The Apple developer documentation states:
Security Note for fgets: Although the fgets function provides the ability to read a limited amount of data, you must be careful when using it. Like the other functions in the “safer” column, fgets always terminates the string. However, unlike the other functions in that column, it takes a maximum number of bytes to read, not a buffer size.
The last sentence sounds wrong to me. For comparison, here is what POSIX says:
The fgets() function shall read bytes from stream into the array pointed to by s until n-1 bytes are read, or a <newline> is read and transferred to s, or an end-of-file condition is encountered. A null byte shall be written immediately after the last byte read into the array.
Here is what an ISO C draft from 2005 says:
The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s. No additional characters are read after a new-line character (which is retained) or after end-of-file. A null character is written immediately after the last character read into the array.
The FreeBSD man page says the same as the C standard and POSIX.
This makes me think that the Apple documentation is clearly wrong. The simplest explanation is that Apple didn't know better when they published this article. But although simple, this hypothesis doesn't feel plausible to me.
Are there other reasons that Apple could deviate from the wording of the C standard?

Even early (early 1970s) versions of fgets() specified that n is the buffer size, and that the buffer will be terminated with a '\0'.
Kernighan and Ritchie reflected that correctly in all their books and documentation.
However, a number of authors of introductory texts (who I won't attempt to name, since I'm sure I'll miss some, and all deserve to be equally embarrassed) documented that up to n characters could be written to the buffer, and that the trailing '\0' might be dropped in some cases.

The fgets functions reads at most the size minus one bytes from the file. If the wrong value is passed as the buffer size then fgets might write out of bounds.
So the quote from the Apple documentation that you show is correct in that the value is more related to the number of bytes to read from the file. But on the other hand any normal code would use the actual buffer size when falling fgets. And if that number is input from a user then it should be validated before use.
On the other hand the documentation continues to state (thanks for the note Sander De Dycker)
In practical terms, this means that you must always pass a size value that is one fewer than the size of the buffer to leave room for the null termination. If you do not, the fgets function will dutifully terminate the string past the end of your buffer, potentially overwriting whatever byte of data follows it.
And this is wrong. The size argument passed to fgets always includes the string terminator. At least according to the C standard.

Related

Why the different behavior of snprintf vs swprintf?

The C standard states the following from the standard library function snprintf:
"The snprintf function is equivalent to fprintf, except that the
output is written into an array (specified by arguments) rather than
to a stream. If n is zero, nothing is written, and s may be a null
pointer. Otherwise, output characters beyond the n-1st are discarded
rather than being written to the array, and a null character is
written at the end of the characters actually written into the array.
If copying takes place between objects that overlap, the behavior is
undefined."
"The snprintf function returns the number of characters that would
have been written had n been sufficiently large, not counting the
terminating null character, or a negative value if an encoding error
occurred. Thus, the null-terminated output has been completely written
if and only if the returned value is nonnegative and less than n."
Compare it to the statement about swprintf:
"The swprintf function is equivalent to fwprintf, except that the
argument s specifies an array of wide characters into which the
generated output is to be written, rather than written to a stream. No
more than n wide characters are written, including a terminating null
wide character, which is always added (unless n is zero)."
"The swprintf function returns the number of wide characters written
in the array, not counting the terminating null wide character, or a
negative value if an encoding error occurred or if n or more wide
characters were requested to be written."
At first glance it may seem like snprintf and swprintf are complete equivalent to each other, the latter merely handling wide strings and the former narrow strings. However, that's not the case. While snprintf returns the number of characters that would have been written if n had been large enough, swprintf returns a negative value in this case (which means that you can't know how many characters would have been written if there had been enough space). This makes the two functions not fully interchangeable, because their behavior is different in this regard (and thus the latter can't be used for some thing that the former can, such as evaluating how long the output buffer would need to be, before actually creating it.)
Why would they make this difference? I suppose the behavior of swprintf makes the implementation more efficient when n is too small, but still, why the difference? I don't think it's even a question of snprintf being older and thus "legacy" and "dragging the weight of its history, which can't be changed later" and swprintf being newer and thus free to be improved, because both were introduced in C99.
There is, however, another significantly subtler difference between the two specifications. If you notice, the specifications are not merely carbon-copies of each other, with the only difference being the return value. That's another, much subtler difference, and that's the somewhat ambiguous behavior of what happens if n is too small for the string-to-be-printed.
The specification for snprintf quite clearly states that the output will be written up to n-1 characters even when n is too small, and the null character will be written at the end of it always. The specification of swprintf almost states this... except it leaves it ambiguous in its specification of the return value.
More specifically a negative return value is used to signal that an error occurred while trying to write the string to the destination. A negative value is also returned when n was too small. It's left ambiguous whether this is actually considered an error situation or not. This is significant because if it's considered an error, then the implementation is free to not write all, or anything, into the destination, because it can be signaling "an error occurred, the output is invalid". The first paragraph of the specification makes it sound like at most n-1 characters are always written, and an ending null character is always written ("which is always added"), but the second paragraph about the return value leaves it ambiguous whether this is actually an error situation and whether the implementation can choose not to write those things in this case.
This is significant because the glibc implementation of swprintf does not write the final null character when n is too small, making the result an invalid string. While I can't find definitive information on this, I have got the impression that the developers of glibc have interpreted the standard in such a manner that they don't have to write the final null character (or anything) to the output because this is an error situation.
The thing is that the standard seems to be very ambiguous and vague in this regard. Is it a correct interpretation? Or are they misinterpreting? Why would the standard leave it this ambiguous?
My interpretation differs from that of the glibc developers. I understand the second paragraph to mean:
A negative value is returned if:
an encoding error occurred, or
n or more wide characters were requested to be written.
I don't see how this could be interpreted as n being too small being considered an error.

What's the difference between gets and scanf?

If the code is
scanf("%s\n",message)
vs
gets(message)
what's the difference?It seems that both of them get input to message.
The basic difference [in reference to your particular scenario],
scanf() ends taking input upon encountering a whitespace, newline or EOF
gets() considers a whitespace as a part of the input string and ends the input upon encountering newline or EOF.
However, to avoid buffer overflow errors and to avoid security risks, its safer to use fgets().
Disambiguation: In the following context I'd consider "safe" if not leading to trouble when correctly used. And "unsafe" if the "unsafetyness" cannot be maneuvered around.
scanf("%s\n",message)
vs
gets(message)
What's the difference?
In terms of safety there is no difference, both read in from Standard Input and might very well overflow message, if the user enters more data then messageprovides memory for.
Whereas scanf() allows you to be used safely by specifying the maximum amount of data to be scanned in:
char message[42];
...
scanf("%41s", message); /* Only read in one few then the buffer (messega here)
provides as one byte is necessary to store the
C-"string"'s 0-terminator. */
With gets() it is not possible to specify the maximum number of characters be read in, that's why the latter shall not be used!
The main difference is that gets reads until EOF or \n, while scanf("%s") reads until any whitespace has been encountered. scanf also provides more formatting options, but at the same time it has worse type safety than gets.
Another big difference is that scanf is a standard C function, while gets has been removed from the language, since it was both superfluous and dangerous: there was no protection against buffer overruns. The very same security flaw exists with scanf however, so neither of those two functions should be used in production code.
You should always use fgets, the C standard itself even recommends this, see C11 K.3.5.4.1
Recommended practice
6 The fgets function allows properly-written
programs to safely process input lines too long to store in the result
array. In general this requires that callers of fgets pay attention to
the presence or absence of a new-line character in the result array.
Consider using fgets (along with any needed processing based on
new-line characters) instead of gets_s.
(emphasis mine)
There are several. One is that gets() will only get character string data. Another is that gets() will get only one variable at a time. scanf() on the other hand is a much, much more flexible tool. It can read multiple items of different data types.
In the particular example you have picked, there is not much of a difference.
gets - Reads characters from stdin and stores them as a string.
scanf - Reads data from stdin and stores them according to the format specified int the scanf statement like %d, %f, %s, etc.
gets:->
gets() reads a line from stdin into the buffer pointed to by s until either a terminating newline or EOF, which it replaces with a null byte ('\0').
BUGS:->
Never use gets(). Because it is impossible to tell without knowing the data in advance how many characters gets() will read, and because gets() will continue to store characters past the end of the buffer, it is extremely dangerous to use. It has been used to break computer security. Use fgets() instead.
scanf:->
The scanf() function reads input from the standard input stream stdin;
BUG:->
Some times scanf makes boundary problems when deals with array and string concepts.
In case of scanf you need that format mentioned, unlike in gets. So in gets you enter charecters, strings, numbers and spaces.
In case of scanf , you input ends as soon as a white-space is encountered.
But then in your example you are using '%s' so, neither gets() nor scanf() that the strings are valid pointers to arrays of sufficient length to hold the characters you are sending to them. Hence can easily cause an buffer overflow.
Tip: use fgets() , but that all depends on the use case
The concept that scanf does not take white space is completely wrong. If you use this part of code it will take white white space also :
#include<stdio.h>
int main()
{
char name[25];
printf("Enter your name :\n");
scanf("%[^\n]s",name);
printf("%s",name);
return 0;
}
Where the use of new line will only stop taking input. That means if you press enter only then it will stop taking inputs.
So, there is basically no difference between scanf and gets functions. It is just a tricky way of implementation.
scanf() is much more flexible tool while gets() only gets one variable at a time.
gets() is unsafe, for example: char str[1]; gets(str)
if you input more then the length, it will end with SIGSEGV.
if only can use gets, use malloc as the base variable.

Scan whole line from file in C Programming

I was writing a program to input multiple lines from a file.
the problem is i don't know the length of the lines, so i cant use fgets cause i need to give the size of the buffer and cant use fscanf cause it stops at a space token
I saw a solution where he recommended using malloc and realloc for each character taken as input but i think there's an easier way and then i found someone suggesting using
fscanf(file,"%[^\n]",line);
Does anyone have a better solution or can someone explain how the above works?(i haven't tested it)
i use GCC Compiler, if that's needed
You can use getline(3). It allocates memory on your behalf, which you should free when you are finished reading lines.
and then i found someone suggesting using fscanf(file,"%[^\n]",line);
That's practically an unsafe version of fgets(line, sizeof line, file);. Don't do that.
If you don't know the file size, you have two options.
There's a LINE_MAX macro defined somewhere in the C library (AFAIK it's POSIX-only, but some implementations may have equivalents). It's a fair assumption that lines don't exceed that length.
You can go the "read and realloc" way, but you don't have to realloc() for every character. A conventional solution to this problem is to exponentially expand the buffer size, i. e. always double the allocated memory when it's exhausted.
A simple format specifier for scanf or fscanf follows this prototype
%specifier
specifiers
As we know d is format specifier for integers Like this
[characters] is Scanset Any number of the characters specified between the brackets.
A dash (-) that is not the first character may produce non-portable behavior in some library implementations.
[^characters] is
Negated scanset Any number of characters none of them specified as characters between the brackets.
fscanf(file,"%[^\n]",line);
Read any characters till occurance of any charcter in Negated scanset in this case newline character
As others suggested you can use getline() or fgets() and see example
The line fscanf(file,"%[^\n]",line); means that it will read anything other than \n into line. This should work in Linux and Windows, I think. But may not work in OS X format which use \r to end a line.

What is gets() equivalent in C11?

From cplusplus.com
The most recent revision of the C standard (2011) has definitively
removed this function from its specification
The function is deprecated in C++ (as of 2011 standard, which follows
C99+TC3).
I just wanted to know what is the alternative to gets() in C11 standard?
In C11 gets has been substituted by gets_s that has the following declaration:
char *gets_s(char *str, rsize_t n);
This function will read at most n-1 chars from stdin into *str. This is to avoid the buffer overflow vulnerability inherent to gets. The function fgets is also an option. From http://en.cppreference.com/w/c/io/gets:
The gets() function does not perform bounds checking, therefore this function is extremely vulnerable to buffer-overflow attacks. It cannot be used safely (unless the program runs in an environment which restricts what can appear on stdin). For this reason, the function has been deprecated in the third corrigendum to the C99 standard and removed altogether in the C11 standard. fgets() and gets_s() are the recommended replacements.
Never use gets().
Given that gets_s is defined in an extension to the standard, only optionally implemented, you should probably write your programs using fgets instead. If you use fgets on stdin your program will also compile in earlier versions of C. But keep in mind the difference in the behavior: when gets_s has read n-1 characters it keeps reading until a new line or end-of-file is reached, discarding the input. So, with gets_s you are always reading an entire line, even if only a part of it can be returned in the input buffer.
Others have already answered the question. For the sake of completeness, this is the C standard's recommendation:
ISO9899:2011 K.3.5.4.1/6
Recommended practice
The fgets function allows properly-written programs to safely process input lines too long to store in the result
array. In general this requires that callers of fgets pay attention to
the presence or absence of a new-line character in the result array.
Consider using fgets (along with any needed processing based on
new-line characters) instead of gets_s.
So you should use fgets whenever possible.
EDIT
gets_s behavior is specified to be:
ISO9899:2011 K.3.5.4.1/4
Description
The gets_s function reads at most one less than the number of characters specified by n
from the stream pointed to by stdin, into the array pointed to by s. No additional
characters are read after a new-line character (which is discarded) or after end-of-file.
The discarded new-line character does not count towards number of characters read. A
null character is written immediately after the last character read into the array.
If end-of-file is encountered and no characters have been read into the array, or if a read
error occurs during the operation, then s[0] is set to the null character, and the other
elements of s take unspecified values.
You can use fgets or gets_s:
http://www.java2s.com/Code/C/Console/Usefgetstoreadstringfromstandardinput.htm
According to man 3 gets, fgets.

C: Reading a text file (with variable-length lines) line-by-line using fread()/fgets() instead of fgetc() (block I/O vs. character I/O)

Is there a getline function that uses fread (block I/O) instead of fgetc (character I/O)?
There's a performance penalty to reading a file character by character via fgetc. We think that to improve performance, we can use block reads via fread in the inner loop of getline. However, this introduces the potentially undesirable effect of reading past the end of a line. At the least, this would require the implementation of getline to keep track of the "unread" part of the file, which requires an abstraction beyond the ANSI C FILE semantics. This isn't something we want to implement ourselves!
We've profiled our application, and the slow performance is isolated to the fact that we are consuming large files character by character via fgetc. The rest of the overhead actually has a trivial cost by comparison. We're always sequentially reading every line of the file, from start to finish, and we can lock the entire file for the duration of the read. This probably makes an fread-based getline easier to implement.
So, does a getline function that uses fread (block I/O) instead of fgetc (character I/O) exist? We're pretty sure it does, but if not, how should we implement it?
Update Found a useful article, Handling User Input in C, by Paul Hsieh. It's a fgetc-based approach, but it has an interesting discussion of the alternatives (starting with how bad gets is, then discussing fgets):
On the other hand the common retort from C programmers (even those considered experienced) is to say that fgets() should be used as an alternative. Of course, by itself, fgets() doesn't really handle user input per se. Besides having a bizarre string termination condition (upon encountering \n or EOF, but not \0) the mechanism chosen for termination when the buffer has reached capacity is to simply abruptly halt the fgets() operation and \0 terminate it. So if user input exceeds the length of the preallocated buffer, fgets() returns a partial result. To deal with this programmers have a couple choices; 1) simply deal with truncated user input (there is no way to feed back to the user that the input has been truncated, while they are providing input) 2) Simulate a growable character array and fill it in with successive calls to fgets(). The first solution, is almost always a very poor solution for variable length user input because the buffer will inevitably be too large most of the time because its trying to capture too many ordinary cases, and too small for unusual cases. The second solution is fine except that it can be complicated to implement correctly. Neither deals with fgets' odd behavior with respect to '\0'.
Exercise left to the reader: In order to determine how many bytes was really read by a call to fgets(), one might try by scanning, just as it does, for a '\n' and skip over any '\0' while not exceeding the size passed to fgets(). Explain why this is insufficient for the very last line of a stream. What weakness of ftell() prevents it from addressing this problem completely?
Exercise left to the reader: Solve the problem determining the length of the data consumed by fgets() by overwriting the entire buffer with a non-zero value between each call to fgets().
So with fgets() we are left with the choice of writing a lot of code and living with a line termination condition which is inconsistent with the rest of the C library, or having an arbitrary cut-off. If this is not good enough, then what are we left with? scanf() mixes parsing with reading in a way that cannot be separated, and fread() will read past the end of the string. In short, the C library leaves us with nothing. We are forced to roll our own based on top of fgetc() directly. So lets give it a shot.
So, does a getline function that's based on fgets (and doesn't truncate the input) exist?
Don't use fread. Use fgets. I take it this is a homework/classproject problem so I'm not providing a complete answer, but if you say it's not, I'll give more advice. It is definitely possible to provide 100% of the semantics of GNU-style getline, including embedded null bytes, using purely fgets, but it requires some clever thinking.
OK, update since this isn't homework:
memset your buffer to '\n'.
Use fgets.
Use memchr to find the first '\n'.
If no '\n' is found, the line is longer than your buffer. Englarge the buffer, fill the new portion with '\n', and fgets into the new portion, repeating as necessary.
If the character following '\n' is '\0', then fgets terminated due to reaching end of a line.
Otherwise, fgets terminated due to reaching EOF, the '\n' is left over from your memset, the previous character is the terminating null that fgets wrote, and the character before that is the last character of actual data read.
You can eliminate the memset and use strlen in place of memchr if you don't care about supporting lines with embedded nulls (either way, the null will not terminate reading; it will just be part of your read-in line).
There's also a way to do the same thing with fscanf and the "%123[^\n]" specifier (where 123 is your buffer limit), which gives you the flexibility to stop at non-newline characters (ala GNU getdelim). However it's probably slow unless your system has a very fancy scanf implementation.
There isn't a big performance difference between fgets and fgetc/setvbuf.
Try:
int c;
FILE *f = fopen("blah.txt","r");
setvbuf(f,NULL,_IOLBF,4096); /* !!! check other values for last parameter in your OS */
while( (c=fgetc(f))!=EOF )
{
if( c=='\n' )
...
else
...
}

Resources