need for the last '\0' in fgets - c

I've seen several usage of fgets (for example, here) that go like this:
char buff[7]="";
(...)
fgets(buff, sizeof(buff), stdin);
The interest being that, if I supply a long input like "aaaaaaaaaaa", fgets will truncate it to "aaaaaa" here, because the 7th character will be used to store '\0'.
However, when doing this:
int i=0;
for (i=0;i<7;i++)
{
buff[i]='a';
}
printf("%s\n",buff);
I will always get 7 'a's, and the program will not crash. But if I try to write 8 'a's, it will.
As I saw it later, the reason for this is that, at least on my system, when I allocate char buff[7] (with or without =""), the 8th byte (counting from 1, not from 0) gets set to 0. From what I guess, things are done like this precisely so that a for loop with 7 writes, followed by a string formatted read, could succeed, whether the last character to be written was '\0' or not, and thus avoiding the need for the programmer to set the last '\0' himself, when writing chars individually.
From this, it follows that in the case of
fgets(buff, sizeof(buff), stdin);
and then providing a too long input, the resulting buffstring will automatically have two '\0' characters, one inside the array, and one right after it that was written by the system.
I have also observed that doing
fgets(buff,(sizeof(buff)+17),stdin);
will still work, and output a very long string, without crashing. From what I guessed, this is because fgets will keep writing until sizeof(buff)+17, and the last char to be written will precisely be a '\0', ensuring that any forthcoming string reading process would terminate properly (although the memory is messed up anyway).
But then, what about fgets(buff, (sizeof(buff)+1),stdin);? this would use up all the space that was rightfully allocated in buff, and then write a '\0' right after it, thus overwriting...the '\0' previously written by the system. In other words, yes, fgets would go out of bounds, but it can be proven that when adding only one to the length of the write, the program will never crash.
So in the end, here comes the question: why does fgets always terminates its write with a '\0', when another '\0', placed by the system right after the array, already exists? why not do like in the one by one for-loop based write, that can access the whole of the array and write anything the programmer wants, without endangering anything?
Thank you very much for your answer!
EDIT: indeed, there is no proof possible, as long as I do not know whether this 8th '\0' that mysteriously appears upon allocation of buff[7], is part of the C standard or not, specifically for string arrays. If not, then...it's just luck that it works :-)

but it can be proven that when adding only one to the length of the write, the program will never crash.
No! You can't prove that! Not in the sense of a mathematical proof. You have only shown that on your system, with your compiler, with those particular compiler settings you used, with particular environment configuration, it might not crash. This is far from a mathematical proof!
In fact the C standard itself, although it guarantees that you can get the address of "one place after the last element of an array", it also states that dereferencing that address (i.e. trying to read or write from that address) is undefined behaviour.
That means that an implementation can do everything in this case. It can even do what you expect with naive reasoning (i.e. work - but it's sheer luck), but it may also crash or it may also format your HD (if your are very, very unlucky). This is especially true when writing system software (e.g. a device driver or a program running on the bare metal), i.e. when there is no OS to shield you from the nastiest consequences of writing bad code!
Edit This should answer the question made in a comment (C99 draft standard):
7.19.7.2 The fgets function
Synopsis
#include <stdio.h>
char *fgets(char * restrict s, int n,
FILE * restrict stream);
Description
The fgets function reads at most one less than the number of characters specified by n
from the stream pointed to by stream into the array pointed to by s. No additional
characters are read after a new-line character (which is retained) or after end-of-file. A
null character is written immediately after the last character read into the array.
Returns
The fgets function returns s if successful. If end-of-file is encountered and no
characters have been read into the array, the contents of the array remain unchanged and a
null pointer is returned. If a read error occurs during the operation, the array contents are
indeterminate and a null pointer is returned.
Edit: Since it seems that the problem lies in a misunderstanding of what a string is, this is the relevant excerpt from the standard (emphasis mine):
7.1.1 Definitions of terms
A string is a contiguous sequence of characters terminated by and including the first null
character. The term multibyte string is sometimes used instead to emphasize special
processing given to multibyte characters contained in the string or to avoid confusion
with a wide string. A pointer to a string is a pointer to its initial (lowest addressed)
character. The length of a string is the number of bytes preceding the null character and
the value of a string is the sequence of the values of the contained characters, in order.

From C11 standard draft:
The fgets function reads at most one less than the number of characters specified by n
from the stream pointed to by stream into the array pointed to by s. No additional
characters are read after a new-line character (which is retained) or after end-of-file. A
null character is written immediately after the last character read into the array.
The fgets function returns s if successful. If end-of-file is encountered and no
characters have been read into the array, the contents of the array remain unchanged and a
null pointer is returned. If a read error occurs during the operation, the array contents are indeterminate and a null pointer is returned.
The behaviour you describe is undefined.

Related

Does the gets() function in C automatically add a NULL character at the end of the input string?

I am writing a simple program to convert a number(+ve,32-bit) from binary to decimal. Here's my code:
int main()
{
int n=0,i=0;
char binary[33];
gets(binary);
for (i = 0; i < 33, binary[i] != '\0'; i++)
n=n*2+binary[i]-'0';
printf("%d",n);
}
If I remove binary[i]!='\0', then it gives wrong answer due to garbage values but if I don't it gives the correct answer. My question is: does the gets function automatically add a '\0' (NULL) character at the end of the string or is this just a coincidence?
Yes it does, writing past the end of binary[33] if it needs to.
Never use gets; automatic buffer overrun.
See Why is the gets function so dangerous that it should not be used? for details.
When gets was last supported (though deprecated) by the C standard, it had the following description (§ 7.19.7.7, The gets function):
The gets function reads characters from the input stream pointed to by stdin, into the
array pointed to by s, until end-of-file is encountered or a new-line character is read.
Any new-line character is discarded, and a null character is written immediately after the last character read into the array.
This means that if the string read from stdin was exactly as long as, or longer than, the array pointed to by s, gets would still (try to) append the null character to the end of the string.
Even if you are on a compiler or C standard revision that supports gets, don't use it. fgets is much safer since it requires the size of the buffer being written to as a parameter, and will not write past its end. Another difference is that it will leave the newline in the buffer, unlike gets did.

A little query, String in C

Recently I was programming in my Code Blocks and I did a little program only for hobby in C.
char littleString[1];
fflush( stdin );
scanf( "%s", littleString );
printf( "\n%s", littleString);
If I created a string of one character, why does the CodeBlocks allow me to save 13 characters?
C have no bounds-checking, writing out of bounds of arrays or dynamically allocated memory can't be checked by the compiler. Instead it will lead to undefined behavior.
To prevent buffer overflow with scanf you can tell it to only read a specific number of characters, and nothing more. So to tell it to read only one character you use the format "%1s".
As a small side-note: Remember that strings in C have an extra character in them, the terminator (character '\0'). So if you have a string that should contain one character, the size actually needs to be two characters.
LittleString is not a string. It is a char array of length one. In order for a char array to be a string, it must be null terminated with an \0. You are writing past the memory you have allotted for littleString. This is undefined behavior.Scanf just reads user input from the console and assigns it to the variable specified, in this case littleString. If you would like to control the length of user input which is assigned to the variable, I would suggest using scanf_s. Please note that scanf_s is not a C99 standard
Many functions in C is implemented without any checks for correctness of use. In other words, it is the callers responsibility that the arguments fulfill some rules set by the function.
Example: For strcpy the Linux man page says
The strcpy() function copies the string pointed to by src,
including the terminating null byte ('\0'), to the buffer
pointed to by dest. The strings may not overlap, and the
destination string dest must be large enough to receive the copy.
If you as a caller break that contract by passing a too small buffer, you'll have undefined behavior and anything can happen.
The program may crash or even do exactly what you expected in 99 out of 100 times and do something strange in 1 out of 100 times.

Why does fgets() store a \0 after the last character in a buffer?

I've been doing abit of reading through the Linux programmer's manual looking up various functions and trying to get a deeper understanding of what they are/how they work.
Looking at fgets() I read "A '\0' is stored after the last character in the buffer .
I've read through What does \0 stand for? and have a pretty solid understanding of what \0 symbolizes (a null character right ?). But what I'm struggling to grasp is its relevance to fgets(), I don't really understand why it "needs" to end with a null character.
As you already said, you are probably aware that \0 constitutes the end of all strings in C. As per the C standard, everything that is a string needs to be \0 terminated.
Since fgets() makes a string, that string, of course, will be properly null terminated.
Do note that for all string functions in C, any string you use or generate with them must be terminated with a \0 character.
Because otherwise you do not know how long the resulting string is.
One of the arguments to fgets is the maximum number of characters to read, but it's just that: a maximum. If you ask for 512 characters, but there are only 8 in the buffer, you will only get 8 characters … and a NULL in the 9th slot to demark the logical end of the C-string.
Arguably, fgets could instead have been designed to return the number of characters read, but then for most purposes you'd only have to add the NULL byte yourself manually, and the function would have to find a way to signify an error other than returning a null pointer.
From C standards:
The fgets function reads at most one less than the number of
characters specified by n from the stream pointed to by stream into
the array pointed to by s. No additional characters are read after a
new-line character (which is retained) or after end-of-file. A null
character is written immediately after the last character read into
the array.
This is to make sure that there is no buffer-overflow (characters/contents are not going beyond the provided storage) is in the created string.
As all the people before me said, fgets reads bytes from a file and makes them into a standard C string, which is null-terminated. The termination with the \0 byte reflects the fact that this function is text-oriented.
If you don't want to use null-termination for the data read from the file, it's not a string (not text), and also the end-of-line byte \n has no significance. In this case, you can use fread.
So C has two functions to read from file: fgets for text and fread for non-text (binary data).
BTW if the input file has a genuine zero-valued byte, fgets will do an uncomfortable thing: it will continue reading until it reads an end-of-line byte \n, and the output "string" will have two (or more) null-terminations. This doesn't make any sense as text, so it's another example of fgets being text-oriented and unsuitable for arbitrary data.

Does the C standard guarantee buffers are not touched past their null terminator?

In the various cases that a buffer is provided to the standard library's many string functions, is it guaranteed that the buffer will not be modified beyond the null terminator? For example:
char buffer[17] = "abcdefghijklmnop";
sscanf("123", "%16s", buffer);
Is buffer now required to equal "123\0efghijklmnop"?
Another example:
char buffer[10];
fgets(buffer, 10, fp);
If the read line is only 3 characters long, can one be certain that the 6th character is the same as before fgets was called?
The C99 draft standard does not explicitly state what should happen in those cases, but by considering multiple variations, you can show that it must work a certain way so that it meets the specification in all cases.
The standard says:
%s - Matches a sequence of non-white-space characters.252)
If no l length modifier is present, the corresponding argument shall be a
pointer to the initial element of a character array large enough to accept the
sequence and a terminating null character, which will be added automatically.
Here's a pair of examples that show it must work the way you are proposing to meet the standard.
Example A:
char buffer[4] = "abcd";
char buffer2[10]; // Note the this could be placed at what would be buffer+4
sscanf("123 4", "%s %s", buffer, buffer2);
// Result is buffer = "123\0"
// buffer2 = "4\0"
Example B:
char buffer[17] = "abcdefghijklmnop";
char* buffer2 = &buffer[4];
sscanf("123 4", "%s %s", buffer, buffer2);
// Result is buffer = "123\04\0"
Note that the interface of sscanf doesn't provide enough information to really know that these were different. So, if Example B is to work properly, it must not mess with the bytes after the null character in Example A. This is because it must work in both cases according to this bit of spec.
So implicitly it must work as you stated due to the spec.
Similar arguments can be placed for other functions, but I think you can see the idea from this example.
NOTE:
Providing size limits in the format, such as "%16s", could change the behavior. By the specification, it would be functionally acceptable for sscanf to zero out a buffer to its limits before writing the data into the buffer. In practice, most implementations opt for performance, which means they leave the remainder alone.
When the intent of the specification is to do this sort of zeroing out, it is usually explicitly specified. strncpy is an example. If the length of the string is less than the maximum buffer length specified, it will fill the rest of the space with null characters. The fact that this same "string" function could return a non-terminated string as well makes this one of the most common functions for people to roll their own version.
As far as fgets, a similar situation could arise. The only gotcha is that the specification explicitly states that if nothing is read in, the buffer remains untouched. An acceptable functional implementation could sidestep this by checking to see if there is at least one byte to read before zeroing out the buffer.
Each individual byte in the buffer is an object. Unless some part of the function description of sscanf or fgets mentions modifying those bytes, or even implies their values may change e.g. by stating their values become unspecified, then the general rule applies: (emphasis mine)
6.2.4 Storage durations of objects
2 [...] An object exists, has a constant address, and retains its last-stored value throughout its lifetime. [...]
It's this same principle that guarantees that
#include <stdio.h>
int a = 1;
int main() {
printf ("%d\n", a);
printf ("%d\n", a);
}
attempts to print 1 twice. Even though a is global, printf can access global variables, and the description of printf doesn't mention not modifying a.
Neither the description of fgets nor that of sscanf mentions modifying buffers past the bytes that actually were supposed to be written (except in the case of a read error), so those bytes don't get modified.
The standard is somewhat ambiguous on this, but I think a reasonable reading of it is that the answer is: yes, it's not allowed to write more bytes to the buffer than it read+null. On the other hand, a stricter reading/interpretation of the text could conclude that the answer is no, there's no guarantee. Here's what a publicly avaialble draft says about fgets.
char *fgets(char * restrict s, int n, FILE * restrict stream);
The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s. No additional characters are read after a new-line character (which is retained) or after end-of-file. A null character is written immediately after the last character read into the array.
The fgets function returns s if successful. If end-of-file is encountered and no characters have been read into the array, the contents of the array remain unchanged and a null pointer is returned. If a read error occurs during the operation, the array contents are indeterminate and a null pointer is returned.
There's a guarantee about how much it is supposed to read from the input, i.e. stop reading at newline or EOF and not read more than n-1 bytes. Although nothing is said explicitly about how much it's allowed to write to the buffer, the common knowledge is that fgets's n parameter is used to prevent buffer overflows. It's a little strange that the standard uses the ambiguous term read, which may not necessarily imply that gets can't write to the buffer more than n bytes, if you want to nitpick on the terminology it uses. But note that the same "read" terminology is used about both issues: the n-limit and the EOF/newline limit. So if you interpret the n-related "read" as a buffer-write limit, then [for consistency] you can/should interpret the other "read" the same way, i.e. not write more than what it read when string is shorter than the buffer.
On the other hand, if you distinguish between the uses of the phrase-verb "read into" (="write") and just "read", then you can't read the committee's text the same way. You are guaranteed that it won't "read into" (="write to") the array more than n bytes, but if the input string is terminated sooner by newline or EOF you're only guaranteed the rest (of the input) won't be "read", but whether that implies in won't be "read into" (="written to") the buffer is unclear under this stricter reading. The crucial issue is keyword is "into", which is elided, so the problem is whether the completion given by me in brackets in the following modified quote is the intended interpretation:
No additional characters are read [into the array] after a new-line character (which is retained) or after end-of-file.
Frankly a single postcondition stated as a formula (and would be pretty short in this case) would have been a lot more helpful than the verbiage I quoted...
I can't be bothered to try and analyze their writeup about the *scanf family, because I suspect it's going to be even more complicated given all the other things that happen in those functions; their writeup for fscanf is about five pages long... But I suspect a similar logic applies.
is it guaranteed that the buffer will not be modified beyond the null
terminator?
No, there's no guarantee.
Is buffer now required to equal "123\0efghijklmnop"?
Yes. But that's only because you've used correct parameters to your string related functions. Should you mess up buffer length, input modifiers to sscanf and such, then you program will compile. But it will most likely fail during runtime.
If the read line is only 3 characters long, can one be certain that the 6th character is the same as before fgets was called?
Yes. Once fgets() figures you have a 3 character input string it stores the input in the provided buffer, and it doesn't care about the reset of provided space at all.
Is buffer now required to equal "123\0efghijklmnop"?
Here buffer is just consists of 123 string guaranteed terminating at NUL.
Yes the memory allocated for array buffer will not get de-allocated, however you are making sure/restricting your string buffer can atmost only have 16 char elements which you can read into it at any point of time. Now depends whether you write just a single char or maximum what buffer can take.
For example:
char buffer[4096] = "abc";`
actually does something below,
memcpy(buffer, "abc", sizeof("abc"));
memset(&buffer[sizeof("abc")], 0, sizeof(buffer)-sizeof("abc"));
The standard insists that if any part of char array is initialized that is all it consists of at any moment until obeying its memory boundary.
There are no any guarantees from standard, which is why the functions sscanf and fgets are recommended to be used (with respect to the size of the buffer) as you show in your question (and using of fgets is considered preferable compared with gets).
However, some standard functions use null-terminator in their work, e.g. strlen (but I suppose you ask about string modification)
EDIT:
In your example
fgets(buffer, 10, fp);
untouching characters after 10-th is guaranteed (content and length of buffer will not be considered by fgets)
EDIT2:
Moreover, when using fgets keep in mind that '\n' will be stored in the buffers. e.g.
"123\n\0fghijklmnop"
instead of expected
"123\0efghijklmnop"
Depends on the function in use (and to a lesser degree its implementation). sscanf will start writing when it encounters its first non-whitespace character, and continue writing until its first whitespace character, where it will add a finishing 0 and return. But a function like strncpy (famously) zeroes out the rest of the buffer.
There is however nothing in the C standard which mandates how these functions behave.

About gets() in C

I am writing a C program, which has a 5-element array to store a string. And I am using gets() to get input. When I typed in more than 5 characters and then output the string, it just gave me all the characters I typed in. I know the string is terminated by a \0 so even I exceeded my array, it will still output the whole thing.
But what I am curious is where exactly gets() stores input, either buffer or just directly goes to my array?
What if I type in a long long string, will gets() try to store characters in the memories that should not be touched? Would it gives me a segment fault?
That's why gets is an evil. It does not check array bound and often invokes undefined behavior. Never use gets, instead you can use fgets.
By the way, now gets is no longer be a part of C. It has been removed in C11 standard in favor of a new safe alternative, gets_s1 (see the wiki). So, better to forget about gets.
1. C11: K.3.5.4.1 The gets_s function
Synopsis
#define _ _STDC_WANT_LIB_EXT1_ _ 1
#include <stdio.h>
char *gets_s(char *s, rsize_t n);
gets() will store the characters in the 5-element buffer. If you type in more than 4 characters, the end of string character will be missed and the result may not work well in any string operations in your program.
excerpt from man page on Ubuntu Linux
gets() reads a line from stdin into the buffer pointed to by s until
either a terminating newline or EOF, which it replaces with a null byte
('\0'). No check for buffer overrun is performed
The string is stored in the buffer and if it is too long it is stored in contiguous memory after the buffer. This can lead to unintended writing over of data or a SEGV fault or other problems. It is a security issue as it can be used to inject code into programs.
gets() stores the characters you type directly into your array and you can safely use/modify them. But indeed, as haccks and unxnut correctly state, gets doesn't care about the size of the array you give it to store its chars in, and when you type more characters than the array has space for you might eventually get a segmentation fault or some other weird results.
Just for the sake of completeness, gets() reads from a buffered file called stdin which contains the chars you typed. More specifically, it takes the chars until it reaches a newline. That newline too is put into your array and next the '\0' terminator. You should, as haccks says, use fgets which is very much alike:
char buf[100]; // the input buffer
fgets(buf, 100, stdin); // reads until it finds a newline (your enter) but never
// more than 99 chars, using the last char for the '\0'
// you can now use and modify buf

Resources