When will scanf append '\0' to user input? - c

I understand that scanf("%s",arr); will automatically append '\0' to the user input. But I thought that it was limited only to this. However even scanf("%[^\n]",arr); also appends a '\0'.(I know this scans all characters until a newline is reached). In this case will '\0' be appended after '\n' or before '\n'. Also how do we figure out when and when not '\0' will be appended? How do we scan many characters to just form a character array and not a string?

Also how do we figure out when and when not '\0' will be appended?
As scanf is a standard function, we figure that out from the C standard. From C11 draft 7.21.6.2p12:
s
... the corresponding argument shall be a pointer to the initial element of a character array large enough to accept the sequence and a terminating null character, which will be added automatically. ...
[
... the corresponding argument shall be a pointer to the initial element of a character array large enough to accept the sequence and a terminating null character, which will be added automatically. ...
There is also nicer looking and commonly used for fast reference the cppreference site.
I strongly suggest you never, ever use %s and %[ without a number specifying the "maximum field width" (unless you know you can use it). Always use %<number>s and %<number>[, to limit the number of character read and prevent overflow. So always:
char arr[20];
scanf("%19s", arr);
scanf is very unsafe function in that manner, it's easy to do stack overflow when reading strings.
How do we scan many characters to just form a character array and not a string?
I think you want:
size_t fread(void * restrict ptr,
size_t size, size_t nmemb,
FILE * restrict stream);
The fread function reads, into the array pointed to by ptr, up to nmemb elements whose size is specified by size, from the stream pointed to by stream.
It will read nmemb elements of size size to the memory pointed to by ptr. So just:
char character_array[20];
size_t number_of_chcaracter_read = fread(character_array, 20, 1, stdin);
It will not append the terminating null byte to the character_array.

Related

Why is the last character of a string not captured?

What happens to the last (nth) character of a n-character string when I try to output the string?
I've included my code, sample input and output below that highlights that the last character I input is lost.
Code:
char buffer[10];
fgets(buffer, sizeof(buffer), stdin);
printf("%s", buffer);
return 0;
Input:
aaaaaaaaab (that's 9 a's followed by 1 b)
Output:
aaaaaaaaa (9 a's)
For an array of characters to be treated as a propper string, its last character must be a null terminator (or null byte) '\0'.
The fgets function, in particular always makes sure that this character is added to the char array, so for a size argument of 10 it stores the first 9 caracters in the array and a null byte in the last available space, if the input is larger than or equal to 9.
Be aware that the unread characters, like b in your sample case, will remain in the input buffer stdin, and can disrupt future input reads.
This null byte acts as a sentinel, and is used by functions like printf to know where the string ends, needless to say that this character is not printable.
If you pass a non null terminated array of characters to printf this will amount to undefined behavior.
Many other functions in the standard library (and others) rely on this to work properly so it's imperative that you make sure that all your strings are properly null terminated.

Does the C standard guarantee buffers are not touched past their null terminator?

In the various cases that a buffer is provided to the standard library's many string functions, is it guaranteed that the buffer will not be modified beyond the null terminator? For example:
char buffer[17] = "abcdefghijklmnop";
sscanf("123", "%16s", buffer);
Is buffer now required to equal "123\0efghijklmnop"?
Another example:
char buffer[10];
fgets(buffer, 10, fp);
If the read line is only 3 characters long, can one be certain that the 6th character is the same as before fgets was called?
The C99 draft standard does not explicitly state what should happen in those cases, but by considering multiple variations, you can show that it must work a certain way so that it meets the specification in all cases.
The standard says:
%s - Matches a sequence of non-white-space characters.252)
If no l length modifier is present, the corresponding argument shall be a
pointer to the initial element of a character array large enough to accept the
sequence and a terminating null character, which will be added automatically.
Here's a pair of examples that show it must work the way you are proposing to meet the standard.
Example A:
char buffer[4] = "abcd";
char buffer2[10]; // Note the this could be placed at what would be buffer+4
sscanf("123 4", "%s %s", buffer, buffer2);
// Result is buffer = "123\0"
// buffer2 = "4\0"
Example B:
char buffer[17] = "abcdefghijklmnop";
char* buffer2 = &buffer[4];
sscanf("123 4", "%s %s", buffer, buffer2);
// Result is buffer = "123\04\0"
Note that the interface of sscanf doesn't provide enough information to really know that these were different. So, if Example B is to work properly, it must not mess with the bytes after the null character in Example A. This is because it must work in both cases according to this bit of spec.
So implicitly it must work as you stated due to the spec.
Similar arguments can be placed for other functions, but I think you can see the idea from this example.
NOTE:
Providing size limits in the format, such as "%16s", could change the behavior. By the specification, it would be functionally acceptable for sscanf to zero out a buffer to its limits before writing the data into the buffer. In practice, most implementations opt for performance, which means they leave the remainder alone.
When the intent of the specification is to do this sort of zeroing out, it is usually explicitly specified. strncpy is an example. If the length of the string is less than the maximum buffer length specified, it will fill the rest of the space with null characters. The fact that this same "string" function could return a non-terminated string as well makes this one of the most common functions for people to roll their own version.
As far as fgets, a similar situation could arise. The only gotcha is that the specification explicitly states that if nothing is read in, the buffer remains untouched. An acceptable functional implementation could sidestep this by checking to see if there is at least one byte to read before zeroing out the buffer.
Each individual byte in the buffer is an object. Unless some part of the function description of sscanf or fgets mentions modifying those bytes, or even implies their values may change e.g. by stating their values become unspecified, then the general rule applies: (emphasis mine)
6.2.4 Storage durations of objects
2 [...] An object exists, has a constant address, and retains its last-stored value throughout its lifetime. [...]
It's this same principle that guarantees that
#include <stdio.h>
int a = 1;
int main() {
printf ("%d\n", a);
printf ("%d\n", a);
}
attempts to print 1 twice. Even though a is global, printf can access global variables, and the description of printf doesn't mention not modifying a.
Neither the description of fgets nor that of sscanf mentions modifying buffers past the bytes that actually were supposed to be written (except in the case of a read error), so those bytes don't get modified.
The standard is somewhat ambiguous on this, but I think a reasonable reading of it is that the answer is: yes, it's not allowed to write more bytes to the buffer than it read+null. On the other hand, a stricter reading/interpretation of the text could conclude that the answer is no, there's no guarantee. Here's what a publicly avaialble draft says about fgets.
char *fgets(char * restrict s, int n, FILE * restrict stream);
The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s. No additional characters are read after a new-line character (which is retained) or after end-of-file. A null character is written immediately after the last character read into the array.
The fgets function returns s if successful. If end-of-file is encountered and no characters have been read into the array, the contents of the array remain unchanged and a null pointer is returned. If a read error occurs during the operation, the array contents are indeterminate and a null pointer is returned.
There's a guarantee about how much it is supposed to read from the input, i.e. stop reading at newline or EOF and not read more than n-1 bytes. Although nothing is said explicitly about how much it's allowed to write to the buffer, the common knowledge is that fgets's n parameter is used to prevent buffer overflows. It's a little strange that the standard uses the ambiguous term read, which may not necessarily imply that gets can't write to the buffer more than n bytes, if you want to nitpick on the terminology it uses. But note that the same "read" terminology is used about both issues: the n-limit and the EOF/newline limit. So if you interpret the n-related "read" as a buffer-write limit, then [for consistency] you can/should interpret the other "read" the same way, i.e. not write more than what it read when string is shorter than the buffer.
On the other hand, if you distinguish between the uses of the phrase-verb "read into" (="write") and just "read", then you can't read the committee's text the same way. You are guaranteed that it won't "read into" (="write to") the array more than n bytes, but if the input string is terminated sooner by newline or EOF you're only guaranteed the rest (of the input) won't be "read", but whether that implies in won't be "read into" (="written to") the buffer is unclear under this stricter reading. The crucial issue is keyword is "into", which is elided, so the problem is whether the completion given by me in brackets in the following modified quote is the intended interpretation:
No additional characters are read [into the array] after a new-line character (which is retained) or after end-of-file.
Frankly a single postcondition stated as a formula (and would be pretty short in this case) would have been a lot more helpful than the verbiage I quoted...
I can't be bothered to try and analyze their writeup about the *scanf family, because I suspect it's going to be even more complicated given all the other things that happen in those functions; their writeup for fscanf is about five pages long... But I suspect a similar logic applies.
is it guaranteed that the buffer will not be modified beyond the null
terminator?
No, there's no guarantee.
Is buffer now required to equal "123\0efghijklmnop"?
Yes. But that's only because you've used correct parameters to your string related functions. Should you mess up buffer length, input modifiers to sscanf and such, then you program will compile. But it will most likely fail during runtime.
If the read line is only 3 characters long, can one be certain that the 6th character is the same as before fgets was called?
Yes. Once fgets() figures you have a 3 character input string it stores the input in the provided buffer, and it doesn't care about the reset of provided space at all.
Is buffer now required to equal "123\0efghijklmnop"?
Here buffer is just consists of 123 string guaranteed terminating at NUL.
Yes the memory allocated for array buffer will not get de-allocated, however you are making sure/restricting your string buffer can atmost only have 16 char elements which you can read into it at any point of time. Now depends whether you write just a single char or maximum what buffer can take.
For example:
char buffer[4096] = "abc";`
actually does something below,
memcpy(buffer, "abc", sizeof("abc"));
memset(&buffer[sizeof("abc")], 0, sizeof(buffer)-sizeof("abc"));
The standard insists that if any part of char array is initialized that is all it consists of at any moment until obeying its memory boundary.
There are no any guarantees from standard, which is why the functions sscanf and fgets are recommended to be used (with respect to the size of the buffer) as you show in your question (and using of fgets is considered preferable compared with gets).
However, some standard functions use null-terminator in their work, e.g. strlen (but I suppose you ask about string modification)
EDIT:
In your example
fgets(buffer, 10, fp);
untouching characters after 10-th is guaranteed (content and length of buffer will not be considered by fgets)
EDIT2:
Moreover, when using fgets keep in mind that '\n' will be stored in the buffers. e.g.
"123\n\0fghijklmnop"
instead of expected
"123\0efghijklmnop"
Depends on the function in use (and to a lesser degree its implementation). sscanf will start writing when it encounters its first non-whitespace character, and continue writing until its first whitespace character, where it will add a finishing 0 and return. But a function like strncpy (famously) zeroes out the rest of the buffer.
There is however nothing in the C standard which mandates how these functions behave.

About gets() in C

I am writing a C program, which has a 5-element array to store a string. And I am using gets() to get input. When I typed in more than 5 characters and then output the string, it just gave me all the characters I typed in. I know the string is terminated by a \0 so even I exceeded my array, it will still output the whole thing.
But what I am curious is where exactly gets() stores input, either buffer or just directly goes to my array?
What if I type in a long long string, will gets() try to store characters in the memories that should not be touched? Would it gives me a segment fault?
That's why gets is an evil. It does not check array bound and often invokes undefined behavior. Never use gets, instead you can use fgets.
By the way, now gets is no longer be a part of C. It has been removed in C11 standard in favor of a new safe alternative, gets_s1 (see the wiki). So, better to forget about gets.
1. C11: K.3.5.4.1 The gets_s function
Synopsis
#define _ _STDC_WANT_LIB_EXT1_ _ 1
#include <stdio.h>
char *gets_s(char *s, rsize_t n);
gets() will store the characters in the 5-element buffer. If you type in more than 4 characters, the end of string character will be missed and the result may not work well in any string operations in your program.
excerpt from man page on Ubuntu Linux
gets() reads a line from stdin into the buffer pointed to by s until
either a terminating newline or EOF, which it replaces with a null byte
('\0'). No check for buffer overrun is performed
The string is stored in the buffer and if it is too long it is stored in contiguous memory after the buffer. This can lead to unintended writing over of data or a SEGV fault or other problems. It is a security issue as it can be used to inject code into programs.
gets() stores the characters you type directly into your array and you can safely use/modify them. But indeed, as haccks and unxnut correctly state, gets doesn't care about the size of the array you give it to store its chars in, and when you type more characters than the array has space for you might eventually get a segmentation fault or some other weird results.
Just for the sake of completeness, gets() reads from a buffered file called stdin which contains the chars you typed. More specifically, it takes the chars until it reaches a newline. That newline too is put into your array and next the '\0' terminator. You should, as haccks says, use fgets which is very much alike:
char buf[100]; // the input buffer
fgets(buf, 100, stdin); // reads until it finds a newline (your enter) but never
// more than 99 chars, using the last char for the '\0'
// you can now use and modify buf

what's in the left part of buf after fgets()?

I hope to know what will be in the left part of buf after fgets() been exacted. For example:
char buf[100];
fgets(buf, sizeof(buf), fp);
If one line has just 10 characters + '\n', then what will be in the left part of buf (from but[12] to buf[99])?
If execute fgets() twice, will the second input cover the first input to buf?
When fgets reads data it changes one element of the buffer at a time (simplifying assumption) until it reaches the limit or it finds a terminator in the input. All other elements in the buffer remain the same as before calling fgets (thus, they might have random data or they might leak previously read info).
#define SIZE 100
...
char buf[SIZE];
fgets reads at most SIZE - 1 characters from the given file stream and stores them in buf. The produced character string is always NULL-terminated. Parsing stops if end-of-file occurs or a newline character is found, in which case buf will contain that newline character. The left of data remains unchanged and it will contain whatever it was holding earlier ,it may be random data or anything left in the memory.And this is validated by the C standard below :
From the C11 standard:-
7.21.7.2 The fgets function
Synopsis
#include <stdio.h>
char *fgets(char * restrict s, int n, FILE * restrict stream);
Description
The fgets function reads at most one less than the number of
characters specified by n from the stream pointed to by stream into
the array pointed to by s. No additional characters are read after
a new-line character (which is retained) or after end-of-file. A null
character is written immediately after the last character read into
the array.
Returns
The fgets function returns s if successful. If end-of-file is
encountered and no characters have been read into the array, the
contents of the array remain unchanged and a null pointer is
returned. If a read error occurs during the operation, the array
contents are indeterminate and a null pointer is returned.
Emphasis mine :)
Now the second question :-
If execute fgets() twice, will the second input cover the first input to buf?
If by cover the second input you mean overwrite then yes ,obviously it will overwrite the first input ,you can execute this yourself and see.

Null termination of char array

Consider following case:
#include<stdio.h>
int main()
{
char A[5];
scanf("%s",A);
printf("%s",A);
}
My question is if char A[5] contains only two characters. Say "ab", then A[0]='a', A[1]='b' and A[2]='\0'.
But if the input is say, "abcde" then where is '\0' in that case. Will A[5] contain '\0'?
If yes, why?
sizeof(A) will always return 5 as answer. Then when the array is full, is there an extra byte reserved for '\0' which sizeof() doesn't count?
If you type more than four characters then the extra characters and the null terminator will be written outside the end of the array, overwriting memory not belonging to the array. This is a buffer overflow.
C does not prevent you from clobbering memory you don't own. This results in undefined behavior. Your program could do anything—it could crash, it could silently trash other variables and cause confusing behavior, it could be harmless, or anything else. Notice that there's no guarantee that your program will either work reliably or crash reliably. You can't even depend on it crashing immediately.
This is a great example of why scanf("%s") is dangerous and should never be used. It doesn't know about the size of your array which means there is no way to use it safely. Instead, avoid scanf and use something safer, like fgets():
fgets() reads in at most one less than size characters from stream and stores them into the buffer pointed to by s. Reading stops after an EOF or a newline. If a newline is read, it is stored into the buffer. A terminating null byte ('\0') is stored after the last character in the buffer.
Example:
if (fgets(A, sizeof A, stdin) == NULL) {
/* error reading input */
}
Annoyingly, fgets() will leave a trailing newline character ('\n') at the end of the array. So you may also want code to remove it.
size_t length = strlen(A);
if (A[length - 1] == '\n') {
A[length - 1] = '\0';
}
Ugh. A simple (but broken) scanf("%s") has turned into a 7 line monstrosity. And that's the second lesson of the day: C is not good at I/O and string handling. It can be done, and it can be done safely, but C will kick and scream the whole time.
As already pointed out - you have to define/allocate an array of length N + 1 in order to store N chars correctly. It is possible to limit the amount of characters read by scanf. In your example it would be:
scanf("%4s", A);
in order to read max. 4 chars from stdin.
character arrays in c are merely pointers to blocks of memory. If you tell the compiler to reserve 5 bytes for characters, it does. If you try to put more then 5 bytes in there, it will just overwrite the memory past the 5 bytes you reserved.
That is why c can have serious security implementations. You have to know that you are only going to write 4 characters + a \0. C will let you overwrite memory until the program crashes.
Please don't think of char foo[5] as a string. Think of it as a spot to put 5 bytes. You can store 5 characters in there without a null, but you have to remember you need to do a memcpy(otherCharArray, foo, 5) and not use strcpy. You also have to know that the otherCharArray has enough space for those 5 bytes.
You'll end up with undefined behaviour.
As you say, the size of A will always be 5, so if you read 5 or more chars, scanf will try to write to a memory, that it's not supposed to modify.
And no, there's no reserved space/char for the \0 symbol.
Any string greater than 4 characters in length will cause scanf to write beyond the bounds of the array. The resulting behavior is undefined and, if you're lucky, will cause your program to crash.
If you're wondering why scanf doesn't stop writing strings that are too long to be stored in the array A, it's because there's no way for scanf to know sizeof(A) is 5. When you pass an array as the parameter to a C function, the array decays to a pointer pointing to the first element in the array. So, there's no way to query the size of the array within the function.
In order to limit the number of characters read into the array use
scanf("%4s", A);
There isn't a character that is reserved, so you must be careful not to fill the entire array to the point it can't be null terminated. Char functions rely on the null terminator, and you will get disastrous results from them if you find yourself in the situation you describe.
Much C code that you'll see will use the 'n' derivatives of functions such as strncpy. From that man page you can read:
The strcpy() and strncpy() functions return s1. The stpcpy() and
stpncpy() functions return a
pointer to the terminating `\0' character of s1. If stpncpy() does not terminate s1 with a NUL
character, it instead returns a pointer to s1[n] (which does not necessarily refer to a valid mem-
ory location.)
strlen also relies on the null character to determine the length of a character buffer. If and when you're missing that character, you will get incorrect results.
the null character is used for the termination of array. it is at the end of the array and shows that the array is end at that point. the array automatically make last character as null character so that the compiler can easily understand that the array is ended.
\0 is an terminator operator which terminates itself when array is full
if array is not full then \0 will be at the end of the array
when you enter a string it will read from the end of the array

Resources