Input stream reads and push backs by fscanf and scanf - c

Regarding fscanf (and I assume similarly for scanf), C17 7.21.6.2.9 states the following:
"An input item is read from the stream... An input item is defined as
the longest sequence of input characters which does not exceed any
specified field width and which is, or is a prefix of, a matching
input sequence. The first character, if any, after the input item
remains unread..."
Before reading this I had always assumed that the first character after the input item was read too, then pushed back. For example, if the input was 5X and the conversion specification was %d, both the 5 and the X would be read but the X would be pushed back. However, the quote above seems to indicate that each successive character in the input stream is being "peeked" at before it is read, so the X would never be read in the first place and a push back would never be necessary. However, footnote 289 states that fscanf pushes back at most one input character onto the input stream. So I guess my question is about what all of this really means. Does "read" mean to remove a character from the stream or could it also mean to "peek" at a character without removing it?

Input stream can push back at least 1 character.
Scanning "5X" with "%d" results in "5" being read and converted to an int 5, then saved. The "X" is read, but pushed back.
Trouble occurs with input like "-a" as the "-" is read and so is "a". C guarantees a successful push-back of "a", but if "-" is successfully pushed back depends on the implementation.
int main() {
int i;
scanf("%d", &i); // Enter -a
printf("%c\n", getchar());
}
My output: -, not a as expected with only 1 push back. YMMV.
This is one of the reasons that it is better to read a line of user input with fgets() into a string and then parse the string, than to use (f)scanf().

The pushback is not always necessary. For example, if the conversion specification is %3d and the code reads three decimal digits successfully, it doesn't need to read anything more and there is no pushback.
The pushback is always the character that was read, so beyond recording where to read next, the input buffer doesn't need to change. (Using ungetc(), you can unget (push back) a character other than the one that was read.)
Reading a character means logically removing it from the stream. If it isn't a usable character, it is pushed back, so the effect is the same as peeking.

Related

Reading input from a file in C

I came across the following question:
If a file contains the line "I am a boy\r\n" then on reading this line into the array str using fgets(). What will str contain?
[A]. "I am a boy\r\n\0"
[B]. "I am a boy\r\0"
[C]. "I am a boy\n\0"
[D]. "I am a boy"
The answer has been given as option c with the explanation
Declaration: char *fgets(char *s, int n, FILE *stream);
fgets reads characters from stream into the string s. It stops when it reads either n - 1 characters or a newline character, whichever comes first.
However, I couldn't understand how will \r (carriage return) influence fgets. I mean, shouldn't it be that first "I am a boy" is read, then on encountering \r cursor is set at the initial position and "I" from "I am a body" is overwritten by \n and space following "I" is overwritten by \0.
Any help is deeply appreciated.
P.s: My claim is based on the explanation given on this link: https://www.quora.com/What-exactly-is-r-in-the-C-language
First, every time you see a multiple choice quiz on some programming website, I recommend you close the tab and do something productive instead such as watching videos of kittens. Because the questions seem to be just some variants of
Which of these is the first letter of the alphabet (only one is right)
A
a
6
a
the letter a
all of the above.
Carriage returns and line feeds do not affect the input read by a C program in that way. Each additional byte is just on top of the other bytes. Otherwise, this is very badly phrased question, as the answer be any of A, B, C or D, or maybe none of them. Saying that C is the only one that is right is wrong.
First question is what it means if "the file contains \r"? Here I assume that the author meant that the file contains the 10 characters I am a boy followed by ASCII 13 and ASCII 10 (carriage return and line feed).
In C there are two translation modes for reading files, text mode and binary mode. On POSIX systems (all those operating systems with X in their name, except for Windows eXcePtion) these are equal - the text mode is ignored. So when you read the line into a buffer with fgets on POSIX, it will look for that line feed and store all letters as is including the , so the buffer will have the following sequence of bytes I am a boy\r\n\0. Therefore A could be true.
But on Windows, the text mode translates the carriage return and the linefeed to one newline character with ASCII value 10 in memory, so what you will have is I am a boy\n\0. Therefore C could be true. If your file was opened in binary mode, you'll still have I am a boy\r\n\0 - so how'd you claim that C is the only one that can be true?
If the string that you'd read with fgets would be I am a boy\r\n (POSIX or binary mode) but you told fgets your buffer has space for only 12 characters, then you'd get 11 characters of the input and terminating \0, and therefore you'd have I am a boy\r\0. The carriage return character would remain in the stream. Therefore B could be true. B cannot be true if you indicated that the buffer will have more space.
Finally any of these array contents does contain the string I am a boy, therefore D would be true in all of the cases above.
And if your buffer didn't have enough space for 10 characters and the terminator then you'd have some prefix of the contents, such as I am a bo followed by \0 which means that none of these was true.

fscanf read()s more than the number of characters I asked for

I have the following code:
#include <stdio.h>
int main(void)
{
unsigned char c;
setbuf(stdin, NULL);
scanf("%2hhx", &c);
printf("%d\n", (int)c);
return 0;
}
I set stdin to be unbuffered, then ask scanf to read up to 2 hex characters. Indeed, scanf does as asked; for example, having compiled the code above as foo:
$ echo 23 | ./foo
35
However, if I strace the program, I find that libc actually read 3 characters. Here is a partial log from strace:
$ echo 234| strace ./foo
read(0, "2", 1) = 1
read(0, "3", 1) = 1
read(0, "4", 1) = 1
35 # prints the correct result
So sscanf is giving the expected result. However, this extra character being read is detectable, and it happens to break the communications protocol I am trying to implement (in my case, GDB remote debugging).
The man page for sscanf says about the field width:
Reading of characters stops either when this maximum is reached or when a nonmatching character is found, whichever happens first.
This seems a bit deceptive, at least; or is it in fact a bug? Is it too much to hope that with unbuffered stdin, scanf might read no more than the amount of input I asked for?
(I'm running on Ubuntu 18.04 with glibc 2.27; I've not tried this on other systems.)
This seems a bit deceptive, at least; or is it in fact a bug?
IMO, no.
An input item is read from the stream, ... An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence. The first character, if any , after the input item remains unread. If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure. C17dr ยง 7.21.6.2 9
Code such as "%hhx" (without a width limit) certainly must get 1 past the hex characters to know it is done. That excess character is pushed-back into stdin for the next input operation.
The "The first character, if any, after the input item remains unread" implies to me then a disassociation of reading characters from the stream at the lowest level and reading characters from the stream as a stream can pushed-back at least 1 character and consider that as "remains unread". The width limit of 2 does not save code as 3 characters can be read from the stream and 1 pushed back.
The width of 2 limits the maximum length of bytes to interpret, not a limit of the number of characters read at the lowest level.
Is it too much to hope that with unbuffered stdin, scanf might read no more than the amount of input I asked for?
Yes. If buffered or not, I think as a stream like stdin allows pushed-back of characters to consider them unread.
Anyways, "%2hhx" is brittle to expect not more than 2 characters read given leading white-space do not count. "These white-space characters are not counted against a specified field width."
The "I set stdin to be unbuffered" does not stop a stream from reading an excess character and later pushing it back.
Given "this extra character being read is detectable, and it happens to break the communications protocol" I recommend a new approach that does not use a stream.

using fscanf to read from a file that contains an integer array

I'm trying to read a JSON-format file that contains an integer array(e.g. [ 0, 1, 2, 3, 4 ])
I am wondering why fscanf skips the brackets in the file and go straight to the numbers when I use
// the type of value is integer
FILE* fp=fopen(file,"r");
fscanf( fp,"%d",&value);
I'm still new to file I/O and I have no idea why this happens. I thought whenever I call fscanf, the file pointer would move 1 position forward.
You should check return values. The fscanf in your example should return 0 since the first non-white-space character encountered is [ which can not start a number so that parsing fails there. The assumed return value of 0 indicates that no succcessful conversion took place. The value of value will probably stay unchanged (I couldn't find a specific statement for that in the man page). The reading position in the file will be before the [ so that subsequent attempts to read an int from the file will fail as well.
How to read the json array:
Note that I do not handle errors in the examples below. You must do that though...
scanf format strings can can contain conversion specifications which contain a list of allowed or forbidden characters. This can be used to read "away" anything which is not a number: char buf[some large enough value]; fscanf(" %[^0-9]", buf);. The reading position is now before the first number.
Then you'll have a loop doing two things:
The number ahead of us can be read trivially with fscanf("%d", &value);. This will also skip possible whitespace before the number in later iterations.
Now we must deal with the comma: fscanf(" %[,]", buf); ("read a comma which is optionally preceded by white space"). Now you can read the next number.
The last number will not be followed by a comma. The attempt to read a comma will therefore fail (i.e. return 0); this can be used as end-of-array indicator.
If more arrays or other stuff may follow you must read away the remaining whitespace and closing square bracket so that you leave the file position after the array for others.

C programming language (scanf)

I have read strings with spaces in them using the following scanf() statement.
scanf("%[^\n]", &stringVariableName);
What is the meaning of the control string [^\n]?
Is is okay way to read strings with white space like this?
This mean "read anything until you find a '\n'"
This is OK, but would be better to do this "read anything until you find a '\n', or read more characters than my buffer support"
char stringVariableName[256] = {}
if (scanf("%255[^\n]", stringVariableName) == 1)
...
Edit: removed & from the argument, and check the result of scanf.
The format specifier "%[^\n]" instructs scanf() to read up to but not including the newline character. From the linked reference page:
matches a non-empty sequence of character from set of characters.
If the first character of the set is ^, then all characters not
in the set are matched. If the set begins with ] or ^] then the ]
character is also included into the set.
If the string is on a single line, fgets() is an alternative but the newline must be removed as fgets() writes it to the output buffer. fgets() also forces the programmer to specify the maximum number of characters that can be read into the buffer, making it less likely for a buffer overrun to occur:
char buffer[1024];
if (fgets(buffer, 1024, stdin))
{
/* Remove newline. */
char* nl = strrchr(buffer, '\n');
if (nl) *nl = '\0';
}
It is possible to specify the maximum number of characters to read via scanf():
scanf("%1023[^\n]", buffer);
but it is impossible to forget to do it for fgets() as the compiler will complain. Though, of course, the programmer could specify the wrong size but at least they are forced to consider it.
Technically, this can't be well defined.
Matches a nonempty sequence of characters from a set of expected
characters (the scanset).
If no l length modifier is present, the corresponding argument shall
be a pointer to the initial element of a character array large enough
to accept the sequence and a terminating null character, which will be
added automatically.
Supposing the declaration of stringVariableName looks like char stringVariableName[x];, then &stringVariableName is a char (*)[x];, not a char *. The type is wrong. The behaviour is undefined. It might work by coincidence, but anything that relies on coincidence doesn't work by my definition.
The only way to form a char * using &stringVariableName is if stringVariableName is a char! This implies that the character array is only large enough to accept a terminating null character. In the event where the user enters one or more characters before pressing enter, scanf would be writing beyond the end of the character array and invoking undefined behaviour. In the event where the user merely presses enter, the %[...] directive will fail and not even a '\0' will be written to your character array.
Now, with that all said and done, I'll assume you meant this: scanf("%[^\n]", stringVariableName); (note the omitted ampersand)
You really should be checking the return value!!
A %[ directive causes scanf to retrieve a sequence of characters consisting of those specified between the [ square brackets ]. A ^ at the beginning of the set indicates that the desired set contains all characters except for those between the brackets. Hence, %[^\n] tells scanf to read as many non-'\n' characters as it can, and store them into the array pointed to by the corresponding char *.
The '\n' will be left unread. This could cause problems. An empty field will result in a match failure. In this situation, it's possible that no data will be copied into your array (not even a terminating '\0' character). For this reason (and others), you really need to check the return value!
Which manual contains information about the return values of scanf? The scanf manual.
Other people have explained what %[^\n] means.
This is not an okay way to read strings. It is just as dangerous as the notoriously unsafe gets, and for the same reason: it has no idea how big the buffer at stringVariableName is.
The best way to read one full line from a file is getline, but not all C libraries have it. If you don't, you should use fgets, which knows how big the buffer is, and be aware that you might not get a complete line (if the line is too long for the buffer).
Reading from the man pages for scanf()...
[ Matches a non-empty sequence of characters from the
specified set of accepted characters; the next pointer must be a
pointer to char, and there must be enough room for all the characters
in the string, plus a terminating null byte. The usual skip of
leading white space is suppressed. The string is to be made up of
characters in (or not in) a particular set; the set is defined by the
characters between the open bracket [ character and a close bracket ]
character. The set excludes those characters if the first character
after the open bracket is a circumflex (^). To include a close
bracket in the set, make it the first character after the open bracket
or the circumflex; any other position will end the set. The hyphen
character - is also special; when placed between two other
characters, it adds all intervening characters to the set. To
include a hyphen, make it the last character before the final close
bracket. For instance, [^]0-9-] means the set "everything except
close bracket, zero through nine, and hyphen". The string ends with
the appearance of a character not in the (or, with a
circumflex, in) set or when the field width runs out.
In a nutshell, the [^\n] means that read everything from the string that is not a \n and store that in the matching pointer in the argument list.

Invalid output with fscanf()

The language I am using is C
I am trying to scan data from a file, and the code segment is like:
char lsm;
long unsigned int address;
int objsize;
while(fscanf(mem_trace,"%c %lx,%d\n",&lsm,&address,&objsize)!=EOF){
printf("%c %lx %d\n",lsm,address,objsize);
}
The file which I read from has the first line as follows:
S 00600aa0,1
I 004005b6,5
I 004005bb,5
I 004005c0,5
S 7ff000398,8
The results that show in stdout is:
8048350 134524916
S 600aa0 1
I 4005b6 5
I 4005bb 5
I 4005c0 5
S 7ff000398,8
Obviously, the results had an extra line which comes nowhere.Is there anybody know how this could happen?
Thx!
This works for me on the data you supply:
#include <stdio.h>
int main(void)
{
char lsm[2];
long unsigned int address;
int objsize;
while (scanf("%1s %lx,%d\n", lsm, &address, &objsize) == 3)
printf("%s %9lx %d\n", lsm, address, objsize);
return 0;
}
There are multiple changes. The simplest and least consequential is the change from fscanf() to scanf(); that's for my convenience.
One important change is the type of lsm from a single char to an array of two characters. The format string then uses %1s reads one character (plus NUL '\0') into the string, but it also (and this is crucial) skips leading blanks.
Another change is the use of == 3 instead of != EOF in the condition. If something goes wrong, scanf() returns the number of successful matches. Suppose that it managed to read a letter but what followed was not a hex number; it would return 1 (not EOF). Further, it would return 1 on each iteration until it could find something that matched a hex number. Always test for the number of values you expect.
The output format was tidied up with the %9lx. I was testing on a 64-bit system, so the 9-digit hex converts fine. One problem with scanf() is that if you get an overflow on a conversion, the behaviour is undefined.
Output:
S 600aa0 1
I 4005b6 5
I 4005bb 5
I 4005c0 5
S 7ff000398 8
Why did you get the results you got?
The first conversion read a space into lsm, but then failed to convert S into a hex number, so it was left behind for the next cycle. So, you got the left-over garbage printed in the address and object size columns. The second iteration read the S and was then in synchrony with the data until the last line. The newline at the end of the format (like any other white space in the format string) eats white space, which is why the last line worked despite the leading blank.
A directive that is a conversion specification defines a set of
matching input sequences, as described below for each specifier. A
conversion specification is executed in the following steps:
Input white-space characters (as specified by the isspace function)
are skipped, unless the specification includes a [, c, or n specifier.
An input item is read from the stream, unless the specification
includes an n specifier.
[...]
The first time you call fscanf, your %c reads the first blank space in the file. Your white-space character reads zero or more characters of white-space, this time zero of them. Your %lx fails to match the S character in the file, so fscanf returns. You don't check the result. Your variables contain values that they had from earlier operations.
The second time you call fscanf, your %c reads the first S character in the file. From that point on, everything else succeeds too.
Added in editing, here is the simplest change to your format string to solve your problem:
" %c %lx,%d\n"
The space at the beginning will read zero or more characters of white-space and then %c will read the first non-white-space character in the file.
Here is another format string that will also solve your problem:
" %c %lx,%d"
The reason is that if you read and discard zero or more white-space characters twice in a row, the result is the same as doing it just once.
I think that fsanf reads the first character [space] into lsm then fails to read address and objsize because the format shift doesn't match for the rest of the line.
Then it prints a space then whatever happened to be in address and objsize when it was declared
EDIT--
fscanf consumes the whitespaces after each call, if you call ftell you'll see
printf("%c %lx %d %d\n",lsm,address,objsize,ftell(mem_trace));

Resources