What is `scanf` supposed to do with incomplete exponent-part? - c

Take for example rc = scanf("%f", &flt); with the input 42ex. An implementation of scanf would read 42e thinking it would encounter a digit or sign after that and realize first when reading x that it didn't get that. Should it at this point push back both x and e? Or should it only push back the x.
The reason I ask is that GNU's libc will on a subsequent call to gets return ex indicating they've pushed back both x and e, but the standard says:
An input item is read from the stream, unless the specification includes an n specifier. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence[245] The first character, if any, after the input item remains unread. If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure.
I interpret this as since 42e is a prefix of a matching input sequence (since for example 42e1 would be a matching input sequence), which should mean that it would consider 42e as a input item that should be read leaving only x unread. That would also be more convenient to implement if the stream only supports single character push back.

Your interpretation of the standard is correct. There's even an example further down in the C standard which says that 100ergs of energy shouldn't match %f%20s of %20s because 100e fails to match %f.
But most C libraries seem to implement this differently, probably due to historical reasons. I just checked the C library on macOS and it behaves like glibc. The corresponding glibc bug was closed as WONTFIX with the following explanation from Ulrich Drepper:
This is stupidity on the ISO C committee side which goes against existing
practice. Any change can break existing code.

Related

What does scanf("%f%c", ...) do against input `100e`?

Consider the following C code (online available io.c):
#include <stdio.h>
int main () {
float f;
char c;
scanf ("%f%c", &f, &c);
printf ("%f \t %c", f, c);
return 0;
}
When the input is 100f, it outputs 100.000000 f.
However, when the input is 100e, it outputs only 100.000000, without e followed. What is going on here? Isn't 100e an invalid floating-point number?
This is (arguably) a glibc bug.
This behaviour clearly goes against the standard. However it is exhibited by other implementations. Some people consider it a bug in the standard instead.
Per the standard, An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence. So 100e is an input item because it is a prefix of a matching input sequence, say, 100e1, but any longer sequence of characters from the input isn't. Further, If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure. 100e is not a matching sequence so the standard requires the directive to fail.
The standard cannot tell scanf to accept 100 and continue scanning from e, as some people would expect, because stdio has a limited push-back of just one character. So having read 100e, the implementation would have to read at least one more character, say a newline to be specific, and then push back both newline and e, which it cannot always do.
I'd say this is pretty clearly a pretty unclear, gray area.
If you're an implementor of a C library (or a member of the X3J11 committee), you have to worry about this sort of thing — sometimes a lot. You have to worry about the edge cases, and sometimes the edge cases can be particularly edgy.
However, you did not tag your question with the "language lawyer" tag, so perhaps you're not worried about a scrupulously correct, official interpretation.
If you're not an implementor of a C library or a member of the X3J11 committee, I'd say: don't worry what the "right" answer here is! You don't have to worry, because you don't care, because you'd be crazy to write code which is sensitive to this question — precisely because it's such an obvious gray area. (Even if you do figure out what the right behavior here is, do you trust every implementor of every C library in the world to always implement that behavior?)
I'd say there are three things you can do in the category of "not worrying", and not writing code which is sensitive to this question.
Don't use scanf at all (for anything). It's an odious, imprecise, imperfect function, that's not good for anything except — perhaps — getting numbers into the first few programs you ever write while you're first learning C. After that, scanf has no use in any serious program.
Don't arrange your code and data such that it has to confront ambiguous input like "100e" in the first place. Where is it coming from, anyway? Is it input the user might type? Data being read in from a data file? Is it expected or unexpected, correct or incorrect input? If you're reading a data file, do you have control over the code that writes the data file? Can you guarantee that floating-point fields will always be delimited appropriately, will not occasionally have random alphabetic characters appended?
If you do have to parse input that might contain a valid floating-point number, might have random alphabetic characters appended, and might therefore be ambiguous like this, I'd encourage you to use strtod instead, which is likely to be both better-defined and better-implemented.
Give a space between "%f %c" like that and also when you are going to enter input make sure to have a space between two inputs.
I am assuming you just want to print a character.
From the C Standard (6.4.4.2 Floating constants)
decimal-floating-constant:
fractional-constant exponent-partopt floating-suffixopt
digit-sequence exponent-part floating-suffixopt
and
exponent-part:
e signopt digit-sequence
E signopt digit-sequence
If you will change the call of printf the following way
printf ("%e \t %d\n", f, c);
you will get the output
1.000000e+02 10
that is the variable c has gotten the new line character '\n'.
It seems that the implementation of scanf is made such a way that the symbol e is interpreted as a part of a floating number though there is no digit after the symbol.
According to the C Standard (7.21.6.2 The fscanf function)
9 An input item is read from the stream, unless the specification
includes an n specifier. An input item is defined as the longest
sequence of input characters which does not exceed any specified
field width and which is, or is a prefix of, a matching input
sequence.278) The first character, if any, after the input item
remains unread.
So 100e is a matching input sequence of characters for a floating number.

Why the different behavior of snprintf vs swprintf?

The C standard states the following from the standard library function snprintf:
"The snprintf function is equivalent to fprintf, except that the
output is written into an array (specified by arguments) rather than
to a stream. If n is zero, nothing is written, and s may be a null
pointer. Otherwise, output characters beyond the n-1st are discarded
rather than being written to the array, and a null character is
written at the end of the characters actually written into the array.
If copying takes place between objects that overlap, the behavior is
undefined."
"The snprintf function returns the number of characters that would
have been written had n been sufficiently large, not counting the
terminating null character, or a negative value if an encoding error
occurred. Thus, the null-terminated output has been completely written
if and only if the returned value is nonnegative and less than n."
Compare it to the statement about swprintf:
"The swprintf function is equivalent to fwprintf, except that the
argument s specifies an array of wide characters into which the
generated output is to be written, rather than written to a stream. No
more than n wide characters are written, including a terminating null
wide character, which is always added (unless n is zero)."
"The swprintf function returns the number of wide characters written
in the array, not counting the terminating null wide character, or a
negative value if an encoding error occurred or if n or more wide
characters were requested to be written."
At first glance it may seem like snprintf and swprintf are complete equivalent to each other, the latter merely handling wide strings and the former narrow strings. However, that's not the case. While snprintf returns the number of characters that would have been written if n had been large enough, swprintf returns a negative value in this case (which means that you can't know how many characters would have been written if there had been enough space). This makes the two functions not fully interchangeable, because their behavior is different in this regard (and thus the latter can't be used for some thing that the former can, such as evaluating how long the output buffer would need to be, before actually creating it.)
Why would they make this difference? I suppose the behavior of swprintf makes the implementation more efficient when n is too small, but still, why the difference? I don't think it's even a question of snprintf being older and thus "legacy" and "dragging the weight of its history, which can't be changed later" and swprintf being newer and thus free to be improved, because both were introduced in C99.
There is, however, another significantly subtler difference between the two specifications. If you notice, the specifications are not merely carbon-copies of each other, with the only difference being the return value. That's another, much subtler difference, and that's the somewhat ambiguous behavior of what happens if n is too small for the string-to-be-printed.
The specification for snprintf quite clearly states that the output will be written up to n-1 characters even when n is too small, and the null character will be written at the end of it always. The specification of swprintf almost states this... except it leaves it ambiguous in its specification of the return value.
More specifically a negative return value is used to signal that an error occurred while trying to write the string to the destination. A negative value is also returned when n was too small. It's left ambiguous whether this is actually considered an error situation or not. This is significant because if it's considered an error, then the implementation is free to not write all, or anything, into the destination, because it can be signaling "an error occurred, the output is invalid". The first paragraph of the specification makes it sound like at most n-1 characters are always written, and an ending null character is always written ("which is always added"), but the second paragraph about the return value leaves it ambiguous whether this is actually an error situation and whether the implementation can choose not to write those things in this case.
This is significant because the glibc implementation of swprintf does not write the final null character when n is too small, making the result an invalid string. While I can't find definitive information on this, I have got the impression that the developers of glibc have interpreted the standard in such a manner that they don't have to write the final null character (or anything) to the output because this is an error situation.
The thing is that the standard seems to be very ambiguous and vague in this regard. Is it a correct interpretation? Or are they misinterpreting? Why would the standard leave it this ambiguous?
My interpretation differs from that of the glibc developers. I understand the second paragraph to mean:
A negative value is returned if:
an encoding error occurred, or
n or more wide characters were requested to be written.
I don't see how this could be interpreted as n being too small being considered an error.

Does sscanf require a null terminated string as input?

A recently discovered explanation for GTA lengthy load times(1) showed that many implementations of sscanf() call strlen() on their input string to set up a context object for an internal routine shared with other scanning functions (scanf(), fscanf()...). This can become a performance bottleneck when the input string is very long. Parsing a 10MB JSON file loaded as a string with repeated calls to sscanf() with an offset and a %n conversion proved to be a dominant cause for the load time.
My question is should sscanf() even read the input string beyond the bytes necessary for the conversions to complete? For example does the following code invoke undefined behavior:
int test(void) {
char buf[1] = { '1' };
int v;
sscanf(buf, "%1d", &v);
return v;
}
The function should return 1 and does not need to read more than one byte from buf, but is sscanf() allowed to read from buf beyond the first byte?
(1) references provided by JdeBP:
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
https://news.ycombinator.com/item?id=26297612
https://github.com/biojppm/rapidyaml/issues/40
Here are the relevant parts from the C Standard:
7.21.6.7 The sscanf function Synopsis
Synopsis
#include <stdio.h>
int sscanf(const char * restrict s, const char * restrict format, ...);
Description
The sscanf function is equivalent to fscanf, except that input is obtained from a string (specified by the argument s) rather than from a stream. Reaching the end of the string is equivalent to encountering end-of-file for the fscanf function. If copying takes place between objects that overlap, the behavior is undefined.
Returns
The sscanf function returns the value of the macro EOF if an input failure occurs before the first conversion (if any) has completed. Otherwise, the sscanf function returns the number of input items assigned, which can be fewer than provided for, or even zero, in the event of an early matching failure.
The input is specifically referred to as a string, so it should be null terminated
Albeit none of the characters in the string beyond the initial prefix that matches the conversion specifier and potentially the next byte that helped determine the end of the matching sequence are used for the conversion, these characters must be followed by a null terminator so the input is a well formed string, and it is conforming to call strlen() on it to determine the input length.
To avoid linear time complexity on long input strings, sscanf() should limit the scan for the end of string to a small size with strnlen() or equivalent and pass an appropriate refill function. Passing a huge length and letting the internal routine special case the null byte is an even better approach.
In the mean time, programmers should avoid passing long input strings to sscanf() and use more specialized functions for their parsing tasks, such as strtol(), which also requires a well formed C string, but is implemented in a more conservative way. This would also avoid potential undefined behavior on number conversions for out of range string representations.
When the Standard was written, many library functions were handled identically by almost all existing implementations, but some implementations may have had good reasons for handling a few cases differently. If the number of implementations that would have reason to differ from the commonplace behavior was substantial, then the Committee would either require that all implementations behave in the common fashion (as happens when e.g. computing UINT_MAX+1u), or explicitly state that they were not required to do so (as when e.g. computing INT_MAX+1). In cases where there was a clear common behavior, but it might not be practical on all implementations, however, the Committee generally simply refrained from saying anything, on the presumption that most compilers would have no reason to deviate from the common behavior, and the authors of those that would have reason to deviate would be better placed than the Committee to judge the pros and cons of following the common behavior versus deviating from it.
The sscanf behavior at issue fits the latter pattern. The Committee didn't want to mandate that implementations which would have trouble if the data source didn't have a trailing zero byte must be changed to deal with such data sources, but nor did they want to require that programmers copy data from sources that don't have trailing zero bytes to places that do before using sscanf upon it even when their implementation wouldn't care about anything beyond the portion of the source that would be meaningfully examined. Since makers of implementations that require a trailing zero will likely block any change to the Standard that would require them to tolerate its absence, and programmers whose implementations that impose no such needless requirements will block any change to the Standard that would require that they add extra data-copying steps to their code, the situation will remain deadlocked unless people can agree to categorize implementations that impose the trailing-byte requirement as "conforming but deficient" and require that they indicate such deficiency via predefined macro or other such means.

Did the meaning of the n parameter of fgets change over time?

The Apple developer documentation states:
Security Note for fgets: Although the fgets function provides the ability to read a limited amount of data, you must be careful when using it. Like the other functions in the “safer” column, fgets always terminates the string. However, unlike the other functions in that column, it takes a maximum number of bytes to read, not a buffer size.
The last sentence sounds wrong to me. For comparison, here is what POSIX says:
The fgets() function shall read bytes from stream into the array pointed to by s until n-1 bytes are read, or a <newline> is read and transferred to s, or an end-of-file condition is encountered. A null byte shall be written immediately after the last byte read into the array.
Here is what an ISO C draft from 2005 says:
The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s. No additional characters are read after a new-line character (which is retained) or after end-of-file. A null character is written immediately after the last character read into the array.
The FreeBSD man page says the same as the C standard and POSIX.
This makes me think that the Apple documentation is clearly wrong. The simplest explanation is that Apple didn't know better when they published this article. But although simple, this hypothesis doesn't feel plausible to me.
Are there other reasons that Apple could deviate from the wording of the C standard?
Even early (early 1970s) versions of fgets() specified that n is the buffer size, and that the buffer will be terminated with a '\0'.
Kernighan and Ritchie reflected that correctly in all their books and documentation.
However, a number of authors of introductory texts (who I won't attempt to name, since I'm sure I'll miss some, and all deserve to be equally embarrassed) documented that up to n characters could be written to the buffer, and that the trailing '\0' might be dropped in some cases.
The fgets functions reads at most the size minus one bytes from the file. If the wrong value is passed as the buffer size then fgets might write out of bounds.
So the quote from the Apple documentation that you show is correct in that the value is more related to the number of bytes to read from the file. But on the other hand any normal code would use the actual buffer size when falling fgets. And if that number is input from a user then it should be validated before use.
On the other hand the documentation continues to state (thanks for the note Sander De Dycker)
In practical terms, this means that you must always pass a size value that is one fewer than the size of the buffer to leave room for the null termination. If you do not, the fgets function will dutifully terminate the string past the end of your buffer, potentially overwriting whatever byte of data follows it.
And this is wrong. The size argument passed to fgets always includes the string terminator. At least according to the C standard.

How does the data flow from input stream into the input buffer during scanf() in C?

For example, when I do scanf("%s",arg); : Terminal allows me to input text until a newline is encountered but it only stores up to the first space character inside the arg variable. Rest remains in buffer.
scanf("%c", arg); : In this case also it allows me to enter text into the terminal till I give a newline character, but only one is stored in arg while the rest remains in buffer.
scanf("%[^P]", arg); : In this case, I can enter text into the terminal even after giving it a newline character until I hit a line with 'P' in it and press enter key (newline character) and then transfers everything to the input buffer.
How is it determined how much data from the input stream is to be transferred to the input buffer at a time?
Assuming that arg is of the proper type.
My understanding seems to be fundamentally wrong here. If someone can please explain this stuff, I will be very grateful.
How is it determined? It's determined by the format string itself.
The scanf function will read items until they no longer match the format specifier given. Then it stops, leaving the first "non-compliant" character still in the buffer.
If you mean "how is it handled under the covers?", that's a different issue.
My first response to that is "it doesn't matter". The ISO standard mandates how the language works, and it describes a "virtual machine" capable of doing that. Provided you follow the rules of the machine, you don't need to worry about how things happen under the covers.
My second answer is probably more satisfying but is very implementation dependent.
For efficiency, the underlying software will probably not deliver any data to the implementation until it has a full line (though this of course is likely to be configurable, such as setting raw mode for the terminal). That means things like backspace may change the characters already entered rather than being inserted into the stream.
It may (such as with the GNU readline() library allow all sorts of really fancy editing on the line before delivering the characters. There's nothing to stop the underlying software from even opening up a vim session to allow you to enter data, and only deliver it once you exit :-)
the buffer and primitive editing features are provided by the operating system.
if you can set the terminal into "raw mode" you will see different behavior.
eg: characters may be available to read before enter is pressed especially if the buffer can also be disabled.
I think, it is not related with how much, rather, what the format specifier tells.
As per C99, chapter 7.19.6.2, paragraph 2, (for fscanf())
The fscanf function reads input from the stream pointed to by stream, under control
of the string pointed to by format that specifies the admissible input sequences and how
they are to be converted for assignment, using subsequent arguments as pointers to the
objects to receive the converted input.
And for the format specifiers, you need to refer to paragraph 12.

Resources