Proper use of fprintf - c

Is this ever acceptable?
fprintf(fp,"Just a string");
or
fprintf(fp,stringvariable);
versus
fprintf(fp,"%s","Just a string");
It seems confusing to me as the string variable (or constant) is used as the formatting versus the output itself. It the string variable had format-specific content ('%s', etc.) then the output would not be as intended.
For string-only output (no formatting) which is better?
fprintf(fp,"%s",stringvariable);
or
fputs(stringvariable,fp);

It is acceptable if you "know" the string variable to be "clean", if you don't care about the warning most modern compilers generate for that construct. Because:
If your string contains conversion specifiers "by accident", you are invoking undefined behaviour.
If you read that string from somewhere, a malicious attacker could exploit point 1. above to his ends.
It's generally better to use puts() or fputs() as they avoid this problem, and consequently don't generate a warning. (puts() also tosses in an automatic '\n'.)
The *puts() functions also have (marginally) better performance. *printf(), even on nothing more than "%s" as format string, still has to parse that conversion specifier, and count the number of characters printed for its return value.
Thanks to users 'rici' and 'Grady Player' for pointing out the character counting and compiler warning. My C got a bit rusty it seems. ;-)

Related

Why is it better to use `%s` to print a string using `printf` rather than printing it directly? [duplicate]

I was reading about vulnerabilities in code and came across this Format-String Vulnerability.
Wikipedia says:
Format string bugs most commonly appear when a programmer wishes to
print a string containing user supplied data. The programmer may
mistakenly write printf(buffer) instead of printf("%s", buffer). The
first version interprets buffer as a format string, and parses any
formatting instructions it may contain. The second version simply
prints a string to the screen, as the programmer intended.
I got the problem with printf(buffer) version, but I still didn't get how this vulnerability can be used by attacker to execute harmful code. Can someone please tell me how this vulnerability can be exploited by an example?
You may be able to exploit a format string vulnerability in many ways, directly or indirectly. Let's use the following as an example (assuming no relevant OS protections, which is very rare anyways):
int main(int argc, char **argv)
{
char text[1024];
static int some_value = -72;
strcpy(text, argv[1]); /* ignore the buffer overflow here */
printf("This is how you print correctly:\n");
printf("%s", text);
printf("This is how not to print:\n");
printf(text);
printf("some_value # 0x%08x = %d [0x%08x]", &some_value, some_value, some_value);
return(0);
}
The basis of this vulnerability is the behaviour of functions with variable arguments. A function which implements handling of a variable number of parameters has to read them from the stack, essentially. If we specify a format string that will make printf() expect two integers on the stack, and we provide only one parameter, the second one will have to be something else on the stack. By extension, and if we have control over the format string, we can have the two most fundamental primitives:
Reading from arbitrary memory addresses
[EDIT] IMPORTANT: I'm making some assumptions about the stack frame layout here. You can ignore them if you understand the basic premise behind the vulnerability, and they vary across OS, platform, program and configuration anyways.
It's possible to use the %s format parameter to read data. You can read the data of the original format string in printf(text), hence you can use it to read anything off the stack:
./vulnerable AAAA%08x.%08x.%08x.%08x
This is how you print correctly:
AAAA%08x.%08x.%08x.%08x
This is how not to print:
AAAA.XXXXXXXX.XXXXXXXX.XXXXXXXX.41414141
some_value # 0x08049794 = -72 [0xffffffb8]
Writing to arbitrary memory addresses
You can use the %n format specifier to write to an arbitrary address (almost). Again, let's assume our vulnerable program above, and let's try changing the value of some_value, which is located at 0x08049794, as seen above:
./vulnerable $(printf "\x94\x97\x04\x08")%08x.%08x.%08x.%n
This is how you print correctly:
??%08x.%08x.%08x.%n
This is how not to print:
??XXXXXXXX.XXXXXXXX.XXXXXXXX.
some_value # 0x08049794 = 31 [0x0000001f]
We've overwritten some_value with the number of bytes written before the %n specifier was encountered (man printf). We can use the format string itself, or field width to control this value:
./vulnerable $(printf "\x94\x97\x04\x08")%x%x%x%n
This is how you print correctly:
??%x%x%x%n
This is how not to print:
??XXXXXXXXXXXXXXXXXXXXXXXX
some_value # 0x08049794 = 21 [0x00000015]
There are many possibilities and tricks to try (direct parameter access, large field width making wrap-around possible, building your own primitives), and this just touches the tip of the iceberg. I would suggest reading more articles on fmt string vulnerabilities (Phrack has some mostly excellent ones, although they may be a little advanced) or a book which touches on the subject.
Disclaimer: the examples are taken [although not verbatim] from the book Hacking: The art of exploitation (2nd ed) by Jon Erickson.
It is interesting that no-one has mentioned the n$ notation supported by POSIX. If you can control the format string as the attacker, you can use notations such as:
"%200$p"
to read the 200th item on the stack (if there is one). The intention is that you should list all the n$ numbers from 1 to the maximum, and it provides a way of resequencing how the parameters appear in a format string, which is handy when dealing with I18N (L10N, G11N, M18N*).
However, some (probably most) systems are somewhat lackadaisical about how they validate the n$ values and this can lead to abuse by attackers who can control the format string. Combined with the %n format specifier, this can lead to writing at pointer locations.
* The acronyms I18N, L10N, G11N and M18N are for internationalization, localization, globalization, and multinationalization respectively. The number represents the number of omitted letters.
Ah, the answer is in the article!
Uncontrolled format string is a type of software vulnerability, discovered around 1999, that can be used in security exploits. Previously thought harmless, format string exploits can be used to crash a program or to execute harmful code.
A typical exploit uses a combination of these techniques to force a program to overwrite the address of a library function or the return address on the stack with a pointer to some malicious shellcode. The padding parameters to format specifiers are used to control the number of bytes output and the %x token is used to pop bytes from the stack until the beginning of the format string itself is reached. The start of the format string is crafted to contain the address that the %n format token can then overwrite with the address of the malicious code to execute.
This is because %n causes printf to write data to a variable, which is on the stack. But that means it could write to something arbitrarily. All you need is for someone to use that variable (it's relatively easy if it happens to be a function pointer, whose value you just figured out how to control) and they can make you execute anything arbitrarily.
Take a look at the links in the article; they look interesting.
I would recommend reading this lecture note about format string vulnerability.
It describes in details what happens and how, and has some images that might help you to understand the topic.
AFAIK it's mainly because it can crash your program, which is considered to be a denial-of-service attack. All you need is to give an invalid address (practically anything with a few %s's is guaranteed to work), and it becomes a simple denial-of-service (DoS) attack.
Now, it's theoretically possible for that to trigger anything in the case of an exception/signal/interrupt handler, but figuring out how to do that is beyond me -- you need to figure out how to write arbitrary data to memory as well.
But why does anyone care if the program crashes, you might ask? Doesn't that just inconvenience the user (who deserves it anyway)?
The problem is that some programs are accessed by multiple users, so crashing them has a non-negligible cost. Or sometimes they're critical to the running of the system (or maybe they're in the middle of doing something very critical), in which case this can be damaging to your data. Of course, if you crash Notepad then no one might care, but if you crash CSRSS (which I believe actually had a similar kind of bug -- a double-free bug, specifically) then yeah, the entire system is going down with you.
Update:
See this link for the CSRSS bug I was referring to.
Edit:
Take note that reading arbitrary data can be just as dangerous as executing arbitrary code! If you read a password, a cookie, etc. then it's just as serious as an arbitrary code execution -- and this is trivial if you just have enough time to try enough format strings.

How can a Format-String vulnerability be exploited?

I was reading about vulnerabilities in code and came across this Format-String Vulnerability.
Wikipedia says:
Format string bugs most commonly appear when a programmer wishes to
print a string containing user supplied data. The programmer may
mistakenly write printf(buffer) instead of printf("%s", buffer). The
first version interprets buffer as a format string, and parses any
formatting instructions it may contain. The second version simply
prints a string to the screen, as the programmer intended.
I got the problem with printf(buffer) version, but I still didn't get how this vulnerability can be used by attacker to execute harmful code. Can someone please tell me how this vulnerability can be exploited by an example?
You may be able to exploit a format string vulnerability in many ways, directly or indirectly. Let's use the following as an example (assuming no relevant OS protections, which is very rare anyways):
int main(int argc, char **argv)
{
char text[1024];
static int some_value = -72;
strcpy(text, argv[1]); /* ignore the buffer overflow here */
printf("This is how you print correctly:\n");
printf("%s", text);
printf("This is how not to print:\n");
printf(text);
printf("some_value # 0x%08x = %d [0x%08x]", &some_value, some_value, some_value);
return(0);
}
The basis of this vulnerability is the behaviour of functions with variable arguments. A function which implements handling of a variable number of parameters has to read them from the stack, essentially. If we specify a format string that will make printf() expect two integers on the stack, and we provide only one parameter, the second one will have to be something else on the stack. By extension, and if we have control over the format string, we can have the two most fundamental primitives:
Reading from arbitrary memory addresses
[EDIT] IMPORTANT: I'm making some assumptions about the stack frame layout here. You can ignore them if you understand the basic premise behind the vulnerability, and they vary across OS, platform, program and configuration anyways.
It's possible to use the %s format parameter to read data. You can read the data of the original format string in printf(text), hence you can use it to read anything off the stack:
./vulnerable AAAA%08x.%08x.%08x.%08x
This is how you print correctly:
AAAA%08x.%08x.%08x.%08x
This is how not to print:
AAAA.XXXXXXXX.XXXXXXXX.XXXXXXXX.41414141
some_value # 0x08049794 = -72 [0xffffffb8]
Writing to arbitrary memory addresses
You can use the %n format specifier to write to an arbitrary address (almost). Again, let's assume our vulnerable program above, and let's try changing the value of some_value, which is located at 0x08049794, as seen above:
./vulnerable $(printf "\x94\x97\x04\x08")%08x.%08x.%08x.%n
This is how you print correctly:
??%08x.%08x.%08x.%n
This is how not to print:
??XXXXXXXX.XXXXXXXX.XXXXXXXX.
some_value # 0x08049794 = 31 [0x0000001f]
We've overwritten some_value with the number of bytes written before the %n specifier was encountered (man printf). We can use the format string itself, or field width to control this value:
./vulnerable $(printf "\x94\x97\x04\x08")%x%x%x%n
This is how you print correctly:
??%x%x%x%n
This is how not to print:
??XXXXXXXXXXXXXXXXXXXXXXXX
some_value # 0x08049794 = 21 [0x00000015]
There are many possibilities and tricks to try (direct parameter access, large field width making wrap-around possible, building your own primitives), and this just touches the tip of the iceberg. I would suggest reading more articles on fmt string vulnerabilities (Phrack has some mostly excellent ones, although they may be a little advanced) or a book which touches on the subject.
Disclaimer: the examples are taken [although not verbatim] from the book Hacking: The art of exploitation (2nd ed) by Jon Erickson.
It is interesting that no-one has mentioned the n$ notation supported by POSIX. If you can control the format string as the attacker, you can use notations such as:
"%200$p"
to read the 200th item on the stack (if there is one). The intention is that you should list all the n$ numbers from 1 to the maximum, and it provides a way of resequencing how the parameters appear in a format string, which is handy when dealing with I18N (L10N, G11N, M18N*).
However, some (probably most) systems are somewhat lackadaisical about how they validate the n$ values and this can lead to abuse by attackers who can control the format string. Combined with the %n format specifier, this can lead to writing at pointer locations.
* The acronyms I18N, L10N, G11N and M18N are for internationalization, localization, globalization, and multinationalization respectively. The number represents the number of omitted letters.
Ah, the answer is in the article!
Uncontrolled format string is a type of software vulnerability, discovered around 1999, that can be used in security exploits. Previously thought harmless, format string exploits can be used to crash a program or to execute harmful code.
A typical exploit uses a combination of these techniques to force a program to overwrite the address of a library function or the return address on the stack with a pointer to some malicious shellcode. The padding parameters to format specifiers are used to control the number of bytes output and the %x token is used to pop bytes from the stack until the beginning of the format string itself is reached. The start of the format string is crafted to contain the address that the %n format token can then overwrite with the address of the malicious code to execute.
This is because %n causes printf to write data to a variable, which is on the stack. But that means it could write to something arbitrarily. All you need is for someone to use that variable (it's relatively easy if it happens to be a function pointer, whose value you just figured out how to control) and they can make you execute anything arbitrarily.
Take a look at the links in the article; they look interesting.
I would recommend reading this lecture note about format string vulnerability.
It describes in details what happens and how, and has some images that might help you to understand the topic.
AFAIK it's mainly because it can crash your program, which is considered to be a denial-of-service attack. All you need is to give an invalid address (practically anything with a few %s's is guaranteed to work), and it becomes a simple denial-of-service (DoS) attack.
Now, it's theoretically possible for that to trigger anything in the case of an exception/signal/interrupt handler, but figuring out how to do that is beyond me -- you need to figure out how to write arbitrary data to memory as well.
But why does anyone care if the program crashes, you might ask? Doesn't that just inconvenience the user (who deserves it anyway)?
The problem is that some programs are accessed by multiple users, so crashing them has a non-negligible cost. Or sometimes they're critical to the running of the system (or maybe they're in the middle of doing something very critical), in which case this can be damaging to your data. Of course, if you crash Notepad then no one might care, but if you crash CSRSS (which I believe actually had a similar kind of bug -- a double-free bug, specifically) then yeah, the entire system is going down with you.
Update:
See this link for the CSRSS bug I was referring to.
Edit:
Take note that reading arbitrary data can be just as dangerous as executing arbitrary code! If you read a password, a cookie, etc. then it's just as serious as an arbitrary code execution -- and this is trivial if you just have enough time to try enough format strings.

Comparison between the two printf statements

please take a look at the two following c statements
printf("a very long string");
printf("%s","a very long string");
they produce the same result,but there is definitely some difference under the hood,so what is the difference and which one is better? Please share your ideas!
If you know what the string contents are, you should use the first form because it is more compact. If the string you want to print can come from the user or from any other source such that you do not know what the string contents are, you must use the second form; otherwise, your code will be wide open to format string injection attacks.
The first printf works like this
'a' is not a special character: print it
' ' is not a special character: print it
'v' is not a special character: print it
...
'g' is not a special character: print it
The second printf works like this
'%' is a special character:
's' print the contents of the string pointed to by the 2nd parameter
The first one passes one parameter and the second passes 2, so the call is slightly faster in the first one.
But in the first one, printf() has to scan the long string for format specifications and in the second one, the format string is very short, so the actual processing is probably faster in the second one.
More important (to me anyway), is that "a very long string" is not likely to be a a constant string as it is in this example. If you're printf'ing a long string, you're probably using a pointer to to something that the program generated. In that case, it's a MUCH better idea to use the second form because otherwise somewhere, somehow, sometime, the long string will contain a format printf format specification and that will cause printf to go looking for another argument and your program will crash. This exact problem just happened to me about a week ago in code that we have been using for nearly 20 years.
The bottom line is that your printf format specification should always be a constant string. If you need to output a variable, use printf("%s",var) or better yet, fputs(var, stdout).
The first is no less efficient than the second. Since there are no format sequences and no corresponding arguments, no work must be done by the printf() function. In the second case, if the compiler isn't smart enough to catch this, you will be calling for unnecessary work (note: miniscule compared to actually sending (and reading!) the output at the terminal.
printf was designed for printing with formatting. It is more useful to provide formatting arguments for the sake of debugging although they aren't required.
%s takes a value of a const char* whereas leaving no argument just prints the literal expression.
You could still cast a different pointer to the const char* explicitly and change its contents without changing the output expression.
First of all you should define "better" better since it is not smart enough by itself. Better in what way? performance, maintenance, readibility, extensibilty ...
With the one line of code presented I would choose option 1 for almost all versions of 'better'
It's more readible
It does what it should do and nothing more (KISS principle)
It's faster (no pointless moving memory around to stuff one string into another). But unless you are doing this printf a hell of a lot of times in a loop this is not that a big plus.

Is sscanf considered safe to use?

I have vague memories of suggestions that sscanf was bad. I know it won't overflow buffers if I use the field width specifier, so is my memory just playing tricks with me?
I think it depends on how you're using it: If you're scanning for something like int, it's fine. If you're scanning for a string, it's not (unless there was a width field I'm forgetting?).
Edit:
It's not always safe for scanning strings.
If your buffer size is a constant, then you can certainly specify it as something like %20s. But if it's not a constant, you need to specify it in the format string, and you'd need to do:
char format[80]; //Make sure this is big enough... kinda painful
sprintf(format, "%%%ds", cchBuffer - 1); //Don't miss the percent signs and - 1!
sscanf(format, input); //Good luck
which is possible but very easy to get wrong, like I did in my previous edit (forgot to take care of the null-terminator). You might even overflow the format string buffer.
The reason why sscanf might be considered bad is because it doesnt require you to specify maximum string width for string arguments, which could result in overflows if the input read from the source string is longer. so the precise answer is: it is safe if you specify widths properly in the format string otherwise not.
Note that as long as your buffers are at least as long as strlen(input_string)+1, there is no way the %s or %[ specifiers can overflow. You can also use field widths in the specifiers if you want to enforce stricter limits, or you can use %*s and %*[ to suppress assignment and instead use %n before and after to get the offsets in the original string, and then use those to read the resulting sub-string in-place from the input string.
Yes it is..if you specify the string width so the are no buffer overflow related problems.
Anyway, like #Mehrdad showed us, there will be possible problems if the buffer size isn't established at compile-time. I suppose that put a limit to the length of a string that can be supplied to sscanf, could eliminate the problem.
All of the scanf functions have fundamental design flaws, only some of which could be fixed. They should not be used in production code.
Numeric conversion has full-on demons-fly-out-of-your-nose undefined behavior if a value overflows the representable range of the variable you're storing the value in. I am not making this up. The C library is allowed to crash your program just because somebody typed too many input digits. Even if it doesn't crash, it's not obliged to do anything sensible. There is no workaround.
As pointed out in several other answers, %s is just as dangerous as the infamous gets. It's possible to avoid this by using either the 'm' modifier, or a field width, but you have to remember to do that for every single text field you want to convert, and you have to wire the field widths into the format string -- you can't pass sizeof(buff) as an argument.
If the input does not exactly match the format string, sscanf doesn't tell you how many characters into the input buffer it got before it gave up. This means the only practical error-recovery policy is to discard the entire input buffer. This can be OK if you are processing a file that's a simple linear array of records of some sort (e.g. with a CSV file, "skip the malformed line and go on to the next one" is a sensible error recovery policy), but if the input has any more structure than that, you're hosed.
In C, parse jobs that aren't complicated enough to justify using lex and yacc are generally best done either with POSIX regexps (regex.h) or with hand-rolled string parsing. The strto* numeric conversion functions do have well-specified and useful behavior on overflow and do tell you how may characters of input they consumed, and string.h has lots of handy functions for hand-rolled parsers (strchr, strcspn, strsep, etc).
There is 2 point to take care.
The output buffer[s].
As mention by others if you specify a size smaller or equals to the output buffer size in the format string you are safe.
The input buffer.
Here you need to make sure that it is a null terminate string or that you will not read more than the input buffer size.
If the input string is not null terminated sscanf may read past the boundary of the buffer and crash if the memorie is not allocated.

if one complains about gets(), why not do the same with scanf("%s",...)?

From man gets:
Never use gets(). Because it is
impossible to tell without knowing the
data in advance how many
characters gets() will read, and
because gets() will continue to store
characters past the end of the buffer,
it is extremely dangerous to use.
It has been used to break computer
security. Use fgets() instead.
Almost everywhere I see scanf being used in a way that should have the same problem (buffer overflow/buffer overrun): scanf("%s",string). This problem exists in this case? Why there are no references about it in the scanf man page? Why gcc does not warn when compiling this with -Wall?
ps: I know that there is a way to specify in the format string the maximum length of the string with scanf:
char str[10];
scanf("%9s",str);
edit: I am not asking to determe if the preceding code is right or not. My question is: if scanf("%s",string) is always wrong, why there are no warnings and there is nothing about it in the man page?
The answer is simply that no-one has written the code in GCC to produce that warning.
As you point out, a warning for the specific case of "%s" (with no field width) is quite appropriate.
However, bear in mind that this is only the case for the case of scanf(), vscanf(), fscanf() and vfscanf(). This format specifier can be perfectly safe with sscanf() and vsscanf(), so the warning should not be issued in that case. This means that you cannot simply add it to the existing "scanf-style-format-string" analysis code; you will have to split that into "fscanf-style-format-string" and "sscanf-style-format-string" options.
I'm sure if you produce a patch for the latest version of GCC it stands a good chance of being accepted (and of course, you will need to submit patches for the glibc header files too).
Using gets() is never safe. scanf() can be used safely, as you said in your question. However, determining if you're using it safely is a more difficult problem for the compiler to work out (e.g. if you're calling scanf() in a function where you pass in the buffer and a character count as arguments, it won't be able to tell); in that case, it has to assume that you know what you're doing.
When the compiler looks at the formatting string of scanf, it sees a string! That's assuming the formatting string is not entered at run-time. Some compilers like GCC have some extra functionality to analyze the formatting string if entered at compile time. That extra functionality is not comprehensive, because in some situations a run-time overhead is needed which is a NO NO for languages like C. For example, can you detect an unsafe usage without inserting some extra hidden code in this case:
char* str;
size_t size;
scanf("%z", &size);
str = malloc(size);
scanf("%9s"); // how can the compiler determine if this is a safe call?!
Of course, there are ways to write safe code with scanf if you specify the number of characters to read, and that there is enough memory to hold the string. In the case of gets, there is no way to specify the number of characters to read.
I am not sure why the man page for scanf doesn't mention the probability of a buffer overrun, but vanilla scanf is not a secure option. A rather dated link - Link shows this as the case. Also, check this (not gcc but informative nevertheless) - Link
It may be simply that scanf will allocate space on the heap based on how much data is read in. Since it doesn't allocate the buffer and then read until the null character is read, it doesn't risk overwriting the buffer. Instead, it reads into its own buffer until the null character is found, and presumably copies that buffer into another of the correct size at the end of the read.

Resources