printf and %s in C - c

I understand that this code segment is supposed to have a buffer overflow vulnerability problem:
printf("File %s", my_file_name);
printf("File %s");
However, I don't get exactly why it is considered risky. Would anyone be able to shed some light on this?

The code below outputs the contents of my_file_name to the standard output:
printf("File %s", my_file_name);
If my_file_name is received from a malicious source and the program outputs to the terminal, it is possible for the malicious source to have put escape sequences in my_file_name that tell the terminal to perform non trivial tasks such as sending terminal contents back through standard input. It is difficult but conceivable that an attacker may derive useful information from such an attack or even attempt to corrupt data via executing commands as if they were typed by the user.
Of course the second call invokes undefined behavior as you are not passing a valid string pointer as a second argument to printf.
The above scenario is probably not what you are referring to by buffer overflow vulnerability. There is no such vulnerability in the printf code it self, but is a buffer overflow flaw exists somewhere else in your code, and the actual format string can be patched via this overflow, an attacker can take advantage of printf's capabilities, especially the %n format to poke any value to almost any location in the program memory. This is the rationale for removing %n in printf_s as exposed in a Microsoft security paper.

The first call is fine. (Issues exist when you use a user provided format string, as in printf(s), where s is under the influence of the user. Here you use a hard-coded format string "File %s", which is not vulnerable. The contents of the string my_file_name will be treated as a regular C string and just be copied to the standard output. Of course it must be null terminated, and if the output is redirected to something else there can be side effects there, but that's not a printf issue.)
The second call is simply undefined behavior, because the number of parameters after the format string (0) does not match the number which the format string demands (1).

Related

Why is it better to use `%s` to print a string using `printf` rather than printing it directly? [duplicate]

I was reading about vulnerabilities in code and came across this Format-String Vulnerability.
Wikipedia says:
Format string bugs most commonly appear when a programmer wishes to
print a string containing user supplied data. The programmer may
mistakenly write printf(buffer) instead of printf("%s", buffer). The
first version interprets buffer as a format string, and parses any
formatting instructions it may contain. The second version simply
prints a string to the screen, as the programmer intended.
I got the problem with printf(buffer) version, but I still didn't get how this vulnerability can be used by attacker to execute harmful code. Can someone please tell me how this vulnerability can be exploited by an example?
You may be able to exploit a format string vulnerability in many ways, directly or indirectly. Let's use the following as an example (assuming no relevant OS protections, which is very rare anyways):
int main(int argc, char **argv)
{
char text[1024];
static int some_value = -72;
strcpy(text, argv[1]); /* ignore the buffer overflow here */
printf("This is how you print correctly:\n");
printf("%s", text);
printf("This is how not to print:\n");
printf(text);
printf("some_value # 0x%08x = %d [0x%08x]", &some_value, some_value, some_value);
return(0);
}
The basis of this vulnerability is the behaviour of functions with variable arguments. A function which implements handling of a variable number of parameters has to read them from the stack, essentially. If we specify a format string that will make printf() expect two integers on the stack, and we provide only one parameter, the second one will have to be something else on the stack. By extension, and if we have control over the format string, we can have the two most fundamental primitives:
Reading from arbitrary memory addresses
[EDIT] IMPORTANT: I'm making some assumptions about the stack frame layout here. You can ignore them if you understand the basic premise behind the vulnerability, and they vary across OS, platform, program and configuration anyways.
It's possible to use the %s format parameter to read data. You can read the data of the original format string in printf(text), hence you can use it to read anything off the stack:
./vulnerable AAAA%08x.%08x.%08x.%08x
This is how you print correctly:
AAAA%08x.%08x.%08x.%08x
This is how not to print:
AAAA.XXXXXXXX.XXXXXXXX.XXXXXXXX.41414141
some_value # 0x08049794 = -72 [0xffffffb8]
Writing to arbitrary memory addresses
You can use the %n format specifier to write to an arbitrary address (almost). Again, let's assume our vulnerable program above, and let's try changing the value of some_value, which is located at 0x08049794, as seen above:
./vulnerable $(printf "\x94\x97\x04\x08")%08x.%08x.%08x.%n
This is how you print correctly:
??%08x.%08x.%08x.%n
This is how not to print:
??XXXXXXXX.XXXXXXXX.XXXXXXXX.
some_value # 0x08049794 = 31 [0x0000001f]
We've overwritten some_value with the number of bytes written before the %n specifier was encountered (man printf). We can use the format string itself, or field width to control this value:
./vulnerable $(printf "\x94\x97\x04\x08")%x%x%x%n
This is how you print correctly:
??%x%x%x%n
This is how not to print:
??XXXXXXXXXXXXXXXXXXXXXXXX
some_value # 0x08049794 = 21 [0x00000015]
There are many possibilities and tricks to try (direct parameter access, large field width making wrap-around possible, building your own primitives), and this just touches the tip of the iceberg. I would suggest reading more articles on fmt string vulnerabilities (Phrack has some mostly excellent ones, although they may be a little advanced) or a book which touches on the subject.
Disclaimer: the examples are taken [although not verbatim] from the book Hacking: The art of exploitation (2nd ed) by Jon Erickson.
It is interesting that no-one has mentioned the n$ notation supported by POSIX. If you can control the format string as the attacker, you can use notations such as:
"%200$p"
to read the 200th item on the stack (if there is one). The intention is that you should list all the n$ numbers from 1 to the maximum, and it provides a way of resequencing how the parameters appear in a format string, which is handy when dealing with I18N (L10N, G11N, M18N*).
However, some (probably most) systems are somewhat lackadaisical about how they validate the n$ values and this can lead to abuse by attackers who can control the format string. Combined with the %n format specifier, this can lead to writing at pointer locations.
* The acronyms I18N, L10N, G11N and M18N are for internationalization, localization, globalization, and multinationalization respectively. The number represents the number of omitted letters.
Ah, the answer is in the article!
Uncontrolled format string is a type of software vulnerability, discovered around 1999, that can be used in security exploits. Previously thought harmless, format string exploits can be used to crash a program or to execute harmful code.
A typical exploit uses a combination of these techniques to force a program to overwrite the address of a library function or the return address on the stack with a pointer to some malicious shellcode. The padding parameters to format specifiers are used to control the number of bytes output and the %x token is used to pop bytes from the stack until the beginning of the format string itself is reached. The start of the format string is crafted to contain the address that the %n format token can then overwrite with the address of the malicious code to execute.
This is because %n causes printf to write data to a variable, which is on the stack. But that means it could write to something arbitrarily. All you need is for someone to use that variable (it's relatively easy if it happens to be a function pointer, whose value you just figured out how to control) and they can make you execute anything arbitrarily.
Take a look at the links in the article; they look interesting.
I would recommend reading this lecture note about format string vulnerability.
It describes in details what happens and how, and has some images that might help you to understand the topic.
AFAIK it's mainly because it can crash your program, which is considered to be a denial-of-service attack. All you need is to give an invalid address (practically anything with a few %s's is guaranteed to work), and it becomes a simple denial-of-service (DoS) attack.
Now, it's theoretically possible for that to trigger anything in the case of an exception/signal/interrupt handler, but figuring out how to do that is beyond me -- you need to figure out how to write arbitrary data to memory as well.
But why does anyone care if the program crashes, you might ask? Doesn't that just inconvenience the user (who deserves it anyway)?
The problem is that some programs are accessed by multiple users, so crashing them has a non-negligible cost. Or sometimes they're critical to the running of the system (or maybe they're in the middle of doing something very critical), in which case this can be damaging to your data. Of course, if you crash Notepad then no one might care, but if you crash CSRSS (which I believe actually had a similar kind of bug -- a double-free bug, specifically) then yeah, the entire system is going down with you.
Update:
See this link for the CSRSS bug I was referring to.
Edit:
Take note that reading arbitrary data can be just as dangerous as executing arbitrary code! If you read a password, a cookie, etc. then it's just as serious as an arbitrary code execution -- and this is trivial if you just have enough time to try enough format strings.

How does the data flow from input stream into the input buffer during scanf() in C?

For example, when I do scanf("%s",arg); : Terminal allows me to input text until a newline is encountered but it only stores up to the first space character inside the arg variable. Rest remains in buffer.
scanf("%c", arg); : In this case also it allows me to enter text into the terminal till I give a newline character, but only one is stored in arg while the rest remains in buffer.
scanf("%[^P]", arg); : In this case, I can enter text into the terminal even after giving it a newline character until I hit a line with 'P' in it and press enter key (newline character) and then transfers everything to the input buffer.
How is it determined how much data from the input stream is to be transferred to the input buffer at a time?
Assuming that arg is of the proper type.
My understanding seems to be fundamentally wrong here. If someone can please explain this stuff, I will be very grateful.
How is it determined? It's determined by the format string itself.
The scanf function will read items until they no longer match the format specifier given. Then it stops, leaving the first "non-compliant" character still in the buffer.
If you mean "how is it handled under the covers?", that's a different issue.
My first response to that is "it doesn't matter". The ISO standard mandates how the language works, and it describes a "virtual machine" capable of doing that. Provided you follow the rules of the machine, you don't need to worry about how things happen under the covers.
My second answer is probably more satisfying but is very implementation dependent.
For efficiency, the underlying software will probably not deliver any data to the implementation until it has a full line (though this of course is likely to be configurable, such as setting raw mode for the terminal). That means things like backspace may change the characters already entered rather than being inserted into the stream.
It may (such as with the GNU readline() library allow all sorts of really fancy editing on the line before delivering the characters. There's nothing to stop the underlying software from even opening up a vim session to allow you to enter data, and only deliver it once you exit :-)
the buffer and primitive editing features are provided by the operating system.
if you can set the terminal into "raw mode" you will see different behavior.
eg: characters may be available to read before enter is pressed especially if the buffer can also be disabled.
I think, it is not related with how much, rather, what the format specifier tells.
As per C99, chapter 7.19.6.2, paragraph 2, (for fscanf())
The fscanf function reads input from the stream pointed to by stream, under control
of the string pointed to by format that specifies the admissible input sequences and how
they are to be converted for assignment, using subsequent arguments as pointers to the
objects to receive the converted input.
And for the format specifiers, you need to refer to paragraph 12.

How can a Format-String vulnerability be exploited?

I was reading about vulnerabilities in code and came across this Format-String Vulnerability.
Wikipedia says:
Format string bugs most commonly appear when a programmer wishes to
print a string containing user supplied data. The programmer may
mistakenly write printf(buffer) instead of printf("%s", buffer). The
first version interprets buffer as a format string, and parses any
formatting instructions it may contain. The second version simply
prints a string to the screen, as the programmer intended.
I got the problem with printf(buffer) version, but I still didn't get how this vulnerability can be used by attacker to execute harmful code. Can someone please tell me how this vulnerability can be exploited by an example?
You may be able to exploit a format string vulnerability in many ways, directly or indirectly. Let's use the following as an example (assuming no relevant OS protections, which is very rare anyways):
int main(int argc, char **argv)
{
char text[1024];
static int some_value = -72;
strcpy(text, argv[1]); /* ignore the buffer overflow here */
printf("This is how you print correctly:\n");
printf("%s", text);
printf("This is how not to print:\n");
printf(text);
printf("some_value # 0x%08x = %d [0x%08x]", &some_value, some_value, some_value);
return(0);
}
The basis of this vulnerability is the behaviour of functions with variable arguments. A function which implements handling of a variable number of parameters has to read them from the stack, essentially. If we specify a format string that will make printf() expect two integers on the stack, and we provide only one parameter, the second one will have to be something else on the stack. By extension, and if we have control over the format string, we can have the two most fundamental primitives:
Reading from arbitrary memory addresses
[EDIT] IMPORTANT: I'm making some assumptions about the stack frame layout here. You can ignore them if you understand the basic premise behind the vulnerability, and they vary across OS, platform, program and configuration anyways.
It's possible to use the %s format parameter to read data. You can read the data of the original format string in printf(text), hence you can use it to read anything off the stack:
./vulnerable AAAA%08x.%08x.%08x.%08x
This is how you print correctly:
AAAA%08x.%08x.%08x.%08x
This is how not to print:
AAAA.XXXXXXXX.XXXXXXXX.XXXXXXXX.41414141
some_value # 0x08049794 = -72 [0xffffffb8]
Writing to arbitrary memory addresses
You can use the %n format specifier to write to an arbitrary address (almost). Again, let's assume our vulnerable program above, and let's try changing the value of some_value, which is located at 0x08049794, as seen above:
./vulnerable $(printf "\x94\x97\x04\x08")%08x.%08x.%08x.%n
This is how you print correctly:
??%08x.%08x.%08x.%n
This is how not to print:
??XXXXXXXX.XXXXXXXX.XXXXXXXX.
some_value # 0x08049794 = 31 [0x0000001f]
We've overwritten some_value with the number of bytes written before the %n specifier was encountered (man printf). We can use the format string itself, or field width to control this value:
./vulnerable $(printf "\x94\x97\x04\x08")%x%x%x%n
This is how you print correctly:
??%x%x%x%n
This is how not to print:
??XXXXXXXXXXXXXXXXXXXXXXXX
some_value # 0x08049794 = 21 [0x00000015]
There are many possibilities and tricks to try (direct parameter access, large field width making wrap-around possible, building your own primitives), and this just touches the tip of the iceberg. I would suggest reading more articles on fmt string vulnerabilities (Phrack has some mostly excellent ones, although they may be a little advanced) or a book which touches on the subject.
Disclaimer: the examples are taken [although not verbatim] from the book Hacking: The art of exploitation (2nd ed) by Jon Erickson.
It is interesting that no-one has mentioned the n$ notation supported by POSIX. If you can control the format string as the attacker, you can use notations such as:
"%200$p"
to read the 200th item on the stack (if there is one). The intention is that you should list all the n$ numbers from 1 to the maximum, and it provides a way of resequencing how the parameters appear in a format string, which is handy when dealing with I18N (L10N, G11N, M18N*).
However, some (probably most) systems are somewhat lackadaisical about how they validate the n$ values and this can lead to abuse by attackers who can control the format string. Combined with the %n format specifier, this can lead to writing at pointer locations.
* The acronyms I18N, L10N, G11N and M18N are for internationalization, localization, globalization, and multinationalization respectively. The number represents the number of omitted letters.
Ah, the answer is in the article!
Uncontrolled format string is a type of software vulnerability, discovered around 1999, that can be used in security exploits. Previously thought harmless, format string exploits can be used to crash a program or to execute harmful code.
A typical exploit uses a combination of these techniques to force a program to overwrite the address of a library function or the return address on the stack with a pointer to some malicious shellcode. The padding parameters to format specifiers are used to control the number of bytes output and the %x token is used to pop bytes from the stack until the beginning of the format string itself is reached. The start of the format string is crafted to contain the address that the %n format token can then overwrite with the address of the malicious code to execute.
This is because %n causes printf to write data to a variable, which is on the stack. But that means it could write to something arbitrarily. All you need is for someone to use that variable (it's relatively easy if it happens to be a function pointer, whose value you just figured out how to control) and they can make you execute anything arbitrarily.
Take a look at the links in the article; they look interesting.
I would recommend reading this lecture note about format string vulnerability.
It describes in details what happens and how, and has some images that might help you to understand the topic.
AFAIK it's mainly because it can crash your program, which is considered to be a denial-of-service attack. All you need is to give an invalid address (practically anything with a few %s's is guaranteed to work), and it becomes a simple denial-of-service (DoS) attack.
Now, it's theoretically possible for that to trigger anything in the case of an exception/signal/interrupt handler, but figuring out how to do that is beyond me -- you need to figure out how to write arbitrary data to memory as well.
But why does anyone care if the program crashes, you might ask? Doesn't that just inconvenience the user (who deserves it anyway)?
The problem is that some programs are accessed by multiple users, so crashing them has a non-negligible cost. Or sometimes they're critical to the running of the system (or maybe they're in the middle of doing something very critical), in which case this can be damaging to your data. Of course, if you crash Notepad then no one might care, but if you crash CSRSS (which I believe actually had a similar kind of bug -- a double-free bug, specifically) then yeah, the entire system is going down with you.
Update:
See this link for the CSRSS bug I was referring to.
Edit:
Take note that reading arbitrary data can be just as dangerous as executing arbitrary code! If you read a password, a cookie, etc. then it's just as serious as an arbitrary code execution -- and this is trivial if you just have enough time to try enough format strings.

scanf Cppcheck warning

Cppcheck shows the following warning for scanf:
Message: scanf without field width limits can crash with huge input data. To fix this error message add a field width specifier:
%s => %20s
%i => %3i
Sample program that can crash:
#include
int main()
{
int a;
scanf("%i", &a);
return 0;
}
To make it crash:
perl -e 'print "5"x2100000' | ./a.out
I cannot crash this program typing "huge input data". What exactly should I type to get this crash? I also don't understand the meaning of the last line in this warning:
perl -e ...
The last line is an example command to run to demonstrate the crash with the sample program. It essentially causes perl to print 2.100.000 times "5" and then pass this to the stdin of the program "a.out" (which is meant to be the compiled sample program).
First of all, scanf() should be used for testing only, not in real world programs due to several issues it won't handle gracefully (e.g. asking for "%i" but user inputs "12345abc" (the "abc" will stay in stdin and might cause following inputs to be filled without a chance for the user to change them).
Regarding this issue: scanf() will know it should read a integer value, however it won't know how long it can be. The pointer could point to a 16 bit integer, 32 bit integer, or a 64 bit integer or something even bigger (which it isn't aware off). Functions with a variable number of arguments (defined with ...) don't know the exact datatype of elements passed, so it has to rely on the format string (reason for the format tags to not be optional like in C# where you just number them, e.g. "{0} {1} {2}"). And without a given length it has to assume some length which might be platform dependant as well (making the function even more unsave to use).
In general, consider it possibly harmful and a starting point for buffer overflow attacks. If you'd like to secure and optimize your program, start by replacing it with alternatives.
I tried running the perl expression against the C program and it did crash here on Linux (segmentation fault).
Using of 'scanf' (or fscanf and sscanf) function in real-world applications usually is not recommended at all because it's not safe and it's usually a hole for buffer overrun if some incorrect input data will be supplied.
There are much more secure ways to input numbers in many commonly used libraries for C++ (QT, runtime libraries for Microsoft Visual C++ etc.). Probably you can find secure alternatives for "pure" C language too.

will 'printf' always do its job?

printf("/*something else*/"); /*note that:without using \n in printf*/
I know printf() uses a buffer which prints whatever it contains when, in the line buffer, "\n" is seen by the buffer function. So when we forget to use "\n" in printf(), rarely, line buffer will not be emptied. Therefore, printf() wont do its job. Am I wrong?
The example you gave above is safe as there are no variable arguments to printf. However it is possible to specify a format string and supply variables that do not match up with the format, which can deliver unexpected (and unsafe) results. Some compilers are taking a more proactive approach with printf use case analysis, but even then one should be very, very careful when printf is used.
From my man page:
These functions return the number of characters printed (not including
the trailing \0 used to end output to strings) or a negative value
if an output error occurs, except for snprintf() and vsnprintf(), which
return the number of characters that would have been printed if the n
were unlimited (again, not including the final \0).
So, it sounds like the can fail with a negative error.
Yes, output to stdout in C (using printf) is normally line buffered. This means that printf() will collect output until either:
the buffer is full, or
the output contains a \n newline
If you want to force the output of the buffer, call fflush(stdout). This will work even if you have printed something without a newline.
Also printf and friends can fail.
Common implementations of C call malloc() in the printf family of the stdC library.
malloc can fail, so then will printf. In UNIX the write() call can be interrupted by EINTR, so context switching in UNIX will trigger faults (EINTR). Windows can and will do similar things.
And... Although you do not see it posted here often you should always check the return code from any system or library function that returns a value.
Like that, no. It won't always work as you expect, especially if you're using user input as the format string. If the first argument has %s or %d or other format specifiers in it, they will be parsed and replaced with values from the stack, which can easily break if it's expecting a pointer and gets an int instead.
This way tends to be a lot safer:
printf("%s", "....");
The output buffer will be flushed before exit, or before you get input, so the data will make it regardless of whether you send a \n.
printf could fail for any number of reasons. If you're deep in recursion, calling printf may blow your stack. The C and C++ standards have little to say on threading issues and calling printf while printf is executing in another thread may fail. It could fail because stdout is attached to a file and you just filled your filesystem, in which case the return value tells you there was a problem. If you call printf with a string that isn't zero terminated then bad things could happen. And printf can apparently fail if you're using buffered I/O and your buffer hasn't been flushed yet.

Resources