Replace deprecated gets() - c

I am using the SLM toolkit by CMU-Cambridge for some baseline language modeling on language data, but when I run one of the built executables, my system detects a buffer overflow when it tries to execute one of the commands.
Based on this StackOverflow question I noticed that __gets_chk+0x179 caused the problem, and I've found two occurrences of gets/fgets in the source code (evallm.c, also available in this GitHub project someone made) but I do not know how to fix them in a proper/secure way.
The relevant parts of the error message:
*** buffer overflow detected ***: /home/CMU-Cam_Toolkit_v2/bin/evallm terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(__gets_chk+0x179)[0x7f613bc719e9]
Aborted
The broken code
# declaration of input string variable
char input_string[500];
# occurence 1
...
while (fgets (wlist_entry, sizeof (wlist_entry),context_cues_fp)) { ... }
...
# occurence 2
...
while (!feof(stdin) && !told_to_quit) {
printf("evallm : ");
gets(input_string);
....
The error basically occurred when the input_string I gave to the evallm command was too long. Normally it is to be called from the command line and you can interactively pass arguments. However, I piped all arguments together with the command (as seen in the example of the docs) but apparently sometimes my argument names where taking too much bytes. When I changed the array length of input_string from 500 to 2000 the problem was solved (so I guess the error was due to occurence 2). But I really would like to fix it by replacing gets() by getline() since it seems to be the right way to go. Or is replacing it by fgets() also a solution? If so, what parameters should I use?
However, when trying to replace gets(), I always get compiling errors. I'm not a C-programmer (Python, Java) and I'm not familiar with the syntax of getline(), so I'm struggling to find the right parameters.

In your particular case, you know that input_string is an array of 500 bytes. (Of course, you could replace that 500 with e.g. 2048)
I am paranoid, adept of defensive programming, and I would zero that buffer before any input, e.g.
memset(input_string, 0, sizeof(input_string));
So the buffer is cleared, even when fgets has failed. In most cases that is in principle useless. But you have corner cases and the evil is in the details.
So read documentation of fgets(3) and replace the gets call with
fgets(input_string, sizeof(input_string), stdin);
(you actually should handle corner cases, e.g. failure of fgets and input line longer than input_string ....)
Of course, you may want to zero the terminating newline. For that, add
int input_len = strlen(input_string);
if (input_len>0) input_string[input_len-1] = '\0`;
(as commented, you might clear the input_string less often, e.g. at start and on fgets failure)
Notice that getline(3) is POSIX specific and is managing a heap-allocated buffer. Read about C dynamic memory allocation. If you are unfamiliar with C programming, that might be tricky to you. BTW, you could even consider using the Linux specific readline(3)
The main point is your familiarity with C programming.
NB: in C, # does not start a comment, but a preprocessor directive.

You replace gets with fgets.
It's almost that simple, the difference (besides the arguments) is that with fgets there might be a newline at the end of the buffer. (Note I say it might be there.)
I recommend this fgets reference.

Related

Creating a user length defined array in C

I'm trying to make an array with variable starting length to get a string. The code should count the words and adjust the size of the array, but this is only a test and I expose it here because I want to know if it's a good practice or one error. And if there is something I should know about, or I must have in mind.
Note, I talk about C, not C++
#include <stdio.h>
int main()
{ int c,b,count;
scanf("%d",&c);
count=c+1;
getchar();
char a[count];
for ( c=b=0 ; c!=count && b!='\n' ; c++ )
{
b=getchar();
a[c]=b;
}
a[c]='\0';
printf("%s",a); printf("%d",c-1);
}
I don't need change the size of the array at the execution time.
I was testing and I don't remember well why I'm using the c variable at first time instead of count directly, but I remember the first getchar was to flush the buffer, because it didn't work without the getchar.
I don't know why I need to put getchar. If I delete the getchar the program fails.
Anyway the program works fine. The first time you run, it expects a number with scanf and then expects the text.
If the text is larger than the size of the array the program will ignore it.
The number is the size of the array.
My questions are:
It is a good practice do a[variable] to do this job?
Why I need the getchar?
It will be portable? I mean, I don't know if some systems or standards don't accept this like some old C compilers or somewhat.
There are better methods?
It is a good practice do a[variable] to do this job?
It depends on someone's compiler configuration. It has been supported since C99. However since there's not a good reason to use it in such a simple program, use the standard malloc instead. Here's an in-depth discussion of the topic.
Why I need the getchar?
There's likely some input still buffered up in your terminal, and that first character is discarding it. Try printing the value out to the screen to see what it is, that might help as figure it out.
It will be portable?
See my answer to your first question. It will probably work on modern versions of gcc, but for example it doesn't work in Windows C (which is still basically on C89).
It is a good practice do a[variable] to do this job?
Where the size is determined by arbitrary user input without imposed limits, it is not good practice. A user could easily enter a very large value and overrun the stack.
Use either dynamic allocation, or check and coerce the input value to some sensible limit.
Also worth noting that VLAs are not supported in C++ or some C compilers, so the code lacks portability.
Why I need the getchar?
The user has to enter at least a newline for scanf() to return, but the %d format specifier does not consume non-digit characters, so it remains buffered. However your code is easily broken by entering additional non-digit characters for example "16a<newline>" will assign 16 to c, and the a will be discarded leaving the newline buffered as before. A better solution is:
while( getchar() != `\n` ) {}
It will be portable? I mean, I don't know if some systems or standards don't accept this like some old C compilers or somewhat.
Adoption of C99 VLAs is variable, and in C11 they are optional in any case.
There are better methods?
I hesitate to say "better", but safer and more flexible and portable ways sure. With respect to the array allocation, you could use malloc().
Using malloc or calloc would be a better choice in C
https://www.tutorialspoint.com/c_standard_library/c_function_malloc.htm

Why is it better to use `%s` to print a string using `printf` rather than printing it directly? [duplicate]

I was reading about vulnerabilities in code and came across this Format-String Vulnerability.
Wikipedia says:
Format string bugs most commonly appear when a programmer wishes to
print a string containing user supplied data. The programmer may
mistakenly write printf(buffer) instead of printf("%s", buffer). The
first version interprets buffer as a format string, and parses any
formatting instructions it may contain. The second version simply
prints a string to the screen, as the programmer intended.
I got the problem with printf(buffer) version, but I still didn't get how this vulnerability can be used by attacker to execute harmful code. Can someone please tell me how this vulnerability can be exploited by an example?
You may be able to exploit a format string vulnerability in many ways, directly or indirectly. Let's use the following as an example (assuming no relevant OS protections, which is very rare anyways):
int main(int argc, char **argv)
{
char text[1024];
static int some_value = -72;
strcpy(text, argv[1]); /* ignore the buffer overflow here */
printf("This is how you print correctly:\n");
printf("%s", text);
printf("This is how not to print:\n");
printf(text);
printf("some_value # 0x%08x = %d [0x%08x]", &some_value, some_value, some_value);
return(0);
}
The basis of this vulnerability is the behaviour of functions with variable arguments. A function which implements handling of a variable number of parameters has to read them from the stack, essentially. If we specify a format string that will make printf() expect two integers on the stack, and we provide only one parameter, the second one will have to be something else on the stack. By extension, and if we have control over the format string, we can have the two most fundamental primitives:
Reading from arbitrary memory addresses
[EDIT] IMPORTANT: I'm making some assumptions about the stack frame layout here. You can ignore them if you understand the basic premise behind the vulnerability, and they vary across OS, platform, program and configuration anyways.
It's possible to use the %s format parameter to read data. You can read the data of the original format string in printf(text), hence you can use it to read anything off the stack:
./vulnerable AAAA%08x.%08x.%08x.%08x
This is how you print correctly:
AAAA%08x.%08x.%08x.%08x
This is how not to print:
AAAA.XXXXXXXX.XXXXXXXX.XXXXXXXX.41414141
some_value # 0x08049794 = -72 [0xffffffb8]
Writing to arbitrary memory addresses
You can use the %n format specifier to write to an arbitrary address (almost). Again, let's assume our vulnerable program above, and let's try changing the value of some_value, which is located at 0x08049794, as seen above:
./vulnerable $(printf "\x94\x97\x04\x08")%08x.%08x.%08x.%n
This is how you print correctly:
??%08x.%08x.%08x.%n
This is how not to print:
??XXXXXXXX.XXXXXXXX.XXXXXXXX.
some_value # 0x08049794 = 31 [0x0000001f]
We've overwritten some_value with the number of bytes written before the %n specifier was encountered (man printf). We can use the format string itself, or field width to control this value:
./vulnerable $(printf "\x94\x97\x04\x08")%x%x%x%n
This is how you print correctly:
??%x%x%x%n
This is how not to print:
??XXXXXXXXXXXXXXXXXXXXXXXX
some_value # 0x08049794 = 21 [0x00000015]
There are many possibilities and tricks to try (direct parameter access, large field width making wrap-around possible, building your own primitives), and this just touches the tip of the iceberg. I would suggest reading more articles on fmt string vulnerabilities (Phrack has some mostly excellent ones, although they may be a little advanced) or a book which touches on the subject.
Disclaimer: the examples are taken [although not verbatim] from the book Hacking: The art of exploitation (2nd ed) by Jon Erickson.
It is interesting that no-one has mentioned the n$ notation supported by POSIX. If you can control the format string as the attacker, you can use notations such as:
"%200$p"
to read the 200th item on the stack (if there is one). The intention is that you should list all the n$ numbers from 1 to the maximum, and it provides a way of resequencing how the parameters appear in a format string, which is handy when dealing with I18N (L10N, G11N, M18N*).
However, some (probably most) systems are somewhat lackadaisical about how they validate the n$ values and this can lead to abuse by attackers who can control the format string. Combined with the %n format specifier, this can lead to writing at pointer locations.
* The acronyms I18N, L10N, G11N and M18N are for internationalization, localization, globalization, and multinationalization respectively. The number represents the number of omitted letters.
Ah, the answer is in the article!
Uncontrolled format string is a type of software vulnerability, discovered around 1999, that can be used in security exploits. Previously thought harmless, format string exploits can be used to crash a program or to execute harmful code.
A typical exploit uses a combination of these techniques to force a program to overwrite the address of a library function or the return address on the stack with a pointer to some malicious shellcode. The padding parameters to format specifiers are used to control the number of bytes output and the %x token is used to pop bytes from the stack until the beginning of the format string itself is reached. The start of the format string is crafted to contain the address that the %n format token can then overwrite with the address of the malicious code to execute.
This is because %n causes printf to write data to a variable, which is on the stack. But that means it could write to something arbitrarily. All you need is for someone to use that variable (it's relatively easy if it happens to be a function pointer, whose value you just figured out how to control) and they can make you execute anything arbitrarily.
Take a look at the links in the article; they look interesting.
I would recommend reading this lecture note about format string vulnerability.
It describes in details what happens and how, and has some images that might help you to understand the topic.
AFAIK it's mainly because it can crash your program, which is considered to be a denial-of-service attack. All you need is to give an invalid address (practically anything with a few %s's is guaranteed to work), and it becomes a simple denial-of-service (DoS) attack.
Now, it's theoretically possible for that to trigger anything in the case of an exception/signal/interrupt handler, but figuring out how to do that is beyond me -- you need to figure out how to write arbitrary data to memory as well.
But why does anyone care if the program crashes, you might ask? Doesn't that just inconvenience the user (who deserves it anyway)?
The problem is that some programs are accessed by multiple users, so crashing them has a non-negligible cost. Or sometimes they're critical to the running of the system (or maybe they're in the middle of doing something very critical), in which case this can be damaging to your data. Of course, if you crash Notepad then no one might care, but if you crash CSRSS (which I believe actually had a similar kind of bug -- a double-free bug, specifically) then yeah, the entire system is going down with you.
Update:
See this link for the CSRSS bug I was referring to.
Edit:
Take note that reading arbitrary data can be just as dangerous as executing arbitrary code! If you read a password, a cookie, etc. then it's just as serious as an arbitrary code execution -- and this is trivial if you just have enough time to try enough format strings.

How can a Format-String vulnerability be exploited?

I was reading about vulnerabilities in code and came across this Format-String Vulnerability.
Wikipedia says:
Format string bugs most commonly appear when a programmer wishes to
print a string containing user supplied data. The programmer may
mistakenly write printf(buffer) instead of printf("%s", buffer). The
first version interprets buffer as a format string, and parses any
formatting instructions it may contain. The second version simply
prints a string to the screen, as the programmer intended.
I got the problem with printf(buffer) version, but I still didn't get how this vulnerability can be used by attacker to execute harmful code. Can someone please tell me how this vulnerability can be exploited by an example?
You may be able to exploit a format string vulnerability in many ways, directly or indirectly. Let's use the following as an example (assuming no relevant OS protections, which is very rare anyways):
int main(int argc, char **argv)
{
char text[1024];
static int some_value = -72;
strcpy(text, argv[1]); /* ignore the buffer overflow here */
printf("This is how you print correctly:\n");
printf("%s", text);
printf("This is how not to print:\n");
printf(text);
printf("some_value # 0x%08x = %d [0x%08x]", &some_value, some_value, some_value);
return(0);
}
The basis of this vulnerability is the behaviour of functions with variable arguments. A function which implements handling of a variable number of parameters has to read them from the stack, essentially. If we specify a format string that will make printf() expect two integers on the stack, and we provide only one parameter, the second one will have to be something else on the stack. By extension, and if we have control over the format string, we can have the two most fundamental primitives:
Reading from arbitrary memory addresses
[EDIT] IMPORTANT: I'm making some assumptions about the stack frame layout here. You can ignore them if you understand the basic premise behind the vulnerability, and they vary across OS, platform, program and configuration anyways.
It's possible to use the %s format parameter to read data. You can read the data of the original format string in printf(text), hence you can use it to read anything off the stack:
./vulnerable AAAA%08x.%08x.%08x.%08x
This is how you print correctly:
AAAA%08x.%08x.%08x.%08x
This is how not to print:
AAAA.XXXXXXXX.XXXXXXXX.XXXXXXXX.41414141
some_value # 0x08049794 = -72 [0xffffffb8]
Writing to arbitrary memory addresses
You can use the %n format specifier to write to an arbitrary address (almost). Again, let's assume our vulnerable program above, and let's try changing the value of some_value, which is located at 0x08049794, as seen above:
./vulnerable $(printf "\x94\x97\x04\x08")%08x.%08x.%08x.%n
This is how you print correctly:
??%08x.%08x.%08x.%n
This is how not to print:
??XXXXXXXX.XXXXXXXX.XXXXXXXX.
some_value # 0x08049794 = 31 [0x0000001f]
We've overwritten some_value with the number of bytes written before the %n specifier was encountered (man printf). We can use the format string itself, or field width to control this value:
./vulnerable $(printf "\x94\x97\x04\x08")%x%x%x%n
This is how you print correctly:
??%x%x%x%n
This is how not to print:
??XXXXXXXXXXXXXXXXXXXXXXXX
some_value # 0x08049794 = 21 [0x00000015]
There are many possibilities and tricks to try (direct parameter access, large field width making wrap-around possible, building your own primitives), and this just touches the tip of the iceberg. I would suggest reading more articles on fmt string vulnerabilities (Phrack has some mostly excellent ones, although they may be a little advanced) or a book which touches on the subject.
Disclaimer: the examples are taken [although not verbatim] from the book Hacking: The art of exploitation (2nd ed) by Jon Erickson.
It is interesting that no-one has mentioned the n$ notation supported by POSIX. If you can control the format string as the attacker, you can use notations such as:
"%200$p"
to read the 200th item on the stack (if there is one). The intention is that you should list all the n$ numbers from 1 to the maximum, and it provides a way of resequencing how the parameters appear in a format string, which is handy when dealing with I18N (L10N, G11N, M18N*).
However, some (probably most) systems are somewhat lackadaisical about how they validate the n$ values and this can lead to abuse by attackers who can control the format string. Combined with the %n format specifier, this can lead to writing at pointer locations.
* The acronyms I18N, L10N, G11N and M18N are for internationalization, localization, globalization, and multinationalization respectively. The number represents the number of omitted letters.
Ah, the answer is in the article!
Uncontrolled format string is a type of software vulnerability, discovered around 1999, that can be used in security exploits. Previously thought harmless, format string exploits can be used to crash a program or to execute harmful code.
A typical exploit uses a combination of these techniques to force a program to overwrite the address of a library function or the return address on the stack with a pointer to some malicious shellcode. The padding parameters to format specifiers are used to control the number of bytes output and the %x token is used to pop bytes from the stack until the beginning of the format string itself is reached. The start of the format string is crafted to contain the address that the %n format token can then overwrite with the address of the malicious code to execute.
This is because %n causes printf to write data to a variable, which is on the stack. But that means it could write to something arbitrarily. All you need is for someone to use that variable (it's relatively easy if it happens to be a function pointer, whose value you just figured out how to control) and they can make you execute anything arbitrarily.
Take a look at the links in the article; they look interesting.
I would recommend reading this lecture note about format string vulnerability.
It describes in details what happens and how, and has some images that might help you to understand the topic.
AFAIK it's mainly because it can crash your program, which is considered to be a denial-of-service attack. All you need is to give an invalid address (practically anything with a few %s's is guaranteed to work), and it becomes a simple denial-of-service (DoS) attack.
Now, it's theoretically possible for that to trigger anything in the case of an exception/signal/interrupt handler, but figuring out how to do that is beyond me -- you need to figure out how to write arbitrary data to memory as well.
But why does anyone care if the program crashes, you might ask? Doesn't that just inconvenience the user (who deserves it anyway)?
The problem is that some programs are accessed by multiple users, so crashing them has a non-negligible cost. Or sometimes they're critical to the running of the system (or maybe they're in the middle of doing something very critical), in which case this can be damaging to your data. Of course, if you crash Notepad then no one might care, but if you crash CSRSS (which I believe actually had a similar kind of bug -- a double-free bug, specifically) then yeah, the entire system is going down with you.
Update:
See this link for the CSRSS bug I was referring to.
Edit:
Take note that reading arbitrary data can be just as dangerous as executing arbitrary code! If you read a password, a cookie, etc. then it's just as serious as an arbitrary code execution -- and this is trivial if you just have enough time to try enough format strings.

if one complains about gets(), why not do the same with scanf("%s",...)?

From man gets:
Never use gets(). Because it is
impossible to tell without knowing the
data in advance how many
characters gets() will read, and
because gets() will continue to store
characters past the end of the buffer,
it is extremely dangerous to use.
It has been used to break computer
security. Use fgets() instead.
Almost everywhere I see scanf being used in a way that should have the same problem (buffer overflow/buffer overrun): scanf("%s",string). This problem exists in this case? Why there are no references about it in the scanf man page? Why gcc does not warn when compiling this with -Wall?
ps: I know that there is a way to specify in the format string the maximum length of the string with scanf:
char str[10];
scanf("%9s",str);
edit: I am not asking to determe if the preceding code is right or not. My question is: if scanf("%s",string) is always wrong, why there are no warnings and there is nothing about it in the man page?
The answer is simply that no-one has written the code in GCC to produce that warning.
As you point out, a warning for the specific case of "%s" (with no field width) is quite appropriate.
However, bear in mind that this is only the case for the case of scanf(), vscanf(), fscanf() and vfscanf(). This format specifier can be perfectly safe with sscanf() and vsscanf(), so the warning should not be issued in that case. This means that you cannot simply add it to the existing "scanf-style-format-string" analysis code; you will have to split that into "fscanf-style-format-string" and "sscanf-style-format-string" options.
I'm sure if you produce a patch for the latest version of GCC it stands a good chance of being accepted (and of course, you will need to submit patches for the glibc header files too).
Using gets() is never safe. scanf() can be used safely, as you said in your question. However, determining if you're using it safely is a more difficult problem for the compiler to work out (e.g. if you're calling scanf() in a function where you pass in the buffer and a character count as arguments, it won't be able to tell); in that case, it has to assume that you know what you're doing.
When the compiler looks at the formatting string of scanf, it sees a string! That's assuming the formatting string is not entered at run-time. Some compilers like GCC have some extra functionality to analyze the formatting string if entered at compile time. That extra functionality is not comprehensive, because in some situations a run-time overhead is needed which is a NO NO for languages like C. For example, can you detect an unsafe usage without inserting some extra hidden code in this case:
char* str;
size_t size;
scanf("%z", &size);
str = malloc(size);
scanf("%9s"); // how can the compiler determine if this is a safe call?!
Of course, there are ways to write safe code with scanf if you specify the number of characters to read, and that there is enough memory to hold the string. In the case of gets, there is no way to specify the number of characters to read.
I am not sure why the man page for scanf doesn't mention the probability of a buffer overrun, but vanilla scanf is not a secure option. A rather dated link - Link shows this as the case. Also, check this (not gcc but informative nevertheless) - Link
It may be simply that scanf will allocate space on the heap based on how much data is read in. Since it doesn't allocate the buffer and then read until the null character is read, it doesn't risk overwriting the buffer. Instead, it reads into its own buffer until the null character is found, and presumably copies that buffer into another of the correct size at the end of the read.

When/why is it a bad idea to use the fscanf() function?

In an answer there was an interesting statement: "It's almost always a bad idea to use the fscanf() function as it can leave your file pointer in an unknown location on failure. I prefer to use fgets() to get each line in and then sscanf() that."
Could you expand upon when/why it might be better to use fgets() and sscanf() to read some file?
Imagine a file with three lines:
1
2b
c
Using fscanf() to read integers, the first line would read fine but on the second line fscanf() would leave you at the 'b', not sure what to do from there. You would need some mechanism to move past the garbage input to see the third line.
If you do a fgets() and sscanf(), you can guarantee that your file pointer moves a line at a time, which is a little easier to deal with. In general, you should still be looking at the whole string to report any odd characters in it.
I prefer the latter approach myself, although I wouldn't agree with the statement that "it's almost always a bad idea to use fscanf()"... fscanf() is perfectly fine for most things.
The case where this comes into play is when you match character literals. Suppose you have:
int n = fscanf(fp, "%d,%d", &i1, &i2);
Consider two possible inputs "323,A424" and "323A424".
In both cases fscanf() will return 1 and the next character read will be an 'A'. There is no way to determine if the comma was matched or not.
That being said, this only matters if finding the actual source of the error is important. In cases where knowing there is malformed input error is enough, fscanf() is actually superior to writing custom parsing code.
When fscanf() fails, due to an input failure or a matching failure, the file pointer (that is, the position in the file from which the next byte will be read) is left in a position other than where it would be had the fscanf() succeeded. This is typically undesirable in sequential file reads. Reading one line at a time results in the file input being predictable, while single line failures can be handled individually.
There are two reasons:
scanf() can leave stdin in a state that's difficult to predict; this makes error recovery difficult if not impossible (this is less of a problem with fscanf()); and
The entire scanf() family take pointers as arguments, but no length limit, so they can overrun a buffer and alter unrelated variables that happen to be after the buffer, causing seemingly random memory corruption errors that very difficult to understand, find, and debug, particularly for less experienced C programmers.
Novice C programmers are often confused about pointers and the “address-of” operator, and frequently omit the & where it's needed, or add it “for good measure” where it's not. This causes “random” segfaults that can be hard for them to find. This isn't scanf()'s fault, so I leave it off my list, but it is worth bearing in mind.
After 23 years, I still remember it being a huge pain when I started C programming and didn't know how to recognize and debug these kinds of errors, and (as someone who spent years teaching C to beginners) it's very hard to explain them to a novice who doesn't yet understand pointers and stack.
Anyone who recommends scanf() to a novice C programmer should be flogged mercilessly.
OK, maybe not mercilessly, but some kind of flogging is definitely in order ;o)
It's almost always a bad idea to use the fscanf() function as it can leave your file pointer in an unknown location on failure. I prefer to use fgets() to get each line in and then sscanf() that.
You can always use ftell() to find out current position in file, and then decide what to do from there. Basicaly, if you know what you can expect then feel free to use fscanf().
Basically, there's no way to to tell that function not to go out of bounds for the memory area you've allocated for it.
A number of replacements have come out, like fnscanf, which is an attempt to fix those functions by specifying a maximum limit for the reader to write, thus allowing it to not overflow.

Resources