How to stop a stack buffer overflow when reading from file?

How to stop a stack buffer overflow when reading from file? - c

I'm reading from a .txt file to save it to a char array at the same size as the file itself. Is this enough to stop a uncontrolled stack buffer overflow from happening?
I already tried to use a fixed size buffer, but I now understand that's the very reason why the overflow is happening.
FILE *inputFP = NULL;
inputFP = fopen(input_file, "r");
if (inputFP == NULL)
return 1;
fseek(inputFP, 0, SEEK_END);
long fileSize = ftell(inputFP);
fseek(inputFP, 0, SEEK_SET);
char buffer[fileSize+20];
while ((ch = fgetc(inputFP)) != EOF)
{
buffer[i] = ch;
i++;
}
fprintf(outputFP, buffer, "%s");
Things work just fine, but I worry that the input file can be so big that something bad happens.

I'm reading from a .txt file to save it to an char array at the same size as the file itself. Is this enough to stop a uncontrolled stack buffer overflow from happening?
You prevent buffer overflows by avoiding writes outside your array. They are a Very Bad ThingTM.
Stack overflows occur when you exhaust the available pages assigned for the stack in your thread/process/program. Typically the size of the stack is very small (consider it on the order of 1 MiB). These are also Bad, but they will only crash your program.
long fileSize = ftell(inputFP);
...
char buffer[fileSize+20];
That is a Variable Length Array (VLA). It allocates dynamic (not known at compile-time) stack space. If you use it right, you won't have buffer overflows but you will have stack overflows, since the file size is unbounded.
What you should do, instead of using VLAs, is use a fixed-size buffer and read chunks of the file, rather than the entire file. If you really need to have the entire file in memory, you can try to allocate heap memory (malloc) for it or perhaps memory map it (mmap).

The way to limit buffer overflow is to carefully control the amount of memory that's written to any buffer.
If you say (in pseudocode):
filesize = compute_file_size(filename);
buffer = malloc(filesize);
read_entire_file_into(buffer, filename);
then you've got a big, gaping, potential buffer overflow problem. The fundamental problem is not that you allocated a buffer just exactly matching the size of the file (although that might be a problem). The problem is not that you computed the file's size in advance (although that might be a problem). No, the fundamental problem is that in the hypothetical call
read_entire_file_into(buffer, filename);
you did not tell the read_entire_file_into function how big the buffer was. This may have been the read_entire_file_into function's problem, not yours, but the bottom line is that functions that write an arbitrary amount of data into a fixed-size buffer, without allowing the size of that buffer to be specified, are disasters waiting to happen. That's why the notorious gets() function has been removed from the C Standard. That's why the strcpy function is disrecommended, and can be used (if at all) only under carefully-controlled circumstances. That's why the %s and %[...] format specifiers to scanf are disrecommended.
If, on the other hand, your code looks more like this:
filesize = compute_file_size(filename);
buffer = malloc(some_random_number);
read_entire_file_into_with_limit(buffer, some_random_number, filename);
-- where the point is that the (again hypothetical) read_entire_file_into_with_limit function can be told how big the buffer is -- then in this case, even if the compute_file_size function gets the wrong answer, and even if you use a completely different size for buffer, you've ensured that you won't overflow the buffer.
Moving from hypothetical pseudocode to real, actual code: you didn't show the part of your code that actually read something from the file. If you're calling fread or fgets to read the file, and if you're properly passing your fileSize variable to these functions as the size of your buffer, then you have adequately protected yourself against buffer overflow. But if, on the other hand, you're calling gets, or calling getc in a loop and writing characters to buffer until you reach EOF (but without checking the number of characters read against fileSize), then you do have a big, potential buffer overflow problem, and you need to rethink your strategy and rewrite your code.
There's a secondary issue with your code which is that you are allocating your buffer as a variable-length array (VLA), on the stack (so to speak). But really big stack-allocated arrays will fail -- not because of buffer overflow, but because they're literally too big. So if you actually want to read an entire file into memory, you will definitely want to use malloc, not a VLA. (And if you don't mind an operating-system-dependent solution, you might want to look into memory-mapped file techniques, e.g. the mmap call.)
You've updated your code, so now I can update this answer. The file-reading loop you've posted is dangerous -- in fact it's exactly what I had in mind when I wrote about
calling getc in a loop and writing characters to buffer until you reach EOF (but without checking the number of characters read against fileSize)
You should replace that code with either
while ((ch = getc(inputFP)) != EOF)
{
if(i >= fileSize) {
fprintf(stderr, "buffer overflow!\n");
break;
}
buffer[i] = ch;
i++;
}
or
while ((ch = getc(inputFP)) != EOF && i < fileSize)
{
buffer[i] = ch;
i++;
}
Or, you can take a completely different approach. Most of the time, there's no need to read an entire file into memory all at once. Most of the time, it's perfectly adequate to read the file a line at a time, or a chunk at a time, or even a character at a time, processing and writing out each piece before moving on to the next. That way, you can work on a file of any size, and you don't need to try to figure out how big the file is in advance, and you don't need to allocate a big buffer, and you don't need to worry about overflowing that buffer.
I don't have time to show you how to do that today, but there are hints and suggestions in some of the other answers.

As mentioned in comments, malloc() can prevent buffer-overflow in your case.
As a side note, always try to read file progressively and don't load it completely in memory. In case of large files, you will be in trouble because your process will not be able to allocate that amount of memory. For example, it is almost impossible to load a 10GB video file in memory completely. In addition, normally every large file of data is structured so you can read it progressively in small chunks.

Related

Will assigning a large value for length of char string be an issue?

I am reading a line from a file and I do not know the length it is going to be. I know there are ways to do this with pointers but I am specifically asking for just a plan char string. For Example if I initialize the string like this:
char string[300]; //(or bigger)
Will having large string values like this be a problem?

Any hard coded number is potentially too small to read the contents of a file. It's best to compute the size at run time, allocate memory for the contents, and then read the contents.
See Read file contents with unknown size.

char string[300]; //(or bigger)
I am not sure which of the two issues you are concerned with, so I will try to address both below:
if the string in the file is larger than 300 bytes and you try to "stick" that string in that buffer, without accounting the max length of your array -you will get undefined behaviour because of overwriting the array.
If you are just asking if 300 bytes is too much too allocate - then no, it is not a big deal unless you are on some very restricted device. e.g. In Visual Studio the default stack size (where that array would be stored) is 1 MB if I am not wrong. Benefits of doing so is understandable, e.g. you don't need to concern yourself with freeing it etc.
PS. So if you are sure the buffer size you specify is enough - this can be fine approach as you free yourself from memory management related issues - which you get from pointers and dynamic memory.

Will having large string values like this be a problem?
Absolutely.
If your application must read the entire line from a file before processing it, then you have two options.
1) Allocate buffer large enough to hold the line of maximum allowed length. For example, the SMTP protocol does not allow lines longer than 998 characters. In that case you can allocate a static buffer of length 1001 (998 + \r + \n + \0). Once you have read a line from a file (or from a client, in the example context) which is longer than the maximum length (that is, you have read 1000 characters and the last one is not \n), you can treat it as a fatal (protocol) error and report it.
2) If there are no limitations on the length of the input line, the only thing you can do to ensure your program robustness is allocating buffers dynamically as the input is read. This may involve storing multiple malloc-ed buffers in a linked list, or calling realloc when buffer exhaustion detected (this is how getline function works, although it is not specified in the C standard, only in POSIX.1-2008).
In either case, never use gets to read the line. Call fgets instead.

It all depends on how you read the line. For example:
char string[300];
FILE* fp = fopen(filename, "r");
//Error checking omitted
fgets(string, 300, fp);
Taken from tutorialspoint.com
The C library function char *fgets(char *str, int n, FILE *stream) reads a line from the specified stream and stores it into the string pointed to by str. It stops when either (n-1) characters are read, the newline character is read, or the end-of-file is reached, whichever comes first.
That means that this will read 299 characters from the file at most. This will cause only a logical error (because you might not get all the data you need) that won't cause any undefined behavior.
But, if you do:
char string[300];
int i = 0;
FILE* fp = fopen(filename, "r");
do{
string[i] = fgetc(fp);
i++;
while(string[i] != '\n');
This will cause Segmantation Fault because it will try to write on unallocated memory on lines bigger than 300 characters.

How come my C program doesn't crash when I overflow an allocated array of characters?

I have a simple C file I/O program which demonstrates reading a text file, line-by-line, an outputting its contents to the console:
/**
* simple C program demonstrating how
* to read an entire text file
*/
#include <stdio.h>
#include <stdlib.h>
#define FILENAME "ohai.txt"
int main(void)
{
// open a file for reading
FILE* fp = fopen(FILENAME, "r");
// check for successful open
if(fp == NULL)
{
printf("couldn't open %s\n", FILENAME);
return 1;
}
// size of each line
char output[256];
// read from the file
while(fgets(output, sizeof(output), fp) != NULL)
printf("%s", output);
// report the error if we didn't reach the end of file
if(!feof(fp))
{
printf("Couldn't read entire file\n");
fclose(fp);
return 1;
}
// close the file
fclose(fp);
return 0;
}
It looks like I've allocated an array with space for 256 characters per line (1024 bytes bits on a 32-bit machine). Even when I fill ohai.txt with more than 1000 characters of text on the first line, the program doesn't segfault, which I assumed it would, since it overflowed the allocated amount of space available to it designated by the output[] array.
My hypothesis is that the operating system will give extra memory to the program while it has extra memory available to give. This would mean the program would only crash when the memory consumed by a line of text in ohai.txt resulted in a stackoverflow.
Could someone with more experience with C and memory management support or refute my hypothesis as to why this program doesn't crash, even when the amount of characters in one line of a text file is much larger than 256?

You're not overflowing anything here: fgets won't write more than sizeof(output) characters to the buffer, and therefore will not overflow anything (see the documentation).
However, if you do overflow a buffer, you get undefined behaviour. According to the C spec, the program may do anything: crash, not crash, silently destroy important data, accidentally call rm -rf /, etc. So, don't expect a program to crash if you invoke UB.

OP's program did not crash because no buffer overflow occurred.
while(fgets(output, sizeof(output), fp) != NULL)
printf("%s", output);
The fgets() nicely read a group of char up to a count or 255 or a \n. Then printf("%s" ... nicely printed them out. This repeated until no more data/
No crash, no overflow, no runs, no hits , no errors.

fgets(output, sizeof(output), fp) reads (sizeof(output) -1) number of characters in this case(otherwise it reads till newline or end of file)

Explanation of stacks and why this might not segfault even if you actually did overflow (and as others have pointed out the code as written will not)
Your stack pointer starts at some address say 0x8000000 then the runtime calls main and it'll move down a bit (there may be other stuff up there so we don't know how much stuff is on the stack at the start of main), then main will move the stack pointer some more for all it's local variables. So at this point your array will have an address that is more than 256 bytes below 0x8000000 and you won't get a segfault unless you run all the way over all of main's stack frame and the stack frames of whatever other C runtime stuff called main.
So for the sake of simplicity assume your array ends up with it's base address at 0x7fffd00 that's 768 bytes below 0x8000000 meaning at a minimum you'd have to overflow by that much to get a segfault, (well you'd probably get a segfault when main returns or when you call feof, because you filled your stack frame with random characters, but we're talking about segfaults inside fgets()) but even that's not gaurenteed if something writable is mapped to the page above your stack (unlikely most OSs avoid doing that so you'll get a segfault if you overflow far enough)
If the stack runs the other way (ie: growing upward) you'd have to run over the entirety of the maximum size stack, which in userspace is usually quite large (Default on Linux for 32bit x86 is 2MB) but I'm pretty sure x86 stacks grow downward so that's not likely for your case.

File read using POSIX API's

Consider the following piece of code for reading the contents of the file into a buffer
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#define BLOCK_SIZE 4096
int main()
{
int fd=-1;
ssize_t bytes_read=-1;
int i=0;
char buff[50];
//Arbitary size for the buffer?? How to optimise.
//Dynamic allocation is a choice but what is the
//right way to relate the file size to bufffer size.
fd=open("./file-to-buff.txt",O_RDONLY);
if(-1 == fd)
{
perror("Open Failed");
return 1;
}
while((bytes_read=read(fd,buff,BLOCK_SIZE))>0)
{
printf("bytes_read=%d\n",bytes_read);
}
//Test to characters read from the file to buffer.The file contains "Hello"
while(buff[i]!='\0')
{
printf("buff[%d]=%d\n",i,buff[i]);
i++;
//buff[5]=\n-How?
}
//buff[6]=`\0`-How?
close(fd);
return 0;
}
Code Description:
The input file contains a string "Hello"
This content needs to be copied into the buffer.
The objective is acheived by open and read POSIX API's.
The read API uses a pointer to a buffer of an*arbitary size* to copy the data in.
Questions:
Dynamic allocation is the method that must be used to optimize the size of the buffer.What is the right procedure to relate/derive the buffer size from the input file size?
I see at the end of the read operation the read has copied a new line character and a NULL character in addition to the characters "Hello". Please elaborate more on this behavior of read.
Sample Output
bytes_read=6
buff[0]=H
buff[1]=e
buff[2]=l
buff[3]=l
buff[4]=o
buff[5]=
PS: Input file is user created file not created by a program (using write API). Just to mention here, in case if it makes any difference.

Since you want to read the whole file, the best way is to make the buffer as big as the file size. There's no point in resizing the buffer as you go. That just hurts performance without good reason.
You can get the file size in several ways. The quick-and-dirty way is to lseek() to the end of the file:
// Get size.
off_t size = lseek(fd, 0, SEEK_END); // You should check for an error return in real code
// Seek back to the beginning.
lseek(fd, 0, SEEK_SET);
// Allocate enough to hold the whole contents plus a '\0' char.
char *buff = malloc(size + 1);
The other way is to get the information using fstat():
struct stat fileStat;
fstat(fd, &fileStat); // Don't forget to check for an error return in real code
// Allocate enough to hold the whole contents plus a '\0' char.
char *buff = malloc(fileStat.st_size + 1);
To get all the needed types and function prototypes, make sure you include the needed header:
#include <sys/stat.h> // For fstat()
#include <unistd.h> // For lseek()
Note that read() does not automatically terminate the data with \0. You need to do that manually, which is why we allocate an extra character (size+1) for the buffer. The reason why there's already a \0 character there in your case is pure random chance.
Of course, since buf is now a dynamically allocated array, don't forget to free it again when you don't need it anymore:
free(buff);
Be aware though, that allocating a buffer that's as large as the file you want to read into it can be dangerous. Imagine if (by mistake or on purpose, doesn't matter) the file is several GB big. For cases like this, it's good to have a maximum allowable size in place. If you don't want any such limitations, however, then you should switch to another method of reading from files: mmap(). With mmap(), you can map parts of a file to memory. That way, it doesn't matter how big the file is, since you can work only on parts of it at a time, keeping memory usage under control.

1, you can get the file size with stat(filename, &stat), but define the buffer to page size is just fine
2, first, there is no NULL character after "Hello", it must be accident that the stack area you allocated was 0 before your code executed, please refer to APUE chapter 7.6. In fact you must initialize the local variable before using it.
I tried to generate the text file with vim, emacs and echo -n Hello > file-to-buff.txt, only vim adds a line break automatically

You could consider allocating the buffer dynamically by first creating a buffer of a fixed size using malloc and doubling (with realloc) the size when you fill it up. This would have a good time complexity and space trade off.
At the moment you repeatedly read into the same buffer. You should increase the point in the buffer after each read otherwise you will overwrite the buffer contents with the next section of the file.
The code you supply allocates 50 bytes for the buffer yet you pass 4096 as the size to the read. This could result in a buffer overflow for any files over the size of 50 bytes.
As for the `\n' and '\0'. The newline is probably in the file and the '\0' was just already in the buffer. The buffer is allocated on the stack in your code and if that section of the stack had not been used yet it would probably contain zeros, placed there by the operating system when your program was loaded.
The operating system makes no attempt to terminate the data read from the file, it might be binary data or in a character set that it doesn't understand. Terminating the string, if needed, is up to you.
A few other points that are more a matter of style:
You could consider using a for (i = 0; buff[i]; ++i) loop instead of a while for the printing out at the end. This way if anyone messes with the index variable i you will be unaffected.
You could close the file earlier, after you finish reading from it, to avoid having the file open for an extended period of time (and maybe forgetting to close it if some kind of error happens).

For your second question, read don't add automatically a character '\0'.
If you consider that your file is a textual file, your must add a '\0' after calling read, for indicate the end of string.
In C, the end of string is represented by this caracter. If read set 4 characters, printf will read these 4 characters, and will test the 5th: if it's not '\0', it will continue to print until next '\0'.
It's also a source of buffer overflow
For the '\n', it is probably in the input file.

C using fread to read an unknown amount of data

I have a text file called test.txt
Inside it will be a single number, it may be any of the following:
1
2391
32131231
3123121412
I.e. it could be any size of number, from 1 digit up to x digits.
The file will only have 1 thing in it - this number.
I want a bit of code using fread() which will read that number of bytes from the file and put it into an appropriately sized variable.
This is to run on an embedded device; I am concerned about memory usage.
How to solve this problem?

You can simply use:
char buffer[4096];
size_t nbytes = fread(buffer, sizeof(char), sizeof(buffer), fp);
if (nbytes == 0)
...EOF or other error...
else
...process nbytes of data...
Or, in other words, provide yourself with a data space big enough for any valid data and then record how much data was actually read into the string. Note that the string will not be null terminated unless either buffer contained all zeroes before the fread() or the file contained a zero byte. You cannot rely on a local variable being zeroed before use.
It is not clear how you want to create the 'appropriately sized variable'. You might end up using dynamic memory allocation (malloc()) to provide the correct amount of space, and then return that allocated pointer from the function. Remember to check for a null return (out of memory) before using it.

If you want to avoid over-reading, fread is not the right function. You probably want fscanf with a conversion specifier along the lines of %100[0123456789]...

One way to achieve this is to use fseek to move your file stream location to the end of the file:
fseek(file, SEEK_END, SEEK_SET);
and then using ftell to get the position of the cursor in the file — this returns the position in bytes so you can then use this value to allocate a suitably large buffer and then read the file into that buffer.
I have seen warnings saying this may not always be 100% accurate but I've used it in several instances without a problem — I think the issues could be dependant on specific implementations of the functions on certain platforms.

Depending on how clever you need to be with the number conversion... If you do not need to be especially clever and fast, you can read it a character at a time with getc(). So,
- start with a variable initialized to 0.
- Read a character, multiply variable by 10 and add new digit.
- Then repeat until done.
Get a bigger sized variable as needed along the way or start with your largest sized variable and then copy it into the smallest size that fits after you finish.

fgetc(): Reading and storing a string of unknown length

What I need to do for an assignment is:
open a file (using fopen())
read the name of a student (using fgetc())
store that name in some part of a struct
The problem I have is that I need to read an arbitrary long string into name, and I don't know how to store that string without wasting memory (or writing into non-allocated memory).
EDIT
My first idea was to allocate a 1 byte (char) memory block, then call realloc() if more bytes are needed but this doesn't seem very efficient. Or maybe I could double the array if it is full and then at the end copy the chars into a new block of memory of the exact size.

Don't worry about wasting 100 or 1000 bytes which is likely to be long enough for all names.
I'd probably just put the buffer that you're reading into on the stack.
Do worry about writing over the end of the buffer. i.e. buffer overrun. Program to prevent that!
When you come to store the name into your structure you can malloc a buffer to store the name the exact length you need (don't forget to add an extra byte for the null terminator).
But if you really must store names of any length at all then you could do it with realloc.
i.e. Allocate a buffer with malloc of some size say 50 bytes.
Then when you need more space, use realloc to increase it's length. Increase the length in blocks of say 50 bytes and keep track with an int on how big it is so that you know when you need to grow it again. At some point, you will have to decide how long that buffer is going to be, because it can't grow indefinitely.

You could read the string character by character until you find the end, then rewind to the beginning, allocate a buffer of the right size, and re-read it into that, but unless you are on a tiny embedded system this is probably silly. For one thing, the fgetc, fread, etc functions create buffers in the O/S anyway.
You could allocate a temporary buffer that's large enough, use a length limited read (for safety) into that, and then allocate a buffer of the precise size to copy it into. You probably want to allocate the temporary buffer on the stack rather than via malloc, unless you think it might exceed your available stack space.
If you are writing single threaded code for a tiny system you can allocate a scratch buffer on startup or statically, and re-use it for many purposes - but be really carefully your usage can't overlap!
Given the implementation complexity of most systems, unless you really research how things work it's entirely possible to write memory optimized code that actually takes more memory than doing things the easy way. Variable initializations can be another surprisingly wasteful one.

My suggestion would be to allocate a buffer of sufficient size:
char name_buffer [ 80 ];
Generally, most names (at least common English names) will be less than 80 characters in size. If you feel that you may need more space than that, by all means allocate more.
Keep a counter variable to know how many characters you have already read into your buffer:
int chars_read = 0; /* most compilers will init to 0 for you, but always good to be explicit */
At this point, read character by character with fgetc() until you either hit the end of file marker or read 80 characters (79 really, since you need room for the null terminator). Store each character you've read into your buffer, incrementing your counter variable.
while ( ( chars_read < 80 ) && ( !feof( stdin ) ) ) {
name_buffer [ chars_read ] = fgetc ( stdin );
chars_read++;
}
if ( chars_read < 80 )
name_buffer [ chars_read ] = '\0'; /* terminating null character */
I am assuming here that you are reading from stdin. A more complete example would also check for errors, verify that the character you read from the stream is valid for a person's name (no numbers, for example), etc. If you try to read more data than for which you allocated space, print an error message to the console.
I understand wanting to maintain as small a buffer as possible and only allocate what you need, but part of learning how to program is understanding the trade-offs in code/data size, efficiency, and code readability. You can malloc and realloc, but it makes the code much more complex than necessary, and it introduces places where errors may come in - NULL pointers, array index out-of-bounds errors, etc. For most practical cases, allocate what should suffice for your data requirements plus a small amount of breathing room. If you find that you are encountering a lot of cases where the data exceeds the size of your buffer, adjust your buffer to accommodate it - that is what debugging and test cases are for.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight