fgetc(): Reading and storing a string of unknown length - c

What I need to do for an assignment is:
open a file (using fopen())
read the name of a student (using fgetc())
store that name in some part of a struct
The problem I have is that I need to read an arbitrary long string into name, and I don't know how to store that string without wasting memory (or writing into non-allocated memory).
EDIT
My first idea was to allocate a 1 byte (char) memory block, then call realloc() if more bytes are needed but this doesn't seem very efficient. Or maybe I could double the array if it is full and then at the end copy the chars into a new block of memory of the exact size.

Don't worry about wasting 100 or 1000 bytes which is likely to be long enough for all names.
I'd probably just put the buffer that you're reading into on the stack.
Do worry about writing over the end of the buffer. i.e. buffer overrun. Program to prevent that!
When you come to store the name into your structure you can malloc a buffer to store the name the exact length you need (don't forget to add an extra byte for the null terminator).
But if you really must store names of any length at all then you could do it with realloc.
i.e. Allocate a buffer with malloc of some size say 50 bytes.
Then when you need more space, use realloc to increase it's length. Increase the length in blocks of say 50 bytes and keep track with an int on how big it is so that you know when you need to grow it again. At some point, you will have to decide how long that buffer is going to be, because it can't grow indefinitely.

You could read the string character by character until you find the end, then rewind to the beginning, allocate a buffer of the right size, and re-read it into that, but unless you are on a tiny embedded system this is probably silly. For one thing, the fgetc, fread, etc functions create buffers in the O/S anyway.
You could allocate a temporary buffer that's large enough, use a length limited read (for safety) into that, and then allocate a buffer of the precise size to copy it into. You probably want to allocate the temporary buffer on the stack rather than via malloc, unless you think it might exceed your available stack space.
If you are writing single threaded code for a tiny system you can allocate a scratch buffer on startup or statically, and re-use it for many purposes - but be really carefully your usage can't overlap!
Given the implementation complexity of most systems, unless you really research how things work it's entirely possible to write memory optimized code that actually takes more memory than doing things the easy way. Variable initializations can be another surprisingly wasteful one.

My suggestion would be to allocate a buffer of sufficient size:
char name_buffer [ 80 ];
Generally, most names (at least common English names) will be less than 80 characters in size. If you feel that you may need more space than that, by all means allocate more.
Keep a counter variable to know how many characters you have already read into your buffer:
int chars_read = 0; /* most compilers will init to 0 for you, but always good to be explicit */
At this point, read character by character with fgetc() until you either hit the end of file marker or read 80 characters (79 really, since you need room for the null terminator). Store each character you've read into your buffer, incrementing your counter variable.
while ( ( chars_read < 80 ) && ( !feof( stdin ) ) ) {
name_buffer [ chars_read ] = fgetc ( stdin );
chars_read++;
}
if ( chars_read < 80 )
name_buffer [ chars_read ] = '\0'; /* terminating null character */
I am assuming here that you are reading from stdin. A more complete example would also check for errors, verify that the character you read from the stream is valid for a person's name (no numbers, for example), etc. If you try to read more data than for which you allocated space, print an error message to the console.
I understand wanting to maintain as small a buffer as possible and only allocate what you need, but part of learning how to program is understanding the trade-offs in code/data size, efficiency, and code readability. You can malloc and realloc, but it makes the code much more complex than necessary, and it introduces places where errors may come in - NULL pointers, array index out-of-bounds errors, etc. For most practical cases, allocate what should suffice for your data requirements plus a small amount of breathing room. If you find that you are encountering a lot of cases where the data exceeds the size of your buffer, adjust your buffer to accommodate it - that is what debugging and test cases are for.

Related

How to stop a stack buffer overflow when reading from file?

I'm reading from a .txt file to save it to a char array at the same size as the file itself. Is this enough to stop a uncontrolled stack buffer overflow from happening?
I already tried to use a fixed size buffer, but I now understand that's the very reason why the overflow is happening.
FILE *inputFP = NULL;
inputFP = fopen(input_file, "r");
if (inputFP == NULL)
return 1;
fseek(inputFP, 0, SEEK_END);
long fileSize = ftell(inputFP);
fseek(inputFP, 0, SEEK_SET);
char buffer[fileSize+20];
while ((ch = fgetc(inputFP)) != EOF)
{
buffer[i] = ch;
i++;
}
fprintf(outputFP, buffer, "%s");
Things work just fine, but I worry that the input file can be so big that something bad happens.
I'm reading from a .txt file to save it to an char array at the same size as the file itself. Is this enough to stop a uncontrolled stack buffer overflow from happening?
You prevent buffer overflows by avoiding writes outside your array. They are a Very Bad ThingTM.
Stack overflows occur when you exhaust the available pages assigned for the stack in your thread/process/program. Typically the size of the stack is very small (consider it on the order of 1 MiB). These are also Bad, but they will only crash your program.
long fileSize = ftell(inputFP);
...
char buffer[fileSize+20];
That is a Variable Length Array (VLA). It allocates dynamic (not known at compile-time) stack space. If you use it right, you won't have buffer overflows but you will have stack overflows, since the file size is unbounded.
What you should do, instead of using VLAs, is use a fixed-size buffer and read chunks of the file, rather than the entire file. If you really need to have the entire file in memory, you can try to allocate heap memory (malloc) for it or perhaps memory map it (mmap).
The way to limit buffer overflow is to carefully control the amount of memory that's written to any buffer.
If you say (in pseudocode):
filesize = compute_file_size(filename);
buffer = malloc(filesize);
read_entire_file_into(buffer, filename);
then you've got a big, gaping, potential buffer overflow problem. The fundamental problem is not that you allocated a buffer just exactly matching the size of the file (although that might be a problem). The problem is not that you computed the file's size in advance (although that might be a problem). No, the fundamental problem is that in the hypothetical call
read_entire_file_into(buffer, filename);
you did not tell the read_entire_file_into function how big the buffer was. This may have been the read_entire_file_into function's problem, not yours, but the bottom line is that functions that write an arbitrary amount of data into a fixed-size buffer, without allowing the size of that buffer to be specified, are disasters waiting to happen. That's why the notorious gets() function has been removed from the C Standard. That's why the strcpy function is disrecommended, and can be used (if at all) only under carefully-controlled circumstances. That's why the %s and %[...] format specifiers to scanf are disrecommended.
If, on the other hand, your code looks more like this:
filesize = compute_file_size(filename);
buffer = malloc(some_random_number);
read_entire_file_into_with_limit(buffer, some_random_number, filename);
-- where the point is that the (again hypothetical) read_entire_file_into_with_limit function can be told how big the buffer is -- then in this case, even if the compute_file_size function gets the wrong answer, and even if you use a completely different size for buffer, you've ensured that you won't overflow the buffer.
Moving from hypothetical pseudocode to real, actual code: you didn't show the part of your code that actually read something from the file. If you're calling fread or fgets to read the file, and if you're properly passing your fileSize variable to these functions as the size of your buffer, then you have adequately protected yourself against buffer overflow. But if, on the other hand, you're calling gets, or calling getc in a loop and writing characters to buffer until you reach EOF (but without checking the number of characters read against fileSize), then you do have a big, potential buffer overflow problem, and you need to rethink your strategy and rewrite your code.
There's a secondary issue with your code which is that you are allocating your buffer as a variable-length array (VLA), on the stack (so to speak). But really big stack-allocated arrays will fail -- not because of buffer overflow, but because they're literally too big. So if you actually want to read an entire file into memory, you will definitely want to use malloc, not a VLA. (And if you don't mind an operating-system-dependent solution, you might want to look into memory-mapped file techniques, e.g. the mmap call.)
You've updated your code, so now I can update this answer. The file-reading loop you've posted is dangerous -- in fact it's exactly what I had in mind when I wrote about
calling getc in a loop and writing characters to buffer until you reach EOF (but without checking the number of characters read against fileSize)
You should replace that code with either
while ((ch = getc(inputFP)) != EOF)
{
if(i >= fileSize) {
fprintf(stderr, "buffer overflow!\n");
break;
}
buffer[i] = ch;
i++;
}
or
while ((ch = getc(inputFP)) != EOF && i < fileSize)
{
buffer[i] = ch;
i++;
}
Or, you can take a completely different approach. Most of the time, there's no need to read an entire file into memory all at once. Most of the time, it's perfectly adequate to read the file a line at a time, or a chunk at a time, or even a character at a time, processing and writing out each piece before moving on to the next. That way, you can work on a file of any size, and you don't need to try to figure out how big the file is in advance, and you don't need to allocate a big buffer, and you don't need to worry about overflowing that buffer.
I don't have time to show you how to do that today, but there are hints and suggestions in some of the other answers.
As mentioned in comments, malloc() can prevent buffer-overflow in your case.
As a side note, always try to read file progressively and don't load it completely in memory. In case of large files, you will be in trouble because your process will not be able to allocate that amount of memory. For example, it is almost impossible to load a 10GB video file in memory completely. In addition, normally every large file of data is structured so you can read it progressively in small chunks.

Will assigning a large value for length of char string be an issue?

I am reading a line from a file and I do not know the length it is going to be. I know there are ways to do this with pointers but I am specifically asking for just a plan char string. For Example if I initialize the string like this:
char string[300]; //(or bigger)
Will having large string values like this be a problem?
Any hard coded number is potentially too small to read the contents of a file. It's best to compute the size at run time, allocate memory for the contents, and then read the contents.
See Read file contents with unknown size.
char string[300]; //(or bigger)
I am not sure which of the two issues you are concerned with, so I will try to address both below:
if the string in the file is larger than 300 bytes and you try to "stick" that string in that buffer, without accounting the max length of your array -you will get undefined behaviour because of overwriting the array.
If you are just asking if 300 bytes is too much too allocate - then no, it is not a big deal unless you are on some very restricted device. e.g. In Visual Studio the default stack size (where that array would be stored) is 1 MB if I am not wrong. Benefits of doing so is understandable, e.g. you don't need to concern yourself with freeing it etc.
PS. So if you are sure the buffer size you specify is enough - this can be fine approach as you free yourself from memory management related issues - which you get from pointers and dynamic memory.
Will having large string values like this be a problem?
Absolutely.
If your application must read the entire line from a file before processing it, then you have two options.
1) Allocate buffer large enough to hold the line of maximum allowed length. For example, the SMTP protocol does not allow lines longer than 998 characters. In that case you can allocate a static buffer of length 1001 (998 + \r + \n + \0). Once you have read a line from a file (or from a client, in the example context) which is longer than the maximum length (that is, you have read 1000 characters and the last one is not \n), you can treat it as a fatal (protocol) error and report it.
2) If there are no limitations on the length of the input line, the only thing you can do to ensure your program robustness is allocating buffers dynamically as the input is read. This may involve storing multiple malloc-ed buffers in a linked list, or calling realloc when buffer exhaustion detected (this is how getline function works, although it is not specified in the C standard, only in POSIX.1-2008).
In either case, never use gets to read the line. Call fgets instead.
It all depends on how you read the line. For example:
char string[300];
FILE* fp = fopen(filename, "r");
//Error checking omitted
fgets(string, 300, fp);
Taken from tutorialspoint.com
The C library function char *fgets(char *str, int n, FILE *stream) reads a line from the specified stream and stores it into the string pointed to by str. It stops when either (n-1) characters are read, the newline character is read, or the end-of-file is reached, whichever comes first.
That means that this will read 299 characters from the file at most. This will cause only a logical error (because you might not get all the data you need) that won't cause any undefined behavior.
But, if you do:
char string[300];
int i = 0;
FILE* fp = fopen(filename, "r");
do{
string[i] = fgetc(fp);
i++;
while(string[i] != '\n');
This will cause Segmantation Fault because it will try to write on unallocated memory on lines bigger than 300 characters.

C using fread to read an unknown amount of data

I have a text file called test.txt
Inside it will be a single number, it may be any of the following:
1
2391
32131231
3123121412
I.e. it could be any size of number, from 1 digit up to x digits.
The file will only have 1 thing in it - this number.
I want a bit of code using fread() which will read that number of bytes from the file and put it into an appropriately sized variable.
This is to run on an embedded device; I am concerned about memory usage.
How to solve this problem?
You can simply use:
char buffer[4096];
size_t nbytes = fread(buffer, sizeof(char), sizeof(buffer), fp);
if (nbytes == 0)
...EOF or other error...
else
...process nbytes of data...
Or, in other words, provide yourself with a data space big enough for any valid data and then record how much data was actually read into the string. Note that the string will not be null terminated unless either buffer contained all zeroes before the fread() or the file contained a zero byte. You cannot rely on a local variable being zeroed before use.
It is not clear how you want to create the 'appropriately sized variable'. You might end up using dynamic memory allocation (malloc()) to provide the correct amount of space, and then return that allocated pointer from the function. Remember to check for a null return (out of memory) before using it.
If you want to avoid over-reading, fread is not the right function. You probably want fscanf with a conversion specifier along the lines of %100[0123456789]...
One way to achieve this is to use fseek to move your file stream location to the end of the file:
fseek(file, SEEK_END, SEEK_SET);
and then using ftell to get the position of the cursor in the file — this returns the position in bytes so you can then use this value to allocate a suitably large buffer and then read the file into that buffer.
I have seen warnings saying this may not always be 100% accurate but I've used it in several instances without a problem — I think the issues could be dependant on specific implementations of the functions on certain platforms.
Depending on how clever you need to be with the number conversion... If you do not need to be especially clever and fast, you can read it a character at a time with getc(). So,
- start with a variable initialized to 0.
- Read a character, multiply variable by 10 and add new digit.
- Then repeat until done.
Get a bigger sized variable as needed along the way or start with your largest sized variable and then copy it into the smallest size that fits after you finish.

Writing into a file

I have a very simple question regarding file write.
I have this program:
char buf[20];
size_t nbytes;
strcpy(buf, "All that glitters is not gold\n");
fd= open("test_file.txt",O_WRONLY);
write(fd,buf,strlen(buf));
close(fd);
What am confused is when I open the file test_file.txt after running this program I see some characters like ^C^#^#^#^^^# after the line "All that glitters is not": Notice that portion of the buf is not written and those characters appear instead. Why is that so?
You're writing more than 19 chars in that buffer. Once you've done that, the behavior of your program is undefined. It could do whatever it wants.
Allocate a large enough buffer. It has to be able to fit all the letters plus a terminating 0 if you need to be able to treat it as a C string.
The string "All that glitters is not gold\n" is longer than 20 characters. I suggest you try it with a larger buffer.
Actually, if you're going to do any nontrivial work in C I suggest you never ever use strcpy, as a general habit. Use functions like strncpy which let you specify a buffer size so that it's clear you'll never overflow.
libgcc strcpy Manual says:
If the destination string of a
strcpy() is not large enough (that
is, if the programmer was stupid
or lazy, and failed to check the size
before copying) then anything might
happen. Overflowing fixed length
strings is a favorite cracker
technique.
Also the strlen says
The strlen() function calculates
the length of the string s, not
including the terminating '\0'
character.
So i guess strlen () does not return what you expect it to return and as a result the extra characters are written
To make the thing work, you need to allocate a large enough buffer, which can hold the entire string.

Scanning a file and allocating correct space to hold the file

I am currently using fscanf to get space delimited words. I establish a char[] with a fixed size to hold each of the extracted words. How would I create a char[] with the correct number of spaces to hold the correct number of characters from a word?
Thanks.
Edit: If I do a strdup on a char[1000] and the char[1000] actually only holds 3 characters, will the strdup reserve space on the heap for 1000 or 4 (for the terminating char)?
Here is a solution involving only two allocations and no realloc:
Determine the size of the file by seeking to the end and using ftell.
Allocate a block of memory this size and read the whole file into it using fread.
Count the number of words in this block.
Allocate an array of char * able to hold pointers to this many words.
Loop through the block of text again, assigning to each pointer the address of the beginning of a word, and replacing the word delimiter at the end of the word with 0 (the null character).
Also, a slightly philosophical matter: If you think this approach of inserting string terminators in-place and breaking up one gigantic string to use it as many small strings is ugly, hackish, etc. then you probably should probably forget about programming in C and use Python or some other higher-level language. The ability to do radically-more-efficient data manipulation operations like this while minimizing the potential points of failure is pretty much the only reason anyone should be using C for this kind of computation. If you want to go and allocate each word separately, you're just making life a living hell for yourself by doing it in C; other languages will happily hide this inefficiency (and abundance of possible failure points) behind friendly string operators.
There's no one-and-only way. The idea is to just allocate a string large enough to hold the largest possible string. After you've read it, you can then allocate a buffer of exactly the right size and copy it if needed.
In addition, you can also specify a width in your fscanf format string to limit the number of characters read, to ensure your buffer will never overflow.
But if you allocated a buffer of, say 250 characters, it's hard to imaging a single word not fitting in that buffer.
char *ptr;
ptr = (char*) malloc(size_of_string + 1);
char first = ptr[0];
/* etc. */

Resources