C Buffer underflows definition and associated risk - c

According to Wikipedia:
In computing, buffer underrun or buffer underflow is a state occurring when a buffer used to communicate between two devices or processes is fed with data at a lower speed than the data is being read from it.
From apple's secure coding guide:
Fundamentally, buffer underflows occur when two parts of your code disagree about the size of a buffer or the data in that buffer. For example, a fixed-length C string variable might have room for 256 bytes, but might contain a string that is only 12 bytes long.
Apple's definition complements the idea of buffer overflow.
Which of these definitions is technically more sound?
Is buffer underflow a major security concern? I have the habbit of using large buffers to poll and read() from serial ports or sockets (although I do use bzero()). Is this the right thing to do?

Those are two different usages of the word "underflow". As they are describing two different things, I don't think you can compare them on technical soundness.
Buffer underflow, as per Apple's definition, could be a weakness. See http://cwe.mitre.org/data/definitions/124.html.

2) ' I do use bzero()). Is this the right thing to do?'
Almost certainly no. The system calls return how many bytes have been received. If you're absolutely certain that you are going to receive text-style data with no embedded nulls, and wish to use C-style string lib calls on it, just push one null onto the end of the buffer, (this often means reading one less byte than the declared buffer length, to ensure thare is enough space for the null). In all other cases, just don't bother with the terminator at all. It's going to be either pointless or dangerous.
bzero() is just a waste of cycles in the case of network buffers. I don't care how many web page examples there are or how many sources say 'vars/buffers must be initialized'. It's rubbish.

Related

fgetc vs getline or fgets - which is most flexible

I am reading data from a regular file and I was wondering which would allow for the most flexibility.
I have found that both fgets and getline both read in a line (one with a maximum number of characters, the other with dynamic memory allocation). In the case of fgets, if the length of the line is bigger than the given size, the rest of the line would not be read but remain buffered in the stream. With getline, I am worried that it may attempt to assign a large block of memory for an obscenely long line.
The obvious solution for me seems to be turning to fgetc, but this comes with the problem that there will be many calls to the function, thereby resulting in the read process being slow.
Is this compromise in either case between flexibility and efficiency unavoidable, or can it worked through?
The three functions you mention do different things:
fgetc() reads a single character from a FILE * descriptor, it buffers input and so, you can process the file in a buffered way without having the overhelm of making a system call for each character. when your problem can be handled in a character oriented way, it is the best.
fgets() reads a single line from a FILE * descriptor, it's like calling fgetc() to fill the character array you pass to it in order to read line by line. It has the drawback of making a partial read in case your input line is longer than the buffer size you specify. This function buffers also input data, so it is very efficient. If you know that your lines will be bounded, this is the best to read your data line by line. Sometimes you want to be able to process data in an unbounded line size way, and you must redesign your problem to use the available memory. Then the one below is probably better election.
getline() this function is relatively new, and is not ANSI-C, so it is possible you port your program to some architecture that lacks it. It's the most flexible, at the price of being the less efficient. It requires a reference to a pointer that is realloc()ated to fill more and more data. It doesn't bind the line length at the cost of being possible to fill all the memory available on a system. Both, the buffer pointer and the size of the buffer are passed by reference to allow them to be updated, so you know where the new string is located and the new size. It mus be free()d after use.
The reason of having three and not only one function is that you have different needs for different cases and selecting the mos efficient one is normally the best selection.
If you plan to use only one, probably you'll end in a situation where using the function you selected as the most flexible will not be the best election and you will probably fail.
Much is case dependent.
getline() is not part of the standard C library. Its functionality may differ - depends on the implementation and what other standards it follows - thus an advantage for the standard fgetc()/fgets().
... case between flexibility and efficiency unavoidable, ...
OP is missing the higher priorities.
Functionality - If code cannot function right with the selected function, why use it? Example: fgets() and reading null characters create issues.
Clarity - without clarity, feel the wrath of the poor soul who later has to maintain the code.
would allow for the most flexibility. (?)
fgetc() allows for the most flexibility at the low level - yet helper functions using it to read lines tend to fail corner cases.
fgets() allows for the most flexibility at mid level - still have to deal with long lines and those with embedded null characters, but at least the low level of slogging in the weeds is avoided.
getline() useful when high portability not needed and risks of allowing the user to overwhelm resources is not a concern.
For robust handing of user/file input to read a line, create a wrapping function (e.g. int my_read_line(size_t buf, char *buf, FILE *f)) and call that and only that in user code. Then when issues arise, they can be handled locally, regardless of the low level input function selected.

How to buffer a line in a file by using System Calls in C?

Here is my approach:
int linesize=1
int ReadStatus;
char buff[200];
ReadStatus=read(file,buff,linesize)
while(buff[linesize-1]!='\n' && ReadStatus!=0)
{
linesize++;
ReadStatus=read(file,buf,linesize)
}
Is this idea right?
I think my code is a bit inefficient because the run time is O(FileWidth); however I think it can be O(log(FileWidth)) if we exponentially increase linesize to find the linefeed character.
What do you think?
....
I just saw a new problem. How do we read the second line?. Is there anyway to delimit the bytes?
Is this idea right?
No. At the heart of a comment written by Siguza, lies the summary of an issue:
1) read doesn't read lines, it just reads bytes. There's no reason buff should end with \n.
Additionally, there's no reason buff shouldn't contain multiple newline characters, and as there's no [posix] tag here there's no reason to suggest what read does, let alone whether it's a syscall. Assuming you're referring to the POSIX function, there's no error handling. Where's your logic to handle the return value/s reserved for errors?
I think my code is a bit inefficient because the run time is O(FileWidth); however I think it can be O(log(FileWidth)) if we exponentially increase linesize to find the linefeed character.
Providing you fix the issues mentioned above (more on that later), if you were to test this theory, you'd likely find, also at the heart of the comment by Siguza,
Disks usually work on a 512-byte basis and file system caches and even CPU/memory caches are a lot larger than that.
To an extent, you can expect your idea to approach O(log n), but your bottleneck will be one of those cache lines (likely the one closest to your keyboard/the filesystem/whatever is feeding the stream with information). At that point, you should stop guzzling memory which other programs might need because your optimisation becomes less and less effective.
What do you think?
I think you should just STOP! You're guessing!
Once you've written your program, decide whether or not it's too slow. If it's not too slow, it doesn't need optimisation, and you probably won't shave enough nanoseconds to make optimisation worthwhile.
If it is to slow, then you should:
Use a profiler to determine what the most significant bottleneck is,
apply optimisations based on what your profiler tells you, then
use your profiler again, with the same inputs as before, to measure the effect your optimisation had.
If you don't use a profiler, your guess-work could result in slower code, or you might miss opportunities for more significant optimisations...
How do we read the second line?
Naturally, it makes sense to read character by character, rather than two hundred characters at a time, because there's no other way to stop reading the moment you reach a line terminating character.
Is there anyway to delimit the bytes?
Yes. The most sensible tools to use are provided by the C standard, and syscalls are managed automatically to be most efficient based on configurations decided by the standard library devs (who are much likely better at this than you are). Those tools are:
fgets to attempt to read a line (by reading one character at a time), up to a threshold (the size of your buffer). You get to decide how large a line should be, because it's more often the case that you won't expect a user/program to input huge lines.
strchr or strcspn to detect newlines from within your buffer, in order to determine whether you read a complete line.
scanf("%*[^\n]"); to discard the remainder of an incomplete line, when you detect those.
realloc to reallocate your buffer, if you decide you want to resize it and call fgets a second time to retrieve more data rather than discarding the remainder. Note: this will have an effect on the runtime complexity of your code, not that I think you should care about that...
Other options are available for the first three. You could use fgetc (or even read one character at a time) like I did at the end of this answer, for example...
In fact, that answer is highly relevant to your question, as it does make an attempt to exponentially increase the size. I wrote another example of this here.
It should be pointed out that the reason to address these problems is not so much optimisation, but the need to read a large, yet variadic in size chunk of memory. Remember, if you haven't yet written the code, it's likely you won't know whether it's worthwhile optimising it!
Suffice to say, it isn't the read function you should try to reduce your dependence upon, but the malloc/realloc/calloc function... That's the real kicker! If you don't absolutely need to store the entire line, then don't!

How to decide a buffer's size

I have a program which it's purpose is to read from some input text file,filter all chars which are printable (i.e., ASCII between 32 and 126) into some other output text file.
I also get as an argument "DataAmount"-which means whats the amount of data I need to read-May be 1B,1K,1M,1G,80000B, etc.(Any natural number can be before the unit).
It is NOT the size of the input file,it is how much I need to read from the input file.And if the input file is smaller than the DataAmount,I need to re read the file,untill I read exactly DataAmount bytes.
For the filtering,I read from the input file into some buffer.I filter from the buffer into some other buffer the printable chars,and write from that buffer to the output file(both buffers are in the same size).
Ther question is,how can I decide what size is the best for those two buffers,so there will be a minimal calls for read() and write()?
(NOTE: I won't write the whole data in one time since it may be too big,and I won't write each byte at a time.I write from the outbuff to the output file only when the buffer is full).
At the moment,I build the buffer size only depends on the unit:
If it's B or K,the size will be 1024.
If it's M or G,the size will be 4096.
This is not good at all,since for 1B and 100000B I'll have the same size of the buffer.
How can I improve this?
My personal experience is that the buffer size does not matter much as long as you are using a few kilobytes.
As you noted in your question, there is overhead in doing system calls, so doing I/O one character at a time is not terribly efficient, and you can cut that overhead down by reading and writing larger blocks. However, there are other things that take time, and any reasonable amount of buffering will drop your system call overhead down to the point where it is the other other things that are taking most of the time. At that point larger buffers do not make the program significantly faster. There are also problems with making a buffer too large, so you can err in that direction too.
I would not make the buffer size dynamic as you are doing. It introduces needless complexity into the program. You can verify that by running your program with different buffer sizes, and see what kind of difference it makes.
As for the actual value to use, the stdio.h header file defines the macro BUFSIZ which is the default size for stdio buffers. That macro is a reasonable size to use.
Also note that if you are using the stdio functions to do your I/O, they already provide buffering (if you're not calling the system calls read() and write() directly, you're using stdio.) There isn't really a reason to buffer the data twice, so you can either do the I/O one character at a time and let the stdio buffers take care of it for you, or disable the stdio buffering with setvbuf().
If you know the input previously you can to some statistics and get the average, so it's not a fixed size but an approximation.
But I recommend to you: don't worry about read and close syscalls. If you read a very little data from the imput and your buffer is high, you waste some bytes. If you get a big input and have a little buffer, you only have to do some extra iterations.
A medium size for the buffer would be good. For example, 512.
Once you decide on the unit, then decide if the number of units needs extra buffer size. Thus, once you have found the B, check the value. That way you would not have to split the smaller units.
You can do a switch statement on the unit indicators, and then process within each case, based on the numeric value of that unit. As an example, for the B do an integer divide of the maximum and set the actual buffer size based on the result (again in a switch/case sequence).

What is more efficient, reading word by word from file or reading a line at a time and splitting the string using C ?

I want to develop an application in C where I need to check word by word from a file on disk. I've been told that reading a line from file and then splitting it into words is more efficient as less file accesses are required. Is it true?
If you know you're going to need the entire file, you may as well be reading it in as large chunks as you can (at the extreme end, you'll memory map the entire file into memory in one go). You are right that this is because less file accesses are needed.
But if your program is not slow, then write it in the way that makes it the fastest and most bug free for you to develop. Early optimization is a grievous sin.
Not really true, assuming you're going to be using scanf() and your definition of 'word' matches what scanf() treats as a word.
The standard I/O library will buffer the actual disk reads, and reading a line or a word will have essentially the same I/O cost in terms of disk accesses. If you were to read big chunks of a file using fread(), you might get some benefit — at a cost in complexity.
But for reading words, it's likely that scanf() and a protective string format specifier such as %99s if your array is char word[100]; would work fine and is probably simpler to code.
If your definition of word is more complex than the definition supported by scanf(), then reading lines and splitting is probably easier.
As far as splitting is concerned there is no difference with respect to performance. You are splitting using whitespace in one case and newline in another.
However it would impact in case of word in a way that you would need to allocate buffers M times, while in case of lines it will be N times, where M>N. So if you are adopting word split approach, try to calculate total memory need first, allocate that much chunk (so you don't end up with fragmented M chunks), and later get M buffers from that chunk. Note that same approach can be applied in lines split but the difference will be less visible.
This is correct, you should read them in to a buffer, and then split into whatever you define as 'words'.
The only case where this would not be true is if you can get fscanf() to correctly grab out what you consider to be words (doubtful).
The major performance bottlenecks will likely be:
Any call to a stdio file I/O function. The less calls, the less overhead.
Dynamic memory allocation. Should be done as scarcely as possible. Ultimately, a lot of calls to malloc will cause heap segmentation.
So what it boils down to is a classic programming consideration: you can get either quick execution time or you can get low memory usage. You can't get both, but you can find some suitable middle-ground that is most effective both in terms of execution time and memory consumption.
To one extreme, the fastest possible execution can be obtained by reading the whole file as one big chunk and upload it to dynamic memory. Or to the other extreme, you can read it byte by byte and evaluate it as you read, which might make the program slower but will not use dynamic memory at all.
You will need a fundamental knowledge of various CPU-specific and OS-specific features to optimize the code most effectively. Issues like alignment, cache memory layout, the effectiveness of the underlying API function calls etc etc will all matter.
Why not try a few different ways and benchmark them?
Not actually answer to your exact question (words vs lines), but if you need all words in memory at the same time anyway, then the most efficient approach is this:
determine file size
allocate buffer for entire file plus one byte
read entire file to the buffer, and put '\0' to the extra byte.
make a pass over it and count how many words it has
allocate char* (pointers to words) or int (indexes to buffer) index array, with size matching word count
make 2nd pass over buffer, and store addresses or indexes to the first letters of words to the index array, and overwrite other bytes in buffer with '\0' (end of string marker).
If you have plenty of memory, then it's probably slightly faster to just assume the worst case for number of words: (filesize+1) / 2 (one letter words with one space in between, with odd number of bytes in file). Also taking the Java ArrayList or Qt QVector approach with the index array, and using realloc() to double it's size when word count exceeds current capacity, will be quite efficient (due to doubling=exponential growth, reallocation will not happen many times).

I/O methods in C

I am looking for various ways of reading/writing data from stdin/stdout. Currently I know about scanf/printf, getchar/putchar and gets/puts. Are there any other ways of doing this? Also I am interesting in knowing that which one is most efficient in terms of Memory and Space.
Thanks in Advance
fgets()
fputs()
read()
write()
And others, details can be found here: http://www.cplusplus.com/reference/clibrary/cstdio/
As per your time question take a look at this: http://en.wikipedia.org/wiki/I/O_bound
Stdio is designed to be fairly efficient no matter which way you prefer to read data. If you need to do character-by-character reads and writes, they usually expand to macros which just access the buffer except when it's full/empty. For line-by-line text io, use puts/fputs and fgets. (But NEVER use gets because there's no way to control how many bytes it will read!) The printf family (e.g. fprintf) is of course extremely useful for text because it allows you to skip constructing a temporary buffer in memory before writing (and thus lets you avoid thinking about all the memory allocation, overflow, etc. issues). fscanf tends to be much less useful, but mostly because it's difficult to use. If you study the documentation for fscanf well and learn how to use %[, %n, and the numeric specifiers, it can be very powerful!
For large blocks of text (e.g. loading a whole file into memory) or binary data, you can also use the fread and fwrite functions. You should always pass 1 for the size argument and the number of bytes to read/write for the count argument; otherwise it's impossible to tell from the return value how much was successfully read or written.
If you're on a reasonably POSIX-like system (pretty much anything) you can also use the lower-level io functions open, read, write, etc. These are NOT part of the C standard but part of POSIX, and non-POSIX systems usually provide the same functions but possibly with slightly-different behavior (for example, file descriptors may not be numbered sequentially 0,1,2,... like POSIX would require).
If you're looking for immediate-mode type stuff don't forget about Curses (more applicable on the *NIX side but also available on Windows)

Resources