there I'm new to C. I'm currently reading the K&R. There I got confused by a definition in it about the text streams "A text stream is a sequence of characters divided into new lines;each line consists of 0 or more characters followed by a newline character."
And trying to knowing about this streams I was introduced to a new term namely buffer.
I just know that:
A continuous flow of data (bytes or characters) in between Input and Output devices is
a STREAM .
A temporary storage area in main memory to store the input or output data temporarily
is a BUFFER.
I don't say that I'm right, but it's my basic idea upon those terms.
I want to know, what actually buffer & stream are and how these 2 things(i.e, stream & buffer) work together, in the non-abstract level of C implementation.
You have three streams in C, stdin, stdout, and stderr, you can also think of files you have opened with fopen for example as a stream. stdin is generally the keyboard, stdout is generally your monitor, stderr is generally also your monitor. But they don't have to be, they are abstractions for the hardware.
If for example you didn't have a keyboard but a keypad on a bank ATM for example, then stdin would be the keypad, if you didn't have a monitor but instead had a printer, then stdout would be the printer. You change what hardware they are using with calls to your operating system. You can also change their behaviour, again, through calls to your operating system, which is beyond the scope of what you're asking.
So in a way, think of the buffer as the memory allocated by the operating system associated with the stream to hold the data received from the hardware component. When you type at your keyboard for example the characters you type aren't being capture directly by your IDE, they are moving from there into the buffer, then you read the buffer.
That's why, for example, you have to hit enter before your code starts interacting with whatever you typed into the keyboard, because stdin is line buffered. Control passes from your program to the operating system until it encounters something that sends control back to your program, in a normal situation that would be the newline character.
So in a way, think of it like this, the stream is the device (keyboard, monitor, or a file on your hard drive), the buffer is where the data is held while the operating system has control, and then you interact with the buffer while you are processing the data.
That abstraction allows you to use all of these different things in a common manner regardless of what they are, for example: fgets(str, sizeof(str), STREAM) ... stream can be any input stream, be it stdin or a file.
Taking it a step further that's why new programmers get thrown off by scanf for an int followed by an fgets, because scanf reads the int from the buffer but leaves the \n in the buffer ... then the call to fgets reads the \n that scanf left there and the new programmer is left wondering why they were unable to input any data. So your curiosity about streams and buffers will serve you well as you move forward in your learning about C.
Those are actually pretty good working definitions.
In practical C terms, a buffer is an array (usually of char or unsigned char type) that's used to store data, either as a result of an input operation, or before sending to output. The array can be declared as a fixed size array, such as
char buffer[SOME_BUFFER_SIZE];
or dynamically, using
char *buffer = malloc( SOME_BUFFER_SIZE * sizeof *buffer );
The advantage of using dynamic memory is that the buffer can be resized if necessary; the disadvantage is you have to manage the lifetime of that memory.
For text input/output you'd typically use arrays of char; for binary input/output you'd typically use arrays of unsigned char.
It's fairly common for systems communicating over a network to send data in fixed-size "chunks", such that you may need several read or write operations to get all the data across. Think of a Web server and a browser - the server sends the HTML in multiple messages, and the browser stores the intermediate result in a buffer. It's only when all the data has been received that the browser renders the page:
Received from Web server Stored in browser's input buffer
------------------------ --------------------------------
HTTP/1.1 200 OK \r\n <!DOCTYPE HTML><html
Content-length: 20\r\n
<!DOCTYPE HTML><html
HTTP/1.1 200 OK \r\n <!DOCTYPE HTML><html><head><title>This i
Content-length: 20\r\n
><head><title>This i
HTTP/1.1 200 OK \r\n <!DOCTYPE HTML><html><head><title>This i
Content-length: 20\r\n s a test</title></he
s a test</title></he
HTTP/1.1 200 OK \r\n <!DOCTYPE HTML><html><head><title>This i
Content-length: 20\r\n s a test</title></head><body><p>Hello, W
ad><body><p>Hello, W
HTTP/1.1 200 OK \r\n <!DOCTYPE HTML><html><head><title>This i
Content-length: 19 s a test</title></head><body><p>Hello, W
orld!</body></html> orld!</body></html>
No sane server sends HTML in chunks of 20 characters, but this should illustrate why and how buffers get used.
The deifinitions are not bad, actually very good. You could add (from an object oriented perspective), that a STREAM uses a BUFFER.
The use of a BUFFER might be necessary, e.g. performance reasons, since every system call comes with a relatively high cost.
Especially IO system calls, Harddisk or Network access are slow, compared to memory access times. And they add up if a read or write consists only of a single byte.
Two common abstractions of I/O devices are:
Streams - transfers a variable number of bytes as the device becomes ready.
Block - transfers fixed-size records.
A buffer is just an area of memory which holds the data being transferred.
Related
My question is regarding the following paragraph on page 15 (Section 1.5) of The ANSI C Programming Language (2e) by Kernighan and Ritchie (emphasis added):
The model of input and output supported by the standard library is very simple.
Text input or output, regardless of where it originates or where it goes to,
is dealt as a stream of characters. A text stream is a sequence of characters divided
into lines; each line consists of zero or more characters followed by a newline character.
It is the responsibility of the library to make each input or output stream conform to
this model; the C programmer using the library need not worry about how lines are
represented outside the program.
I'm unsure of what is meant by the text in bold, especially the line "it is the responsibility of the library to make each input or ouptput stream conform to this model." Could someone please help me understand what this means?
At first, I thought it had something to do with the line-buffering of stdin I was seeing when I call getchar() when stdin is empty, but then learned that the buffering mode varies across implementations (see here). So I don't think this is what the text in bold is referring to when it talks about conforming to the text stream model.
Consider running code like printf("hello world"); in the firmware of a USB device. Suppose that whatever characters you pass to printf are sent over USB from the device to the computer. The way the USB protocol works, the characters must be split up into groups of characters called packets. There is a maximum packet size depending on how your USB hardware and descriptors are configured. Also, for efficiency, you want to fill up the packets whenever possible, because sending a packet that is less than the maximum size means the computer will stop letting you send more data for a while. Also, if the computer doesn't receive your packet, you might need to re-send it. Also, if your USB packet buffers are already filled, you might need to wait a while until one of them gets sent.
To make programming in C a manageable task, the implementation of printf needs to handle all of these details so the user doesn't need to worry about them when they are calling printf. For example, it would be really bad if printf was only able to send a single packet of 1 to 8 bytes whenever you call it, and thus it returns an error whenever you give it more than 8 characters.
This is called an abstraction: the underlying system has some complexity (like USB endpoints, packets, buffers, retries). You don't want to think about that stuff all the time so you make a library that transforms that stuff into a more abstract interface (like a stream of characters). Or you just use a "standard library" written by someone else that takes care of that for you.
If you want a more PC-centric example... I believe that printf is implemented on many systems by calling the write system call. Since write isn't always guaranteed to actually write all of the data you give it, the implementation of printf needs to try multiple times to write the data you give it. Also, for efficiency, the printf implementation might buffer the data you give it in RAM for a while before passing it to the kernel with write. You don't generally have to worry about retrying or buffering details while programming in C because once your program terminates or you flush the buffer, the standard library makes sure all your data has been written.
I'm a newbie programmer, can you help me imagine what a stream is, is it a fixed array of bytes that transfer data from i.e: a file to Y? And what is Y here, a buffer or something else?
In what way is the buffer related to stream?
A stream is either a source (input stream) or sink (output stream) of data, that is available (or provided) in time (as opposed to all at once).
A buffer is an array (a piece of memory) that is used to store data temporarily. An input buffer is typically filled from an input stream by the OS; an output buffer (once filled by the programmer) is consumed by the OS.
Imagine you want to fill a tub with water. You start with a water source like a water tank or public waterworks that can be transfered through a water tap. You put a bucket under the water tap and turn it on. When the bucket is full, you dump it into the tub, and put it back under the tap. You repeat that until your tub is full.
Loading a file, for example, works almost the same way. You have a data source (the file on disk); you open an input stream (a programmatic construct that will give you data generally as fast as the disk can read them). You allocate a buffer (a small memory area) and tell the system to fill it from the stream. When it is full, you append it to the big chunk of allocated memory that you reserved for file contents, then let the buffer be filled again. When the whole file is read, you close the stream.
Difference between a buffer and a stream is
A Stream is a sequence of bytes of data that transfers information from or to a specified source.
A sequence of bytes flowing into a program is called input stream. A sequence of bytes flowing out of the program is called output stream
Use of Stream makes I/O machine independent.
A Buffer is a sequence of bytes that are stored in memory.
In C, I/O operations are asynchronous: you don’t know when you have data nor how much of it. So a buffer is usually used to collect data from the stream (file, socket, device). When the buffer is full, consumers of that stream are notified and can consume data from the buffer until is depleted. Then wait for the buffer to be filled again before using that data. It is a place to store something temporarily, in order to mitigate differences between the input speed and output speed. While the producer is being faster than the consumer, the producer can continue to store the output in the buffer. When the consumer speeds up, it can read from the buffer. The buffer is there in the middle to bridge the gap.
Y in your question can be a file, socket or a device(I/O).
Hope this solves your Query :)
I am having a really hard time understanding the depths of buffering especially in C programming and I have searched for really long on this topic but haven't found something satisfying till now.
I will be a little more specific:
I do understand the concept behind it (i.e. coordination of operations by different hardware devices and minimizing the difference in speed of these devices) but I would appreciate a more full explanation of these and other potential reasons for buffering (and by full I mean full the longer and deeper the better) it would also be really nice to give some concrete Examples of how buffering is implemented in I/O streams.
The other questions would be that I noticed that some rules in buffer flushing aren't followed by my programs as weirdly as this sounds like the following simple fragment:
#include <stdio.h>
int main(void)
{
FILE * fp = fopen("hallo.txt", "w");
fputc('A', fp);
getchar();
fputc('A', fp);
getchar();
return 0;
}
The program is intended to demonstrate that impending input will flush arbitrary stream immediately when the first getchar() is called but this simply doesn't happen as often as I try it and with as many modifications as I want — it simply doesn't happen as for stdout (with printf() for example) the stream is flushed without any input requested also negating the rule therefore am I understanding this rule wrongly or is there something other to consider
I am using Gnu GCC on Windows 8.1.
Update:
I forgot to ask that I read on some sites how people refer to e.g. string literals as buffers or even arrays as buffers; is this correct or am I missing something?
Please explain this point too.
The word buffer is used for many different things in computer science. In the more general sense, it is any piece of memory where data is stored temporarily until it is processed or copied to the final destination (or other buffer).
As you hinted in the question there are many types of buffers, but as a broad grouping:
Hardware buffers: These are buffers where data is stored before being moved to a HW device. Or buffers where data is stored while being received from the HW device until it is processed by the application. This is needed because the I/O operation usually has memory and timing requirements, and these are fulfilled by the buffer. Think of DMA devices that read/write directly to memory, if the memory is not set up properly the system may crash. Or sound devices that must have sub-microsecond precision or it will work poorly.
Cache buffers: These are buffers where data is grouped before writing into/read from a file/device so that the performance is generally improved.
Helper buffers: You move data into/from such a buffer, because it is easier for your algorithm.
Case #2 is that of your FILE* example. Imagine that a call to the write system call (WriteFile() in Win32) takes 1ms for just the call plus 1us for each byte (bear with me, things are more complicated in real world). Then, if you do:
FILE *f = fopen("file.txt", "w");
for (int i=0; i < 1000000; ++i)
fputc('x', f);
fclose(f);
Without buffering, this code would take 1000000 * (1ms + 1us), that's about 1000 seconds. However, with a buffer of 10000 bytes, there will be only 100 system calls, 10000 bytes each. That would be 100 * (1ms + 10000us). That's just 0.1 seconds!
Note also that the OS will do its own buffering, so that the data is written to the actual device using the most efficient size. That will be a HW and cache buffer at the same time!
About your problem with flushing, files are usually flushed just when closed or manually flushed. Some files, such as stdout are line-flushed, that is, they are flushed whenever a '\n' is written. Also the stdin/stdout are special: when you read from stdin then stdout is flushed. Other files are untouched, only stdout. That is handy if you are writing an interactive program.
My case #3 is for example when you do:
FILE *f = open("x.txt", "r");
char buffer[1000];
fgets(buffer, sizeof(buffer), f);
int n;
sscanf(buffer, "%d", &n);
You use the buffer to hold a line from the file, and then you parse the data from the line. Yes, you could call fscanf() directly, but in other APIs there may not be the equivalent function, and moreover you have more control this way: you can analyze the type if line, skip comments, count lines...
Or imagine that you receive one byte at a time, for example from a keyboard. You will just accumulate characters in a buffer and parse the line when the Enter key is pressed. That is what most interactive console programs do.
The noun "buffer" really refers to a usage, not a distinct thing. Any block of storage can serve as a buffer. The term is intentionally used in this general sense in conjunction with various I/O functions, though the docs for the C I/O stream functions tend to avoid that. Taking the POSIX read() function as an example, however: "read() attempts to read up to count bytes from file descriptor fd into the buffer starting at buf". The "buffer" in that case simply means the block of memory in which the bytes read will be recorded; it is ordinarily implemented as a char[] or a dynamically-allocated block.
One uses a buffer especially in conjunction with I/O because some devices (especially hard disks) are most efficiently read in medium-to-large sized chunks, where as programs often want to consume that data in smaller pieces. Some other forms of I/O, such as network I/O, may inherently come in chunks, so that you must record each whole chunk (in a buffer) or else lose that part you're not immediately ready to consume. Similar considerations apply to output.
As for your test program's behavior, the "rule" you hoped to demonstrate is specific to console I/O, but only one of the streams involved is connected to the console.
The first question is a bit too broad. Buffering is used in many cases, including message storage before actual usage, DMA uses, speedup usages and so on. In short, the entire buffering thing can be summarized as "save my data, let me continue execution while you do something with the data".
Sometimes you may modify buffers after passing them to functions, sometimes not. Sometimes buffers are hardware, sometimes software. Sometimes they reside in RAM, sometimes in other memory types.
So, please ask more specific question. As a point to begin, use wikipedia, it is almost always helpful: wiki
As for the code sample, I haven't found any mention of all output buffers being flushed upon getchar. Buffers for files are generally flushed in three cases:
fflush() or equivalent
File is closed
The buffer is overflown.
Since neither of these cases is true for you, the file is not flushed (note that application termination is not in this list).
Buffer is a simple small area inside your memory (RAM) and that area is responsible of storing information before sent to your program, as long I'm typing the characters from the keyboard these characters will be stored inside the buffer and as soon I press the Enter key these characters will be transported from the buffer into your program so with the help of buffer all these characters are instantly available to your program (prevent lag and the slowly) and sent them to the output display screen
I had started learning C programming, so I'm a beginner, while learning about standard streams of text, I came up with the lines "stdout" stream is buffered while "stderr" stream is not buffered, but I am not able to make sense with this lines.
I already have read about "buffer" on this forum and I like candy analogy, but I am not able to figure out what is meant when one says: "This stream is buffered and the other one is not." What is the effect?
What is the difference?
Update: Does it affect the speed of processing?
Buffer is a block of memory which belongs to a stream and is used to hold stream data temporarily. When the first I/O operation occurs on a file, malloc is called and a buffer is obtained. Characters that are written to a stream are normally accumulated in the buffer (before being transmitted to the file in chunks), instead of appearing as soon as they are output by the application program. Similarly, streams retrieve input from the host environment in blocks rather than on a character-by-character basis. This is done to increase efficiency, as file and console I/O is slow in comparison to memory operations.
GCC provides three types of buffering - unbuffered, block buffered, and line buffered. Unbuffered means that characters appear on the destination file as soon as written (for an output stream), or input is read from a file on a character-by-character basis instead of reading in blocks (for input streams). Block buffered means that characters are saved up in the buffer and written or read as a block. Line buffered means that characters are saved up only till a newline is written into or read from the buffer.
stdin and stdout are block buffered if and only if they can be determined not to refer to an interactive device else they are line buffered (this is true of any stream). stderr is always unbuffered by default.
The standard library provides functions to alter the default behaviour of streams. You can use fflush to force the data out of the output stream buffer (fflush is undefined for input streams). You can make the stream unbuffered using the setbuf function.
Buffering is collecting up many elements before writing them, or reading many elements at once before processing them. Lots of information out there on the Internet, for example, this
and other SO questions like this
EDIT in response to the question update: And yes, it's done for performance reasons. Writing and reading from disks etc will in any case write or read a 'block' of some sort for most devices, and there's a fair overhead in doing so. So batching these operations up can make for a dramatic performance difference
A program writing to buffered output can perform the output in the time it takes to write to the buffer which is typically very fast, independent of the speed of the output device which may be slow.
With buffered output the information is queues and a separate process deals with the output rendering.
With unbuffered output, the data is written directly to the output device, so runs at the speed on the device. This is important for error output because if the output were buffered it would be possible for the process to fail before the buffered output has made it to the display - so the program might terminate with no diagnostic output.
Why would you want to set aside a block of memory in setvbuf()?
I have no clue why you would want to send your read/write stream to a buffer.
setvbuf is not intended to redirect the output to a buffer (if you want to perform IO on a buffer you use sprintf & co.), but to tightly control the buffering behavior of the given stream.
In facts, C IO functions don't immediately pass the data to be written to the operating system, but keep an intermediate buffer to avoid continuously performing (potentially expensive) system calls, waiting for the buffer to fill before actually performing the write.
The most basic case is to disable buffering altogether (useful e.g. if writing to a log file, where you want the data to go to disk immediately after each output operation) or, on the other hand, to enable block buffering on streams where it is disabled by default (or is set to line-buffering). This may be useful to enhance output performance.
Setting a specific buffer for output can be useful if you are working with a device that is known to work well with a specific buffer size; on the other side, you may want to have a small buffer to cut down on memory usage in memory-constrained environments, or to avoid losing much data in case of power loss without disabling buffering completely.
In C files opened with e.g. fopen are by default buffered. You can use setvbuf to supply your own buffer, or make the file operations completely unbuffered (like to stderr is).
It can be used to create fmemopen functionality on systems that doesn't have that function.
The size of a files buffer can affect Standard library call I/O rates. There is a table in Chap 5 of Steven's 'Advanced Programming in the UNIX Environment' that shows I/O throughput increasing dramatically with I/O buffer size, up to ~16K then leveling off. A lot of other factor can influenc overall I/O throughtput, so this one "tuning" affect may or may not be a cureall. This is the main reason for "why" other than turning off/on buffering.
Each FILE structure has a buffer associated with it internally. The reason behind this is to reduce I/O, and real I/O operations are time costly.
All your read/write will be buffered until the buffer is full. All the data buffered will be output/input in one real I/O operation.
Why would you want to set aside a block of memory in setvbuf()?
For buffering.
I have no clue why you would want to send your read/write stream to a buffer.
Neither do I, but as that's not what it does the point is moot.
"The setvbuf() function may be used on any open stream to change its buffer" [my emphasis]. In other words it alread has a buffer, and all the function does is change that. It doesn't say anything about 'sending your read/write streams to a buffer". I suggest you read the man page to see what it actually says. Especially this part:
When an output stream is unbuffered, information appears on the destination file or terminal as soon as written; when it is block buffered many characters are saved up and written as a block; when it is line buffered characters are saved up until a newline is output or input is read from any stream attached to a terminal device (typically stdin).