I'm a newbie programmer, can you help me imagine what a stream is, is it a fixed array of bytes that transfer data from i.e: a file to Y? And what is Y here, a buffer or something else?
In what way is the buffer related to stream?
A stream is either a source (input stream) or sink (output stream) of data, that is available (or provided) in time (as opposed to all at once).
A buffer is an array (a piece of memory) that is used to store data temporarily. An input buffer is typically filled from an input stream by the OS; an output buffer (once filled by the programmer) is consumed by the OS.
Imagine you want to fill a tub with water. You start with a water source like a water tank or public waterworks that can be transfered through a water tap. You put a bucket under the water tap and turn it on. When the bucket is full, you dump it into the tub, and put it back under the tap. You repeat that until your tub is full.
Loading a file, for example, works almost the same way. You have a data source (the file on disk); you open an input stream (a programmatic construct that will give you data generally as fast as the disk can read them). You allocate a buffer (a small memory area) and tell the system to fill it from the stream. When it is full, you append it to the big chunk of allocated memory that you reserved for file contents, then let the buffer be filled again. When the whole file is read, you close the stream.
Difference between a buffer and a stream is
A Stream is a sequence of bytes of data that transfers information from or to a specified source.
A sequence of bytes flowing into a program is called input stream. A sequence of bytes flowing out of the program is called output stream
Use of Stream makes I/O machine independent.
A Buffer is a sequence of bytes that are stored in memory.
In C, I/O operations are asynchronous: you don’t know when you have data nor how much of it. So a buffer is usually used to collect data from the stream (file, socket, device). When the buffer is full, consumers of that stream are notified and can consume data from the buffer until is depleted. Then wait for the buffer to be filled again before using that data. It is a place to store something temporarily, in order to mitigate differences between the input speed and output speed. While the producer is being faster than the consumer, the producer can continue to store the output in the buffer. When the consumer speeds up, it can read from the buffer. The buffer is there in the middle to bridge the gap.
Y in your question can be a file, socket or a device(I/O).
Hope this solves your Query :)
Related
there I'm new to C. I'm currently reading the K&R. There I got confused by a definition in it about the text streams "A text stream is a sequence of characters divided into new lines;each line consists of 0 or more characters followed by a newline character."
And trying to knowing about this streams I was introduced to a new term namely buffer.
I just know that:
A continuous flow of data (bytes or characters) in between Input and Output devices is
a STREAM .
A temporary storage area in main memory to store the input or output data temporarily
is a BUFFER.
I don't say that I'm right, but it's my basic idea upon those terms.
I want to know, what actually buffer & stream are and how these 2 things(i.e, stream & buffer) work together, in the non-abstract level of C implementation.
You have three streams in C, stdin, stdout, and stderr, you can also think of files you have opened with fopen for example as a stream. stdin is generally the keyboard, stdout is generally your monitor, stderr is generally also your monitor. But they don't have to be, they are abstractions for the hardware.
If for example you didn't have a keyboard but a keypad on a bank ATM for example, then stdin would be the keypad, if you didn't have a monitor but instead had a printer, then stdout would be the printer. You change what hardware they are using with calls to your operating system. You can also change their behaviour, again, through calls to your operating system, which is beyond the scope of what you're asking.
So in a way, think of the buffer as the memory allocated by the operating system associated with the stream to hold the data received from the hardware component. When you type at your keyboard for example the characters you type aren't being capture directly by your IDE, they are moving from there into the buffer, then you read the buffer.
That's why, for example, you have to hit enter before your code starts interacting with whatever you typed into the keyboard, because stdin is line buffered. Control passes from your program to the operating system until it encounters something that sends control back to your program, in a normal situation that would be the newline character.
So in a way, think of it like this, the stream is the device (keyboard, monitor, or a file on your hard drive), the buffer is where the data is held while the operating system has control, and then you interact with the buffer while you are processing the data.
That abstraction allows you to use all of these different things in a common manner regardless of what they are, for example: fgets(str, sizeof(str), STREAM) ... stream can be any input stream, be it stdin or a file.
Taking it a step further that's why new programmers get thrown off by scanf for an int followed by an fgets, because scanf reads the int from the buffer but leaves the \n in the buffer ... then the call to fgets reads the \n that scanf left there and the new programmer is left wondering why they were unable to input any data. So your curiosity about streams and buffers will serve you well as you move forward in your learning about C.
Those are actually pretty good working definitions.
In practical C terms, a buffer is an array (usually of char or unsigned char type) that's used to store data, either as a result of an input operation, or before sending to output. The array can be declared as a fixed size array, such as
char buffer[SOME_BUFFER_SIZE];
or dynamically, using
char *buffer = malloc( SOME_BUFFER_SIZE * sizeof *buffer );
The advantage of using dynamic memory is that the buffer can be resized if necessary; the disadvantage is you have to manage the lifetime of that memory.
For text input/output you'd typically use arrays of char; for binary input/output you'd typically use arrays of unsigned char.
It's fairly common for systems communicating over a network to send data in fixed-size "chunks", such that you may need several read or write operations to get all the data across. Think of a Web server and a browser - the server sends the HTML in multiple messages, and the browser stores the intermediate result in a buffer. It's only when all the data has been received that the browser renders the page:
Received from Web server Stored in browser's input buffer
------------------------ --------------------------------
HTTP/1.1 200 OK \r\n <!DOCTYPE HTML><html
Content-length: 20\r\n
<!DOCTYPE HTML><html
HTTP/1.1 200 OK \r\n <!DOCTYPE HTML><html><head><title>This i
Content-length: 20\r\n
><head><title>This i
HTTP/1.1 200 OK \r\n <!DOCTYPE HTML><html><head><title>This i
Content-length: 20\r\n s a test</title></he
s a test</title></he
HTTP/1.1 200 OK \r\n <!DOCTYPE HTML><html><head><title>This i
Content-length: 20\r\n s a test</title></head><body><p>Hello, W
ad><body><p>Hello, W
HTTP/1.1 200 OK \r\n <!DOCTYPE HTML><html><head><title>This i
Content-length: 19 s a test</title></head><body><p>Hello, W
orld!</body></html> orld!</body></html>
No sane server sends HTML in chunks of 20 characters, but this should illustrate why and how buffers get used.
The deifinitions are not bad, actually very good. You could add (from an object oriented perspective), that a STREAM uses a BUFFER.
The use of a BUFFER might be necessary, e.g. performance reasons, since every system call comes with a relatively high cost.
Especially IO system calls, Harddisk or Network access are slow, compared to memory access times. And they add up if a read or write consists only of a single byte.
Two common abstractions of I/O devices are:
Streams - transfers a variable number of bytes as the device becomes ready.
Block - transfers fixed-size records.
A buffer is just an area of memory which holds the data being transferred.
I've read that pipes need to have a limited capacity. But I don't understand why. What happens if a process writes into a pipe without a limit?
It's due to buffering. Pipes are not "magical", pipes do not ensure all processes process each individual byte or character in lockstep. Instead pipes buffer inter-process output and then pass the buffer along. And this buffer size limit is what you're referring to. In many Linux distros and in macOS the buffer size is 64KiB.
Imagine there's a process that outputs 1GB of data every second to stdout - and it's piped to another process that can only process 100 bytes of data every minute on stdin - consider that those gigabytes of data have to go somewhere. If there was an infinitely sized buffer than you would quickly fill up the memory-space of whatever OS component owns the pipe and then start paging out to disk - and then your pagefile on disk would fill up - and that's not good.
By having maximum buffer sizes, the output process will be notified when it's filled the buffer and it's free to handle that event however is appropriate (e.g. by pausing output if it's a random number generator, by dropping data if it's a network monitor, by crashing, etc).
Internal mechanisms aside, I suspect the root issue behind the question is one of terminology. Pipes have limited capacity, but unlimited overall volume of data transferred.
The analogy to a piece of physical plumbing is pretty good: a given piece of water pipe has a characteristic internal volume defined by its length, its shape, and the cross section of its interior. At any given time, it cannot hold any more water than fits in that volume, so if you close a valve at its downstream end then water eventually (maybe immediately) stops flowing into its other end because all the available space within -- the pipe's capacity -- is full. Until and unless the pipe is permanently closed, however, there is no bound on how much water may be able traverse it over its lifetime.
In C, while reading into a buffer from a socket file descriptor, how do I make the read stop if a delimiter is detected? Let's assume the delimiter is a '>' character.
read(socket_filedes, buffer, MAXSZ);
/* stop if delimiter '>' is detected */
You have two options here:
Read a single byte at a time until you encounter a delimiter. This is likely to be very inefficient.
Read in a full buffer's worth of data at a time, then look for the delimiter there. When you find it, save off the remaining data in another buffer and process the data you want. When you're ready to read again, put the saved data back in the buffer and call read with the address of the next available byte in the buffer.
The read() function does not examine the data it transfers them from source to buffer. You cannot force it to stop transferring data at a specific character or characters if it would not otherwise have stopped there.
On the other hand, it is important to recognize that read() does not necessarily read the full number of bytes specified in any case. On one hand, that means that you need to be prepared to run read() calls in a loop to collect all the data you expect, but on the other hand it means that you can usually expect that if read() has already transferred at least one byte then it will return when no more data are immediately available to transfer. Thus, if the sender stops sending data after the delimiter, then read() will probably stop reading after that delimiter.
If you cannot rely on the sender to break up the transmission as you require, then you have the option of reading one byte at a time. That can be awfully inefficient, however. The more usual solution is to do the job in two stages: (1) perform fast, block-wise read()s from the kernel into a userspace buffer, and then (2) parse the buffer contents via your userspace code. This is basically what readable C streams do when using a buffer.
I am having a really hard time understanding the depths of buffering especially in C programming and I have searched for really long on this topic but haven't found something satisfying till now.
I will be a little more specific:
I do understand the concept behind it (i.e. coordination of operations by different hardware devices and minimizing the difference in speed of these devices) but I would appreciate a more full explanation of these and other potential reasons for buffering (and by full I mean full the longer and deeper the better) it would also be really nice to give some concrete Examples of how buffering is implemented in I/O streams.
The other questions would be that I noticed that some rules in buffer flushing aren't followed by my programs as weirdly as this sounds like the following simple fragment:
#include <stdio.h>
int main(void)
{
FILE * fp = fopen("hallo.txt", "w");
fputc('A', fp);
getchar();
fputc('A', fp);
getchar();
return 0;
}
The program is intended to demonstrate that impending input will flush arbitrary stream immediately when the first getchar() is called but this simply doesn't happen as often as I try it and with as many modifications as I want — it simply doesn't happen as for stdout (with printf() for example) the stream is flushed without any input requested also negating the rule therefore am I understanding this rule wrongly or is there something other to consider
I am using Gnu GCC on Windows 8.1.
Update:
I forgot to ask that I read on some sites how people refer to e.g. string literals as buffers or even arrays as buffers; is this correct or am I missing something?
Please explain this point too.
The word buffer is used for many different things in computer science. In the more general sense, it is any piece of memory where data is stored temporarily until it is processed or copied to the final destination (or other buffer).
As you hinted in the question there are many types of buffers, but as a broad grouping:
Hardware buffers: These are buffers where data is stored before being moved to a HW device. Or buffers where data is stored while being received from the HW device until it is processed by the application. This is needed because the I/O operation usually has memory and timing requirements, and these are fulfilled by the buffer. Think of DMA devices that read/write directly to memory, if the memory is not set up properly the system may crash. Or sound devices that must have sub-microsecond precision or it will work poorly.
Cache buffers: These are buffers where data is grouped before writing into/read from a file/device so that the performance is generally improved.
Helper buffers: You move data into/from such a buffer, because it is easier for your algorithm.
Case #2 is that of your FILE* example. Imagine that a call to the write system call (WriteFile() in Win32) takes 1ms for just the call plus 1us for each byte (bear with me, things are more complicated in real world). Then, if you do:
FILE *f = fopen("file.txt", "w");
for (int i=0; i < 1000000; ++i)
fputc('x', f);
fclose(f);
Without buffering, this code would take 1000000 * (1ms + 1us), that's about 1000 seconds. However, with a buffer of 10000 bytes, there will be only 100 system calls, 10000 bytes each. That would be 100 * (1ms + 10000us). That's just 0.1 seconds!
Note also that the OS will do its own buffering, so that the data is written to the actual device using the most efficient size. That will be a HW and cache buffer at the same time!
About your problem with flushing, files are usually flushed just when closed or manually flushed. Some files, such as stdout are line-flushed, that is, they are flushed whenever a '\n' is written. Also the stdin/stdout are special: when you read from stdin then stdout is flushed. Other files are untouched, only stdout. That is handy if you are writing an interactive program.
My case #3 is for example when you do:
FILE *f = open("x.txt", "r");
char buffer[1000];
fgets(buffer, sizeof(buffer), f);
int n;
sscanf(buffer, "%d", &n);
You use the buffer to hold a line from the file, and then you parse the data from the line. Yes, you could call fscanf() directly, but in other APIs there may not be the equivalent function, and moreover you have more control this way: you can analyze the type if line, skip comments, count lines...
Or imagine that you receive one byte at a time, for example from a keyboard. You will just accumulate characters in a buffer and parse the line when the Enter key is pressed. That is what most interactive console programs do.
The noun "buffer" really refers to a usage, not a distinct thing. Any block of storage can serve as a buffer. The term is intentionally used in this general sense in conjunction with various I/O functions, though the docs for the C I/O stream functions tend to avoid that. Taking the POSIX read() function as an example, however: "read() attempts to read up to count bytes from file descriptor fd into the buffer starting at buf". The "buffer" in that case simply means the block of memory in which the bytes read will be recorded; it is ordinarily implemented as a char[] or a dynamically-allocated block.
One uses a buffer especially in conjunction with I/O because some devices (especially hard disks) are most efficiently read in medium-to-large sized chunks, where as programs often want to consume that data in smaller pieces. Some other forms of I/O, such as network I/O, may inherently come in chunks, so that you must record each whole chunk (in a buffer) or else lose that part you're not immediately ready to consume. Similar considerations apply to output.
As for your test program's behavior, the "rule" you hoped to demonstrate is specific to console I/O, but only one of the streams involved is connected to the console.
The first question is a bit too broad. Buffering is used in many cases, including message storage before actual usage, DMA uses, speedup usages and so on. In short, the entire buffering thing can be summarized as "save my data, let me continue execution while you do something with the data".
Sometimes you may modify buffers after passing them to functions, sometimes not. Sometimes buffers are hardware, sometimes software. Sometimes they reside in RAM, sometimes in other memory types.
So, please ask more specific question. As a point to begin, use wikipedia, it is almost always helpful: wiki
As for the code sample, I haven't found any mention of all output buffers being flushed upon getchar. Buffers for files are generally flushed in three cases:
fflush() or equivalent
File is closed
The buffer is overflown.
Since neither of these cases is true for you, the file is not flushed (note that application termination is not in this list).
Buffer is a simple small area inside your memory (RAM) and that area is responsible of storing information before sent to your program, as long I'm typing the characters from the keyboard these characters will be stored inside the buffer and as soon I press the Enter key these characters will be transported from the buffer into your program so with the help of buffer all these characters are instantly available to your program (prevent lag and the slowly) and sent them to the output display screen
I need to write around 103 sparse double arrays to disk (one at a time) and read them individually later in the program.
EDIT: Apologies for not framing the question clearly earlier. To be specific I am looking to store as much as possible in memory and save the currently unused variables on the disk. I am working on linux.
The fastest way would be to buffer the I/O. Instead of writing each array individually, you'd first copy as many as you can to a buffer. Once that buffer is full you would write the entire buffer to disk, clear the buffer, and repeat. This minimizes the amount of writes that occur to the disk and will increase I/O efficiency.
If you plan on reading the arrays later in sequential order, I recommend you also buffer the reads so it reads more that it needs and you can work out of the buffer.
You could take it a step further and use asynchronous read/write operations so that your program can process other tasks while waiting on the disk.
If you are concerned about the size on disk it will consume, you can add another layer that will compress/uncompress the data stream as you write/read to and from the disk.
The HDF5 data format is meant to write large amount of data to disk efficiently.
This format is used by NASA and a large number of scientific applications :
http://www.hdfgroup.org/HDF5/
http://en.wikipedia.org/wiki/Hierarchical_Data_Format