Why do pipes have a limited capacity? - c

I've read that pipes need to have a limited capacity. But I don't understand why. What happens if a process writes into a pipe without a limit?

It's due to buffering. Pipes are not "magical", pipes do not ensure all processes process each individual byte or character in lockstep. Instead pipes buffer inter-process output and then pass the buffer along. And this buffer size limit is what you're referring to. In many Linux distros and in macOS the buffer size is 64KiB.
Imagine there's a process that outputs 1GB of data every second to stdout - and it's piped to another process that can only process 100 bytes of data every minute on stdin - consider that those gigabytes of data have to go somewhere. If there was an infinitely sized buffer than you would quickly fill up the memory-space of whatever OS component owns the pipe and then start paging out to disk - and then your pagefile on disk would fill up - and that's not good.
By having maximum buffer sizes, the output process will be notified when it's filled the buffer and it's free to handle that event however is appropriate (e.g. by pausing output if it's a random number generator, by dropping data if it's a network monitor, by crashing, etc).

Internal mechanisms aside, I suspect the root issue behind the question is one of terminology. Pipes have limited capacity, but unlimited overall volume of data transferred.
The analogy to a piece of physical plumbing is pretty good: a given piece of water pipe has a characteristic internal volume defined by its length, its shape, and the cross section of its interior. At any given time, it cannot hold any more water than fits in that volume, so if you close a valve at its downstream end then water eventually (maybe immediately) stops flowing into its other end because all the available space within -- the pipe's capacity -- is full. Until and unless the pipe is permanently closed, however, there is no bound on how much water may be able traverse it over its lifetime.

Related

Relationship between stream and buffer?

I'm a newbie programmer, can you help me imagine what a stream is, is it a fixed array of bytes that transfer data from i.e: a file to Y? And what is Y here, a buffer or something else?
In what way is the buffer related to stream?
A stream is either a source (input stream) or sink (output stream) of data, that is available (or provided) in time (as opposed to all at once).
A buffer is an array (a piece of memory) that is used to store data temporarily. An input buffer is typically filled from an input stream by the OS; an output buffer (once filled by the programmer) is consumed by the OS.
Imagine you want to fill a tub with water. You start with a water source like a water tank or public waterworks that can be transfered through a water tap. You put a bucket under the water tap and turn it on. When the bucket is full, you dump it into the tub, and put it back under the tap. You repeat that until your tub is full.
Loading a file, for example, works almost the same way. You have a data source (the file on disk); you open an input stream (a programmatic construct that will give you data generally as fast as the disk can read them). You allocate a buffer (a small memory area) and tell the system to fill it from the stream. When it is full, you append it to the big chunk of allocated memory that you reserved for file contents, then let the buffer be filled again. When the whole file is read, you close the stream.
Difference between a buffer and a stream is
A Stream is a sequence of bytes of data that transfers information from or to a specified source.
A sequence of bytes flowing into a program is called input stream. A sequence of bytes flowing out of the program is called output stream
Use of Stream makes I/O machine independent.
A Buffer is a sequence of bytes that are stored in memory.
In C, I/O operations are asynchronous: you don’t know when you have data nor how much of it. So a buffer is usually used to collect data from the stream (file, socket, device). When the buffer is full, consumers of that stream are notified and can consume data from the buffer until is depleted. Then wait for the buffer to be filled again before using that data. It is a place to store something temporarily, in order to mitigate differences between the input speed and output speed. While the producer is being faster than the consumer, the producer can continue to store the output in the buffer. When the consumer speeds up, it can read from the buffer. The buffer is there in the middle to bridge the gap.
Y in your question can be a file, socket or a device(I/O).
Hope this solves your Query :)

How to prevent C read() from reading from cache

I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.

Send data to multiple sockets using pipes, tee() and splice()

I'm duplicating a "master" pipe with tee() to write to multiple sockets using splice(). Naturally these pipes will get emptied at different rates depending on how much I can splice() to the destination sockets. So when I next go to add data to the "master" pipe and then tee() it again, I may have a situation where I can write 64KB to the pipe but only tee 4KB to one of the "slave" pipes. I'm guessing then that if I splice() all of the "master" pipe to the socket, I will never be able to tee() the remaining 60KB to that slave pipe. Is that true? I guess I can keep track of a tee_offset (starting at 0) which I set to the start of the "unteed" data and then don't splice() past it. So in this case I would set tee_offset to 4096 and not splice more than that until I'm able to tee it to all to the other pipes. Am I on the right track here? Any tips/warnings for me?
If I understand correctly, you've got some realtime source of data that you want to multiplex to multiple sockets. You've got a single "source" pipe hooked up to whatever's producing your data, and you've got a "destination" pipe for each socket over which you wish to send the data. What you're doing is using tee() to copy data from the source pipe to each of the destination pipes and splice() to copy it from the destination pipes to the sockets themselves.
The fundamental issue you're going to hit here is if one of the sockets simply can't keep up - if you're producing data faster than you can send it, then you're going to have a problem. This isn't related to your use of pipes, it's just a fundamental issue. So, you'll want to pick a strategy to cope in this case - I suggest handling this even if you don't expect it to be common as these things often come up to bite you later. Your basic choices are to either close the offending socket, or to skip data until it's cleared its output buffer - the latter choice might be more suitable for audio/video streaming, for example.
The issue which is related to your use of pipes, however, is that on Linux the size of a pipe's buffer is somewhat inflexible. It defaults to 64K since Linux 2.6.11 (the tee() call was added in 2.6.17) - see the pipe manpage. Since 2.6.35 this value can be changed via the F_SETPIPE_SZ option to fcntl() (see the fcntl manpage) up to the limit specified by /proc/sys/fs/pipe-size-max, but the buffering is still more awkward to change on-demand than a dynamically allocated scheme in user-space would be. This means that your ability to cope with slow sockets will be somewhat limited - whether this is acceptable depends on the rate at which you expect to receive and be able to send data.
Assuming this buffering strategy is acceptable, you're correct in your assumption that you'll need to track how much data each destination pipe has consumed from the source, and it's only safe to discard data which all destination pipes have consumed. This is somewhat complicated by the fact that tee() doesn't have the concept of an offset - you can only copy from the start of the pipe. The consequence of this is that you can only copy at the speed of the slowest socket, since you can't use tee() to copy to a destination pipe until some of the data has been consumed from the source, and you can't do this until all the sockets have the data you're about to consume.
How you handle this depends on the importance of your data. If you really need the speed of tee() and splice(), and you're confident that a slow socket will be an extremely rare event, you could do something like this (I've assumed you're using non-blocking IO and a single thread, but something similar would also work with multiple threads):
Make sure all pipes are non-blocking (use fcntl(d, F_SETFL, O_NONBLOCK) to make each file descriptor non-blocking).
Initialise a read_counter variable for each destination pipe to zero.
Use something like epoll() to wait until there's something in the source pipe.
Loop over all destination pipes where read_counter is zero, calling tee() to transfer data to each one. Make sure you pass SPLICE_F_NONBLOCK in the flags.
Increment read_counter for each destination pipe by the amount transferred by tee(). Keep track of the lowest resultant value.
Find the lowest resultant value of read_counter - if this is non-zero, then discard that amount of data from the source pipe (using a splice() call with a destination opened on /dev/null, for example). After discarding data, subtract the amount discarded from read_counter on all the pipes (since this was the lowest value then this cannot result in any of them becoming negative).
Repeat from step 3.
Note: one thing that's tripped me up in the past is that SPLICE_F_NONBLOCK affects whether the tee() and splice() operations on the pipes are non-blocking, and the O_NONBLOCK you set with fnctl() affects whether the interactions with other calls (e.g. read() and write()) are non-blocking. If you want everything to be non-blocking, set both. Also remember to make your sockets non-blocking or the splice() calls to transfer data to them might block (unless that's what you want, if you're using a threaded approach).
As you can see, this strategy has a major problem - as soon as one socket blocks up, everything halts - the destination pipe for that socket will fill up, and then the source pipe will become stagnant. So, if you reach the stage where tee() returns EAGAIN in step 4 then you'll want to either close that socket, or at least "disconnect" it (i.e. take it out of your loop) such that you don't write anything else to it until its output buffer is empty. Which you choose depends on whether your data stream can recovery from having bits of it skipped.
If you want to cope with network latency more gracefully then you're going to need to do more buffering, and this is going to involve either user-space buffers (which rather negates the advantages of tee() and splice()) or perhaps disk-based buffer. The disk-based buffering will almost certainly be significantly slower than user-space buffering, and hence not appropriate given that presumably you want a lot of speed since you've chosen tee() and splice() in the first place, but I mention it for completeness.
One thing that's worth noting if you end up inserting data from user-space at any point is the vmsplice() call which can perform "gather output" from user-space into a pipe, in a similar way to the writev() call. This might be useful if you're doing enough buffering that you've split your data among multiple different allocated buffers (for example if you're using a pool allocator approach).
Finally, you could imagine swapping sockets between the "fast" scheme of using tee() and splice() and, if they fail to keep up, moving them on to a slower user-space buffering. This is going to complicate your implementation, but if you're handling large numbers of connections and only a very small proportion of them are slow then you're still reducing the amount of copying to user-space that's involved somewhat. However, this would only ever be a short-term measure to cope with transient network issues - as I said originally, you've got a fundamental problem if your sockets are slower than your source. You'd eventually hit some buffering limit and need to skip data or close connections.
Overall, I would carefully consider why you need the speed of tee() and splice() and whether, for your use-case, simply user-space buffering in memory or on disk would be more appropriate. If you're confident that the speeds will always be high, however, and limited buffering is acceptable then the approach I outlined above should work.
Also, one thing I should mention is that this will make your code extremely Linux-specific - I'm not aware of these calls being support in other Unix variants. The sendfile() call is more restricted than splice(), but might be rather more portable. If you really want things to be portable, stick to user-space buffering.
Let me know if there's anything I've covered which you'd like more detail on.

C: Each child process reads alternate lines

I'm training a typical map-reduce architecture (in O.S. classes) and I'm free to decide how the master process will tell its N child processes to parse a log. So, I'm kind of stuck in these two possibilities:
count the number of rows and give X rows for each map OR
each map reads the line of its ID and the next line to read= current_one+number_of_existent_maps
E.g.: with 3 maps, each one is going to read these lines:
Map1: 1, 4, 7, 10, 13
Map2: 2, 5, 8, 11, 14
Map3: 3, 6, 9, 12, 15
I have to do this in order to out-perform a single process that parses the entire log file, so the way I split the job between child processes has to be consistent with this objective.
Which one do you think is best? How can I do the scanf or fgets to adapt to 1) or 2)?
I would be happy with some example code for 2), because the fork/pipes are not my problem :P
RE-EDIT:
I'm not encouraged to use select here, only between map procs and the reduce process that will be monitoring the reads. I have restrictions now and :
I want each process to read total_lines/N lines each. But it seems like I have to make map procs open the file and then read the respective lines. So here are my doubts:
1- Is it bad or even possible to make every procs open the file simultaneously or almost simultaneously? How will that help in speeding up?
2- If it isn't possible to do that, I will have a parent opening the file (instead of each child doing that)that sends a struct with min and max limit and then the map procs will read whatever the lines they are responsible for, process them and give the reduce process a result (this doesn't matter for the problem now).
How can I divide correctly the number of lines by N maps and putting them to read at the same time? I think fseek() may be a good weapon, but I don't know HOW I can use it. Help, please!
If I understood correctly, you want to have all processes reading lines from a single file. I don't recommend this, it's kinda messy, and you'll have to a) read the same parts of the file several times or b) use locking/mutex or some other mechanism to avoid that. It'll get complicated and hard to debug.
I'd have a master process read the file, and assign lines to a subprocess pool. You can use shared memory to speed this up, and reduce the need for data-copying IPC; or use threads.
As for examples, I answered a question about forking and IPC and gave a code snippet on an example function that forks and returns a pair of pipes for parent-child communication. Let me look that up (...) here it is =P Can popen() make bidirectional pipes like pipe() + fork()?
edit: I kept thinking about this =P. Here's an idea:
Have a master process spawn subprocesses with something similar to what I showed in the link above.
Each process starts by sending a byte up to the master to signal it's ready, and blocking on read().
Have the master process read a line from the file to a shared memory buffer, and block on select() on its children pipes.
When select() returns, read one of the bytes that signal readiness and send to that subprocess the offset of the line in the shared memory space.
The master process repeats (reads a line, blocks on select, reads a byte to consume the readiness event, etc.)
The children process the line in whatever way you need, then send a byte to the master to signal readiness once again.
(You can avoid the shared memory buffer if you want, and send the lines down the pipes, though it'll involve constant data-copying. If the processing of each line is computationally expensive, it won't really make a difference; but if the lines require little processing, it may slow you down).
I hope this helps!
edit 2 based on Newba's comments:
Okay, so no shared memory. Use the above model, only instead of sending down the pipe the offset of the line read in the shared memory space, send the whole line. This may sound to you like you're wasting time when you could just read it from the file, but trust me, you're not. Pipes are orders of magnitude faster than reads from regular files in a hard disk, and if you wanted subprocesses to read directly from the file, you'll run into the problem I pointed at the start of the answer.
So, master process:
Spawn subprocesses using something like the function I wrote (link above) that creates pipes for bidirectional communication.
Read a line from the file into a buffer (private, local, no shared memory whatsoever).
You now have data ready to be processed. Call select() to block on all the pipes that communicate you with your subprocesses.
Choose any of the pipes that have data available, read one byte from it, and then send the line you have waiting to be processed in the buffer down the corresponding pipe (remember, we had 2 per child process, on to go up, one to go down).
Repeat from step 2, i.e. read another line.
Child processes:
When they start, they have a reading pipe and a writing pipe at their disposal. Send a byte down your writing pipe to signal the master process you are ready and waiting for data to process (this is the single byte we read in step 4 above).
Block on read(), waiting for the master process (that knows you are ready because of step 1) to send you data to process. Keep reading until you reach a newline (you said you were reading lines, right?). Note I'm following your model, sending a single line to each process at a time, you could send multiple lines if you wanted.
Process the data.
Return to step 1, i.e. send another byte to signal you are ready for more data.
There you go, simple protocol to assign tasks to as many subprocesses as you want. It may be interesting to run a test with 1 child, n children (where n is the number of cores in your computer) and more than n children, and compare performances.
Whew, that was a long answer. I really hope I helped xD
Since each of the processes is going to have to read the file in its entirety (unless the log lines are all of the same length, which is unusual), there really isn't a benefit to your proposal 2.
If you are going to split up the work into 3, then I would do:
Measure (stat()) the size of the log file - call it N bytes.
Allocate the range of bytes 0..(N/3) to first child.
Allocate the range of bytes (N/3)+1..2(N/3) to the second child.
Allocate the range of bytes 2(N/3)+1..end to the third child.
Define that the second and third children must synchronize by reading forward to the first line break after their start position.
Define that each child is responsible for reading to the first line break on or after the end of their range.
Note that the third child (last child) may have to do more work if the log file is growing.
Then the processes are reading independent segments of the file.
(Of course, with them all sharing the file, then the system buffer pool saves rereading the disk, but the data is still copied to each of the three processes, only to have each process throw away 2/3 of what was copied as someone else's job.)
Another, more radical option:
mmap() the log file into memory.
Assign the children to different segments of the file along the lines described previously.
If you're on a 64-bit machine, this works pretty well. If your log files are not too massive (say 1 GB or less), you can do it on a 32-bit machine too. As the file size grows above 1 GB or so, you may start running into memory mapping and allocation issues, though you might well get away with it until you reach a size somewhat less than 4 GB (on a 32-bit machine). The other issue here is with growing log files. AFAIK, mmap() doesn't map extra memory as extra data is written to the file.
Use a master and slave queue pattern.
The master sets up the slaves which sit waiting on a queue for work items.
The master then reads the file line by line.
Each line then represents a work item that you put on the queue
with a function pointer of how do the work.
One of the waiting slaves then takes the item of the queue
A slave processes a work item.
When a slave has finished it rejoins the work queue.

types of buffers

Recently an interviewer asked me about the types of buffers. What types of buffers are there ? Actually this question came up when I said I will be writing all the system calls to a log file to monitor the system. He said it will be slow to write each and every call to a file. How to prevent it. I said I will use a buffer and he asked me what type of buffer ? Can some one explain me types of buffers please.
In C under UNIX (and probably other operating systems as well), there are usually two buffers, at least in your given scenario.
The first exists in the C runtime libraries where information to be written is buffered before being delivered to the OS.
The second is in the OS itself, where information is buffered until it can be physically written to the underlying media.
As an example, we wrote a logging library many moons ago that forced information to be written to the disk so that it would be there if either the program crashed or the OS crashed.
This was achieved with the sequence:
fflush (fh); fsync (fileno (fh));
The first of these actually ensured that the information was handed from the C runtime buffers to the operating system, the second that it was written to disk. Keep in mind that this is an expensive operation and should only be done if you absolutely need the information written immediately (we only did it at the SUPER_ENORMOUS_IMPORTANT logging level).
To be honest, I'm not entirely certain why your interviewer thought it would be slow unless you're writing a lot of information. The two levels of buffering already there should perform quite adequately. If it was a problem, then you could just introduce another layer yourself which wrote the messages to an in-memory buffer and then delivered that to a single fprint-type call when it was about to overflow.
But, unless you do it without any function calls, I can't see it being much faster than what the fprint-type buffering already gives you.
Following clarification in comments that this question is actually about buffering inside a kernel:
Basically, you want this to be as fast, efficient and workable (not prone to failure or resource shortages) as possible.
Probably the best bet would be a buffer, either statically allocated or dynamically allocated once at boot time (you want to avoid the possibility that dynamic re-allocation will fail).
Others have suggested a ring (or circular) buffer but I wouldn't go that way (technically) for the following reason: the use of a classical circular buffer means that to write out the data when it has wrapped around will take two independent writes. For example, if your buffer has:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|s|t|r|i|n|g| |t|o| |w|r|i|t|e|.| | | | | | |T|h|i|s| |i|s| |t|h|e| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
^ ^
| |
Buffer next --+ +-- Buffer start
then you'll have to write "This is the " followed by "string to write.".
Instead, maintain the next pointer and, if the bytes in the buffer plus the bytes to be added are less than the buffer size, just add them to the buffer with no physical write to the underlying media.
Only if you are going to overflow the buffer do you start doing tricky stuff.
You can take one of two approaches:
Either flush the buffer as it stands, set the next pointer back to the start for processing the new message; or
Add part of the message to fill up the buffer, then flush it and set the next pointer back to the start for processing the rest of the message.
I would probably opt for the second given that you're going to have to take into account messages that are too big for the buffer anyway.
What I'm talking about is something like this:
initBuffer:
create buffer of size 10240 bytes.
set bufferEnd to end of buffer + 1
set bufferPointer to start of buffer
return
addToBuffer (size, message):
while size != 0:
xfersz = minimum (size, bufferEnd - bufferPointer)
copy xfersz bytes from message to bufferPointer
message = message + xfersz
bufferPointer = bufferPointer + xfersz
size = size - xfersz
if bufferPointer == bufferEnd:
write buffer to underlying media
set bufferPointer to start of buffer
endif
endwhile
That basically handles messages of any size efficiently by reducing the number of physical writes. There will be optimisations of course - it's possible that the message may have been copied into kernel space so it makes little sense to copy it to the buffer if you're going to write it anyway. You may as well write the information from the kernel copy directly to the underlying media and only transfer the last bit to the buffer (since you have to save it).
In addition, you'd probably want to flush an incomplete buffer to the underlying media if nothing had been written for a time. That would reduce the likelihood of missing information on the off chance that the kernel itself crashes.
Aside: Technically, I guess this is sort of a circular buffer but it has special case handling to minimise the number of writes, and no need for a tail pointer because of that optimisation.
There are also ring buffers which have bounded space requirements and are probably best known in the Unix dmesg facility.
What comes to mind for me is time-based buffers and size-based. So you could either just write whatever is in the buffer to file once every x seconds/minutes/hours or whatever. Alternatively, you could wait until there are x log entries or x bytes worth of log data and write them all at once. This is one of the ways that log4net and log4J do it.
Overall, there are "First-In-First-Out" (FIFO) buffers, also known as queues; and there are "Latest*-In-First-Out" (LIFO) buffers, also known as stacks.
To implement FIFO, there are circular buffers, which are usually employed where a fixed-size byte array has been allocated. For example, a keyboard or serial I/O device driver might use this method. This is the usual type of buffer used when it is not possible to dynamically allocate memory (e.g., in a driver which is required for the operation of the Virtual Memory (VM) subsystem of the OS).
Where dynamic memory is available, FIFO can be implemented in many ways, particularly with linked-list derived data structures.
Also, binomial heaps implementing priority queues may be used for the FIFO buffer implementation.
A particular case of neither FIFO nor LIFO buffer is the TCP segment reassembly buffers. These may hold segments received out-of-order ("from the future") which are held pending the receipt of intermediate segments not-yet-arrived.
* My acronym is better, but most would call LIFO "Last In, First Out", not Latest.
Correct me if I'm wrong, but wouldn't using a mmap'd file for the log avoid both the overhead of small write syscalls and the possibility of data loss if the application (but not the OS) crashed? It seems like an ideal balance between performance and reliability to me.

Resources