Set pipe buffer size - c

I have a C++ multithreaded application which uses posix pipes in order to perform inter thread communications efficiently (so I don't have to get crazy with deadlocks).
I've set the write operation non-blocking, so the writer will get an error if there is not enough space in the buffer to write.
if((pipe(pipe_des)) == -1)
throw PipeException();
int flags = fcntl(pipe_des[1], F_GETFL, 0); // set write operation non-blocking
assert(flags != -1);
fcntl(pipe_des[1], F_SETFL, flags | O_NONBLOCK);
Now I'd wish to set the pipe buffer size to a custom value (one word in the specific case).
I've googled for it but I was not able to find anything useful. Is there a way (possibly posix compliant) to do it?
Thanks
Lorenzo
PS: I'm under linux (if it may be useful)

Since you mentioned you are on Linux and may not mind non-portability, you may be interested in the file descriptor manipulator F_SETPIPE_SZ, available since Linux 2.6.35.
int pipe_sz = fcntl(pipe_des[1], F_SETPIPE_SZ, sizeof(size_t));
You'll find that pipe_sz == getpagesize() after that call, since the buffer cannot be made smaller than the system page size. See fcntl(2).

I googled "linux pipe buffer size" and got this as the top link. Basically, the limit is 64Kb and is hard coded.
Edit The link is dead and it was probably wrong anyway. The Linux pipe(7) man page says this:
A pipe has a limited capacity. If the pipe is full, then a write(2)
will block or fail, depending on whether the O_NONBLOCK flag is set
(see below). Different implementations have different limits for the
pipe capacity. Applications should not rely on a particular
capacity: an application should be designed so that a reading process
consumes data as soon as it is available, so that a writing process
does not remain blocked.
In Linux versions before 2.6.11, the capacity of a pipe was the same
as the system page size (e.g., 4096 bytes on i386). Since Linux
2.6.11, the pipe capacity is 16 pages (i.e., 65,536 bytes in a system
with a page size of 4096 bytes). Since Linux 2.6.35, the default
pipe capacity is 16 pages, but the capacity can be queried and set
using the fcntl(2) F_GETPIPE_SZ and F_SETPIPE_SZ operations. See
fcntl(2) for more information.
Anyway, the following still applies IMO:
I'm not sure why you are trying to set the limit lower, it seems like a strange idea to me. If you want the writer to wait until the reader has processed what it has written, you should use a pipe in the other direction for the reader to send back an ack.

You could use a shared-memory area( System V - like ) of two words, one for sending data and the other for receiving data, and implement your pipes with them.
other solutions, as you may found previously, are about recompiling the kernel as you would like to have it, but it is not the case, I suppose.
Ciao!

Related

Is read() on a nonblocking socket "greedy" on platforms other than Linux (OSX, FreeBSD)?

Consider the following invocation of read() on a nonblocking stream-mode socket (SOCK_STREAM):
ssize_t n = read(socket_fd, buffer, size);
Assume that the remote peer will not close the connection, and will not shut down its writing half of the connection (the reading half, from a local point of view).
On Linux, a short read (n > 0 && n < size) under these circumstances means that the kernel-level read buffer has been exhausted, and an immediate follow-up invocation would normally fail with EAGAIN/EWOULDBLOCK (it would fail unless new data manages to arrive in between the two calls).
In other words, on Linux, an invocation of read() will always consume everything that is immediately available provided that size is large enough.
Likewise for write(), on Linux a short write always means that the kernel-level buffer was filled, and an immediate follow-up invocation is likely to fail with EAGAIN/EWOULDBLOCK.
Question 1: Is this also guaranteed on macOS/OSX?
Question 2: Is this also guaranteed on FreeBSD?
Question 3: Is this required/guaranteed by POSIX?
I know this is true on Linux, because of the following note in the manual page for epoll (section 7):
For stream-oriented files (e.g., pipe, FIFO, stream socket), the condition that the read/write I/O space is exhausted can also be detected by checking the amount of data read from / written to the target file descriptor. For example, if you call read(2) by asking to read a certain amount of data and read(2) returns a lower number of bytes, you can be sure of having exhausted the read I/O space for the file descriptor. The same is true when writing using write(2). (Avoid this latter technique if you cannot guarantee that the monitored file descriptor always refers to a stream-oriented file.)
EDIT: As a motivation for the question, consider a case where you want to process input on a number of sockets simultaneously, and for whatever reason, you want to do this by fully exhausting in-kernel buffers for each socket in turn (i.e., "depth first" rather than "breadth first"). This can obviously be done by repeating a read on a ready-ready socket until it fails with EAGAIN/EWOULDBLOCK, but the last invocation would be redundant if the previous read was short, and we knew that a short read was a guarantee of exhaustion.
It is guaranteed by Posix:
data shall be returned to the user as soon as it becomes available.
... and therefore on all the other platforms you mention as well, and also Windows, OS/2, NetWare, ...
Any other implementation would be pointless.

Handling mmap and fread for big files between processes

I have two processes:
Process A is mapping large file (~170 GB - content constantly changes) into memory for writing with the flags MAP_NONBLOCK and MAP_SHARED:
MyDataType *myDataType; = (MyDataType*)mmap(NULL, sizeof(MyDataType), PROT_WRITE, MAP_NONBLOCK | MAP_SHARED , fileDescriptor, 0);
and every second I call msync:
msync((void *)myDataType, sizeof(MyDataType), MS_ASYNC);
This section works fine.
The problem occurs when process B is trying to read from the same file that process A is mapped to, process A does not respond for ~20 seconds.
Process B is trying to read from the file something like 1000 times, using fread() and fseek(), small blocks (~4 bytes every time).
Most of the content the process is reading are close to each other.
What is the cause for this problem? Is it related to pages allocation? How can I solve it?
BTW, same problem occur when I use mmap() in process B instead of simple fread().
msync() is likely the problem. It forces the system to write to disk, blocking the kernel in a write frenzy.
In general on Linux (it's the same on Solaris BTW), it is a bad idea to use msync() too often. There is no need to call msync() for the synchronization of data between the memory map and the read()/write() I/O operations, this is a misconception that comes from obsolete HOWTOs. In reality, mmap() makes only the file system cache "visible" for a process. This means that the memory blocks the process changes are still under kernel control. Even if your process crashed, the changes would land on the disk eventually. Other processes would also still be serviced by the same buffer.
Here another answer on the subject mmap, msync and linux process termination
The interesting part is the link to a discussion on realworldtech where Linus Torvalds himself explains how buffer cache and memory mapping work.
PS: fseek()/fread() pair is also probably better replaced by pread(). 1 system call is always better than 2. Also fseek()/fread() read always 4K and copies in a buffer, so if you have several small reads without fseek(), it will read from its local buffer and maybe miss updates in process A.
This sounds that you are suffering from IO-Starvation, which has nothing to do with the method (mmap or fread) you choose. You will have to improve your (pre-)caching-strategy and/or try another IO-scheduler (cfq being the default, maybe deadline delivers better overall-results for you)
You can change the scheduler by writing to /sys:
echo deadline > /sys/block/<device>/queue/scheduler
Maybe you should try profiling or even using strace to figure out for sure where the process is spending its time. 20 s seems like an awfully long time to be explained by io in msync().
When you say A doesn't respond, what exactly do you mean?

Send data to multiple sockets using pipes, tee() and splice()

I'm duplicating a "master" pipe with tee() to write to multiple sockets using splice(). Naturally these pipes will get emptied at different rates depending on how much I can splice() to the destination sockets. So when I next go to add data to the "master" pipe and then tee() it again, I may have a situation where I can write 64KB to the pipe but only tee 4KB to one of the "slave" pipes. I'm guessing then that if I splice() all of the "master" pipe to the socket, I will never be able to tee() the remaining 60KB to that slave pipe. Is that true? I guess I can keep track of a tee_offset (starting at 0) which I set to the start of the "unteed" data and then don't splice() past it. So in this case I would set tee_offset to 4096 and not splice more than that until I'm able to tee it to all to the other pipes. Am I on the right track here? Any tips/warnings for me?
If I understand correctly, you've got some realtime source of data that you want to multiplex to multiple sockets. You've got a single "source" pipe hooked up to whatever's producing your data, and you've got a "destination" pipe for each socket over which you wish to send the data. What you're doing is using tee() to copy data from the source pipe to each of the destination pipes and splice() to copy it from the destination pipes to the sockets themselves.
The fundamental issue you're going to hit here is if one of the sockets simply can't keep up - if you're producing data faster than you can send it, then you're going to have a problem. This isn't related to your use of pipes, it's just a fundamental issue. So, you'll want to pick a strategy to cope in this case - I suggest handling this even if you don't expect it to be common as these things often come up to bite you later. Your basic choices are to either close the offending socket, or to skip data until it's cleared its output buffer - the latter choice might be more suitable for audio/video streaming, for example.
The issue which is related to your use of pipes, however, is that on Linux the size of a pipe's buffer is somewhat inflexible. It defaults to 64K since Linux 2.6.11 (the tee() call was added in 2.6.17) - see the pipe manpage. Since 2.6.35 this value can be changed via the F_SETPIPE_SZ option to fcntl() (see the fcntl manpage) up to the limit specified by /proc/sys/fs/pipe-size-max, but the buffering is still more awkward to change on-demand than a dynamically allocated scheme in user-space would be. This means that your ability to cope with slow sockets will be somewhat limited - whether this is acceptable depends on the rate at which you expect to receive and be able to send data.
Assuming this buffering strategy is acceptable, you're correct in your assumption that you'll need to track how much data each destination pipe has consumed from the source, and it's only safe to discard data which all destination pipes have consumed. This is somewhat complicated by the fact that tee() doesn't have the concept of an offset - you can only copy from the start of the pipe. The consequence of this is that you can only copy at the speed of the slowest socket, since you can't use tee() to copy to a destination pipe until some of the data has been consumed from the source, and you can't do this until all the sockets have the data you're about to consume.
How you handle this depends on the importance of your data. If you really need the speed of tee() and splice(), and you're confident that a slow socket will be an extremely rare event, you could do something like this (I've assumed you're using non-blocking IO and a single thread, but something similar would also work with multiple threads):
Make sure all pipes are non-blocking (use fcntl(d, F_SETFL, O_NONBLOCK) to make each file descriptor non-blocking).
Initialise a read_counter variable for each destination pipe to zero.
Use something like epoll() to wait until there's something in the source pipe.
Loop over all destination pipes where read_counter is zero, calling tee() to transfer data to each one. Make sure you pass SPLICE_F_NONBLOCK in the flags.
Increment read_counter for each destination pipe by the amount transferred by tee(). Keep track of the lowest resultant value.
Find the lowest resultant value of read_counter - if this is non-zero, then discard that amount of data from the source pipe (using a splice() call with a destination opened on /dev/null, for example). After discarding data, subtract the amount discarded from read_counter on all the pipes (since this was the lowest value then this cannot result in any of them becoming negative).
Repeat from step 3.
Note: one thing that's tripped me up in the past is that SPLICE_F_NONBLOCK affects whether the tee() and splice() operations on the pipes are non-blocking, and the O_NONBLOCK you set with fnctl() affects whether the interactions with other calls (e.g. read() and write()) are non-blocking. If you want everything to be non-blocking, set both. Also remember to make your sockets non-blocking or the splice() calls to transfer data to them might block (unless that's what you want, if you're using a threaded approach).
As you can see, this strategy has a major problem - as soon as one socket blocks up, everything halts - the destination pipe for that socket will fill up, and then the source pipe will become stagnant. So, if you reach the stage where tee() returns EAGAIN in step 4 then you'll want to either close that socket, or at least "disconnect" it (i.e. take it out of your loop) such that you don't write anything else to it until its output buffer is empty. Which you choose depends on whether your data stream can recovery from having bits of it skipped.
If you want to cope with network latency more gracefully then you're going to need to do more buffering, and this is going to involve either user-space buffers (which rather negates the advantages of tee() and splice()) or perhaps disk-based buffer. The disk-based buffering will almost certainly be significantly slower than user-space buffering, and hence not appropriate given that presumably you want a lot of speed since you've chosen tee() and splice() in the first place, but I mention it for completeness.
One thing that's worth noting if you end up inserting data from user-space at any point is the vmsplice() call which can perform "gather output" from user-space into a pipe, in a similar way to the writev() call. This might be useful if you're doing enough buffering that you've split your data among multiple different allocated buffers (for example if you're using a pool allocator approach).
Finally, you could imagine swapping sockets between the "fast" scheme of using tee() and splice() and, if they fail to keep up, moving them on to a slower user-space buffering. This is going to complicate your implementation, but if you're handling large numbers of connections and only a very small proportion of them are slow then you're still reducing the amount of copying to user-space that's involved somewhat. However, this would only ever be a short-term measure to cope with transient network issues - as I said originally, you've got a fundamental problem if your sockets are slower than your source. You'd eventually hit some buffering limit and need to skip data or close connections.
Overall, I would carefully consider why you need the speed of tee() and splice() and whether, for your use-case, simply user-space buffering in memory or on disk would be more appropriate. If you're confident that the speeds will always be high, however, and limited buffering is acceptable then the approach I outlined above should work.
Also, one thing I should mention is that this will make your code extremely Linux-specific - I'm not aware of these calls being support in other Unix variants. The sendfile() call is more restricted than splice(), but might be rather more portable. If you really want things to be portable, stick to user-space buffering.
Let me know if there's anything I've covered which you'd like more detail on.

read() from files - blocking vs. non-blocking behavior

Let's assume we opened a file using fopen() and from the file-pointer received, fetch the file-descriptor using fileno(). Then we do lots (>10^8) of random read()s of relativly small chunks, between a size of 4Bytes to 10KBytes from this file:
Is it expected behaviour such a read() might return less bytes then requested, without setting errno, if the file-system is an
ext3
NFS
OCFS2
combination of 2 and 3 (OCFS2 via NFS)
?
My readings gave me the conclusion it should not be possible for 1. (if the file has not O_NONBLOCK set, if ever possible for ext3 to have it set) but for the other three (2., 3., 4.) I'm uncertain.
(Btw: Could I assume having O_NONBLOCK not set to be the default in any case?)
This questions arose because I observed read()s returning less bytes then requested without errno set in case 4.
The problem to drill this down by testing is that such behaviour happens in <1/1000000000 cases ... - which is still too often :-}
Update: The average file size is between some TBytes and around 1GByte.
You should not assume that read() will not return less bytes than requested for any filesystem. This is particularly true in the case of large reads, as POSIX.1 indicates that read() behavior for sizes larger than SSIZE_MAX is implementation-dependent. On this mainstream Unix box I'm using right now, SSIZE_MAX is 32767 bytes. The fact that read() always returns the full amount today does not mean that it will in the future.
One possible reason might be that I/O priorities are more fully fleshed out in the kernel in the future. E.g. you're trying to read from the same device as another higher priority process and the other process would get better throughput if your process wasn't causing head movement away from the sectors the other process wants. The kernel might choose to give your read() a short count to get you out of the way for a while, instead of continuing to do inefficient interleaved block reads. Stranger things have been done for the sake of I/O efficiency. What is not prohibited often becomes compulsory.
We solved the problem described as having read() return less bytes then request when reading from a file located on a NFS mount, pointing to an OCFS2 file system (case 4 in my question).
It is a fact that using the setup mentioned above, such read()s on file descriptors sometimes return less bytes then requested, without having errno set.
To have all data read it is as simple as just read()ing again and again up until the amount of data requested had been read.
Moreover such setup sometimes makes read() fail with EIO, and even then a simple re-read() leads to success and data arrives.
My conclusion: Reading via OCFS2 via NFS makes read()ing from files behave like read()ing from sockets which is inconsistent with the specifications of read() http://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html :
When attempting to read a file (other than a pipe or FIFO) that
supports non-blocking reads and has no data currently available:
If O_NONBLOCK is set, read() shall return -1 and set errno to [EAGAIN].
If O_NONBLOCK is clear, read() shall block the calling thread until some data becomes available.
No need to say we never ever tried, nor even thought about to set O_NONBLOCK for the file descriptors in question.

How can I buffer non-blocking IO?

When I need buffered IO on blocking file descriptor I use stdio. But if I turn file descriptor into non-blocking mode according to manual stdio buffering is unusable. After some research I see that BIO can be usable for buffering non-blocking IO.
But may be there are other alternatives?
I need this to avoid using threads in a multi-connection environment.
I think what you are talking about is the Reactor Pattern. This is a pretty standard way of processing lots of network connections without threads, and is very common in multiplayer game server engines. Another implementation (in python) is twisted matrix.
The basic algorith is:
have a buffer for each socket
check which sockets are ready to read (select(), poll(), or just iterate)
for each socket:
call recv() and accumulate the contents into the socket's buffer until recv returns 0 or an error with EWOULDBLOCK
call application level data handler for the socket with the contents of the buffer
clear the socket's buffer
I see the question has been edited now, and is at least more understandable than before.
Anyway, isn't this a contradiction?
You make I/O non-blocking because you want to be able to read small amounts quickly, typically sacrificing throughput for latency.
You make it buffered because you don't care that much about latency, but want to make efficient use of the I/O subsystem by trading latency for throughput.
Doing them both at the same time seems like a contradiction, and is hard to imagine.
What are the semantics you're after? If you do this:
int fd;
char buf[1024];
ssize_t got;
fd = setup_non_blocking_io(...);
got = read(fd, buf, sizeof buf);
What behavior do you expect if there is 3 bytes available? Blocking/buffered I/O might block until able to read more satisfy your request, non-blocking I/O would return the 3 available bytes immediately.
Of course, if you have some protocol on top, that defines some kind of message structure so that you can know that "this I/O is incomplete, I can't parse it until I have more data", you can buffer it yourself at that level, and not pass data on upwards until a full message has been received.
Depending on the protocol, it is certainly possible that you will need to buffer your reads for a non-blocking network node (client or server).
Typically, these buffers provide multiple indexes (offsets) that both record the position of the last byte processed and last byte read (which is either the same or greater than the processed offset). And they also (should) provide richer semantics of compacting the buffer, transparent buffer size management, etc.
In Java (at least) the non-blocking network io (NIO) packages also provide a set of data structures (ByteBuffer, etc.) that are geared towards providing a general data structure.
There either exists such data structures for C, or you must roll your own. Once you have it, then simply read as much data as available and let the buffer manage issues such as overflow (e.g. reading bytes across message frame boundaries) and use the marker offset to mark off the bytes that you have processed.
As Android pointed out, you will (very likely) need to create matched buffers for each open connection.
You could create a struct with buffers for each open file descriptor, then accumulate these buffers until recv() returns 0 or you have data enough to process in your buffer.
If I understand your question correctly, you can't buffer because with non-blocking you're writing to the same buffer with multiple connections (if global) or just writing small pieces of data (if local).
In any case, your program has to be able to identify where the data is coming (possibly by file descriptor) from and buffer it accordingly.
Threading is also an option, it's not as scary as many make it sound out to be.
Ryan Dahl's evcom library which does exactly what you wanted.
I use it in my job and it works great. Be aware, though, that it doesn't (yet, but coming soon) have async DNS resolving. Ryan suggests udns by Michael Tokarev for that. I'm trying to adopt udns instead of blocking getaddrinfo() now.

Resources