From epoll's man page:
epoll is a variant of poll(2) that can be used either as an edge-triggered
or a level-triggered interface
When would one use the edge triggered option? The man page gives an example that uses it, but I don't see why it is necessary in the example.
When an FD becomes read or write ready, you might not necessarily want to read (or write) all the data immediately.
Level-triggered epoll will keep nagging you as long as the FD remains ready, whereas edge-triggered won't bother you again until the next time you get an EAGAIN (so it's more complicated to code around, but can be more efficient depending on what you need to do).
Say you're writing from a resource to an FD. If you register your interest for that FD becoming write ready as level-triggered, you'll get constant notification that the FD is still ready for writing. If the resource isn't yet available, that's a waste of a wake-up, because you can't write any more anyway.
If you were to add it as edge-triggered instead, you'd get notification that the FD was write ready once, then when the other resource becomes ready you write as much as you can. Then if write(2) returns EAGAIN, you stop writing and wait for the next notification.
The same applies for reading, because you might not want to pull all the data into user-space before you're ready to do whatever you want to do with it (thus having to buffer it, etc etc). With edge-triggered epoll you get told when it's ready to read, and then can remember that and do the actual reading "as and when".
In my experiments, ET doesn't guarantee that only one thread wakes up, although it often wakes up only one. The EPOLLONESHOT flag is for this purpose.
Level triggered
Use level trigger mode when you can't consume all the data in the FD and want epoll to keep triggering while data is available.
For example, if you want to receive large files from FD, and you cannot consume all the file data from the FD at one time, and want to keep the triggering continue for the next consumption. The level trigger mode could be suitable for this case.
Disadvantage
thundering herd
The EPOLLEXCLUSIVE directive is meant to prevent the thundering heard phenomenon
less efficiency
When a read/write event occurs on the monitored file descriptor, epoll_wait() notifies the handler to read or write. If you don’t read or write all the data at once (e.g., the read/write buffer is too small), then the next time epoll_wait() is called, it will notify you to continue reading or writing on the file descriptor you didn’t finish reading or writing on, but of course, if you never read or write, it will keep notifying you.
If the system has a large number of ready file descriptors that you don’t need to read or write, and they return every time, this can greatly reduce the efficiency of the handler retrieving the ready file descriptors it cares about.
use cases
redis epoll Since the IO thread of Redis is single-threaded, level trigger mode is used.
Edge triggered
Use edge triggered mode and make sure all data available is buffered and will be handled eventually.
As Chris Dodd mentioned in the comments
ET is also particularly nice with a multithreaded server on a multicore machine. You can run one thread per core and have all of them call epoll_wait on the same FD. When data comes in on an FD, exactly one thread will be woken to handle it
use cases
nginx epoll model
golang netpoll
Related
I'm trying to use serial ports in an async manner. I can use select, poll, or epoll with O_NONBLOCK to do async read and writes. But what about open and close?
I've seen close block for more than a second.
There are very few operating systems which implement true asynchronous open() and close() (specifying O_NONBLOCK to open() means don't sleep waiting for connection or input, not actually perform the operation truly in the background). Two that come to mind are QNX and the Hurd, both are micro-kernel operating system designs where every syscall is by definition multiplexable and therefore asynchronous.
As to why, historically you can't do anything until an open() completes so API designers never thought to make it async. More recently, if you really want it to be async, do the call from a threadpool. close() is a bit more interesting, it's actually pretty hard to make closing a file descriptor fast without making it lose valuable information the loss of which will cause data loss e.g. "the buffered data I just tried to write out failed". But again, if you really need close() to be async, just call it from a threadpool.
As a general rule, you cannot expect high performance if you call open() and close() a lot. Both inevitably involve making the kernel run a lot of code checking perms, allocating kernel structures, taking locks on kernel structures etc. Generally for high performance file i/o for example, you open the files you need at the beginning and never close them. That gets good to superb performance on most operating systems.
Some Unix code I am working on depends on being able to poll over a small number of pipes. poll is a POSIX system call that (much like the older select) allows the process to wait until one or more file descriptors is "ready" for reading or writing, which means one can proceed to do so without blocking. This is useful to implement event loops where waiting is clearly separated from the rest of the communication.
Is it possible to do the same for Windows pipe handles - wait for one or more of them to become "ready" for reading/writing?
Existing SO advice on the matter, such as answers to this question, recommend the use of completion ports. However as far as I can tell, completion ports require initiating reading/writing beforehand, and then waiting for (or being notified of) the completion of those operations. This approach does not fit the architecture of the code, which strongly separates the polling code from the reading/writing code, the latter calling into a library that uses the regular ReadFile and WriteFile on the underlying handle.
If there is no direct equivalent to poll, could one abuse completion ports to provide something similar? In other words, is it possible to create IO completion events that announce "you can now call ReadFile (WriteFile) on this handle without it blocking" and wait for them using WaitForMultipleObjects or GetQueuedCompletionStatus?
When working with Linux epoll in edge triggered mode (EPOLLET), and a read/write fails with EAGAIN/EWOULDBLOCK, it means that read/write-readiness was lost, and that a new readiness event is guaranteed to be made available via epoll_wait() as soon as readiness is regained.
Additionally, when working with Linux epoll in edge triggered mode, and a nonblocking stream-mode socket, and provided that we registered interest in EPOLLRDHUP events, and that an EPOLLRDHUP event was not already received, a short read/write (return value less than requested size) also means loss of read/write-readiness, and we can still rely on a new readiness notification when readiness is regained, even though no read/write ever failed with EAGAIN/EWOULDBLOCK.
Similarly, when working with Kqueue (macOS/FreeBSD) in edge triggered mode (EV_CLEAR), and a read/write fails with EAGAIN/EWOULDBLOCK, it means that read/write-readiness was lost, and that a new readiness event is guaranteed to be made available via kevent() as soon as readiness is regained.
Question: When working with Kqueue in edge-triggered mode, and a nonblocking stream-mode socket, and provided that we registered interest in EV_EOF events, and that an EV_EOF event was not already received, is there a similar guarantee, that a short read/write means loss of read/write-readiness, and that a new readiness event is guaranteed to be produced when readiness is regained?
EDIT: Note: Knowing that a short read means loss of read-readiness allows me (in the general case) to avoid a redundant invocation of read() just to get the EAGAIN/EWOULDBLOCK failure.
The meaning of a short read/write in the context of Linux epoll, follows from this comment in epoll(7) man page:
For stream-oriented files (e.g., pipe, FIFO, stream socket), the condition that the read/write I/O space is exhausted can also be detected by checking the amount of data read from / written to the target file descriptor. For example, if you call read(2) by asking to read a certain amount of data and read(2) returns a lower number of bytes, you can be sure of having exhausted the read I/O space for the file descriptor. The same is true when writing using write(2). (Avoid this latter technique if you cannot guarantee that the monitored file descriptor always refers to a stream-oriented file.)
You ask about "Kqueue in edge-triggered mode", but the kqueue documentation does not use that terminology. I think you must mean that you have enabled the EV_CLEAR flag for the event in question, with the effect that
After the event is retrieved by the user, its state is reset.
(BSD documentation for kqueue())
Furthermore, you stipulate that the program has
registered interest in EV_EOF events, and that an EV_EOF event was not already received
but EV_EOF is not an event in its own right; rather, it is a flag that some of the available filters will set when appropriate, especially EVFILT_READ.
Anyway, the core of your question is
is there a similar guarantee, that a short read/write means loss of read/write-readiness, and that a new readiness event is guaranteed to be produced when readiness is regained?
As far as I can determine, there is no guarantee that a short read signals loss of read readiness, neither for BSD nor for Linux. Indeed, the Linux docs for read(2) specifically call out receipt of a signal as a possible alternative reason for a short read.
Moreover, the usage model the Linux epoll() docs recommend for non-blocking file descriptors in edge-triggered mode is to read repeatedly until a read fails with EAGAIN, using that as the indication of loss of readiness prior to end-of-file. I recommend following the same policy with the kqueue system for events that have EV_CLEAR in effect.
I recognize that you hoped to save one read() call by stopping upon a short read, but I think that presents a bona fide risk of the process least of leaving incoming data streams unserviced indefinitely. Furthermore, your concern is premature unless you've determined that those extra reads are responsible for a measurable, unacceptable performance drain.
The CreateIoCompletionPort function allows the creation of a new I/O completion port and the registration of file handles to an existing I/O completion port.
Then, I can use any function, like a recv on a socket or a ReadFile on a file with a OVERLAPPED structure to start an asynchronous operation.
I have to check whether the function call returned synchronously although it was called with an OVERLAPPED structure and in this case handle it directly. In the other case, when ERROR_IO_PENDING is returned, I can use the GetQueuedCompletionStatus function to be notified when the operation completes.
The question which arise are:
How can I remove a handle from the I/O completion port? For example, when I add sockets to the IOCP, how can I remove closed ones? Should I just re-register another socket with the same completion key?
Also, is there a way to make the calls ALWAYS go over the I/O completion port and don't return synchronously?
And finally, is it possible for example to recv asynchronously but to send synchronously? For example when a simple echo service is implemented: Can I wait with an asynchronous recv for new data but send the response in a synchronous way so that code complexity is reduced? In my case, I wouldn't recv a second time anyways before the first request was processed.
What happens if an asynchronous ReadFile has been requested, but before it completes, a WriteFile to the same file should be processed. Will the ReadFile be cancelled with an error message and I have to restart the read process as soon as the write is complete? Or do I have to cancel the ReadFile manually before writing? This question arises in combination with a communication device; so, the write and read should not do problems if happening concurrently.
How can I remove a handle from the I/O completion port?
In my experience you can't disassociate a handle from a completion port. However, you may disable completion port notification by setting the low-order bit of your OVERLAPPED structure's hEvent field: See the documentation for GetQueuedCompletionStatus.
For example, when I add sockets to the IOCP, how can I remove closed ones? Should I just re-register another socket with the same completion key?
It is not necessary to explicitly disassociate a handle from an I/O completion port; closing the handle is sufficient. You may associate multiple handles with the same completion key; the best way to figure out which request is associated with the I/O completion is by using the OVERLAPPED structure. In fact, you may even extend OVERLAPPED to store additional data.
Also, is there a way to make the calls ALWAYS go over the I/O completion port and don't return synchronously?
That is the default behavior, even when ReadFile/WriteFile returns TRUE. You must explicitly call SetFileCompletionNotificationModes to tell Windows to not enqueue a completion packet when TRUE and ERROR_SUCCESS are returned.
is it possible for example to recv asynchronously but to send synchronously?
Not by using recv and send; you need to use functions that accept OVERLAPPED structures, such as WSARecv, WSASend, or alternatively ReadFile and WriteFile. It might be more handy to use the latter if your code is meant to work multiple types of I/O handles, such as both sockets and named pipes. Those functions provide a synchronous mode, so if you use those them you can mix asynchronous and synchronous calls.
What happens if an asynchronous ReadFile has been requested, but before it completes, a WriteFile to the same file should be processed?
There is no implicit cancellation. As long as you're using separate OVERLAPPED structures for each read/write to a full-duplex device, I see no reason why you can't do concurrent I/O operations.
As I’ve already pointed out there, the commonly held belief that it is impossible to remove handles from completion ports is wrong, probably caused by the abscence of any hint whatsoever on how to do this from nearly all documentation I could find. Actually, it’s pretty easy:
Call NtSetInformationFile with the FileReplaceCompletionInformationenumerator value for FileInformationClass and a pointer to a FILE_COMPLETION_INFORMATION structure for the FileInformation parameter. In this structure, set the Port member to NULL (or nullptr, in C++) to disassociate the file from the port it’s currently attached to (I guess if it isn’t attached to any port, nothing would happen),
or set Port to a valid HANDLE to another completion port to associate the file with that one instead.
First some important corrections.
In case the overlapped I/O operation completes immediately (ReadFile or similar I/O function returns success) - the I/O completion is already scheduled to the IOCP.
Also, according to your questions I think you confuse between the file/socket handles, and the specific I/O operations issued on them.
Now, regarding your questions:
AFAIK there is no conventional way to remove a file/socket handle from the IOCP (usually you just don't have to do this). You talk about removing closed handles from the IOCP, which is absolutely incorrect. You can't remove a closed handle, because it does not reference a valid kernel object anymore!
A more correct question should be how the file/socket should be properly closed. The answer is: just close your handle. All the outstanding I/O operations (issued on this handle) will return soon with an error code (abortion). Then, in your completion routine (the one that calls GetQueuedCompletionStatus in a loop) should perform the per-I/O needed cleanup.
As I've already said, all the I/O completion arrives at IOCP in both synchronous and asynchronous cases. The only situation where it does not arrive at IOCP is when an I/O completes synchronously with an error. Anyway, if you want a unified processing - in such a case you may post an artificial completion data to IOCP (use PostQueuedCompletionStatus).
You should use WSASend and WSARecv (not recv and send) for overlapped I/O. Nevertheless, even of the socket was opened with flag WSA_FLAG_OVERLAPPED - you are allowed to call the I/O functions without specifying the OVERLAPPED structure. In such a case those functions work synchronously.
So that you may decide on synchronous/asynchronous modes for every function call.
There is no problem to mix overlapped read/write requests. The only delicate point here is what happens if you try to read the data from the file position where you're currently writing to. The result may depend on subtle things, such as order of completion of I/Os by the hardware, some PC timing parameters and etc. Such a situation should be avoided.
How can I remove a handle from the I/O completion port? For example, when I add sockets to the IOCP, how can I remove closed ones? Should I just re-register another socket with the same completion key?
You've got it the wrong way around. You set the I/O completion port to be used by a file object - when the file object is deleted, you have nothing to worry about. The reason you're getting confused is because of the way Win32 exposes the underlying native API functionality (CreateIoCompletionPort does two very different things in one function).
Also, is there a way to make the calls
ALWAYS go over the I/O completion port
and don't return synchronously?
This is how it's always been. Only starting with Windows Vista can you customize how the completion notifications are handled.
What happens if an asynchronous
ReadFile has been requested, but
before it completes, a WriteFile to
the same file should be processed.
Will the ReadFile be cancelled with an
error message and I have to restart
the read process as soon as the write
is complete?
I/O operations in Windows are asynchronous inherently, and requests are always queued. You may not think this is so because you have to specify FILE_FLAG_OVERLAPPED in CreateFile to turn on asynchronous I/O. However, at the native layer, synchronous I/O is really an add-on, convenience thing where the kernel keeps track of the file position for you and waits for the I/O to complete before returning.
I'm trying to understand how asynchronous file operations being emulated using threads. I've found next-to-nothing materials to read about the subject.
Is it possible that:
a process uses a thread to open a regular file (HDD).
the parent gets the file descriptor from the thread, now it may close the thread.
the parent uses the file descriptor with a new thread, reading X bytes from the file.
the parent gets the file descriptor with the seek-position of the current file state.
the parent may repeat these operations, without the need to open, or seek, every time it wishes to "continue" reading a new chunk of the file?
This is just a wild guess of mine, would appreciate if anybody mind to shed more light to clarify how it's being emulated efficiently.
UPDATE:
By efficient I actually mean that I don't want the thread to "wait" since the moment the file been opened. Think of a HTTP non-blocking daemon which serves a client with a huge file, you want to use the thread to read chunks of the file without blocking the daemon - but you don't want to keep the thread busy while "waiting" for the actual transfer to take place, you want to use the thread for other blocking operations of other clients.
To understand asynchronous I/O better, it may be helpful to think in terms of overlapping operation. That is, the number of pending operations (operations that have been started but not yet completed) can simutaneously go above one.
A diagram that explains asynchronous I/O might look like this: http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx
If you are using the asynchronous I/O capabilities provided by the underlying Operating System, then it is possible to asynchronously read from multiple files without spawning a equal number of threads.
If your underlying Operating System does not provide asynchronous I/O, or if you decide not to use it, in other words, you wish to emulate asynchronous operation by only using blocking I/O (the regular Read/Write provided by the Operating System) then it is necessary to spawn as many threads as the number of simutaneous I/O operations. This is because when a thread is making a function call to blocking I/O, the thread cannot continue its execution until the operation finishes. In order to start another blocking I/O operation, that operation has to be issued from another thread that is not already occupied.
When you open/create a file fire up a thread. Now store that thread id/ptr as your file handle.
Basically the thread will do nothing except sit in a loop waiting for an "event". A semaphore would be good here. When you want to do a read then you add the read command to a queue (remember to critical section the stack add), return a unique id, and then you increment the semaphore. If the thread is asleep it will now wake up and grab the first message off the queue and process it. When it has completed you remove the command from the queue.
To poll if a file read has completed you can, simply, check to see if its in the command queue. If its not there then the command has completed.
Furthermore if you want to allow synchronous reads as well then you can wait after sending the message through for an "event" to get triggered by the completion. You then check to see if the unique id is the queue and if it isn't you return control. If it still is then you go back to a wait state until the relevant unique id has been processed.