So more recently, I have been developing some asynchronous algorithms in my research. I was doing some parallel performance studies and I have been suspicious that I am not properly understanding some details about the various non-blocking MPI functions.
I've seen some insightful posts on here, namely:
MPI: blocking vs non-blocking
MPI Non-blocking Irecv didn't receive data?
There's a few things I am uncertain about or just want to clarify related to working with non-blocking functionality that I think will help me potentially increase the performance of my current software.
From the Nonblocking Communication part of the MPI 3.0 standard:
A nonblocking send start call initiates
the send operation, but does not complete it. The send start call can return before the message was copied out of the send buffer. A separate send complete call is needed to complete the communication, i.e., to verify that the data has been copied out of the send buffer. With suitable hardware, the transfer of data out of the sender memory may proceed
concurrently with computations done at the sender after the send was initiated and before it completed.
...
If the send mode is standard then the send-complete call may
return before a matching receive is posted, if the message is
buffered. On the other hand, the receive-complete may not complete
until a matching receive is posted, and the message was copied into
the receive buffer.
So as a first set of questions about the MPI_Isend (and similarly MPI_Irecv), it seems as though to ensure a non-blocking send finishes, I need to use some mechanism to check that it is complete because in the worst case, there may not be suitable hardware to transfer the data concurrently, right? So if I never use something like MPI_Test or MPI_Wait following the non-blocking send, the MPI_Isend may never actually get its message out, right?
This question applies to some of my work because I am sending messages via MPI_Isend and not actually testing for completeness until I get the expected response message because I want to avoid the overhead of MPI_Test calls. While this approach has been working, it seems faulty based on my reading.
Further, the second paragraph appears to say that for the standard non-blocking send, MPI_Isend, it may not even begin to send any of its data until the destination process has called a matching receive. Given the availability of MPI_Probe/MPI_Iprobe, does this mean an MPI_Isend call will at least send out some preliminary metadata of the message, such as size, source, and tag, so that the probe functions on the destination process can know a message wants to be sent there and so the destination process can actually post a corresponding receive?
Related is a question about the probe. In the Probe and Cancel section, the standard says that
MPI_IPROBE(source, tag, comm, flag, status) returns flag = true if there is a message that can be received and that matches the pattern specifed by the arguments source, tag, and comm. The call matches the same message that would have been received by a call to MPI_RECV(..., source, tag, comm, status) executed at the same point in the program, and returns in status the same value that would have been returned by MPI_RECV(). Otherwise, the call returns flag = false, and leaves status undefined.
Going off of the above passage, it is clear the probing will tell you whether there's an available message you can receive corresponding to the specified source, tag, and comm. My question is, should you assume that the data for the corresponding send from a successful probing has not actually been transferred yet?
It seems reasonable to me now, after reading the standard, that indeed a message the probe is aware of need not be a message that the local process has actually fully received. Given the previous details about the standard non-blocking send, it seems you would need to post a receive after doing the probing to ensure the source non-blocking standard send will complete, because there might be times where the source is sending a large message that MPI does not want to copy into some internal buffer, right? And either way, it seems that posting the receive after a probing is how you ensure that you actually get the full data from the corresponding send to be sent. Is this correct?
This latter question relates to one instance in my code where I am doing a MPI_Iprobe call and if it succeeds, I perform an MPI_Recv call to get the message. However, I think this could be problematic now because I was thinking in my mind that if the probe succeeds, that means it has gotten the whole message already. This implied to me that the MPI_Recv would run quickly, then, since the full message would already be in local memory somewhere. However, I am feeling this was an incorrect assumption now that some clarification on would be helpful.
The MPI standard does not mandate a progress thread. That means that MPI_Isend() might do nothing at all until communications are progressed. Progress occurs under the hood by most MPI subroutines, MPI_Test(), MPI_Wait() and MPI_Probe() are the most obvious ones.
I am afraid you are mixing progress and synchronous send (e.g. MPI_Ssend()).
MPI_Probe() is a local operation, it means it will not contact the sender and ask if something was sent nor progress it.
Performance wise, you should as much as possible avoid unexpected messages, it means a receive should be posted on one end before the message is sent by the other end.
There is a trade-off between performance and portability here :
if you want to write portable code, then you cannot assume there is a MPI progress thread
if you want to optimize your application on a given system, you should give a try to a MPI library that implements a progress thread on the interconnect you are using
Keep in mind most MPI implementations (read this is not mandated by the MPI standard, and you should not rely on it) send small messages in eager mode.
It means MPI_Send() will likely return immediately if the message is small enough (and small enough depends among other things on your MPI implementation, how it is tuned or which interconnect is used).
Related
I'm working on a client-application that uses OpenSSL for TLS using non-blocking I/O. Both sides of the connection may write at any time, indicating some state-change. So I need to check the socket regularly for new data to become available. If nothing has been received my code should immediately continue with the next task: sending new data from the client to the server.
But I'm not sure, how I can do this with the OpenSSL-API: Calling SSL_read() on a socket where no data has been received (as the other side did not send anything) always results in an error-code of SSL_ERROR_WANT_READ (or even SSL_ERROR_WANT_WRITE in case of a renegotiation). In that case may I just continue with the next task and call SSL_write() to write other data to server?
The man-page doesn't talk about this in detail. It just mentions that the arguments of the repeated read-call need to be exactly the same as before. The book Network Security with OpenSSL by Viega et. al. contains an example of non-blocking I/O. But the author waits for SSL_read() to succeed (reading 1 or more bytes) and only then calls SSL_write() to send data to the other side, which means that no write-operation is possible until data from the other side has been received. For my purpose this would not work.
In a nutshell: If SSL_read() returns with SSL_ERROR_WANT_WRITE or SSL_ERROR_WANT_READ may I just call SSL_write() with other data before I repeat the read? And if not, how can I achieve that read-if-available-and-write-thing?
SSL_ERROR_WANT_WRITE or SSL_ERROR_WANT_READ does not necessarily mean an error condition. It signifies that in order to proceed further, Read or write must be done first.
Hence, if your SSL_read is returning either one of these. If you have to wait the condition to be fulfilled just the example in the book that you have mentioned. Untill then calling any other SSL API including SSL_write will not work.
Perhaps, you can use ,if not used, select/poll/epoll or whatever to know any network activity on your socket
I have 20 threads all sending data on single tcp socket at a time and receiving data as well. When I run my application I don’t see any synchronization issues but according to my understanding, there may be some issues when two threads simultaneously try to write to tcp socket or when one thread is writing while other is reading.
If my understanding is correct, why don’t I face any errors?
Sometimes when you don't look both ways before crossing the street, you still get to the other side of the street safely. That doesn't mean it will work successfully every time you do it.
Here's the thing, you say "you don't see any synchronization issues", but that's only because it happens to do what you want it to do. Flip this around -- the reason you don't see any synchronization issues is because you happen to want it to do what it happens to do. Someone who expected it to do something else would see synchronization issues with the very same code.
In other words, you flipped a coin that could come up heads or tails. You expected it to come up heads, knowing that it was not guaranteed. And it came up heads. There's no mystery -- the explanation is that you expected what it happened to do. Had you expected something else, even had it done the very same thing, it would not have done what you expected.
First, the send and receive streams of each socket are independent. There should be no problem with one thread sending while another is receiving.
If multiple threads attempt to write to one socket, the behaviour is, in general, undefined. In practice, a write call from one of the threads will get into a lock in the TCP stack state-machine first, preventing any other threads from entering, write its data, release the lock and exit the stack, so allowing write calls from other threads to proceed. The would allow single write calls to be serialized. If your protocol implementation can send all PDU's with one write call, then fine. If a PDU requires more than one write call, then your outgoing PDU's can get sliced as the write calls from the multiple threads get interleaved.
Making receive calls from multiple threads to one socket is just... something. Even if the stack internal synchro allows only one receive call per socket at a time, the streaming nature of TCP would surely split up the received data in a pseudo-arbitrary manner across the threads. Just don't do it, it's crazy.
TCP already has a mechanism for multiplexing data streams - multiple sockets. You should use them correctly.
If you need to multiplex data streams across one socket, you should add a data routing protocol on top of TCP and implement this protocol in just one receive thread. This thread can keep a list of virtual connections and so service stream/message requests from other threads.
When a socket is signalled as being OK to write by a call to select(), how can I know how much data I can send without blocking? (In the case of full send buffers etc.)
Does inclusion in the set returned by select() signify that the socket is ready for at least one byte of data, and will send() then return a short count of written bytes?
Or will it block when I call send() with a len parameter that is bigger than the available buffer space? If so, how do I know the maximum amount?
I'm using regular C sockets on Linux.
The send call should not block on the first call, and should send at least one byte on the first call -- assuming you are using a stream protocol and assuming it's not interrupted by a signal, etc. However, there are really only two ways to figure out how much data you can send:
Call select after every call to send to see if more data can be sent.
Put the socket in non-blocking mode, and call send until it gives an EAGAIN or EWOULDBLOCK error.
The second option is preferred. (The third option is to do it in a different thread and simply let the thread block, which is also a good solution. In the past, threading implementations weren't as mature so non-blocking mode was seen as necessary for high-performance servers.)
You cannot know. You have to sent the socket to be non-blocking, and then pay attention to the return value that tells you how much it has written.
The CreateIoCompletionPort function allows the creation of a new I/O completion port and the registration of file handles to an existing I/O completion port.
Then, I can use any function, like a recv on a socket or a ReadFile on a file with a OVERLAPPED structure to start an asynchronous operation.
I have to check whether the function call returned synchronously although it was called with an OVERLAPPED structure and in this case handle it directly. In the other case, when ERROR_IO_PENDING is returned, I can use the GetQueuedCompletionStatus function to be notified when the operation completes.
The question which arise are:
How can I remove a handle from the I/O completion port? For example, when I add sockets to the IOCP, how can I remove closed ones? Should I just re-register another socket with the same completion key?
Also, is there a way to make the calls ALWAYS go over the I/O completion port and don't return synchronously?
And finally, is it possible for example to recv asynchronously but to send synchronously? For example when a simple echo service is implemented: Can I wait with an asynchronous recv for new data but send the response in a synchronous way so that code complexity is reduced? In my case, I wouldn't recv a second time anyways before the first request was processed.
What happens if an asynchronous ReadFile has been requested, but before it completes, a WriteFile to the same file should be processed. Will the ReadFile be cancelled with an error message and I have to restart the read process as soon as the write is complete? Or do I have to cancel the ReadFile manually before writing? This question arises in combination with a communication device; so, the write and read should not do problems if happening concurrently.
How can I remove a handle from the I/O completion port?
In my experience you can't disassociate a handle from a completion port. However, you may disable completion port notification by setting the low-order bit of your OVERLAPPED structure's hEvent field: See the documentation for GetQueuedCompletionStatus.
For example, when I add sockets to the IOCP, how can I remove closed ones? Should I just re-register another socket with the same completion key?
It is not necessary to explicitly disassociate a handle from an I/O completion port; closing the handle is sufficient. You may associate multiple handles with the same completion key; the best way to figure out which request is associated with the I/O completion is by using the OVERLAPPED structure. In fact, you may even extend OVERLAPPED to store additional data.
Also, is there a way to make the calls ALWAYS go over the I/O completion port and don't return synchronously?
That is the default behavior, even when ReadFile/WriteFile returns TRUE. You must explicitly call SetFileCompletionNotificationModes to tell Windows to not enqueue a completion packet when TRUE and ERROR_SUCCESS are returned.
is it possible for example to recv asynchronously but to send synchronously?
Not by using recv and send; you need to use functions that accept OVERLAPPED structures, such as WSARecv, WSASend, or alternatively ReadFile and WriteFile. It might be more handy to use the latter if your code is meant to work multiple types of I/O handles, such as both sockets and named pipes. Those functions provide a synchronous mode, so if you use those them you can mix asynchronous and synchronous calls.
What happens if an asynchronous ReadFile has been requested, but before it completes, a WriteFile to the same file should be processed?
There is no implicit cancellation. As long as you're using separate OVERLAPPED structures for each read/write to a full-duplex device, I see no reason why you can't do concurrent I/O operations.
As I’ve already pointed out there, the commonly held belief that it is impossible to remove handles from completion ports is wrong, probably caused by the abscence of any hint whatsoever on how to do this from nearly all documentation I could find. Actually, it’s pretty easy:
Call NtSetInformationFile with the FileReplaceCompletionInformationenumerator value for FileInformationClass and a pointer to a FILE_COMPLETION_INFORMATION structure for the FileInformation parameter. In this structure, set the Port member to NULL (or nullptr, in C++) to disassociate the file from the port it’s currently attached to (I guess if it isn’t attached to any port, nothing would happen),
or set Port to a valid HANDLE to another completion port to associate the file with that one instead.
First some important corrections.
In case the overlapped I/O operation completes immediately (ReadFile or similar I/O function returns success) - the I/O completion is already scheduled to the IOCP.
Also, according to your questions I think you confuse between the file/socket handles, and the specific I/O operations issued on them.
Now, regarding your questions:
AFAIK there is no conventional way to remove a file/socket handle from the IOCP (usually you just don't have to do this). You talk about removing closed handles from the IOCP, which is absolutely incorrect. You can't remove a closed handle, because it does not reference a valid kernel object anymore!
A more correct question should be how the file/socket should be properly closed. The answer is: just close your handle. All the outstanding I/O operations (issued on this handle) will return soon with an error code (abortion). Then, in your completion routine (the one that calls GetQueuedCompletionStatus in a loop) should perform the per-I/O needed cleanup.
As I've already said, all the I/O completion arrives at IOCP in both synchronous and asynchronous cases. The only situation where it does not arrive at IOCP is when an I/O completes synchronously with an error. Anyway, if you want a unified processing - in such a case you may post an artificial completion data to IOCP (use PostQueuedCompletionStatus).
You should use WSASend and WSARecv (not recv and send) for overlapped I/O. Nevertheless, even of the socket was opened with flag WSA_FLAG_OVERLAPPED - you are allowed to call the I/O functions without specifying the OVERLAPPED structure. In such a case those functions work synchronously.
So that you may decide on synchronous/asynchronous modes for every function call.
There is no problem to mix overlapped read/write requests. The only delicate point here is what happens if you try to read the data from the file position where you're currently writing to. The result may depend on subtle things, such as order of completion of I/Os by the hardware, some PC timing parameters and etc. Such a situation should be avoided.
How can I remove a handle from the I/O completion port? For example, when I add sockets to the IOCP, how can I remove closed ones? Should I just re-register another socket with the same completion key?
You've got it the wrong way around. You set the I/O completion port to be used by a file object - when the file object is deleted, you have nothing to worry about. The reason you're getting confused is because of the way Win32 exposes the underlying native API functionality (CreateIoCompletionPort does two very different things in one function).
Also, is there a way to make the calls
ALWAYS go over the I/O completion port
and don't return synchronously?
This is how it's always been. Only starting with Windows Vista can you customize how the completion notifications are handled.
What happens if an asynchronous
ReadFile has been requested, but
before it completes, a WriteFile to
the same file should be processed.
Will the ReadFile be cancelled with an
error message and I have to restart
the read process as soon as the write
is complete?
I/O operations in Windows are asynchronous inherently, and requests are always queued. You may not think this is so because you have to specify FILE_FLAG_OVERLAPPED in CreateFile to turn on asynchronous I/O. However, at the native layer, synchronous I/O is really an add-on, convenience thing where the kernel keeps track of the file position for you and waits for the I/O to complete before returning.
Can we call send from one thread and recv from another on the same socket?
Can we call multiple sends parallely from different threads on the same socket?
I know that a good design should avoid this, but I am not clear how these system APIs will behave. I am unable to find a good documentation also for the same.
Any pointers in the direction will be helpful.
POSIX defines send/recv as atomic operations, so assuming you're talking about POSIX send/recv then yes, you can call them simultaneously from multiple threads and things will work.
This doesn't necessarily mean that they'll be executed in parallel -- in the case of multiple sends, the second will likely block until the first completes. You probably won't notice this much, as a send completes once its put its data into the socket buffer.
If you're using SOCK_STREAM sockets, trying to do things a parallel is less likely to be useful as send/recv might send or receive only part of a message, which means things could get split up.
Blocking send/recv on SOCK_STREAM sockets only block until they send or recv at least 1 byte, so the difference between blocking and non-blocking is not useful.
The socket descriptor belongs to the process, not to a particular thread. Hence, it is possible to send/receive to/from the same socket in different threads, the OS will handle the synchronization.
However, if the order of sending/receiving is semantically significant, you yourself (respectively your code) have to ensure proper sequencing between the operations in the different threads - as is always the case with threads.
I don't see how receiving in parallel could possibly accomplish anything. If you have a 3 bytes message, 1 thread could get the 1st 2 bytes and another the last byte, but you'd have no way of telling which was which. Unless your messages are only a byte long, there is no way you could reliably make anything work with multiple threads receiving.
Multiple sends might work, if you sent the entire message in a single call, but I'm not sure. It's possible that one could overwrite another. There certainly wouldn't be any performance benefit to doing so.
If multiple threads need to send, you should implement a synchronized message queue. Have one thread that does the actual sending that reads messages from the queue and have the other threads enqueue whole messages. The same thing would work for receiving, but the receive thread would have to know the format of the messages so it could deserialize them properly.