I'm implementing a TCP/IP server application which uses epoll in edge-triggered mode and does non-blocking socket operations. The clients are using simple blocking operations without epoll.
I don't see how "atomic reads" can be implemented on the server side. To explain what I mean about "atomic read", see this example with simple blocking operations:
Both the client and the server are using 64K buffers. (On application level. They don't change the kernel level socket buffers.)
Client writes 12K data with a single write operation.
Server reads it. In this case it always reads the whole 12K when the buffers are the same. So it can't read only the half of it. This is what I call "atomic".
But in the case of epoll + non-blocking operations this can happen:
Both the client and the server are using 64K buffers. (On application level. They don't change the kernel level socket buffers.)
Client writes 12K data with a single write operation.
6K arrives to the server
epoll tells the application that data arrived to the socket
the application reads the 6K into the buffer using a non-blocking operation.
When repeating the read, it returns EAGAIN / EWOULDBLOCK.
In this case the read is not "atomic". It is not guarantied that when the data was written with a single write operation, then the read will return the whole, in one piece.
Is it possible to know when the data is partial? I know that one solution is to always append the data size to the beginning, or another could be to always close and re-open the connection, but I don't wanna do these: because I think the kernel must know that not the full "package" (how is that unit called BTW?) arrived, since it guaranties atomicity for the blocking operations.
Many thanks!
TCP is stream based and not message oriented. Even in case of blocking socket, you cannot be guaranteed that what the application sends would go as is on the wire in one go. TCP will decide its own course.
So, it is up to the application to do "atomic" read of it wishes. For example:
The application protocol should dictate that the message should be prepended by the length bytes. The length bytes inform the peer the size of the application data of interest. Of course, the application should be aware of when the two byte length indicator begins.
[2 byte msg length][Data bytes of interest]
Based on this information the application doing read should take action. It should be polling the socket until it receives all bytes as indicated by the msg length bytes. Only then process the data.
If you need "atomic" read and not partial read you can use MSG_PEEK flag in recv. This shall not remove the data from the socket buffer. Application peeks into the socket, see if required number of data is in socket buffer based on return value.
ret = recv(sd, buf, MAX_CALL_DATA_SIZE, MSG_PEEK);
Related
I’m using raspberry pi b+ and building tcp server/client connection with C.
I have few questions from client side.
How long does Linux queue the packets for client? When the packet has received thru Linux, what if client is not ready to process it or select/epoll func inside loop has 1min sleep? If there is a timeout, is there a way to adjust the timeout with code/script?
What is the internal process inside of Linux when it receives the packet? (i.e., ethernet port->kernel->ram->application??)
The raspberry pi (with linux) and any known linux (or nonlinux) tcp/ip works in some way like this:
You have a kernel buffer in which the kernel stores all the data from the other side, this is the data that has not yet been read by the user process. the kernel normally has all this data acknowledged to the other side (the acknowledge states the last byte received and stored in that buffer) The sender side has also a buffer, where it stores all the sent data that has not yet been acknowledged by the receiver (This data must be resent in case of timeout) plus data that is not yet in the window admitted by the receiver. If this buffer fills, the sender is blocked, or a partial write is reported (depending on options) to the user process.
That kernel buffer (the reading buffer) allows the kernel to make the data available for reading to the user process while the process is not reading the data. If the user process cannot read it, it remains there until de process does a read() system call.
The amount of buffering that the kernel is still capable of reading (known as the window size) is sent to the other end on each acknowledge, so the sender knows the maximum amount of data it is authorized to send. When the buffer is full, the window size descends to zero and the receiver announces it cannot receive more data. This allows a slow receiver to stop a fast sender from filling the network with data that cannot be sent.
From then on (the situation with a zero window), the sender periodically (or randomly) sends a segment with no data at all (or with just one byte of data, depending on the implementation) to check if some window has open to allow it to send more data. The acknowledge to that packet will allow it to start communicating again.
Everything is stopped now, but no timeout happens. both tcps continue talking this way until some window is available (meaning the receiver has read() part of the buffer)
This situation can be mainained for days without any problem, the reading process is busy and cannot read the data, and the writing process is blocked in the write call until the kernel in the sending side has buffer to accomodate the data to be written.
When the reading process reads the data:
An ack of the last sent byte is sent, announcing a new window size, larger than zero (by the amount freed by the reader process when reading)
The sender receives this acknowledge and sends that amount of data from his buffer, if this allows to accomodate the data the writer has requested to write, it will be awaken and allowed to continue sending data.
Again, timeouts normally only occur if data is lost in transit.
But...
If you are behind a NAT device, your connection data can be lost from not exercising it (the nat device maintains a cache of used address/port local devices making connections to the outside) and on the next data transfer that comes from the remote device, the nat device can (or cannot) send a RST, because the packet refers to a connection that is not known to it (the cache entry expired)
Or if the packet comes from the internal device, the connection can be recached and continue, what happens, depends on who is the first to send a packet.
Nothing specifies that an implementation should provide a timeout for data to be sent, but some implementations do, aborting the connection with an error in case some data is timeout for a large amount of time. TCP specifies no timeout in this case, so it is the process resposibility to cope with it.
TCP is specified in RFC-793 and must be obeyed by all implementations if they want communications to succeed. You can read it if you like. I think you'll get a better explanation than the one I give you here.
So, to answer your first question: The kernel will store the data in its buffer as long as your process wants to wait for it. By default, you just call write() on a socket, and the kernel tries as long as you (the user) don't decide to stop the process and abort the operation. In that case the kernel will probably try to close the connection or reset it. The resources are surrogated to the life of the process, so as long as the process is alive and holding the connection, the kernel will wait for it.
In a server/client in C, TCP, what happens if I do consecutive read/write in my program? Is this possibile, or do I've always to follow the structure "client write->server read->server write->client read"?
Is it possible to do something like this? Or is it possible that data from the 3rd write in the server are received from the 2nd read in the client, and other bad things like this?
client.c
write (1)
read (2)
read (3)
read (4)
server.c
read (1)
write (2)
write (3)
write (4)
TCP transfers a data stream.
So N calls to write could result in M calls to read (with N>0 and M>0 and N<=number of bytes transferred and M<=number of bytes transferred).
The order in which the bytes were send will however be preserved between the various writes and reads involved.
As a direct consequence to this you cannot just use the plain number of calls to write and read to synchronise processes. You need to add for example the number of bytes to be transferred by each step.
To cut it short: You need an application level protocol on top of the TCP stream.
Yes, this is possible. reads will block (by default) until data are available to read on the socket. There's no need to try to arrange for reads and writes in a particular order, your program will just wait until there is something to read before the read() call returns.
Your code will result in this:
Server blocks in read()
Client write()s
Client blocks in read()
Server receives data, read() returns
Server write()s three times
Client receives data, read(1) returns
Client's read(2) is called and returns as soon as data arrive
Client's read(3) is called and returns as soon as data arrive
On a really fast link it's possible that the server write()s and the client read()s happen at roughly the same time, maybe even interleaved, but the server write()s will always be in order and the client's read()s will always be in order.
Data ordering is preserved if the sockets are SOCK_STREAM, eg. TCP or UNIX sockets like you're asking about.
So read(2) will always return the data written in write(2) for a TCP socket.
If you used a UDP socket (SOCK_DGRAM's default) you might find that the messages are received out-of-order or not at all.
When reading from a socket using read(2) and blocking I/O, when do I know that the other side (the client) has no more data to send? (by "no more data to send" I mean that, as an example, the client is waiting for a response). At first, I thought that this point is reached when less than count bytes are returned by read (as in read(fd, *buf, count)).
But what if the client sends the data fragmented? Reading until read returns 0 would be a solution, but as far as I know 0 is only returned when the client closes the connection - otherwise, read would just block until the connection is closed. I thought of using non-blocking I/O and a timeout for select(2), but this does not seem to be a tidy solution to me.
Are there any known best practices?
The concept of "the other side has no more data to send", without either a timeout or some semantics in the transmitted data, is quite pointless. Normally, code on the client/server will be able to process data faster than the network can transmit it. So if there's no data in the receive buffer when you're trying to read() it, this just means the network has not yet transmitted everything, but you have no way to tell if the next packet will arrive within a millisecond, a second, or a day. You'd probably consider the first case as "there is more data to send", the third as "no more data to send", and the second depends on your application.
If the other side doesn't close the connection, you probably don't know when it's ready to send the next data packet either.
So unless you have specific semantics and knowledge about what the client sends, using select() and non-blocking I/O is the best you can do.
In specific cases, there might be other ways - for example, if you know the client will send and XML tag, some data, and a closing tag, every n seconds. In that case you could start reading n seconds after the last packet you received, then just read on until you receive the closing tag. But as i said, this isn't a general approach since it requires semantics on the channel.
TCP is a byte-stream protocol, not a message protocol. If you want messages you really have to implement them yourself, e.g. with a length-word prefix, lines, XML, etc. You can guess with the FIONREAD option of ioctl(), but guessing is all it is, as you can't know whether the client has paused in the middle of transmission of the message, or whether the network has done so for some reason.
The protocol needs to give you a way to know when the client is finishes sending a message.
Common approaches are to send the length of each message before it, or to send a special terminator after each message (similar to the NUL character at the end of strings in C).
I'm attempting to write a simple server using C system calls that takes unknown byte streams from unknown clients and executes specific actions depending on client input. For example, the client will send a command "multiply 2 2" and the server will multiply the numbers and return the result.
In order to avoid errors where the server reads before the client has written, I have a blocking recv() call to wait for any data using MSG_PEEK. When recv detects data to be read, I move onto non-blocking recv()'s that read the stream byte by byte.
Everything works except in the corner case where the client sends no data (i.e. write(socket, "", 0); ). I was wondering how exactly I would detect that a message with no data is sent. In this case, recv() blocks forever.
Also, this post pretty much sums up my problem, but it doesn't suggest a way to detect a size 0 packet.
What value will recv() return if it receives a valid TCP packet with payload sized 0
When using TCP at the send/recv level you are not privy to the packet traffic that goes into making the stream. When you send a nonzero number of bytes over a TCP stream the sequence number increases by the number of bytes. That's how both sides know where the other is in terms of successful exchange of data. Sending multiple packets with the same sequence number doesn't mean that the client did anything (such as your write(s, "", 0) example), it just means that the client wants to communicate some other piece of information (for example, an ACK of data flowing the other way). You can't directly see things like retransmits, duplicate ACKs, or other anomalies like that when operating at the stream level.
The answer you linked says much the same thing.
Everything works except in the corner case where the client sends no data (i.e. write(socket, "", 0); ).
write(socket, "", 0) isn't even a send in the first place. It's just a local API call that does nothing on the network.
I was wondering how exactly I would detect that a message with no data is sent.
No message is sent, so there is nothing to detect.
In this case, recv() blocks forever.
I agree.
I have a blocking recv() call to wait for any data using MSG_PEEK. When recv detects data to be read, I move onto non-blocking recv()'s that read the stream byte by byte.
Instead of using recv(MSG_PEEK), you should be using select(), poll(), or epoll() to detect when data arrives, then call recv() to read it.
I have an application that sends data point to point from a sender to the receiver over an link that can operate in simplex (one way transmission) or duplex modes (two way). In simplex mode, the application sends data using UDP, and in duplex it uses TCP. Since a write on TCP socket may block, we are using Non Blocking IO (ioctl with FIONBIO - O_NONBLOCK and fcntl are not supported on this distribution) and the select() system call to determine when data can be written. NIO is used so that we can abort out of send early after a timeout if needed should network conditions deteriorate. I'd like to use the same basic code to do the sending but instead change between TCP/UDP at a higher abstraction. This works great for TCP.
However I am concerned about how Non Blocking IO works for a UDP socket. I may be reading the man pages incorrectly, but since write() may return indicating fewer bytes sent than requested, does that mean that a client will receive fewer bytes in its datagram? To send a given buffer of data, multiple writes may be needed, which may be the case since I am using non blocking IO. I am concerned that this will translate into multiple UDP datagrams received by the client.
I am fairly new to socket programming so please forgive me if have some misconceptions here. Thank you.
Assuming a correct (not broken) UDP implementation, then each send/sendmsg/sendto will correspond to exactly one whole datagram sent and each recv/recvmsg/recvfrom will correspond to exactly one whole datagram received.
If a UDP message cannot be transmitted in its entirety, you should receive an EMSGSIZE error. A sent message might still fail due to size at some point in the network, in which case it will simply not arrive. But it will not be delivered in pieces (unless the IP stack is severely buggy).
A good rule of thumb is to keep your UDP payload size to at most 1400 bytes. That is very approximate and leaves a lot of room for various forms of tunneling so as to avoid fragmentation.