How to avoid TCP aggregation in 'C'? - c

I'm sending many TCP packets each of size 50 bytes over the network. Later I found it out TCP aggregates a few 50 byte packets into a single TCP packet. My question is, is there a way to avoid TCP aggregation in 'C' program?

Packing multiple sent packets into a single TCP packet is handled using an algorithm known as Nagle's algorithm. To disable it, set the TCP_NODELAY option on your socket:
int flag = 1;
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag));
Note that this decreases the efficiency of your network, and should be avoided unless you really do need each packet to be sent immediately.

TCP fundamentally does not provide a packet-based service at the application layer. Applications using TCP that need more than a pure stream must provide their own framing mechanism.
Whilst the TCP_NODELAY option will often cause each application-level write to be carried in a separate frame (unless it's larger than the MTU), there's no guarantee. In particular, if the recieve window has been reduced to zero (common on a connection where you are sending bulk data), then multiple writes can still be coalesced when the receive window opens up again.
Furthermore, even if the sending TCP does not coalesce your writes, there is no way to stop the receiving side from presenting the data from multiple TCP frames in one application-level read from the socket - this will particularly occur when your receiving application is not scheduled for some time, or when packet loss occurs on the connection.
If your application assumes that each application-level write on the sending end will be paired exactly with an application-level read on the receiving end, you are setting yourself up for rare and hard-to-reproduce bugs.

TCP does not, by definition, offer any packetization services. TCP is a byte-stream oriented protocol. The best you can do is to insert your own framing marks into your data. Then if your receiver can at least detect bad frames (aggregated data perhaps) and drop such data, that sort of gets you where you want to be.
The send-as-soon-as-possible approach (ala NODELAY) is not guaranteed to work because your receiver can still aggregate the data. This is more probable if any tcp data gets delayed for whatever reason

One way is to use the TCP_NODELAY flags in the send call. However it depends on the TCP stack if this is supported or not.

Related

Can a linux socket return data less than the underlying packet? [duplicate]

When will a TCP packet be fragmented at the application layer? When a TCP packet is sent from an application, will the recipient at the application layer ever receive the packet in two or more packets? If so, what conditions cause the packet to be divided. It seems like a packet won't be fragmented until it reaches the Ethernet (at the network layer) limit of 1500 bytes. But, that fragmentation will be transparent to the recipient at the application layer since the network layer will reassemble the fragments before sending the packet up to the next layer, right?
It will be split when it hits a network device with a lower MTU than the packet's size. Most ethernet devices are 1500, but it can often be smaller, like 1492 if that ethernet is going over PPPoE (DSL) because of the extra routing information, even lower if a second layer is added like Windows Internet Connection Sharing. And dialup is normally 576!
In general though you should remember that TCP is not a packet protocol. It uses packets at the lowest level to transmit over IP, but as far as the interface for any TCP stack is concerned, it is a stream protocol and has no requirement to provide you with a 1:1 relationship to the physical packets sent or received (for example most stacks will hold messages until a certain period of time has expired, or there are enough messages to maximize the size of the IP packet for the given MTU)
As an example if you sent two "packets" (call your send function twice), the receiving program might only receive 1 "packet" (the receiving TCP stack might combine them together). If you are implimenting a message type protocol over TCP, you should include a header at the beginning of each message (or some other header/footer mechansim) so that the receiving side can split the TCP stream back into individual messages, either when a message is received in two parts, or when several messages are received as a chunk.
Fragmentation should be transparent to a TCP application. Keep in mind that TCP is a stream protocol: you get a stream of data, not packets! If you are building your application based on the idea of complete data packets then you will have problems unless you add an abstraction layer to assemble whole packets from the stream and then pass the packets up to the application.
The question makes an assumption that is not true -- TCP does not deliver packets to its endpoints, rather, it sends a stream of bytes (octets). If an application writes two strings into TCP, it may be delivered as one string on the other end; likewise, one string may be delivered as two (or more) strings on the other end.
RFC 793, Section 1.5:
"The TCP is able to transfer a
continuous stream of octets in each
direction between its users by
packaging some number of octets into
segments for transmission through the
internet system."
The key words being continuous stream of octets (bytes).
RFC 793, Section 2.8:
"There is no necessary relationship
between push functions and segment
boundaries. The data in any particular
segment may be the result of a single
SEND call, in whole or part, or of
multiple SEND calls."
The entirety of section 2.8 is relevant.
At the application layer there are any number of reasons why the whole 1500 bytes may not show up one read. Various factors in the internal operating system and TCP stack may cause the application to get some bytes in one read call, and some in the next. Yes, the TCP stack has to re-assemble the packet before sending it up, but that doesn't mean your app is going to get it all in one shot (it is LIKELY will get it in one read, but it's not GUARANTEED to get it in one read).
TCP tries to guarantee in-order delivery of bytes, with error checking, automatic re-sends, etc happening behind your back. Think of it as a pipe at the app layer and don't get too bogged down in how the stack actually sends it over the network.
This page is a good source of information about some of the issues that others have brought up, namely the need for data encapsulation on an application protocol by application protocol basis Not quite authoritative in the sense you describe but it has examples and is sourced to some pretty big names in network programming.
If a packet exceeds the maximum MTU of a network device it will be broken up into multiple packets. (Note most equipment is set to 1500 bytes, but this is not a necessity.)
The reconstruction of the packet should be entirely transparent to the applications.
Different network segments can have different MTU values. In that case fragmentation can occur. For more information see TCP Maximum segment size
This (de)fragmentation happens in the TCP layer. In the application layer there are no more packets. TCP presents a contiguous data stream to the application.
A the "application layer" a TCP packet (well, segment really; TCP at its own layer doesn't know from packets) is never fragmented, since it doesn't exist. The application layer is where you see the data as a stream of bytes, delivered reliably and in order.
If you're thinking about it otherwise, you're probably approaching something in the wrong way. However, this is not to say that there might not be a layer above this, say, a sequence of messages delivered over this reliable, in-order bytestream.
Correct - the most informative way to see this is using Wireshark, an invaluable tool. Take the time to figure it out - has saved me several times, and gives a good reality check
If a 3000 byte packet enters an Ethernet network with a default MTU size of 1500 (for ethernet), it will be fragmented into two packets of each 1500 bytes in length. That is the only time I can think of.
Wireshark is your best bet for checking this. I have been using it for a while and am totally impressed

C : Linux Sockets: Recvfrom() too slow in fetching the UDP Packets

I am receiving UDP packets at the rate of 10Mbps. Each packet is formed of around 1109 bytes.
So, it makes more than 1pkt/ms that I am receving on eth0. The recvfrom() in C receives the packet and passes on the packet to Java. Java does the filtering of the packets and the necessary processing.
The bottlenecks are:
recvfrom() is too slow:fetching takes more than 10ms possibly because it does not get the CPU.
Passing of the packet from C to Java through the interface(JNI) takes 1-2 ms.
The processing of packet in Java itself takes 0.5 to 1 second depending on if database insertion or image processing needs to be done.
So, the problem is many delays add up and more than half of the packets are lost.
The possible solutions could be:
Exclude the need for C's recvfrom() completely and implement UDP fetching directly in Java (Note: On the first place, C's recvfrom() was implemented to receive raw packets and not UDP). This solution may reduce the JNI transfer delay.
Implement multi-threading on the UDP receive function in Java. But then an ID shall be required in the UDP packets for the sequence because in multi-threading the order of incoming packets is not guaranteed. (However, in this particular program, there is a need for packets to be ordered). This solution might be helpful in receiving all packets but the protocol which sends the data needs to be modified to add a sequence identifier. Due to multi-threading, the receiver might have higher chances to get the CPU and packets can be quickly fetched.
In Java, a blocking queue can be implemented as a huge buffer which stores the incoming packets. The Java parser can then use the packets from this queue to process it. However, it is not sure if the receiver function will be fast enough and put all the received packets in the queue without dropping any packet.
I would like to know which of the solutions could be optimal or a combination of the above solutions will work. Any help or suggestions would be greatly appreciated.
How long is this burst going on? Is it continuous and will go on forever? Then you need beefier hardware that can handle the load. Possibly some load-balancing where multiple servers handle the incoming data.
Does the burst only last a short wile, like in at most a second or two? Then have the lower levels read all packets as fast as it can, and put in a queue, and let the upper levels get the messages from the queue in its own time.
It sounds like you may be calling recvfrom() with your socket in blocking mode, in which case it will not return until the next packet arrives. If the packets are being sent at 10ms intervals, perhaps due to some delay on the sending side, then a loop of blocking recvfrom() calls would appear to take 10ms each. Set the socket to non-blocking and use something like select() to decide when to call it. Then profile everything to see where the real bottleneck lies. My bet would be on one or more of the JNI passthroughs.
Note that recvfrom() is not a "C" function, it is a system call. Java's functions just add layers on top of it.

Can a Non Blocking UDP write return with fewer bytes than requested?

I have an application that sends data point to point from a sender to the receiver over an link that can operate in simplex (one way transmission) or duplex modes (two way). In simplex mode, the application sends data using UDP, and in duplex it uses TCP. Since a write on TCP socket may block, we are using Non Blocking IO (ioctl with FIONBIO - O_NONBLOCK and fcntl are not supported on this distribution) and the select() system call to determine when data can be written. NIO is used so that we can abort out of send early after a timeout if needed should network conditions deteriorate. I'd like to use the same basic code to do the sending but instead change between TCP/UDP at a higher abstraction. This works great for TCP.
However I am concerned about how Non Blocking IO works for a UDP socket. I may be reading the man pages incorrectly, but since write() may return indicating fewer bytes sent than requested, does that mean that a client will receive fewer bytes in its datagram? To send a given buffer of data, multiple writes may be needed, which may be the case since I am using non blocking IO. I am concerned that this will translate into multiple UDP datagrams received by the client.
I am fairly new to socket programming so please forgive me if have some misconceptions here. Thank you.
Assuming a correct (not broken) UDP implementation, then each send/sendmsg/sendto will correspond to exactly one whole datagram sent and each recv/recvmsg/recvfrom will correspond to exactly one whole datagram received.
If a UDP message cannot be transmitted in its entirety, you should receive an EMSGSIZE error. A sent message might still fail due to size at some point in the network, in which case it will simply not arrive. But it will not be delivered in pieces (unless the IP stack is severely buggy).
A good rule of thumb is to keep your UDP payload size to at most 1400 bytes. That is very approximate and leaves a lot of room for various forms of tunneling so as to avoid fragmentation.

can one call of recv() receive data from 2 consecutive send() calls?

i have a client which sends data to a server with 2 consecutive send calls:
send(_sockfd,msg,150,0);
send(_sockfd,msg,150,0);
and the server is receiving when the first send call was sent (let's say i'm using select):
recv(_sockfd,buf,700,0);
note that the buffer i'm receiving is much bigger.
my question is: is there any chance that buf will contain both msgs? of do i need 2 recv() calls to get both msgs?
thank you!
TCP is a stream oriented protocol. Not message / record / chunk oriented. That is, all that is guaranteed is that if you send a stream, the bytes will get to the other side in the order you sent them. There is no provision made by RFC 793 or any other document about the number of segments / packets involved.
This is in stark contrast with UDP. As #R.. correctly said, in UDP an entire message is sent in one operation (notice the change in terminology: message). Try to send a giant message (several times larger than the MTU) with TCP ? It's okay, it will split it for you.
When running on local networks or on localhost you will certainly notice that (generally) one send == one recv. Don't assume that. There are factors that change it dramatically. Among these
Nagle
Underlying MTU
Memory usage (possibly)
Timers
Many others
Of course, not having a correspondence between an a send and a recv is a nuisance and you can't rely on UDP. That is one of the reasons for SCTP. SCTP is a really really interesting protocol and it is message-oriented.
Back to TCP, this is a common nuisance. An equally common solution is this:
Establish that all packets begin with a fixed-length sequence (say 32 bytes)
These 32 bytes contain (possibly among other things) the size of the message that follows
When you read any amount of data from the socket, add the data to a buffer specific for that connection. When 32 bytes are reached, read the length you still need to read until you get the message.
It is really important to notice how there are really no messages on the wire, only bytes. Once you understand it you will have made a giant leap towards writing network applications.
The answer depends on the socket type, but in general, yes it's possible. For TCP it's the norm. For UDP I believe it cannot happen, but I'm not an expert on network protocols/programming.
Yes, it can and often does. There is no way of matching up sends and receive calls when using TCP/IP. Your program logic should test the return values of both send and recv calls in a loop, which terminates when everything has been sent or recieved.

Winsock UDP packets being dropped?

We have a client/server communication system over UDP setup in windows. The problem we are facing is that when the throughput grows, packets are getting dropped. We suspect that this is due to the UDP receive buffer which is continuously being polled causing the buffer to be blocked and dropping any incoming packets. Is it possible that reading this buffer will cause incoming packets to be dropped? If so, what are the options to correct this? The system is written in C. Please let me know if this is too vague and I can try to provide more info. Thanks!
The default socket buffer size in Windows sockets is 8k, or 8192 bytes. Use the setsockopt Windows function to increase the size of the buffer (refer to the SO_RCVBUF option).
But beyond that, increasing the size of your receive buffer will only delay the time until packets get dropped again if you are not reading the packets fast enough.
Typically, you want two threads for this kind of situation.
The first thread exists solely to service the socket. In other words, the thread's sole purpose is to read a packet from the socket, add it to some kind of properly-synchronized shared data structure, signal that a packet has been received, and then read the next packet.
The second thread exists to process the received packets. It sits idle until the first thread signals a packet has been received. It then pulls the packet from the properly-synchronized shared data structure and processes it. It then waits to be signaled again.
As a test, try short-circuiting the full processing of your packets and just write a message to the console (or a file) each time a packet has been received. If you can successfully do this without dropping packets, then breaking your functionality into a "receiving" thread and a "processing" thread will help.
Yes, the stack is allowed to drop packets — silently, even — when its buffers get too full. This is part of the nature of UDP, one of the bits of reliability you give up when you switch from TCP. You can either reinvent TCP — poorly — by adding retry logic, ACK packets, and such, or you can switch to something in-between like SCTP.
There are ways to increase the stack's buffer size, but that's largely missing the point. If you aren't reading fast enough to keep buffer space available already, making the buffers larger is only going to put off the time it takes you to run out of buffer space. The proper solution is to make larger buffers within your own code, and move data from the stack's buffers into your program's buffer ASAP, where it can wait to be processed for arbitrarily long times.
Is it possible that reading this buffer will cause incoming packets to be dropped?
Packets can be dropped if they're arriving faster than you read them.
If so, what are the options to correct this?
One option is to change the network protocol: use TCP, or implement some acknowledgement + 'flow control' using UDP.
Otherwise you need to see why you're not reading fast/often enough.
If the CPU is 100% utilitized then you need to do less work per packet or get a faster CPU (or use multithreading and more CPUs if you aren't already).
If the CPU is not 100%, then perhaps what's happening is:
You read a packet
You do some work, which takes x msec of real-time, some of which is spent blocked on some other I/O (so the CPU isn't busy, but it's not being used to read another packet)
During those x msec, a flood of packets arrive and some are dropped
A cure for this would be to change the threading.
Another possibility is to do several simultaneous reads from the socket (each of your reads provides a buffer into which a UDP packet can be received).
Another possibility is to see whether there's a (O/S-specific) configuration option to increase the number of received UDP packets which the network stack is willing to buffer until you try to read them.
First step, increase the receiver buffer size, Windows pretty much grants all reasonable size requests.
If that doesn't help, your consume code seems to have some fairly slow areas. I would use threading, e.g. with pthreads and utilize a producer consumer pattern to put the incoming datagram in a queue on another thread and then consume from there, so your receive calls don't block and the buffer does not run full
3rd step, modify your application level protocol, allow for batched packets and batch packets at the sender to reduce UDP header overhead from sending a lot of small packets.
4th step check your network gear, switches, etc. can give you detailed output about their traffic statistics, buffer overflows, etc. - if that is in issue get faster switches or possibly switch out a faulty one
... just fyi, I'm running UDP multicast traffic on our backend continuously at avg. ~30Mbit/sec with peaks a 70Mbit/s and my drop rate is bare nil
Not sure about this, but on windows, its not possible to poll the socket and cause a packet to drop. Windows collects the packets separately from your polling and it shouldn't cause any drops.
i am assuming your using select() to poll the socket ? As far as i know , cant cause a drop.
The packets could be lost due to an increase in unrelated network traffic anywhere along the route, or full receive buffers. To mitigate this, you could increase the receive buffer size in Winsock.
Essentially, UDP is an unreliable protocol in the sense that packet delivery is not guaranteed and no error is returned to the sender on delivery failure. If you are worried about packet loss, it would be best to implement acknowledgment packets into your communication protocol, or to port it to a more reliable protocol like TCP. There really aren't any other truly reliable ways to prevent UDP packet loss.

Resources