How to utilize 100% of the network bandwidth with sockets? - c

I have a server and a client. The are working in different servers. Both of the servers have two 1000M network adapters.
I am using tcp blocking socket both in server and client.
Server
Once a socket is accepted, a new thread will be started to process the request. It works like:
while(1) {
recv(); /* receive a char */
send(); /* send a line */
}
The client just send a char to the server, server will send a line of text to the client. The length of the text is about 200.
The line has beed loaded into memory in advance.
Client
The client use different threads to connect to the server. Once connected, It will work like:
while(1) {
send(); /* send a char */
recv(); /* receive a line and */
}
Bandwidth usage
When I use 100 threads in client(and more the result is almost the same), I get this network traffic in Server:
tsar -l -i 1 --traffic
the result:
Time -------------traffic------------
Time bytin bytout pktin pktout
06/09/14-23:12:56 0.00 0.00 0.00 0.00
06/09/14-23:12:57 63.4M 155.3M 954.6K 954.6K
06/09/14-23:12:58 0.00 0.00 0.00 0.00
06/09/14-23:12:59 60.1M 147.3M 905.4K 905.4K
06/09/14-23:13:00 0.00 0.00 0.00 0.00
06/09/14-23:13:01 57.5M 140.8M 866.5K 866.4K
and sar -n DEV 1:
11:20:46 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
11:20:47 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:20:47 PM eth0 478215.05 478217.20 31756.46 77744.95 0.00 0.00 0.00
11:20:47 PM eth1 484318.28 484318.28 32162.05 78724.16 0.00 0.00 1.08
11:20:47 PM bond0 962533.33 962535.48 63918.51 156469.11 0.00 0.00 1.08
Question:
In theory, the max value of (bytin + bytout) could be 256M. How can I archive that?
Any help will be great, thank in advance.

In practice there are some overhead in several layers. 1Gbits/sec ethernet does not mean that much on the application side (but I guess at most 90% of that). A rule of thumb is to send or recv quite large data sizes (e.g. several kilobytes at least). Sending or recieving a few hundred bytes is inefficient. And the question is surely OS specific (I am thinking of Linux).
Recall that by definition TCP is not a transmission of packets, but of a stream of bytes. Read TCP wikipage. You should avoid send-ing or recv-ing a few bytes, or even a hundred of them. Try to send thousands of bytes at each time. Of course, a single recv on the recieving side is not (in general) corresponding to a single send on the emitter side and vice versa (especially if you have some routers between sending and recieving computers; routers can split or coalesce network packets, so you can't be sure to have one recv on the receptor per each send in the emitter).
Gigabit Ethernet wants Jumbo Frames of nearly 9000 bytes. You probably want your data buffer for send to be a little below that (because of the various overhead for IP and TCP), so try 8Kbytes.
The send(2) man page mentions MSG_MORE flag for tcp(7). You could use it with care. See also this.
Also syscalls(2) have some overhead. I'm surprised you are able to make a million of them each second. That overhead is another reason for buffering both outgoing and incoming data in significant pieces (of e.g. 8192, 16384, or 32768 bytes each; you need to benchmark to find the best one). And I won't be surprised if the kernel prefers page-aligned data. So perhaps try to have your buffer aligned to 4096 bytes (e.g. using mmap(2) or posix_memalign(3)...)
If you care about performance, don't use send(2) with a small byte count. At least change your application to send more than a few kilobytes (e.g. 4Kbytes) at each send syscall. And for recv(2), pass a buffer of at least 4kilobytes. So sending or recv a single byte or a line of a hundred bytes is inefficient. Your application should buffer such data (and perhaps split data into "application messages"...). There are some libraries doing that (like 0MQ...), or at least terminate each message with a delimiter (newline perhaps), which would ease the splitting of a received buffer into several incoming application messages.
My feeling is that your application is inefficient and buggy (probably would work badly on other networks, e.g. if there are some routers between both computers). You need to redesign and recode some parts of your application! You need to buffer, and you need to manage application messages - splitting and joining them ....
Yous should test your application on several networks, in particular thru ADSL and wifi and if possible long-distance networking (you'll then observe that send and recv do not "match").

According to my math, you are relatively close to saturating the link.
As I understand it, this is one second of traffic.
Time -------------traffic------------
Time bytin bytout pktin pktout
06/09/14-23:12:57 63.4M 155.3M 954.6K 954.6K
A TCP packet sent over Ethernet has 82 bytes of overhead (42 ethernet, 20 IP, 20 TCP), so the amount of data received is (954.6k * 80 + 63.4M)*8 bits, which totals 1.1G.
I would assume that with such a large number of packets, there would be additional overhead involved with negotiation of the physical medium. Since the links have about 50% utilization, if there's an additional delay as small as (1s / 954.6k) * 50% = 500 ns (one half microsecond!) then you've accounted for the additional delay. 500 ns is the amount of time it takes for light to travel 150 meters, which isn't that far.

Related

How to understand where the UDP packets are dropped in my ubuntu C program

I am using a python tool to generate packets in one VM and I capture them in my program running as a linux process in another VM. Both the VMs are ubuntu and they are running on the same subnet. I notice that some of the packets get dropped in my program. What is the best tool to know where the packets are dropped?
I see that RcvbufErrors in netstat output gets increased as I send new packets.
# netstat -us
IcmpMsg:
InType0: 14
InType3: 1493
InType5: 204
InType8: 54
InType13: 5
InType17: 5
OutType0: 54
OutType3: 645946
OutType8: 584
OutType14: 5
Udp:
7686124 packets received
646545 packets to unknown port received.
33928069 packet receive errors
7157259 packets sent
RcvbufErrors: 33928069
IgnoredMulti: 345772
UdpLite:
IpExt:
InMcastPkts: 4
InBcastPkts: 363522
InOctets: 13243409806
OutOctets: 8445992434
InMcastOctets: 144
InBcastOctets: 114457552
InNoECTPkts: 100191590
InECT0Pkts: 143
Well, if you think that you need to find out where some packets get dropped, this might not be useful since you already know that at least RcvbufErrors figure is growing. This means that the NIC is able to deliver the packets to the kernel but the latter is unable to deliver the packets to the application, obviously, because a fixed-size receive buffer gets full faster than the application could read the data out (or just flush it).
So, I'd say such a result just gives an impression of poor application design. Perhaps, you should consider some good technique to enhance packet capture in your application so that more packets are captured/examined/dropped per second. A good example is PACKET_MMAP Rx ring which has an exhaustive description and is widely used in libpcap-based applications. The latter aspect makes it indeed reasonable to use wireshark or just tcpdump instead of your hand-made app to inspect the real capture rate, etc.

Receiving RAW socket packets with microseconds level accuracy

I am writing a code, which receives raw ethernet packets (no TCP/UDP) every 1ms from the server. For every packet received, my application has to reply with 14 raw packets. If the server doesn't receive the 14 packets before it sends it's packet scheduled for every 1ms, then the server raises an alarm and the application has to break out. The server-client communication is a one to one link.
The server is a hardware (FPGA) which generates packets at precise 1ms interval. The client application runs on a Linux (RHEL/Centos 7) machine with 10G SolarFlare NIC.
My first version of code is like this
while(1)
{
while(1)
{
numbytes = recvfrom(sockfd, buf, sizeof(buf), 0, NULL, NULL);
if(numbytes > 0)
{
//Some more lines here, to read packet number
break;
}
}
for (i=0;i<14;i++)
{
if (sendto(sockfd,(void *)(sym) , sizeof(sym), 0, NULL, NULL) < 0)
perror("Send failed\n");
}
}
I measure the receive time by taking timestamps (using clock_gettime) before the recvfrom call and one after it, I print the time differences of these timestamps and print them whenever the time difference exceeds allowable range of 900-1100 us.
The problem I am facing is that the packet receive time is fluctuating.Something like this (the prints are in microseconds)
Decode Time : 1234
Decode Time : 762
Decode Time : 1593
Decode Time : 406
Decode Time : 1703
Decode Time : 257
Decode Time : 1493
Decode Time : 514
and so on..
And sometimes the decode times exceed 2000us and application would break.
In this situation, application would break anywhere between 2 seconds to a few minutes.
Options tried by me till now.
Setting affinity to a particular isolated core.
Setting scheduling priorities to maximum with SCHED_FIFO
Increase socket buffer sizes
Setting network interface interrupt affinity to the same core which processes application
Spinning over recvfrom using poll(),select() calls.
All these options give a significant improvement over initial version of code. Now the application would run for ~1-2 hours. But this is still not enough.
A few observations:
I get a a huge dump of these decode time prints, whenever I take ssh sessions to Linux machine while the application is running (which makes me think network communication over other 1G Ethernet interface is creating interference with the 10G Ethernet interface).
The application performs better in RHEL (run times of about 2-3 hours) than Centos (run times of about 30 mins - 1.5 hours)
The run times is also varying with Linux machines with different hardware configurations with same OS.
Please suggest if there are any other methods to improve the run-time of the application.
Thanks in advance.
First, you need to verify the accuracy of the timestamping method; clock_gettime. The resolution is nanoseconds, but the accuracy and precision is in question. That is not the answer to your problem, but informs on how reliable the timestamping is before proceeding. See Difference between CLOCK_REALTIME and CLOCK_MONOTONIC? for why CLOCK_MONOTONIC should be used for your application.
I suspect the majority of the decode time fluctuation is either due to a variable number of operations per decode, context switching of the operating system, or IRQs.
Operations per decode I cannot comment on since the code has been simplified in your post. This issue can also be profiled and inspected.
Context switching per process can be easily inspected and monitored https://unix.stackexchange.com/a/84345
As Ron stated, these are very strict timing requirements for a network. It must be an isolated network, and single purpose. Your observation regarding decode over-time when ssh'ing indicates all other traffic must be prevented. This is disturbing, given separate NICs. Thus I suspect IRQs are the issue. See /proc/interrupts.
To achieve consistent decode times over long intervals (hours->days) will require drastically simplifying the OS. Removing unnecessary processes and services, hardware, and perhaps building your own kernel. All for the goal of reducing context switching and interrupts. At which point a real-time OS should be considered. This will only improve the probability of consistent decode time, not guarantee.
My work is developing a data acquisition system that is a combination of FPGA ADC, PC, and ethernet. Inevitably, the inconsistency of a multi-purpose PC means certain features must be moved to dedicated hardware. Consider the Pros/Cons of developing your application for PC versus moving it to hardware.

UDP sendto performance over loopback

Background
I have a very high throughput / low latency network app (goal is << 5 usec per packet) and I wanted to add some monitoring/metrics to it. I have heard about the statsd craze and seems a simple way to collect metrics and feed them into our time series database. Sending metrics is done via a small udp packet write to a daemon (typically running on same server).
I wanted to characterize the effects of sending ~5-10 udp packets in my data path to understand how much latency it would add and was surprised at how bad it is. I know this is a very obscure micro-benchmark but just wanted to get a rough idea on where it lands.
The question I have
I am trying to understand why it takes so long (relatively speaking) to send a UDP packet to localhost versus a remote host. Are there any tweaks I can make to reduce the latency to send a UDP packet? I am thinking the solution for me to push metric collection to an auxiliary core or actually run the statsd daemon on a seperate host.
My setup/benchmarks
CentOS 6.5 with some beefy server hardware.
The client test program I have been using is available here: https://gist.github.com/rishid/9178261
Compiled with gcc 4.7.3 gcc -O3 -std=gnu99 -mtune=native udp_send_bm.c -lrt -o udp_send_bm
The receiver side is running nc -ulk 127.0.0.1 12000 > /dev/null (ip change per IF)
I have ran this micro-benchmark with the following devices.
Some benchmark results:
loopback
Packet Size 500 // Time per sendto() 2159 nanosec // Total time 2.159518
integrated 1 Gb mobo controller
Packet Size 500 // Time per sendto() 397 nanosec // Total time 0.397234
intel ixgbe 10 Gb
Packet Size 500 // Time per sendto() 449 nanosec // Total time 0.449355
solarflare 10 Gb with userspace stack (onload)
Packet Size 500 // Time per sendto() 317 nanosec // Total time 0.317229
Writing to loopback will not be an efficient way to communicate inter-process for profiling. Generally the buffer will be copied multiple times before it's processed, and you run the risk of dropping packets since you're using udp. You're also making additional calls into the operating system, so you add to the risk of context switching (~2us).
goal is << 5 usec per packet
Is this a hard real-time requirement, or a soft requirement? Generally when you're handling things in microseconds, profiling should be zero overhead. You're using solarflare?, so I think you're serious. The best way I know to do this is tapping into the physical line, and sniffing traffic for metrics. A number of products do this.
i/o to disk or the network is very slow if you are incorporating it in a very tight (real time) processing loop. A solution might be to offload the i/o to a separate lower priority task. Let the real time loop pass the messages to the i/o task through a (best lock-free) queue.

Different performance between sendto and recvfrom

I have noticed that there is a difference of performance between sendto and recvfrom (UDP). I send about 100Kbytes from a server to a client, using WiFi (the estimated bandwidth is about 30Mb/s in both directions), and the sending time is about 4-5 ms (it depends, but this value is comparable to the ideal one, 3ms). On the client, the receiving time is ten-fifteen times higher, like 50-60ms. I'd like to have the two elapsed times quite similar. Any idea?
and the sending time is about 4-5 ms (it depends, but this value is comparable to the ideal one, 3ms)
30Mb/s (where the b means bits) is approximately (give or take to account for headers etc) 3 MB/s (where the B means bytes). It should take roughly 30 milliseconds to transmit 100kBytes.
The sendto is returning as soon as it has written all the data to the local buffer of the network stack of the sending machine. The recv obviously has to wait for the data to be transmitted, including latency and stuff needed for all the layers of protocols.

How to test the speed for Socket?

I write a program which can forward ip packets between 2 servers, so how to test the speed of the program ? thanks!
There are a number of communication metrics that may be of interest to your potential users.
Latency is the amount of time to send a message, usually quoted in microseconds for co-located devices and in milliseconds for all other scenarios. It is usually quoted as the "zero-byte latency", meaning the time required to transmitted the meta-data of a message. Lower is better.
Bandwidth is measured in bits per second. It is often quoted as "peak bandwidth" and can be obtained by sending a massive amount of data over the line. Higher is better.
CPU utilization is the percent of CPU time required to transmit a message. Network protocols that can offload a message's transmission have low utilization, which means that the communication can "overlap" some other computation in the user's application, which has the effect of hiding latency. Lower is better.
All of these are measured simply by a variation of the ping test, usually called the "ping-pong":
Node 1:
for n = 1 to MAXSIZE, step via n*=2
send message of size n bytes
receive a response of size n bytes
Node 2:
for n = 1 to MAXSIZE, step via n*=2
receive a message of size n bytes
send response of size n bytes
There's also a "ping-ping" test, in which both nodes write to each other at the same time. This requires non-blocking communication to set-up.
Just output n and the time required for each iteration. The first time is the zero-byte latency. The largest sustainable n/time is the bandwidth (convert to bits per second to be industry standard). You can also measure the CPU utilization required to run the larger iterations, but that's a tricky topic for a whole different question.
Take a look at iperf. You can find it at http://sourceforge.net/projects/iperf/ If you google around you will find tutorials for it. You can look at the source and might get some good ideas of how he does it. I use it for routine testing and it is quite robust

Resources