I am developing my own protocol over UDP (under Linux) for a cache application (similar to memcached) which only executes INSERT/READ/UPDATE/DELETE operations on an object and I am not sure which design would be the best:
Send one request per packet. (client prepares the request and sends it to the server immediately)
Send multiple requests per packet. (client enqueues the requests in a packet and when it is full (close to the MTU size) sends it to the server)
The size of the request (i.e. the record data) can be from 32 bytes to 1400 bytes, I don't know which will it be on average, it entirely depends on the user's application.
If choose single request per packet, I will have to manage a lot of small packets and the kernel will be interruped a lot of times. This will slow the operation since the kernel must save registers when switching from user space to system. Also there will be overhead in data transmition, if user's application sends many requests of 32 bytes (the packet overhead for udp is about 28 bytes) network traffic will double and I will have big impact on transmission speed. However high network traffic not necessarily implies low performance since the NIC has its own processor and does not makes the CPU stall. Additional network card can be installed in case of a network bottleneck.
The big advantage for using single packet is that the server and client will be so simple that I will save on instructions and gain on speed, at the same time I will have less bugs and the project will be finished earlier.
If I use multiple requests per packet, I will have fewer but bigger packets and therefore more data could be transmitted over the network. I will have reduced number of system calls but the complexity of the server will require more memory and more instructions to be executed so it is unknown if we get faster execution doing it this way. It may happen that the CPU will be the bottleneck, but what is cheaper, to add a CPU or a network card?
The application should have heavy data load, like 100,000 requests per second on lastest CPUs. I am not sure which way to do it. I am thinking to go for 'single request per packet', but before I rewrite all the code I already wrote for multiple request handling I would like to ask for recommendations.
Thanks in advance.
What do you care about more: latency or bandwidth?
If latency, send the request as soon as possible even if that means a lot of "slack" at the ends of packets and more packets overall.
If bandwidth, bundle multiple requests to eliminate the "slack" and send fewer packets overall.
NOTE: The network, not the CPU, will likely be your major bottleneck in either case, unless you are running over an extremely fast network. And even if you do, the INSERT/READ/UPDATE/DELETE in the database will likely spend more CPU and I/O than the CPU work needed for packets.
Another trade off to sending multiple requests per packet is that
On the one hand, the unreliable nature of UDP may cause you to drop multiple requests at at time, thus making re-transmissions more expensive.
On the other hand, the kernel will be using fewer buffers to deliver your data, reducing the chances of data drops
However, the analysis is incomplete without an understanding of the deployment architecture, such as the buffer sizes of the NICs, switches, and routers, and other networking hardware.
But the recommendation is to start with a relatively simple implementation (single request per packet), but write the code in such a way so that it will not be too difficult to add more complexity if needed.
Related
I'm developing high performance web server that should handle ~2k simultaneous connections and 40k QPS, achieving resp time < 7ms.
What it does is querying Redis server (running on the same host) and returning the response to the client.
During the testing, I observed that implementation using TCP STREAM_SOCKETs acts way better than connecting with unix sockets. With ~1500 connections TCP stays about 8ms while unix sockets come to ~50.
Server is written in C, its based on constant Posix threads pool, I use blocking connection to Redis. My OS is CentOS 6, tests were performed using Jmeter, wrk and ab.
For connection with redis I use hiredis lib that provides these two ways of connecting to Redis.
As far as I know unix socket should be at least as fast as TCP.
Does somebody have any idea what could cause such a behaviour?
Unix Domain Sockets are generally faster than TCP sockets over the loopback interface. Generally Unix Domain Sockets have on average 2 microseconds latency whereas TCP sockets 6 microseconds.
If I run redis-benchmark with defaults (no pipeline) I see 160k requests per second, basically because the single-threaded redis server is limited by the TCP socket, 160k requests running at average response time of 6 microseconds.
Redos achieves 320k SET/GET requests per second when using Unix Domain Sockets.
But there is a limit, which in fact we, at Torusware, have reached with our product Speedus, a high-performance TCP socket implementation with an average latency of 200 nanoseconds (ping us at info#torusware.com for requesting the Extreme Performance version). With almost zero latency we see redis-benchmark achieving around 500k requests per second. So we can say that redis-server latency is around 2 microseconds per request on average.
If you want to answer ASAP and your load is below the peak redis-server performance then avoiding pipelining is probably the best option. However, if you want to be able to handle higher throughput, then you can handle pipelines of request. The response can take a bit longer but you will be able to process more requests on the some hardware.
Thus, on the previous scenario, with a pipeline of 32 requests (buffering 32 requests before sending the actual request through the socket) you could process up to 1 million requests per second over the loopback interface. And in this scenario is where UDS benefits are not that high, especially because processing such pipelining is the performance bottleneck. In fact, 1M requests with a pipeline of 32 is around 31k "actual" requests per second, and we have seen that redis-server is able to handle 160k requests per second.
Unix Domain Sockets handle around 1.1M and 1.7M SET/GET requests per second, respectively. TCP loopback handles 1M and 1.5 SET/GET requests per second.
With pipelining the bottleneck moves from the transport protocol to the pipeling handling.
This is in tune with the information in mentioned in the redis-benchmark site.
However, pipelining increases dramatically the response time. Thus, with no pipelining 100% of operations generally run in less than 1 millisecond. When pipelining 32 requests the maximum response time is 4 milliseconds in a high-performance server, and tens of milliseconds if redis-server runs in a different machine or in a virtual machine.
So you have to trade-off response time and maximum throughput.
Although this is an old question, i would like to make an addition. Other answers talk about 500k or even 1.7M responses/s. This may be achievable with Redis, but the question was:
Client --#Network#--> Webserver --#Something#--> Redis
The webserver functions in sort of a HTML proxy to Redis i assume.
This means that your number of requests is also limited to how many requests the webserver can achieve.
There is a limitation often forgotten: if you have a 100Mbit connection, you have 100.000.000 bits a second at your disposal, but default in packages of 1518 bits (including the required space after the package). This means: 65k network packages. Assuming all your responses are smaller that the data part of such a package and non have to be resend due to CRC errors or lost packages.
Also, if persistent connections are not used, you need to do an TCP/IP handshake for each connection This adds 3 packages per request (2 receive, one send). So in that unoptimised situation, you remain with 21k obtainable requests to your webserver. (or 210k for a 'perfect' gigabit connection) - if your response fits in one packet of 175 bytes.
So:
Persistent connections only require a bit of memory, so enable it. It can quadruple your performance. (best case)
Reduce your response size by using gzip/deflate if needed so they fit in as few of packets as possible. (Each packet lost is a possible response lost)
Reduce your response size by stripping unneeded 'garbage' like debug data or long xml tags.
HTTPS connections will add a huge (in comparisation here) overhead
Add network cards and trunk them
if responses are always smaller then 175 bytes, use a dedicted network card for this service and reduce the network frame size to increase the packages send each second.
don't let the server do other things (like serving normal webpages)
...
I am developing a UDP client module in Solaris using C, and there are 2 design modules:
(1) Create a socket, and send all messages through this socket. The receive thread only call recvfrom on this socket.
(2) Create a group of sockets. When sending message, select a socket randomly from the socket pool. The receive thread needs to call poll or select on a group of sockets.
When the throughput is low, I think the first design module is OK.
If the throughput is high, I am wondering whether the second design module can be better?
Because it will dispatch messages to a group of sockets, and this maybe improve UDP datagram delivery successful rate and more efficient.
There's still only one network. You can have as many sockets, threads, whatever, as you like. The rate-determining step is the network. There is no point to this.
The question here primarily depends on how parallel the computer is (number of cores) and how parallel the algorithm is. Most likely your CPU cores are vastly faster than the network connection anyway and even one of them could easily overwhelm the connection. Thus on a typical system option (1) will give significantly better performance and lower drop rates.
This is because there is a significant overhead to using a UDP port on several threads or processes due to the internal locking the OS has to do to ensure the packets' contents are not multiplexed and corrupted, this causes a significant performance loss and significantly increased chance of packet loss where the kernel gives up waiting for other threads and just throws your pending packets away.
In the extreme case where your cores are very slow and your connection extremely fast (say a 500 core super computer with a 10 - 100Gbit fibre connection) option two could become more feasible, the locking would be less likely as the connection would be fast enough to keep many cores busy without them tripping over each other and locking often, this will -not- increase reliability (and may slightly decrease it) but might increase throughput depending on your architecture.
Overall in nearly every case I would suggest option 1, but if you really do have an extreme throughput situation you should look into other methods, however if you are writing software for this kind of system you would probably benefit from some more general training in massively parallel systems.
I hope this helps, if you have any queries please leave a comment.
I'm trying to learn UDP, and make a simple file transferring server and client.
I know TCP would potentially be better, because it has some reliability built in. However I would like to implement some basic reliability code myself.
I've decided to try and identify when packets are lost, and resend them.
What I've implemented is a system where the server will send the client a certain file in 10 byte chunks. After it sends each chunk, it waits for an acknowledgement. If it doesn't receive one in a few seconds time, it sends the chunk again.
My question is how can a file transfer like this be done quickly? If you send a file, and lets say theirs 25% chance a packet could be lost, then there will be a lot of time built up waiting for the ACK.
Is there some way around this? Or is it accepted that with high packet loss, it will take a very long time? Whats an accepted time-out value for the acknowledgement?
Thanks!
There are many questions in your post, I will try to address some. The main thing is to benchmark and find the bottleneck. What is the slowest operation?
I can tell you now that the bottleneck in your approach is waiting for an ACK after each chunk. Instead of acknowledging chunks, you want to acknowledge sequences. The second biggest problem is the ridiculously small chunk. At that size there's more overhead than actual data (look up the header sizes for IP and UDP).
In conclusion:
What I've implemented is a system where the server will send the
client a certain file in 10 byte chunks.
You might want to try a few hundred bytes chunks.
After it sends each chunk, it waits for an acknowledgement.
Send more chunks before requiring an acknowledgement, and label them. There is more than one way:
Instead of acknowledging chunks, acknowledge data: "I've received
5000 bytes" (TCP, traditional)
Acknowledge multiple chunks in one message. "I've received chunks 1, 5, 7, 9" (TCP with SACK)
What you've implemented is Stop-and-wait ARQ. In a high-latency network, it will inevitably be slower than some other more complex options, because it waits for a full cycle on each transmission.
For other possibilities, see Sliding Window and follow links to other variants. What you've got is basically a degenerate form of sliding window with window-size 1.
As other answers have noted, this will involve adding sequence numbers to your packets, sending additional packets while waiting for acknowledgement, and retransmitting on a more complex pattern.
If you do this, you are essentially reinventing TCP, which uses these tactics to supply a reliable connection.
You want some kind of packet numbering, so that the client can detect a lost packet by the missing number in the sequence of received packets. Then the client can request a resend of the packets it knows it is missing.
Example:
Server sends packet 1,2,3,4,5 to client. Client receives 1,4,5, so it knows 2 and 3 were lost. So client acks 1,4 and 5 and requests resend of 2 and 3.
Then you still need to work out how to handle acks / requests for resends, etc. In any case, assigning a sequence of consecutive numbers to the packets so that packet loss can be detected by "gaps" in the sequence is a decent approach to this problem.
Your question exactly describes one of the problems that TCP tries to answer. TCP's answer is particularly elegant and parsimonious, imo, so reading an English-language description of TCP might reward you.
Just to give you a ballpark idea of UDP in the real world: SNMP is a network-management protocol that is meant to operate over UDP. SNMP requests (around 1500 payload bytes) sent by a manager to a managed node are never explicitly acknowledged and it works pretty well. Twenty-five percent packet loss is a huge number -- real-life packet loss is an order of magnitude somaller, at worst -- and, in that broken environment, SNMP would hardly work at all. Certainly a human being operating the network management system -- the NMS -- would be on the phone to network hardware support very quickly.
When we use SNMP, we generally understand that a good value for timeout is three or four seconds, meaning that the SNMP agent in the managed network node will probably have completed its work in that time.
HTH
Have a look at the TFTP protocol. It is a UDP-based file transfer protocol with built-in ack/resend provisions.
I am writing a simple multi-drop RS485 protocol for serial communications within a distributed system. I am using an addressable model where slave devices are given a window of 20ms to respond. The master uC polls the connected devices for updates and they respond accordingly. I've employed checksums and take the necessary overrun precautions to ensure that connected devices will not respond to malformed messages. This method has proved effective in approximately 99% of situations, but I lose the packet if a new device is introduced during a communication session. Plugging in a new device "hot" will have negative effects on the signal being monitored by the slave devices, if only for an extremely short time. I'm on the software side of engineering, but how I can mitigate this situation without trying to recreate TCP? We use a polling model because it is fast and does the job well for our application, no need for RTOS functionality. I have an abundance of cycles on each cpu, think in basic terms.
Sending packets over the RS485 is not a reliable communication. You will have to handle the lost of packets anyway. Of course, you won't have to reinvent TCP. But you will have to detect lost packets by means of timeout monitoring and sequence numbers. In simple applications this can be done at application level, what keeps you far off from the complexity of TCP. When your polling model discards all packets with invalid checksum this might be integrated with less effort.
If you want to check for collisions, that can be caused by hot plugs or misbehaving devices there are probably some improvements. Some hardware allows to read back the own transmissing. If you find a difference between sent data and receive data, you can assume a collision and repeat the packet. This will also require a kind of sequence numbering.
Perhaps I've missed something in your question, but can't you just write the master so that if a response isn't seen from a device within the allowed time, it re-polls that device?
im working on a project with two clients ,one for sending, and the other one for receiving udp datagrams, between 2 machines wired directly to each other.
each datagram is 1024byte in size, and it is sent using winsock(blocking).
they are both running on a very fast machines(separate). with 16gb ram and 8 cpu's, with raid 0 drives.
im looking for tips to maximize my throughput , tips should be at winsock level, but if u have some other tips, it would be great also.
currently im getting 250-400mbit transfer speed. im looking for more.
thanks.
Since I don't know what else besides sending and receiving that your applications do it's difficult to know what else might be limiting it, but here's a few things to try. I'm assuming that you're using IPv4, and I'm not a Windows programmer.
Maximize the packet size that you are sending when you are using a reliable connection. For 100 mbs Ethernet the maximum packet is 1518, Ethernet uses 18 of that, IPv4 uses 20-64 (usually 20, thought), and UDP uses 8 bytes. That means that typically you should be able to send 1472 bytes of UDP payload per packet.
If you are using gigabit Ethernet equiptment that supports it your packet size increases to 9000 bytes (jumbo frames), so sending something closer to that size should speed things up.
If you are sending any acknowledgments from your listener to your sender then try to make sure that they are sent rarely and can acknowledge more than just one packet at a time. Try to keep the listener from having to say much, and try to keep the sender from having to wait on the listener for permission to keep sending.
On the computer that the sender application lives on consider setting up a static ARP entry for the computer that the receiver lives on. Without this every few seconds there may be a pause while a new ARP request is made to make sure that the ARP cache is up to date. Some ARP implementations may do this request well before the ARP entry expires, which would decrease the impact, but some do not.
Turn off as many users of the network as possible. If you are using an Ethernet switch then you should concentrate on the things that will introduce traffic to/from the computers/network devices on which your applications are running reside/use (this includes broadcast messages, like many ARP requests). If it's a hub then you may want to quiet down the entire network. Windows tends to send out a constant stream of junk to networks which in many cases isn't useful.
There may be limits set on how much of the network bandwidth that one application or user can have. Or there may be limits on how much network bandwidth the OS will let it self use. These can probably be changed in the registry if they exist.
It is not uncommon for network interface chips to not actually support the maximum bandwidth of the network all the time. There are chips which may miss packets because they are busy handling a previous packet as well as some which just can't send packets as close together as Ethernet specifications would allow. Additionally the rest of the system might not be able to keep up even if it is.
Some things to look at:
Connected UDP sockets (some info) shortcut several operations in the kernel, so are faster (see Stevens UnP book for details).
Socket send and receive buffers - play with SO_SNDBUF and SO_RCVBUF socket options to balance out spikes and packet drop
See if you can bump up link MTU and use jumbo frames.
use 1Gbps network and upgrade your network hardware...
Test the packet limit of your hardware with an already proven piece of code such as iperf:
http://www.noc.ucf.edu/Tools/Iperf/
I'm linking a Windows build, it might be a good idea to boot off a Linux LiveCD and try a Linux build for comparison of IP stacks.
More likely your NIC isn't performing well, try an Intel Gigabit Server Adapter:
http://www.intel.com/network/connectivity/products/server_adapters.htm
For TCP connections it has been shown that using multiple parallel connections will better utilize the data connection. I'm not sure if that applies to UDP, but it might help with some of the latency issues of packet processing.
So you might want to try multiple threads of blocking calls.
As well as Nikolai's suggestion of send and recv buffers, if you can, switch to overlapped I/O and have many recvs pending, this also helps to minimise the number of datagrams that are dropped by the stack due to lack of buffer space.
If you're looking for reliable data transfer, consider UDT.