Libevent: multithreading to handle HTTP keep-alive connections

Libevent: multithreading to handle HTTP keep-alive connections - c

I am writing an HTTP reverse-proxy in C using Libevent and I would like to implement multithreading to make use of all available CPU cores. I had a look at this example: http://roncemer.com/software-development/multi-threaded-libevent-server-example/
In this example it appears that one thread is used for the full duration of a connection, but for HTTP 1.1 I don't think this will be the most effective solution as connections are kept alive by default after each request so that they can be reused later. I have noticed that even one browser panel can open several connections to one server and keep them open until the tab is closed which would immediately exhaust the thread pool. For an HTTP 1.1 proxy there will be many open connections but only very few of them actively transferring data at a given moment.
So I was thinking of an alternative, to have one event base for all incoming connections and have the event callback functions delegate to worker threads. This way we could have many open connections and make use of a thread only when data arrives on a connection, returning it back to the pool once the data has been dealt with.
My question is: is this a suitable implementation of threads with Libevent?
Specifically – is there any need to have one event base per connection as in the example or is one for all connections sufficient?
Also – are there any other issues I should be aware of?
Currently the only problem I can see is with burstiness, when data is received in many small chunks triggering many read events per HTTP response which would lead to a lot of handing-off to worker threads. Would this be a problem? If it would be, then it could be somewhat negated using Libevent's watermarking, although I'm not sure how that works if a request arrives in two chunks and the second chunk is sufficiently small to leave the buffer size below the watermark. Would it then stay there until more data arrives?
Also, I would need to implement scheduling so that a chunk is only sent once the previous chunk has been fully sent.
The second problem I thought of is when the thread pool is exhausted, i.e. all threads are currently doing something, and another read event occurs – this would lead to the read event callback blocking. Does that matter? I thought of putting these into another queue, but surely that's exactly what happens internally in the event base. On the other hand, a second queue might be a good way to organise scheduling of the chunks without blocking worker threads.

Related

Execution Patter of Multi-Threaded Server on Linux

I like to know what should be the execution pattern of Multiple Threads of a Server to implement TCP in request-response cycle of hi-performance Server (like dozens of packets with single or no system call on Linux using Packet MMAP or some other way).
Design 1) For simplicity, Start two thread in main at the start of a Server program. one thread just getting packets directly from network interface(s) like wlan0/eth0. and once number of packets read in one cycle (using while loop with poll() in Linux). wake up the other thread using conditional variable signal call. and after waking up, other thread (sender) process and send packet as tcp response.
Design 2) Start receiver thread at the start of main program. The packet receiver thread reads packets from interfaces using while loop and poll(). When number of packets received, create sender thread and pass number of packets received in one cycle to sender as parameter. Sender thread process the packets and respond as tcp response.
(I think, Design 2 will be more easy to implement but is there any design issue or possible performance issue with this approach this is the question). Since creating buffer to pass to sender thread from receiver thread need to be allocated prior to receiving packets. So I know the size of buffer to allocate. Also in this execution pattern I am creating new thread (which will return and end execution after processing packets and responding tcp response). I like to know what will be the performance issue with this approach since I am creating new thread every time I get a batch of packet from interfaces.
In first approach I am not creating more than two threads (or limited number of threads and threads can be tracked easily for logging and debugging since I will know how many thread are initially created) In second approach I don't know how many threads are hanging around and executing concurrently.
I need any advise how real website like youtube/ or others may have handled this in there hi-performance server if they had followed this way of implementing their front facing servers.

First when going to a 'real' website the magic lies in having a load balancers and a whole bunch of worker nodes to take the load and you easily exceed the boundary of a single system. For example take a look at the following AWS reference architecture for serving web pages at scale AWS Cloud Architecture for serving web whitepaper.
That being said taking this one level down it is always interesting to look at how other well-known products have solved this issue. For example NGINX has an excellent infographic available and matching blogpost describing their architecture and threading.

select() equivalence in I/O Completion Ports

I am developing a proxy server using WinSock 2.0 in Windows. If I wanted to develop it in blocking model, select() was the way to wait for client or remote server to receive data from. Is there any applicable way to do this so using I/O Completion Ports?
I used to have two Contexts for two directions of data using I/O Completion Ports. But having a WSARecv pending couldn't receive any data from remote server! I coudn't find the problem.
Thanks in advance.
EDIT. Here's the WorkerThread Code on currently developed I/O Completion Ports. But I am asking about how to implement select() equivalence.

I/O Completion Ports provide an indication of when an I/O operation completes, they do not indicate when it is possible to initiate an operation. In many situations this doesn't actually matter. Most of the time the overlapped I/O model will work perfectly well if you assume it is always possible to initiate an operation. The underlying operating system will, in most cases, simply do the right thing and queue the data for you until it is possible to complete the operation.
However, there are some situations when this is less than ideal. For example you can always send to a socket using overlapped I/O. You can do this even when the remote peer is not reading and the TCP stack has started to use flow control and has filled the TCP window... This simply uses resources on your local machine in a completely uncontrolled manner (not entirely uncontrolled, but controlled by the peer, which is not ideal). I write about this here and in many situations you DO need to actively manage this kind of thing by tracking how many outstanding I/O write requests you have and using that as an indication of 'readiness to send'.
Likewise if you want a 'readiness to recv' indication you could issue a 'zero byte' read on the socket. This is a read which is issued with a zero length buffer. The read returns when there is data to read but no data is returned. This would give you the indication that there is data to be read on the connection but is, IMHO, pointless unless you are suffering from the very unlikely situation of hitting the I/O page lock limit, as you may as well read the data when it becomes available rather than forcing multiple kernel to user mode transitions.
In summary, you don't really need an answer to your question. You need to look at how the API works and write your code to work with it rather than trying to force the API to work in a way that other APIs that you are familiar with work.

Is threading the best way to handle 40 Clients at a time in UDP Server?

I am working on a UDP server/client application.
I want my server to be able to handle 40 clients at a time. I have thought of creating 40 threads at server side, each thread handling one client. Clients are distinguished on the basis of IP addresses and there is one thread for each unique IP address.
Whenever a client sends some data to a server, the main thread extracts the IP address of the client and decides which thread will process this specific client. Is there a better way to achieve this functionality?

There are different approaches for scale able server application, one thread per client seems good if no of clients are not many, another most efficient approach to accomplish this task is to use thread pool. These threads are work as task base when ever you have any new task assign this task to free worker thread.

Take a look at this project, I think it is very helpful to start with: http://www.codeproject.com/Articles/16935/A-Chat-Application-Using-Asynchronous-UDP-sockets
With IPAddress.Any, we specify that the server should accept client
requests coming on any interface. To use any particular interface, we
can use IPAddress.Parse (“192.168.1.1”) instead of IPAddress.Any. The
Bind function then bounds the serverSocket to this IP address. The
epSender identifies the clients from where the data is coming.
With BeginReceiveFrom, we start receiving the data that will be sent
by the client. Note that we pass epSender as the last parameter of
BeginReceiveFrom, the AsyncCallback OnReceive gets this object via the
AsyncState property of IAsyncResult, and it then processes the client
requests (login, logout, and send message to the users). Please see
the code attached to understand the implementation of OnReceive.

A better way would be to use the Proactor pattern (take a look at Boost.Asio library), instead of creating thread per client. With such an approach your application would have much better scalability and performace (especially on platforms that have native async i/o)
Besides, with this technique the threading would be de-coupled from the concurrency, meaning that you don't necessarily have to mess with multi-threading with all its complications.

Is there any benefit to using epoll with a very small number of file descriptors?

Would the following single threaded UDP client application see a performance benefit from using epoll over simply calling recvfrom/sendto on non-blocking sockets?
Let me explain the client.
I am writing single threaded UDP based client (custom protocol) that both sends and receives data using non-blocking I/O and my colleague suggested I use epoll for this.
The client sends and receives multiple packets of information that are all associated with a unique session id and multiple sessions can be run simultaneously.
If I use epoll, there will be a limited number of maybe 10-20 file descriptors which epoll_wait could wait on. Each file descriptor would be associated with one session. So that's maximum 10 - 20 sessions and this number will be enforced.
Each session has it's own state machine. From a single thread I need to run each state machine reasonably frequently and poll the associated socket as well.
In my case, I'd have to use epoll_wait with a timeout of zero or some very small value so that I can give CPU time to run the state machines for each session.
If there is data for a session then it needs to be directed to the associated state machine.
However, I can't really see much benefit of this design with such a small number of file descriptors.
The way I see it is I have two design options:
1. In my main loop using epoll I can poll the descriptors using epoll_wait with either a small timeout or no timeout.
How it handles data at this point is where I'm getting a bit stuck... either I read it right away and then throw it into a queue for each state machine to pick up when it's run, or I set a flag on the state machine to tell it that data is waiting and when the state machine runs it'll pick it up with a call to recvfrom. Or, I read the data and handle it right away and run the state machine for it.
Or...
2, Just run each state machine from the main loop and call recvfrom. If I get some data, handle it. If I don't then do whatever else the state machine requires. Is there huge overhead calling recvfrom when there is no data?
With going the epoll route I'm coding in some extra complexity. If there is a strong likelyhood for it be faster in my case then I will start doing it. However, if the second way which really simple works just as well then I would not use epoll.
Any thoughts?

No, and in fact performance will be much worse using epoll if adding and removing file descriptors from the set to poll is anything but an extremely rare event. With poll, a single syscall performs the entire operation. With epoll, you need multiple syscalls to modify the set and then wait on it.
Unless you're writing a server that's intended to scale to tens, hundreds, or thousands of thousands of long-term persistent connections, epoll is not only premature optimization, but actually a pessimization. It's also completely nonstandard and non-portable.

Problem supporting keep-alive sockets on a home-grown http server

I am currently experimenting with building an http server. The server is multi-threaded by one listening thread using select(...) and four worker threads managed by a thread pool. I'm currently managing around 14k-16k requests per second with a document length of 70 bytes, a response time of 6-10ms, on a Core I3 330M. But this is without keep-alive and any sockets I serve I immediatly close when the work is done.
EDIT: The worker threads processes 'jobs' that have been dispatched when activity on a socket is detected, ie. service requests. After a 'job' is completed, if there are no more 'jobs', we sleep until more 'jobs' gets dispatched or if there already are some available, we start processing one of these.
My problems started when I began to try to implement keep-alive support. With keep-alive activated I only manage 1.5k-2.2k requests per second with 100 open sockets. This number grows to around 12k with 1000 open sockets. In both cases the response time is somewhere around 60-90ms. I feel that this is quite odd since my current assumptions says that requests should go up, not down, and response time should hopefully go down, but definitely not up.
I've tried several different strategies for fixing the low performance:
1. Call select(...)/pselect(...) with a timeout value so that we can rebuild our FD_SET structure and listen to any additional sockets that arrived after we blocked, and service any detected socket activity.
(aside from the low performance, there's also the problem of sockets being closed while we're blocking, resulting in select(...)/pselect(...) reporting bad file descriptor.)
2. Have one listening thread that only accept new connections and one keep-alive thread that is notified via a pipe of any new sockets that arrived after we blocked and any new socket activity, and rebuild the FD_SET.
(same additional problem here as in '1.').
3. select(...)/pselect(...) with a timeout, when new work is to be done, detach the linked-list entry for the socket that has activity, and add it back when the request has been serviced. Rebuilding the FD_SET will hopefully be faster. This way we also avoid trying to listen to any bad file descriptors.
4. Combined (2.) and (3.).
-. Probably a few more, but they escape me atm.
The keep-alive sockets are stored in a simple linked List, whose add/remove methods are surrounded by a pthread_mutex lock, the function responsible for rebuilding the FD_SET also has this lock.
I suspect that it's the constant locking/unlocking of the mutex that is the main culprit here, I've tried to profile the problem but neither gprof or google-perftools has been very cooperative, either introducing extreme instability or plain refusing to gather any data att all (This could be me not knowing how to use the tools properly though.). But removing the locks risks putting the linked list in a non-sane state and probably crash or put the program into an infinite loop.
I've also suspected the select(...)/pselect(...) timeout when I've used it, but I'm pretty confident that this was not the problem since the low performance is maintained even without it.
I'm at a loss of how I should handle keep-alive sockets and I'm therefor wondering if you people out there has any suggestions on how to fix the low performance or have suggestions on any alternate methods I can use to go about supporting keep-alive sockets.
If you need any more information to be able to answer my question properly, don't hesitate to ask for it and I shall try my best to provide you with the necessary information and update the question with this new information.

Try to get rid of select completely. You can find some kind of event notification on every popular platform: kqueue/kevent on freebsd(), epoll on Linux, etc. This way you do not need to rebuild FD_SET and can add/remove watched fds anytime.

The time increase will be more visible when the client uses your socket for more then one request. If you are merely opening and closing yet still telling the client to keep alive then you have the same scenario as you did without keepalive. But now you have the overhead of the sockets sticking around.
If however you are using the sockets multiple times from the same client for multiple requests then you will lose the TCP connection overhead and gain performance that way.
Make sure your client is using keepalive properly. and likely a better way to get notification of the sockets state and data. Perhaps a poll device or queuing the requests.
http://www.techrepublic.com/article/using-the-select-and-poll-methods/1044098
This page has a patch for linux to handle a poll device. Perhaps some understanding of how it works and you can use the same technique in your application rather then rely on a device that may not be installed.

There are many alternatives:
Use processes instead of threads, and pass file descriptors via Unix sockets.
Maintain per-thread lists of sockets. You could even accept() directly on the worker threads.
etc...

Are your test clients reusing the sockets? Are they correctly handling keep alive?
I could see that case where you do the minimum change possible in your benchmarking code by just passing the keep alive header, but then not changing your code so that the socket is closed at the client end once the pay packet is received.
This would incure all the costs of keep-alive with none of the benefits.

What you are trying to do has been done before. Consider reading about the Leader-Follower network server pattern, http://www.kircher-schwanninger.de/michael/publications/lf.pdf

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight