I'm trying to architecture the main event handling of a libuv-based application. The application is composed of one (or more) UDP receivers sharing a socket, whose job is to delegate processing incoming messages to a common worker pool.
As the protocol handled is stateful, all packets coming from any given server should always be directed to the same worker – this constraint seem to make using LibUV built-in worker pool impossible.
The workers should be able to send themselves packets.
As such, and as I am new to LibUV, I wanted to share with you the intended architecture, in order to get feedback and best practices about it.
– Each worker run their very own LibUV loop, allowing them to send directly packets over the network. Additionally, each worker has a dedicated concurrent queue for sending it messages.
– When a packet is received, its source address is hashed to select the corresponding worker from the pool.
– The receiver created a unique async handle on the receiver loop, to act as callback when processing has finished.
– The receiver notifies the worker with an async handle that a new message is available, which wakes up the worker, that starts to process all enqueued messages.
– The worker thread calls the async handle on the receiver queue, which will cause the receiver to return the buffer to pool and free all allocated resources (as such, the pool does not need to be thread-safe).
The main questions I have would be:
– What is the overhead of creating an async handle for each received message? Is it a good design?
– Is there any built-in way to send a message to another event loop?
– Would it be better to send outgoing packets using another loop, instead of doing it right from the worker loop?
Thanks.
Related
I like to know what should be the execution pattern of Multiple Threads of a Server to implement TCP in request-response cycle of hi-performance Server (like dozens of packets with single or no system call on Linux using Packet MMAP or some other way).
Design 1) For simplicity, Start two thread in main at the start of a Server program. one thread just getting packets directly from network interface(s) like wlan0/eth0. and once number of packets read in one cycle (using while loop with poll() in Linux). wake up the other thread using conditional variable signal call. and after waking up, other thread (sender) process and send packet as tcp response.
Design 2) Start receiver thread at the start of main program. The packet receiver thread reads packets from interfaces using while loop and poll(). When number of packets received, create sender thread and pass number of packets received in one cycle to sender as parameter. Sender thread process the packets and respond as tcp response.
(I think, Design 2 will be more easy to implement but is there any design issue or possible performance issue with this approach this is the question). Since creating buffer to pass to sender thread from receiver thread need to be allocated prior to receiving packets. So I know the size of buffer to allocate. Also in this execution pattern I am creating new thread (which will return and end execution after processing packets and responding tcp response). I like to know what will be the performance issue with this approach since I am creating new thread every time I get a batch of packet from interfaces.
In first approach I am not creating more than two threads (or limited number of threads and threads can be tracked easily for logging and debugging since I will know how many thread are initially created) In second approach I don't know how many threads are hanging around and executing concurrently.
I need any advise how real website like youtube/ or others may have handled this in there hi-performance server if they had followed this way of implementing their front facing servers.
First when going to a 'real' website the magic lies in having a load balancers and a whole bunch of worker nodes to take the load and you easily exceed the boundary of a single system. For example take a look at the following AWS reference architecture for serving web pages at scale AWS Cloud Architecture for serving web whitepaper.
That being said taking this one level down it is always interesting to look at how other well-known products have solved this issue. For example NGINX has an excellent infographic available and matching blogpost describing their architecture and threading.
I have a client-server model. A multithreaded client sends a message to the server over the TCP sockets. The server is also multiple threaded with each request handled by a thread from the worker pool.
Now, the server must send back the message to the client via shared-memory IPC. For example:
multi threaded client --- GET /a.png --> server
|
|
one worker
|
|
\ /
puts the file descriptor into the shared memory
When worker thread adds the information into the shared memory, how do I make sure that it is read by the same client that requested it?
I feel clueless here as to how to proceed. Currently, I have created one segment of shared memory and there are 20 threads on the server and 10 threads on the client.
While you can use IPC between threads, it's generally not a good idea. Threads share all memory anyway since they are part of the same process and there are very efficient mechanisms for communications between threads.
It might just be easier to have the same thread handle a request all the way through. That way, you don't have to hand off a request from thread to thread. However, if you have a pool of requests that are being worked on, it often makes sense to have a thread be able to "put down" a request and then later be able to have that thread or a different thread "pick up" the request.
The easiest way to do this is to make all the information related to the request live in a single structure or object. Use standard thread synchronization tools (like mutexes) to control finding the object, taking ownership of it, and so on.
So when an I/O thread receives a request, it creates a new request object, acquires a mutex, and adds it to the global collection of requests the server is working on. Worker threads can check this global collection to see which requests need work or they can be explicitly dispatched by the thread that created the request.
I am writing an HTTP reverse-proxy in C using Libevent and I would like to implement multithreading to make use of all available CPU cores. I had a look at this example: http://roncemer.com/software-development/multi-threaded-libevent-server-example/
In this example it appears that one thread is used for the full duration of a connection, but for HTTP 1.1 I don't think this will be the most effective solution as connections are kept alive by default after each request so that they can be reused later. I have noticed that even one browser panel can open several connections to one server and keep them open until the tab is closed which would immediately exhaust the thread pool. For an HTTP 1.1 proxy there will be many open connections but only very few of them actively transferring data at a given moment.
So I was thinking of an alternative, to have one event base for all incoming connections and have the event callback functions delegate to worker threads. This way we could have many open connections and make use of a thread only when data arrives on a connection, returning it back to the pool once the data has been dealt with.
My question is: is this a suitable implementation of threads with Libevent?
Specifically – is there any need to have one event base per connection as in the example or is one for all connections sufficient?
Also – are there any other issues I should be aware of?
Currently the only problem I can see is with burstiness, when data is received in many small chunks triggering many read events per HTTP response which would lead to a lot of handing-off to worker threads. Would this be a problem? If it would be, then it could be somewhat negated using Libevent's watermarking, although I'm not sure how that works if a request arrives in two chunks and the second chunk is sufficiently small to leave the buffer size below the watermark. Would it then stay there until more data arrives?
Also, I would need to implement scheduling so that a chunk is only sent once the previous chunk has been fully sent.
The second problem I thought of is when the thread pool is exhausted, i.e. all threads are currently doing something, and another read event occurs – this would lead to the read event callback blocking. Does that matter? I thought of putting these into another queue, but surely that's exactly what happens internally in the event base. On the other hand, a second queue might be a good way to organise scheduling of the chunks without blocking worker threads.
Problem definition:
We are designing an application for an industrial embedded system running Linux.
The system is driven by events from the outside world. The inputs to the system could be any of the following:
Few inputs to the system in the form of Digital IO lines(connected
to the GPIOs of the processor like e-stop).
The system runs a web-server which allows for the system to be
controlled via the web browser.
The system runs a TCP server. Any PC or HMI device could send commands over TCP/IP.
The system needs to drive or control RS485 slave devices over UART using Modbus. The system also need to control few IO lines like Cooler ON/OFF etc.We believe that a state machine is essential to define this application. The core application shall be a multi threaded application which shall have the following threads...
Main thread
Thread to control the RS485 slaves.
Thread to handle events from the Web interface.
Thread to handle digital I/O events.
Thread to handle commands over TCP/IP(Sockets)
For inter-thread communication, we are using Pthread condition signal & wait. As per our initial design approach(one state machine in main thread), any input event to the system(web or tcp/ip or digital I/O) shall be relayed to the main thread and it shall communicate to the appropriate thread for which the event is destined. A typical scenario would be to get the status of the RS485 slave through the web interface. In this case, the web interface thread shall relay the event to the main thread which shall change the state and then communicate the event to the thread that control's the RS485 slaves & respond back. The main thread shall send the response back to the web interface thread.
Questions:
Should each thread have its own state machine thereby reducing the
complexity of the main thread ? In such a case, should we still need
to have a state machine in main thread ?
Any thread processing input event can communicate directly to the
thread that handles the event bypassing the main thread ? For e.g
web interface thread could communicate directly with the thread
controlling the RS485 slaves ?
Is it fine to use pthread condition signals & wait for inter thread
communication or is there a better approach ?
How can we have one thread wait for event from outside & response
from other threads ? For e.g. the web interface thread usually waits
for events on a POSIX message queue for Inter process communication
from web server CGI bins. The CGI bin's send events to the web
interface thread through this message queue. When processing this
event, the web interface thread would wait for response from other
threads. In such a situation, it couldn't process any new event from
the web interface until it has completed processing the previous
event and gets back to the wait on the POSIX message queues.
sorry for the too big explanation...I hope I have put forward my explanation in the best possible way for others to understand and help me.
I could give more inputs if needed.
What I always try to do with such requirements is to use one state machine, run by one 'SM' thread, which could be the main thread. This thread waits on an 'EventQueue' input producer-cosumer queue with a timeout. The timeout is used to run an internal delta-queue that can provide timeout events into the state-machine when they are required.
All other threads communicate their events to the state engine by pushing messages onto the EventQueue, and the SM thread processes them serial manner.
If an action routine in the SM decides that it must do something, it must not synchronously wait for anything and so it must request the action by pushing a request message to an input queue of whatever thread/susbsystem can perform it.
My message class, (OK, *struct in your C case), typically contains a 'command' enum, 'result' enum, a data buffer pointer, (in case it needs to transport bulk data), an error-message pointer, (null if no error), and as much other state as is necessary to allow the asynchronous queueing up of any kind of request and returning the complete result, (whether success or fail).
This message-passing, one SM design is the only one I have found that is capable of doing such tasks in a flexible, expandable manner without entering into a nightmare world of deadlocks, uncontrolled communications and unrepeatable, undebuggable interactions.
The first question that should be asked about any design is 'OK, how can the system be debugged if there is some strange problem?'. In my design above, I can answer straightaway: 'we log all events dequeued in the SM thread - they all come in serially so we always know exactly what actions are taken based on them'. If any other design is suggested, ask the above question and, if a good answer is not immediately forthcoming, it will never be got working.
So:
If a thread, or threaded subsystem, can use a separate state-machine to do its own INTERNAL functionality, OK, fine. These SM's should be invisible from the rest of the system.
NO!
Use the pthread condition signals & wait to implement producer-consumer blocking queues.
One input queue per thread/subsystem. All inputs go to this queue in the form of messages. Commands/state in each message identify the message and what should be done with it.
BTW, I would 100% do this in C++ unless shotgun-at-head :)
I have implemented a legacy embedded library that was originally written for a clone (EC115/EC270) of Siemens ES122C terminal controller. This library and OS included more or less what you describe. The original hardware was based on 80186 cpu. The OS, RMOS for Siemens, FXMOS for us (don't google it was never published) had all the stuff needed for basic controller work.
It had preemptive multi-tasking, task-to-task communication, semaphores, timers and I/O events, but no memory protection.
I ported that stuff to RaspberryPi (i.e. Linux).
I used the pthreads to simulate our legacy "tasks" because we hadn't memory protection, so threads are semantically the closest.
The rest of the implementation then turned around the epoll API. This means that everything generates an event. An event is when something happens, a timer expires, another thread sends data, a TCP socket is connected, an IO pin changes state, etc.
This requires that all the event sources be transformed in file descriptors. Linux provides several syscalls that do exactly that:
for task to task communication I used classic Unix pipes.
for timer events I used timerfd API.
for TCP communication I used normal sockets.
for serial I/O I simply opened the right device /dev/???.
signals are not necessary in my case but Linux provides 'signalfd' if necessary.
I have then epoll_wait wrapped around to simulate the original semantic.
I works like a charm.
TL;DR
take a deep look at the epoll API it does what you probably need.
EDIT: Yes and the advices of Martin James are very good especially 4. Each thread should only ever be in a loop waiting on an event via epoll_wait.
I am trying to implement a TCP server which is a part of a larger project. Basically the server should be able to maintain a TCP connection with any number of clients (a minimum of 32) and service any client that requests servicing. In our scenario the thing is that it will be assumed that once the client is connected to the server, it will never close the connection unless some sort of failure occurs (e-g the machine running the client breaks down ) and it will repeatedly request service from the server. Same is the case with all the other clients i-e each will maintain a connection with the server and perform transactions. so to sum up the server will be at the same time maintaining the connection with the clients while simultaneously serving each client as needed and should also have the ability to accept any other client connections that want to connect to the server.
Now I implemented the above functionality using the select() system call of the berkely socket API and it works fine when we have a small number of clients (say 10). But the server needs to be scaled to the highest possible level as we are implementing it on a 16 core machine. For that I looked through various multi threading design techniques e-g one thread per client etc and the best one in my opinion would be a thread pool design. Now As I was about to implement that I ran into some problems:
If I designate the main thread to accept any number of incoming connections and save each connections File descriptor in a data structure, and I have a pool of threads, how would I get the threads to poll that whether a particular client is requesting for service or not. The design is simple enough for scenarios in which client contacts the server and after getting the service it closes the connection so that we can pick a thread from a pool, service the client and then push it back into the pool for future connection handling. But when we have to service a set of clients that maintain a connection and request services intermittently, what would be the best approach to do this. All help will be much appreciated as I am really stuck in this.
Thanks.
Use pthreads, with one thread per CPU plus one extra thread.
The extra thread (the main thread) listens for new connections with the listen() system call, accepts the new connections with accept(), then determines which worker thread currently has the least number of connections, acquires a lock/mutex for that worker thread's "pending connections" FIFO queue, places the descriptor for the accepted connection onto the worker thread's "pending connections" FIFO queue, and sends a "check your queue" notification (e.g. using a pipe) to the worker thread.
The worker threads use "select()", and send/receive data to whatever connections they've accepted. If/when a worker thread receives a "check your queue" notification from the main thread it would acquire the lock/mutex for its "pending connections" FIFO queue and add any newly accepted connections to its "fd_set" list.
For 1024 connections and 16 CPUs; you might end up with one main thread waiting for new connections (but doing almost nothing as you wouldn't be expecting many new connections), and 16 worker threads handling an average of 64 connections each.
One thread per client is almost certainly the best design. Make sure you always have at least one thread blocked in accept waiting for a new connection - this means that after accept succeeds, you might need to create a new thread before proceeding if it was the last one. I've found semaphores to be a great primitive for keeping track of the need to spawn new listening threads.