Multithreaded udp server

Multithreaded udp server - c

I'm new with threads and please if you can advice me. I have server that broadcast messages
to clients. Then clients sends back replies to server. I want to handle every reply using
seperate thread. Every reply will have mesage id and thread id. How can I fill some structure
with this information from all threads and then read it
Also from my code, is it correctly to create thread in while or does it exist someway
to create thread just if I get reply from client?
Did I start with correct understanding?
int main(){
while(1){
sendto on broadcast IP
pthread_create(&cln_thread, NULL, msg_processing, (void *) &arg))
}
}
msg_processing () {
recvfrom client msg with id of packet and thread id
how to write those informations and then read them in main when program finish
}
Thank you

Err.. no, just create ONE thread, once only, for receiving datagrams on one socket. In the thread function, receive the datagrams in a while(true) loop. Don't let this receive thread terminate and don't create any more receive threads. Continually creating/terminating/destroying threads is inefficient, dangerous, unnecessary, error-prone, difficult-to-debug and you should try very hard to not do it, ever.
Edit:
One receive thread only - but you don't have to do the processing there. Malloc a 64K buffer, receive data into it, push the buffer pointer onto a producer-consumer queue to a pool of threads that will do the processing, loop back and so malloc again to reseat the pointer and create another buffer for the next datagram. Free the *buffers in the pool threads after the processing is completed. The receive thread will be back waiting for datagrams quickly while the buffer processing runs concurrently.
If you find that the datagrams can arrive so rapidly that the processing cannot keep up, memory use will grow unchecked as more and more *buffers pile up in the queue. There are a couple ways round that. You can use a bounded queue that blocks if its capacity is reached. You could create x buffers at startup and store them on another producer-consumer 'pool queue' that the receive thread pops from, (instead of the malloc) - the processing pool threads can then push the 'used' *buffers back on to the pool queue for re-use. If the pool runs out, the receive thread will block on the pool until *buffers are returned.
I prefer the pooled-buffer approach since it caps memory-use across the whole system, avoids continual malloc/free with its fragmentation issues etc, avoids the more complex bounded queues, is easier to tweak, (the pool level is easy to change at runtime), and is easier to monitor/debug - I usually dump the pool level, (ie. queue count), to the display once a second with a timer.
In either case, datagrams may be lost but, if your system is overloaded such that data regularly arrives faster than it can possibly be processed, that's going to be the case anyway no matter how you design it.
One socket is fine, so why complicate matters? :)

You can find a good example of a multithreaded UDP server in Go lang following:
https://gist.github.com/jtblin/18df559cf14438223f93
The main idea is to use multi-core functionality in a full so each thread works on its own CPU core and reads UDP data into the single buffer for processing.

Related

How to maintain order of packets if multiple threads are processing in parallel?

I am working on improving the performance of a network application written in C running on linux systems.
The program as it is written now it reads a packet from a socket interface, it does some processing on it and then it adds it to a send queue.
I am pretty new to multi threading programming but I am familiar with the basic concepts (mutex, conditional signals etc).
I am trying to implement a solution where a set of worker threads are passed what is read from the interface and they do the work that follows.
My question is how could I ensure that, if first thread reads the first packet and the second thread reads the second packet, the order in which the packets are added to the send queue are in the same order as read.

There are lots of ways to solve this. Different ways have different trade offs. Things to consider are if you want to have a static number of worker threads, how many worker threads, and how perfect you want the solution to be.
If all worker threads are to receive their data packets directly via a call to read or recv then:
pthread_mutex_lock(&the_mutex);
do
{
read_size = read(sock, buf, buf_size);
if (read_size > 0)
{
my_count = ++packet_counter;
break;
} else
{
// figure out how to handle different failures here
}
} while (1);
pthread_mutex_unlock(&the_mutex);
results = do_work(buf, read_size);
enqueue_results(my_count, results);
Would work, where enqueue_results() would put the results into a priority queue that can handle wrapping around of the key (which isn't that difficult to do since you just order by last_sent_count-this_count rather than using this_count directly for the queue ordering).
Then another thread would needs to wait on the next reply to be sent to become ready and send that.
You could get a lot fancier, but you should give this a try.

Just code what you want. Outbound packets can be inserted into a queue in order, and the sender can wait if the packet it needs to send next isn't at the head of the queue.

When you say you add it to the send queue, does another thread that you have control over remove it from this send queue to later send or is this outside your control. If the former you can use a priority queue for the second queue instead of a traditional first in first out queue. If the key to the priority queue is some global counter, the sender will always pull the smallest/next value. Now if that smallest value isn't the next value you can have the thread pulling from the sender thread wait for the next value. Depending on the priority queue implementation you can also just peek into the queue to see the next value and then conditionally wait until another insertion into the queue.

The program could use separate receive and send queues for each thread. The receiving thread would queue packets to the processing threads in round robin order. Each processing thread would dequeue a packet from its receive queue, process the packet and queue the processed packet to its send queue. The sending thread would dequeue processed packets from the processed send queues in round robin order.

epoll IO with worker threads in C

I am writing a small server that will receive data from multiple sources and process this data. The sources and data received is significant, but no more than epoll should be able to handle quite well. However, all received data must be parsed and run through a large number of tests which is time consuming and will block a single thread despite epoll multiplexing. Basically, the pattern should be something like follows: IO-loop receives data and bundles it into a job, sends to the first thread available in the pool, the bundle is processed by the job and the result is passed pack to the IO loop for writing to file.
I have decided to go for a single IO thread and N worker threads. The IO thread for accepting tcp connections and reading data is easy to implement using the example provided at:
http://linux.die.net/man/7/epoll
Thread are also usually easy enough to deal with, but I am struggling to combine the epoll IO loop with a threadpool in an elegant manner. I am unable to find any "best practice" for using epoll with a worker pool online either, but quite a few questions regarding the same topic.
I therefore have some question I hope someone can help me answering:
Could (and should) eventfd be used as a mechanism for 2-way synchronization between the IO thread and all the workers? For instance, is it a good idea for each worker thread to have its own epoll routine waiting on a shared eventfd (with a struct pointer, containing data/info about the job) i.e. using the eventfd as a job queue somehow? Also perhaps have another eventfd to pass results back into the IO thread from multiple worker threads?
After the IO thread is signaled about more data on a socket, should the actual recv take place on the IO thread, or should the worker recv the data on their own in order to not block the IO thread while parsing data frames etc.? In that case, how can I ensure safety, e.g. in case recv reads 1,5 frames of data in a worker thread and another worker thread receives the last 0,5 frame of data from the same connection?
If the worker thread pool is implemented through mutexes and such, will waiting for locks block the IO thread if N+1 threads are trying to use the same lock?
Are there any good practice patterns for how to build a worker thread pool around epoll with two way communication (i.e. both from IO to workers and back)?
EDIT: Can one possible solution be to update a ring buffer from the IO-loop, after update send the ring buffer index to the workers through a shared pipe for all workers (thus giving away control of that index to the first worker that reads the index off the pipe), let the worker own that index until end of processing and then send the index number back into the IO-thread through a pipe again, thus giving back control?
My application is Linux-only, so I can use Linux-only functionality in order to achieve this in the most elegant way possible. Cross platform support is not needed, but performance and thread safety is.

In my tests, one epoll instance per thread outperformed complicated threading models by far. If listener sockets are added to all epoll instances, the workers would simply accept(2) and the winner would be awarded the connection and process it for its lifetime.
Your workers could look something like this:
for (;;) {
nfds = epoll_wait(worker->efd, &evs, 1024, -1);
for (i = 0; i < nfds; i++)
((struct socket_context*)evs[i].data.ptr)->handler(
evs[i].data.ptr,
evs[i].events);
}
And every file descriptor added to an epoll instance could have a struct socket_context associated with it:
void listener_handler(struct socket_context* ctx, int ev)
{
struct socket_context* conn;
conn->fd = accept(ctx->fd, NULL, NULL);
conn->handler = conn_handler;
/* add to calling worker's epoll instance or implement some form
* of load balancing */
}
void conn_handler(struct socket_context* ctx, int ev)
{
/* read all available data and process. if incomplete, stash
* data in ctx and continue next time handler is called */
}
void dummy_handler(struct socket_context* ctx, int ev)
{
/* handle exit condition async by adding a pipe with its
* own handler */
}
I like this strategy because:
very simple design;
all threads are identical;
workers and connections are isolated--no stepping on each other's toes or calling read(2) in the wrong worker;
no locks are required (the kernel gets to worry about synchronization on accept(2));
somewhat naturally load balanced since no busy worker will actively contend on accept(2).
And some notes on epoll:
use edge-triggered mode, non-blocking sockets and always read until EAGAIN;
avoid dup(2) family of calls to spare yourself from some surprises (epoll registers file descriptors, but actually watches file descriptions);
you can epoll_ctl(2) other threads' epoll instances safely;
use a large struct epoll_event buffer for epoll_wait(2) to avoid starvation.
Some other notes:
use accept4(2) to save a system call;
use one thread per core (1 for each physical if CPU-bound, or 1 for each each logical if I/O-bound);
poll(2)/select(2) is likely faster if connection count is low.
I hope this helps.

When performing this model, because we only know the packet size once we have fully received the packet, unfortunately we cannot offload the receive itself to a worker thread. Instead the best we can still do is a thread to receive the data which will have to pass off pointers to fully received packets.
The data itself is probably best held in a circular buffer, however we will want a separate buffer for each input source (if we get a partial packet we can continue receiving from other sources without splitting up the data. The remaining question is how to inform the workers of when a new packet is ready, and to give them a pointer to the data in said packet. Because there is little data here, just some pointers the most elegant way of doing this would be with posix message queues. These provide the ability for multiple senders and multiple receivers to write and read messages, always ensuring every message is received and by precisely 1 thread.
You will want a struct resembling the one below for each data source, I shall go through the fields purposes now.
struct DataSource
{
int SourceFD;
char DataBuffer[MAX_PACKET_SIZE * (THREAD_COUNT + 1)];
char *LatestPacket;
char *CurrentLocation
int SizeLeft;
};
The SourceFD is obviously the file descriptor to the data stream in question, the DataBuffer is where Packets contents are held while being processed, it is a circular buffer. The LatestPacket pointer is used to temporarily hold a pointer to the most resent packet in case we receive a partial packet and move onto another source before passing the packet off. The CurrentLocation stores where the latest packet ends so that we know where to place the next one, or where to carry on in case of partial receive. The size left is the room left in the buffer, this will be used to tell if we can fit the packet or need to circle back around to the beginning.
The receiving function will thus effectively
Copy the contents of the packet into the buffer
Move CurrentLocation to point to the end of the packet
Update SizeLeft to account for the now decreased buffer
If we cannot fit the packet in the end of the buffer we cycle around
If there is no room there either we try again a bit later, going to another source meanwhile
If we had a partial receive store the LatestPacket pointer to point to the start of the packet and go to another stream until we get the rest
Send a message using a posix message queue to a worker thread so it can process the data, the message will contain a pointer to the DataSource structure so it can work on it, it also needs a pointer to the packet it is working on, and it's size, these can be calculated when we receive the packet
The worker thread will do its processing using the received pointers and then increase the SizeLeft so the receiver thread will know it can carry on filling the buffer. The atomic functions will be needed to work on the size value in the struct so we don't get race conditions with the size property (as it is possible it is written by a worker and the IO thread simultaneously, causing lost writes, see my comment below), they are listed here and are simple and extremely useful.
Now, I have given some general background but will address the points given specifically:
Using the EventFD as a synchronization mechanism is largely a bad idea, you will find yourself using a fair amount of unneeded CPU time and it is very hard to perform any synchronization. Particularly if you have multiple threads pick up the same file descriptor you could have major problems. This is in effect a nasty hack that will work sometimes but is no real substitute for proper synchronization.
It is also a bad idea to try and offload the receive as explained above, you can get around the issue with complex IPC but frankly it is unlikely receiving IO will take enough time to stall your application, your IO is also likely much slower than CPU so receiving with multiple threads will gain little. (this assumes you do not say, have several 10 gigabit network cards).
Using mutexes or locks is a silly idea here, it fits much better into lockless coding given the low amount of (simultaneously) shared data, you are really just handing off work and data. This will also boost performance of the receive thread and make your app far more scalable. Using the functions mentioned here http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html you can do this nice and easily. If you did do it this way, what you would need is a semaphore, this can be unlocked every time a packet is received and locked by each thread which starts a job to allow dynamically more threads in if more packets are ready, that would have far less overhead then a homebrew solution with mutexes.
There is not really much difference here to any thread pool, you spawn a lot of threads then have them all block in mq_receive on the data message queue to wait for messages. When they are done they send their result back to the main thread which adds the results message queue to its epoll list. It can then receive results this way, it is simple and very efficient for small data payloads like pointers. This will also use little CPU and not force the main thread to waste time managing workers.
Finally your edit is fairly sensible, except for the fact as I ave suggested, message queues are far better than pipes here as they very efficiently signal events , guarantee a full message read and provide automatic framing.
I hope this helps, however it is late so if I missed anything or you have questions feel free to comment for clarification or more explanation.

I post the same answer in other post: I want to wait on both a file descriptor and a mutex, what's the recommended way to do this?
==========================================================
This is a very common seen problem, especially when you are developing network server-side program. Most Linux server-side program's main look will loop like this:
epoll_add(serv_sock);
while(1){
ret = epoll_wait();
foreach(ret as fd){
req = fd.read();
resp = proc(req);
fd.send(resp);
}
}
It is single threaded(the main thread), epoll based server framework. The problem is, it is single threaded, not multi-threaded. It requires that proc() should never blocks or runs for a significant time(say 10 ms for common cases).
If proc() will ever runs for a long time, WE NEED MULTI THREADS, and executes proc() in a separated thread(the worker thread).
We can submit task to the worker thread without blocking the main thread, using a mutex based message queue, it is fast enough.
Then we need a way to obtain the task result from a worker thread. How? If we just check the message queue directly, before or after epoll_wait(), however, the checking action will execute after epoll_wait() to end, and epoll_wait() usually blocks for 10 micro seconds(common cases) if all file descriptors it wait are not active.
For a server, 10 ms is quite a long time! Can we signal epoll_wait() to end immediately when task result is generated?
Yes! I will describe how it is done in one of my open source project.
Create a pipe for all worker threads, and epoll waits on that pipe as well. Once a task result is generated, the worker thread writes one byte into the pipe, then epoll_wait() will end in nearly the same time! - Linux pipe has 5 us to 20 us latency.
In my project SSDB(a Redis protocol compatible in-disk NoSQL database), I create a SelectableQueue for passing messages between the main thread and worker threads. Just like its name, SelectableQueue has an file descriptor, which can be wait by epoll.
SelectableQueue: https://github.com/ideawu/ssdb/blob/master/src/util/thread.h#L94
Usage in main thread:
epoll_add(serv_sock);
epoll_add(queue->fd());
while(1){
ret = epoll_wait();
foreach(ret as fd){
if(fd is worker_thread){
sock, resp = worker->pop_result();
sock.send(resp);
}
if(fd is client_socket){
req = fd.read();
worker->add_task(fd, req);
}
}
}
Usage in worker thread:
fd, req = queue->pop_task();
resp = proc(req);
queue->add_result(fd, resp);

UDP Multiple sockets Receive data and process efficiently - C & Linux

I have to receive data from 15 different clients each of them sending on 5 different ports. totally 15 *5 sockets.
for each client port no is defined and fixed. example client 1 ,ports 3001 to 3005. client 2 ,ports 3051 to 3055 etc. They have one thing in common say first port (3001 , 3051) is used to send commands. other ports send some data.
After receiving the data i have to check for checksum. keep track of recvd packets, Re request the packet if lost and also have to write to files on hard disk.
Restriction I cannot change the above design and i cannot change from UDP to TCP.
The two methods i'm aware of after reading are
asynchronous multiplexing using select().
Thread per socket.
I tried the first one and i'm stuck at the point when i get the data. I'm able to receive data. I have some processing to do so i want to start a thread for each socket (or) for sockets to handle (say all first ports, all second, etc ..i.e.3001,3051 etc)
But here if client sends any data then FD_ISSET becomes true , so if i start a thread ,then it becomes thread for every message.
Question:
How to add thread code here, Say if i include pthread_create inside if(FD_ISSET .. ) then for every message that i receive i create a thread. But i wanted a thread per socket.
while(1)
{
int nready=0;
read_set = active_set;
if((nready = select(fdmax+1,&read_set,NULL,NULL,NULL)) == -1)
{
printf("Select Errpr\n");
perror("select");
exit(EXIT_FAILURE);
}
printf("number of ready desc=%d\n",nready);
for(index=1;index <= 15*5;index++)
{
if(FD_ISSET(sock_fd[index],&read_fd_set))
{
rc = recvfrom(sock_fd[index],clientmsgInfo,MSG_SIZE,0,
(struct sockaddr *)&client_sockaddr_in,
&sockaddr_in_length);
if(rc < 0)
printf("socket %d down\n",sock_fd[index]);
printf("Recieved packet from %s: %d\nData: %s\n\n", inet_ntoa(client_sockaddr_in.sin_addr), ntohs(client_sockaddr_in.sin_port), recv_client_message);
}
} //for
} //while

create the threads at the startup of program and divide them to handle data, commmands e.t.c.
how?
1. lets say you created 2 threads, one for data and another for the commands.
2. make them sleep in the thread handler or let them wait on a lock that the main thread
acquired, seems to be that mainthread got two locks one for each of them.
3. when any client data or command that got into the recvfrom at mainthread, depending on the
type of the buffer(data, commands), copy the buffer into the shared data by mainthread and
other threads and unlock the mutex.
4. at threads lock the mutex so that mainthread wont' corrupt the data and once processing is
done at the threads unlock and sleep.
The better one would be to have a queue, that fills up by main thread and can be accessed element wise by the other threads.

I assume that each client context is independent of the others, ie. one client socket group can be managed on its own, and the data pulled from the sockets can be processed alone.
You express two possibilities of handling the problem:
Asynchronous multiplexing: in this setting, the sockets are all managed by one single thread. This threads selects which socket must be read next, and pulls data out of it
Thread per socket: in this scenario, you have as many threads as there are sockets, or more probably group of sockets, ie. clients - this the interpretation I will build from.
In both cases, threads must keep ownership of their respective resources, meaning sockets. If you start moving sockets around between threads, you will make things more difficult that it needs to be.
Outside the work that needs to be done, you will need to handle thread management:
How do threads get started?
How and when are they stopped?
What are the error handling policies?
Your question doesn't cover these issues, but they might play a significant role in your final design.
Scenario (2) seems simpler: you have one main "template" (I use the word in a general meaning here) for handling a group of sockets using select on them, and in the same thread receive and process the data. It's quite straightforward to implement, with a struct to contain the context specific data (socket ports, pointer to function for packet processing), and a single function looping on select and process, plus perhaps some other checks for errors and thread life management.
Scenario (1) requires a different setup: one I/O thread reads all the packets and pass them on to specialized worker threads to do the processing. If processing error occurs, worker threads will have to generate the adhoc packet to be sent to the client, and pass it to the I/O thread for sending. You will need packet queues both ways to allow communication between I/O and workers, and have the I/O thread check the worker queues somehow for resend requests. So this solution is a bit more expensive in terms of developement, but reduce the I/O contention to one single point. It's also more flexible, in case some processing must be done against data coming from several clients, or if you want to chain up processing somehow. For instance, you could have instead one thread per client socket, and then one other thread per client group of socket further down the work pipeline, with each step of the pipeline interconnected by packet queue.
A blend of both solution can of course be implemented, with one IO thread per client, and pipelined worker threads.
The advantage of both outlined solutions is the fixed number of threads: no need to spawn and destroy threads on demand (although you could design a thread pool to handle that as well).
For a solution involving moving sockets between threads, the questions are:
When should these resources be passed on? What happens after a worker thread has read a packet? Should it return the socket to the IO thread, or risk a blocking read on the socket for the next packet? If it does a select to poll the socket for more packets, we fall in scenario (2), where each client will has its own I/O thread when there is network trafic from all of them, in which case what is the gain of the initial I/O thread doing the select?
If it passes the socket back, should the IO thread wait for all workers to give back their socket before initiating another select? If it waits, it takes the risk of making unserved client wait for packets already in the network buffers, inducing processing lag. If it does not wait, and return to select to avoid lag on unserved sockets, then the served ones will have to wait for the next wake up to see their sockets back in the select pool.
As you can see, the problem is difficult to handle. That's the reason why I recommend exclusive sockets ownership by threads as described in scenarii (1) and (2).

Your solution requires a fixed, relatively small, number of connections.
Create a help procedure that creates thread procedures that listen to each of the five ports and block on the recvfrom(), process the data, and block again. You can then call the helper 15 times to create the threads.
This avoids all polling, and allows Linux to schedule each thread when the I/O completes. No CPU used while waiting, and this can scale to somewhat larger solutions.
If you need to scale massively, why not use a single set of ports, and get the partner address from the client_sockaddr_in structure. If the processing takes a material amount of time, you could extend it by keeping a pool of threads available and assign a new one each time a message is received and continue processing the message thereafter, and adding the thread back to the pool after the response is sent.

Multicast program lost data

I've a multithread program written in C, one thread is receiving multicast data from the network and store it in a queue, another thread keep reading the queue and write it to file. Everything work just great i.e. no data lost from the multicast network.
Thread 1: Read Multicast data and store it into a queue
Thread 2: Read from queue and write it to file.
now I have another source of multicast data from network, I need another thread to read the network data, then I just go and add a for loop to create another thread for multicast data, then when the 2 multicast threads switching back and forth, I lost data from the multicast network!
Anyone has idea about why there are lost datagrams if 2 threads are used. Thanks

It is likely you are not using any concurrency mechanisms like semaphore or mutexs. A classic solution is a monitor. A monitor provides a lock to mediate concurrent access and condition signals to allow independent processes blocking access (no busy waiting). In plain English this means that only one thread can access the data at a time. This prevents a reading thread from reading data that the writing thread has not yet finished writing. It also allows the reading threads to read data that the other reading thread has not yet read. An approach to implement this is to use a read-write mutex and an access semaphore. Each thread that wants to access the data decrements the access semaphore, the thread will either be granted access or sleep until it gets it turn. The read-write mutex will prevent a read thread from reading until some data has been written.

The posix C write() and thread-safety

There is a way to serialize the C write() so that I can write bytes on a socket, shared between k-threads, with no data-loss? I imagine that a solution to this problem includes user-space locking, and what about scalability? Thank you in advance.

I think the right answer depends on whether your threads need to synchronously wait for a response or not. If they just need to write some message to a socket and not wait for the peer to respond, I think the best answer is to have a single thread that is dedicated to writing messages from a queue that the other threads place messages on. That way, the worker threads can simply place their messages on the queue and get on with doing something else.
Of course, the queue has to be protected by a mutex but any one thread only has to hold the lock for as long as it is manipulating the queue (guaranteed to be quite a short time). The more obvious alternative of letting every thread write directly to the socket requires each thread to hold the lock for as long as it takes the write operation to complete. This will always be much longer than just adding an item to a queue since write is a system call and potentially, it could block for a long period.
Even if your threads need a response to their messages, it may still pay to do something similar. Your socket servicing thread becomes more complex because you'll have to do something like select() on the socket for reads and writes to stop it from blocking and you'll also need a way to match up messages to responses and a way to inform the threads when their responses have arrived.

Since POSIX does not seem to specify atomicity guarantees on send(2), you will likely have to use a mutex. Scalability of course goes down the drain with this sort of serialization.

One possible approach would be to use the locking mechanism. Every thread should wait for a lock before writing any thing on the socket and should release the lock, once it is done.
If all of your threads are sending exactly the same kind of messages, the receiver end would not have any problem in reading the data, but if different threads can send different kind of data with possible different info, you should have an unique message id associated with each kind of data and its better to send the thread id as well (although not necessary, but might help you in debugging small issues).
You can have a structure like:
typedef struct my_socket_data_st
{
int msg_id;
#ifdef __debug_build__
int thread_id;
#endif
size_t data_size_in_bytes;
.... Followed by your data ....
} my_socket_data_t
Scalability depends on a lot things including the hardware resources on which your application would be running. Since it is a network application, you will have to think about the network bandwidth as well. Although there is no (there are a few, but I think you can ignore them for now for your application) limitation from OS on sending/receiving data over a socket, but you will have to consider about making the send synchronous or asynchronous based on your requirement. Also since, you are taking a lock, you will have to think about lock congestion as well. If the lock is not available easily for other threads, that will degrade the performance by a huge factor.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight