Monitoring Thread performance of server - c

I have developed a C server using gcc and pthreads that receives UDP packets and depending on the configuration either drops or forwards them to specific targets. In some cases these packets are untouched and just redirected, in some cases headers in the packet are modified, in other cases there is another module of the server that modifies every byte of the packet.
To configure this server, there is a GUI written in Java that connects to the C Server using TCP (to exchange configuration commands). There can be multiple connected GUIs at the same time.
In order to measure utilization of the server I have written kind of a module that starts two separate threads (#2 & #3). The main thread (#1) that does the whole forwarding work essentially works like the following:
struct monitoring_struct data; //contains 2 * uint64_t for start and end time among other fields
for(;;){
recvfrom();
data.start = current_time();
modifyPacket();
sendPacket(); //sometimes to multiple destinations
data.end = current_time();
writeDataToPipe();
}
The current_time function:
//give a timestamp in microsecond precision
uint64_t current_time(void){
struct timespec spec;
clock_gettime(CLOCK_REALTIME, &spec);
uint64_t ts = (uint64_t) ((((double) spec.tv_sec) * 1.0e6) +
(((double) spec.tv_nsec) / 1.0e3));
return ts;
}
As indicated in the main thread, the data struct is written into a pipe, where thread #2 waits to read from. Everytime there is data to be read from the pipe, thread #2 uses a given aggregation function that stores the data in another place in memory. Thread #3 is a loop, that always sleeps for ~1 sec and then sends out the aggregated values (median, avg, min, max, lower quartil, upper quartil, ...) and then resets the aggregated data. Thread #2 and #3 are synchronized by mutexes.
The GUI listens to this data (if the monitoring window is open) which is sent out via UDP to listeners (there can be more) and the GUI then converts the numbers into diagrams, graphs and "pressure" indicators.
I came up with this as this is in my mind the solution that interferes least of all with thread #1 (assuming that it is run on a multicore system, which it always is, and exclusively besides OS and maybe SSH).
As performance is critical for my server (version "1.0" with simpler configuration was able to manage the maximum amount of streams that were possible using gigabit ethernet) I would like to ask if have my solution may be not as good as I think it is to ensure the least performance hit on thread #1 and if you think there would better designs for that? At least I am unable to think of another solution that is not using locks on the data itself (avoiding the pipe, but potentially locking thread #1) or a shared list implementation using rwlock, with possible reader starvation.
There are scenarios where packets are larger, but we currently use the mode for performance measuring where 1 Streams sends exactly 1000 packets per second. We currently want to ensure version 2.0 at least is possible to work with 12 Streams (hence 12000 packets per second), however previously the server was able to manage 84 Streams.
In the future I would like to add other milestone timestamps to thread #1, e.g. inside modifyPacket() (there are multiple steps) and before sendPacket().
I have tried tinkering with the current_time() function, mostly trying to remove it to save time by just storing the value of clock_gettime(), but in my simple test program the current_time() function always beat the clock_gettime.
Thanks in advance for any input.

if you think there would better designs for that?
The short answer is to use Data Plane Development Kit (DPDK) with its design patterns and libraries. It might be quite a learning curve, but in terms of performance it is the best solution at the moment. It is free and open source (BSD license).
A bit more detailed answer:
the data struct is written into a pipe
Since thread #1 and #2 are the threads of the same process, it would be much faster to pass data using shared memory, not pipes. Just like you used between threads #2 and #3.
thread #2 uses a given aggregation function that stores the data in another place in memory
Those two threads seems unnecessary. Thread #2 can read data passed by thread #1, aggregate and send it out?
I am unable to think of another solution that is not using locks on the data itself
Have a look at the lockless queues which are called "rings" in DPDK. The idea is to have a common circular buffer between threads and use lockless algorithms to enqueue/dequeue to/from the buffer.
We currently want to ensure version 2.0 at least is possible to work with 12 Streams (hence 12000 packets per second), however previously the server was able to manage 84 Streams.
Measure the performance and find the bottlenecks (seems your are still not 100% sure what is the bottleneck in the code).
Just for the reference, Intel publishes the performance reports for DPDK. Those reference numbers for L3 forwarding (i.e. routing) are up to 30 million packet per second.
Sure, you might have less powerful processor and NIC, but few millions packets per second are reachable quite easily using the right techniques.

Related

Compute data from multiple clients simultaneously

I'm trying to write a server able to handle multiple (more than a thousand) client connections concurrently in C language. Every connection is meant to accomplish three things:
Send data to the server
The server processes the data
The server returns data to the client
I am using non-blocking sockets and epoll() for handling all the connections, but my problem is right in the moment after the server receives the data from one client and has to call a function which spends several seconds in processing the data before it returns the result that has to be sent back to the client before closing the connection.
My question is, what paradigm can I use in order to be able to keep handling more connections while the data of one client "is cooking"?
I've been researching a bit about the possibilities of doing it by creating a thread or a process every time I need to call the computing function, but I'm not sure if this would be possible given the number of possible concurrent connections, that's why I came here expecting that someone more experienced that me in the matter could shed some light on my ignorance.
Code snippet:
while (1)
{
ssize_t count;
char buf[512];
count = read (events[i].data.fd, buf, sizeof buf); // read the data
if (count == -1)
{
/* If errno == EAGAIN, that means we have read all
data. So go back to the main loop. */
if (errno != EAGAIN)
{
perror ("read");
done = 1;
}
/* Here is where I should call the processing function before
exiting the loop and closing the actual connection */
answer = proc_function(buf);
count = write (events[i].data.fd, answer, sizeof answer); // send the answer to the client
break;
}
...
Thanks in advance.
It seems sensible to multi-thread or multi-process to some degree to accomplish this. The degree to which you multi-thread or multi-process is the question.
1) You could dump the polling system entirely and use a thread/process per connection. That thread can then stall as long as it wants working on the processing for that connection. You'd then have to decide on creating/killing a thread/process each time (probably easiest) or having a pool of threads/processes (probably fastest).
2) You could have a thread/process for the networky bits and hand off the processing to one other thread. This is less parallel, but it does mean you can at least keep handling network connections whilst you're chopping through the list of work. This gives you control of what processing is being handled at least. It would be easy to prioritise incoming connections this way, whereas option 1 might not.
3) (sort of possible 1 & 2) You could use asynchronous I/O to multiplex your connections. You still to handle the processing in the same way as 1 & 2 above.
You also have the question of threads vs processes. Threads are probably quicker to get going but it's more difficult to ensure data integrity. Processes are going to be more resilient but require more interfacing between them.
You also have to decide on a way to pass data between the threads/processes. This is less of an issue for option 1 as you only have to pass off the connection to the thread. Option 2 may (depending on what your data is) be more of a problem. You could use a message queue for passing the messages about but if you have a lot of data to send shared memory is more appropriate. Shared memory is a pain to engineer for processes but easy with threads (as all threads share the same memory space).
There are performance issues as you get to this scale too. It's worth investigating performance characteristics for these things. The differences to how calls like select and poll scale is significant when you're dealing with a lot of connections.
Without knowledge of what data is being sent and received it's hard to give solid recommendations.
Incidentally, this isn't a new problem. Dan Kegel had a good article about it a few years back. It's now out-of-date, but the overview is still good. You should research the current state of the art for the concepts he discusses though.

Overlapping communications with computations in MPI (mvapich2) for large messages

I have a very simple code, a data decomposition problem in which in a loop each process sends two large messages to the ranks before and after itself at each cycle. I run this code in a cluster of SMP nodes (AMD Magny cores, 32 core per node, 8 cores per socket). It's a while I'm in the process of optimizing this code. I have used pgprof and tau for profiling and it looks to me that the bottleneck is the communication. I have tried to overlap the communication with the computations in my code however it looks that the actual communication starts when the computations finish :(
I use persistent communication in ready mode (MPI_Rsend_init) and in between the MPI_Start_all and MPI_Wait_all bulk of the computation is done. The code looks like this:
void main(int argc, char *argv[])
{
some definitions;
some initializations;
MPI_Init(&argc, &argv);
MPI_Rsend_init( channel to the rank before );
MPI_Rsend_init( channel to the rank after );
MPI_Recv_init( channel to the rank before );
MPI_Recv_init( channel to the rank after );
for (timestep=0; temstep<Time; timestep++)
{
prepare data for send;
MPI_Start_all();
do computations;
MPI_Wait_all();
do work on the received data;
}
MPI_Finalize();
}
Unfortunately the actual data transfer does not start until the computations are done, I don't understand why. The network uses QDR InfiniBand Interconnect and mvapich2. each message size is 23MB (totally 46 MB message is sent). I tried to change the message passing to eager mode, since the memory in the system is large enough. I use the following flags in my job script:
MV2_SMP_EAGERSIZE=46M
MV2_CPU_BINDING_LEVEL=socket
MV2_CPU_BINDING_POLICY=bunch
Which gives me an improvement of about 8%, probably because of better placement of the ranks inside the SMP nodes however still the problem with communication remains. My question is why can't I effectively overlap the communications with the computations? Is there any flag that I should use and I'm missing it? I know something is wrong, but whatever I have done has not been enough.
By the order of ranks inside the SMP nodes the actual message sizes between the nodes is also 46MB (2x23MB) and the ranks are in a loop. Can you please help me? To see the flags that other users use I have checked /etc/mvapich2.conf however it is empty.
Is there any other method that I should use? do you think one sided communication gives better performance? I feel there is a flag or something that I'm not aware of.
Thanks alot.
There is something called progression of operations in MPI. The standard allows for non-blocking operations to only be progressed to completion once the proper testing/waiting call was made:
A nonblocking send start call initiates the send operation, but does not complete it. The send start call can return before the message was copied out of the send buffer. A separate send complete call is needed to complete the communication, i.e., to verify that the data has been copied out of the send buffer. With suitable hardware, the transfer of data out of the sender memory may proceed concurrently with computations done at the sender after the send was initiated and before it completed. Similarly, a nonblocking receive start call initiates the receive operation, but does not complete it. The call can return before a message is stored into the receive buffer. A separate receive complete call is needed to complete the receive operation and verify that the data has been received into the receive buffer. With suitable hardware, the transfer of data into the receiver memory may proceed concurrently with computations done after the receive was initiated and before it completed.
(words in bold are also bolded in the standard text; emphasis added by me)
Although this text comes from the section about non-blocking communication (ยง3.7 of MPI-3.0; the text is exactly the same in MPI-2.2), it also applies to persistent communication requests.
I haven't used MVAPICH2, but I am able to speak about how things are implemented in Open MPI. Whenever a non-blocking operation is initiated or a persistent communication request is started, the operation is added to a queue of pending operations and is then progressed in one of the two possible ways:
if Open MPI was compiled without an asynchronous progression thread, outstanding operations are progressed on each call to a send/receive or to some of the wait/test operations;
if Open MPI was compiled with an asynchronous progression thread, operations are progressed in the background even if no further communication calls are made.
The default behaviour is not to enable the asynchronous progression thread as doing so increases the latency of the operations somehow.
The MVAPICH site is unreachable at the moment from here, but earlier I saw a mention of asynchronous progress in the features list. Probably that's where you should start from - search for ways to enable it.
Also note that MV2_SMP_EAGERSIZE controls the shared memory protocol eager message size and does not affect the InfiniBand protocol, i.e. it can only improve the communication between processes that reside on the same cluster node.
By the way, there is no guarantee that the receive operations would be started before the ready send operations in the neighbouring ranks, so they might not function as expected as the ordering in time is very important there.
For MPICH, you can set MPICH_ASYNC_PROGRESS=1 environment variable when runing mpiexec/mpirun. This will spawn a background process which does "asynchronous progress" stuff.
MPICH_ASYNC_PROGRESS - Initiates a spare thread to provide
asynchronous progress. This improves progress semantics for
all MPI operations including point-to-point, collective,
one-sided operations and I/O. Setting this variable would
increase the thread-safety level to
MPI_THREAD_MULTIPLE. While this improves the progress
semantics, it might cause a small amount of performance
overhead for regular MPI operations.
from MPICH Environment Variables
I have tested on my cluster with MPICH-3.1.4, it worked! I believe MVAPICH will also work.

Suggestion for callbacks oriented library in C

I'm making small library for controlling various embedded devices using C language. I'm using UDP sockets to communicate with each of the devices. Devices send me various interesting data, alarms and notifications and at the same time they send some data that is used internally by the library but may not be interesting to users. So, I've implemented a callback approach, where user could register a callback function with some interesting events on each of the devices. Right now, overall design of this library is something like this:-
I've two threads running.
In one of the thread, there is a infinite while event-loop that uses select and non-blocking sockets to maintain the communication with each of the devices.
Basically, every time I receive a packet from any of devices, I strip off the header which is 20 bytes of some useless information and add my own header containing DEVICE_ID, REQUES_TIME (time request was sent to retrieve that packet and RETRIEVAL_TIME (time now when packet actually arrived) and REQUEST_ID and REQUEST_TYPE (alarm, data, notification etc..).
Now, this thread (one with infinite loop) put packet with new header into ring buffer and then notifies other thread (thread #2) to parse this information.
In thread #2, when notification is received, it locks the buffer and read pop the packet and start parsing it.
Every message contains some information that user may not be interested, so I'm providing user call back approach to act upon data which is useful to user.
Basically, I'm doing something like this in thread 2:-
THREAD #2
wait(data_put_in_buffer_cond)
lock(buffer_mutex)
packet_t* packet = pop_packet_from_buffer(buf);
unlock(buffer_mutex)
/* parsing the package... */
parsed_packet_t* parsed_packet = parse_and_change_endianess(packet->data);
/* header for put by thread #1 with host byte order only not parsing necessary */
header_t* header = get_header(packet);
/* thread 1 sets free callback for kind of packet it puts in buffer
* This not a critical section section of buffer, so fine without locks
*/
buffer.free_callback(packet);
foreach attribute in parsed_packet->attribute_list {
register_info_t* rinfo = USER_REGISTRED_EVENT_TABLE[header->device_id][attribute.attr_id];
/*user is register with this attribute ID on this device ID */
if(rinfo != NULL) {
rinof->callback(packet);
}
// Do some other stuff with this attribute..
}
free(parsed_packet);
Now, my concerned is that what will happen if callback function that user implements takes some time to complete and meanwhile I may drop some packet because ring buffer is in overwriting mode? I've tested my API for 3 to 4 devices, I don't see much drop event if callback function wait decent amount of time..I'm speculating that this approach may not be best.
Would it be a better design, if I use some sort of thread-pool to run user callback functions? In that case I would need to make explicit copy of packet before I send it to user callback? Each packet is about 500 to 700 bytes, I get around 2 packets per second from each device. Any suggestions or comments on improving the current design or solving this issues would be appreciated.
Getting 500-700 bytes per device is not a problem at all, especially if you only have 3-4 devices. Even if you had, let's say, 100 devices, it should not be a problem. The copy overhead would be most probably negligible. So, my suggest would be: do not try to optimize beforehand until you are certain that buffer copying is your bottleneck.
About losing packets, as you say in your question, you are already using a buffer ring (I assume that is something like a circular queue, right?). If the queue becomes full, then you just need to make thread #1 to wait until there is some available space in the queue. Clearly, more events from different devices may arrive, but that should not be a problem. Once, you have space again, select will let you know that you have available data from different devices, so you will just need to process all that data. Of course, in order to have a balanced system, you can set the size of the queue to a value that reduces as much as possible the number of times that the queue is full, and thus, thread #1 needs to wait.

One Socket Multiple Threads

I'm coding a part of little complex communication protocol to control multiple medical devices from single computer terminal. Computer terminal need to manage about 20 such devices. Every device uses same protocol fro communication called DEP. Now, I've created a loop that multiplexes within different devices to send the request and received the patient data associated with a particular device. So structure of this loop, in general, is something like this:
Begin Loop
Select Device i
if Device.Socket has Data
Strip Header
Copy Data on Queue
end if
rem_time = TIMEOUT - (CurrentTime - Device.Session.LastRequestTime)
if TIMEOUT <= 0
Send Re-association Request to Device
else
Sort Pending Request According to Time
Select First Request
Send the Request
Set Request Priority Least
end Select
end if
end Select
end Loop
I might have made some mistake in above pseudo-code, but I hope I've made myself clear about what this loop is trying to do. I've priority list structure that selects the device and pending request for that device, so that, all the requests and devices are selected at good optimal intervals.
I forgot to mention, above loop do not actually parse the received data, but it only strips off the header and put it in a queue. The data in queue is parsed in different thread and recorded in file or database.
I wish to add a feature so that other computers may also import the data and control the devices attached to computer terminal remotely. For this, I would need to create socket that would listen to commands in this INFINITE LOOP and send the data in different thread where PARSING is performed.
Now, my question to all the concurrency experts is that:
Is it a good design to use single socket for reading and writing in two different threads? Where each of the thread will be strictly involved in either reading or writing not both. Also, I believe socket is synchronized on process level, so do I need locks to synchronize the read and write over one socket from different threads?
There is nothing inherently wrong with having multiple threads handle a single socket; however, there are many good and bad designs based around this one very general idea. If you do not want to rediscover the problems as you code your application, I suggest you search around for designs that best fit your planned particular style of packet handling.
There is also nothing inherently wrong with having a single thread handle a single socket; however, if you put the logic handling on that thread, then you have selected a bad design, as then that thread cannot handle requests while it is "working" on the last reqeust.
In your particular code, you might have an issue. If your packets support fragmentation, or even if your algorithm gets a little ahead of the hardware due to timing issues, you might have just part of the packet "received" in the buffer. In that case, your algorithm will fail in two ways.
It will process a partial packet, one which has the first part of it's data.
It will mis-process the subsequent packet, as the information in the buffer will not start with a valid packet header.
Such failures are difficult to conceive and diagnose until they are encountered. Perhaps your library already buffers and splits messages, perhaps not.
In short, your design is not dictated by how many threads are accessing your socket: how many threads access your socket is dictated by your design.

Socket client/server input/output polling vs. read/write in linux

Basically I set up a test to see which method is the fastest way to get data from another computer my network for a server with only a few clients(10 at max, 1 at min).
I tried two methods, both were done in a thread/per client fashion, and looped the read 10000 times. I timed the loop from the creation of the threads to the joining of the threads right after. In my threads I used these two methods, both used standard read(2)/write(2) calls and SOCK_STREAM/AF_INET:
In one I polled for data in my client reading(non blocking) whenever data was available, and in my server, I instantly sent data whenever I got a connection. My thread returned on a read of the correct number of bytes(which happened every time).
In the other, my client sent a message to the sever on connect and my server sent a message to my client on a read(both sides blocked here to make this more turn-based and synchronous). My thread returned after my client read.
I was pretty sure polling would be faster. I made a histogram of times to complete threads, and, as expected, polling was faster by a slight margin, but two things were not expected about the read/write method. Firstly, the read/write method gave me two distinct time spikes. I.E. some event sometimes occurred which would slow the read/write down by about .01 microseconds. I ran this test on a switch initially, and thought this may be a collision of packets, but then I ran the server and client on the same computer and still got these two different time spikes. Anyone know what event may be occurring?
The other, my read function returned too many bytes sometimes, and some bytes were garbage. I know streams don't guarantee you'll get all the data correctly, but why would the read function return extra garbage bytes?
Seems you are confusing the purpose of these two alternatives:
Connection per thread approach does not need polling (unless your protocol allows for random sequence of messages either way, which would be very confusing to implement). Blocking reads and writes will always be faster here since you skip one extra system call to select(2)/poll(2)/epoll(4).
Polling approach allows to multiplex I/O on many sockets/files in single-threaded or fixed-number-of-threads setup. This is how web-servers like nginx handle thousands of client connections in very few threads. The idea is that wait on any given file descriptor does not block others - wait on all of them.
So I would say you are comparing apples and goblins :) Take a look here:
High Performance Server Architecture
The C10K problem
libevent
As for the spikes - check if TCP gets into re-transmission mode, i.e. one of the sides is not reading fast enough to drain receive buffers, play with SO_RCVBUF and SO_SNDBUF socket options.
Too many bytes is definitely wrong - looks like API misuse - check if you are comparing signed and unsigned numbers, compile with high warning level.
Edit:
Looks like you are between two separate issues - data corruption and data transfer performance. I would strongly recommend focusing on the first one before tackling the second. Reduce the test to a minimum and try to figure out what you are doing wrong with the sockets. i.e. where's that garbage data comes from. Do you check return values of the read(2) and write(2) calls? Do you share buffers between threads? Paste the reduced code sample into the question (or provide a link to it) if really stuck.
Hope this helps.
I know streams don't guarantee you'll get all the data correctly, but why would the read function return extra garbage bytes?
Actually, streams do guarantee you will get all the data correctly, and in order. Datagrams (UDP) are what you were thinking of, SOCK_DGRAM, which is not what you are using. Within AF_INET, SOCK_STREAM means TCP and TCP means reliable.

Resources