Server Architecture for Embedded Device - c

I am working on a server application for an embedded ARM platform. The ARM board is connected to various digital IOs, ADCs, etc that the system will consistently poll. It is currently running a Linux kernel with the hardware interfaces developed as drivers. The idea is to have a client application which can connect to the embedded device and receive the sensory data as it is updated and issue commands to the device (shutdown sensor 1, restart sensor 2, etc). Assume the access to the sensory devices is done through typical ioctl.
Now my question relates to the design/architecture of this server application running on the embedded device. At first I was thinking to use something like libevent or libev, lightweight C event handling libraries. The application would prioritize the sensor polling event (and then send the information to the client after the polling is done) and process client commands as they are received (over a typical TCP socket). The server would typically have a single connection but may have up to a dozen or so, but not something like thousands of connections. Is this the best approach to designing something like this? Of the two event handling libraries I listed, is one better for embedded applications or are there any other alternatives?
The other approach under consideration is a multi-threaded application in which the sensor polling is done in a prioritized/blocking thread which reads the sensory data and each client connection is handled in separate thread. The sensory data is updated into some sort of buffer/data structure and the connection threads handle sending out the data to the client and processing client commands (I supposed you would still need an event loop of sort in these threads to monitor for incoming commands). Are there any libraries or typical packages used which facilitate designing an application like this or is this something you have to start from scratch?
How would you design what I am trying to accomplish?

I would use a unix domain socket -- and write the library myself, can't see any advantages to using libvent since the application is tied to linux, and libevent is also for hundreds of connections. You can do all of what you are trying to do with a single thread in your daemon. KISS.
You don't need a dedicated master thread for priority queues you just need to write your threads so that it always processes high priority events before anything else.
In terms of libraries, you will possibly benifit from Google's protocol buffers (for serialization and representing your protocol) -- however it only has first class supports for C++, and the over the wire (serialization) format does a bit of simple bit shifting to numeric data. I doubt it will add any serious overhead. However an alternative is ASN.1 (asn1c).

My suggestion would be a modified form of your 2nd proposal. I would create a server that has two threads. One thread polling the sensors, and another for ALL of your client connections. I have used in embedded devices (MIPS) boost::asio library with great results.
A single thread that handles all sockets connections asynchronously can usually handle the load easily (of course, it depends on how many clients you have). It would then serve the data it has on a shared buffer. To reduce the amount and complexity of mutexes, I would create two buffers, one 'active' and another 'inactive', and a flag to indicate the current active buffer. The polling thread would read data and put it in the inactive buffer. When it finished and had created a 'consistent' state, it would flip the flag and swap the active and inactive buffers. This could be done atomically and should therefore not require anything more complex than this.
This would all be very simple to set up since you would pretty much have only two threads that know nothing about the other.

Related

Execution Patter of Multi-Threaded Server on Linux

I like to know what should be the execution pattern of Multiple Threads of a Server to implement TCP in request-response cycle of hi-performance Server (like dozens of packets with single or no system call on Linux using Packet MMAP or some other way).
Design 1) For simplicity, Start two thread in main at the start of a Server program. one thread just getting packets directly from network interface(s) like wlan0/eth0. and once number of packets read in one cycle (using while loop with poll() in Linux). wake up the other thread using conditional variable signal call. and after waking up, other thread (sender) process and send packet as tcp response.
Design 2) Start receiver thread at the start of main program. The packet receiver thread reads packets from interfaces using while loop and poll(). When number of packets received, create sender thread and pass number of packets received in one cycle to sender as parameter. Sender thread process the packets and respond as tcp response.
(I think, Design 2 will be more easy to implement but is there any design issue or possible performance issue with this approach this is the question). Since creating buffer to pass to sender thread from receiver thread need to be allocated prior to receiving packets. So I know the size of buffer to allocate. Also in this execution pattern I am creating new thread (which will return and end execution after processing packets and responding tcp response). I like to know what will be the performance issue with this approach since I am creating new thread every time I get a batch of packet from interfaces.
In first approach I am not creating more than two threads (or limited number of threads and threads can be tracked easily for logging and debugging since I will know how many thread are initially created) In second approach I don't know how many threads are hanging around and executing concurrently.
I need any advise how real website like youtube/ or others may have handled this in there hi-performance server if they had followed this way of implementing their front facing servers.
First when going to a 'real' website the magic lies in having a load balancers and a whole bunch of worker nodes to take the load and you easily exceed the boundary of a single system. For example take a look at the following AWS reference architecture for serving web pages at scale AWS Cloud Architecture for serving web whitepaper.
That being said taking this one level down it is always interesting to look at how other well-known products have solved this issue. For example NGINX has an excellent infographic available and matching blogpost describing their architecture and threading.

should I use processes or threads for my application?

I have an ARM device running a Linux 2.6 Kernel, with total ram of 64 MB RAM.
There is a data source, which consists of a meter that is queried by the Linux box, through RS485 and ModBus as app protocol.
There is another task, that consists of reading these values and making a json object, then HTTP POST to a specific server.
Network operation might be slower than serial, especially on low GPRS Coverage.
I need concurrency, program is written in C.
Which way would you have concurrency? Using select() or using pthreads?
When analyzing this particular application there's really only one question relevant to choosing pthreads:
Do the sensor reader and network writer need to share an address space?
In this instance I think the answer is clearly "no". Of course that isn't the only possible question, but the only germane one. There are reasons to prefer separate processes:
the two halves of the application have no common code; RS485 is wildly different from HTTP/JSON
segregation of responsibility: if the RS485 side is waiting on a UART, do you really want to block the HTTP side?
letting the OS do its job so you don't have to: if using pthreads, you have to handle a lot of the synchronization and preemption that the kernel does for you for free and code that you don't have to write has no new bugs.
Further analysis would require more detail than you've given, but here is one additional way to think about the choice: threads were invented to mitigate some limitations of the process model. Unless you know that you are going to hit those limitations, use separate processes.
added in response to comments:
I half agree with psusi's suggested design. There need only be two processes, one (let's say the sensor reader, that's a fine choice) which forks one and only one http sender. The two processes can communicate using traditional IPC like a pipe. The sensor process sends data down the pipe when it has some and the child (http) process packs it up in json and sends it on its way.
It only takes two long-lived processes, it uses probably about the same amount of core as would a pthread implementation and it is far, far easier to get right.
select() is more efficient, because it avoids the context switching that comes with multiple threads. And threads would be more efficient than separate processes, because you avoid having to copy the data (unless you setup shared memory, but at that point you might as well have gone with threads). However, writing non-blocking I/O, as with select(), is harder to do and get right, and doesn't enjoy the multitasking that comes with multiple threads. And multiple processes is likely to be the easiest implementation, especially because you can use curl rather than writing the HTTP POST half yourself.
Why you need concurrency? Is the meter has to be polled in a strict time interval?
If the answer is YES: Just use two processes, one poll the meter data and write to a ring buffer in nand storage, the other read the data from the ring buffer and send HTTP data.
If the answer is NO: You don't need concurrency and non-block at all. Use a big loop in main() is enough.

Maintaining connections with many real-time devices

I'm writing a program on Linux to control about 1000 Patient Monitors at same time over UDP sockets. I've successfully written a library to parse and send messages to collect the data from a single patient monitor device. There are various scheduling constraints on the the device, listed below:-
Each device must constantly get an alive-request from computer client within max time-period of 300 milliseconds(may differ for different devices), otherwise connection is lost.
Computer client must send a poll-request to a device in order fetch the data within some time period. I'm polling for about 5 seconds of averaged data from patient monitor, therefore, I'm required to send poll-request in every 5 * 3 = 15 seconds. If I fail to send the request within 15 seconds time-frame, I looses the connection from device.
Now, I'm trying to extend my current program so that it is capable of handling about 1000+ devices at same time. Right now, my program can efficiently handle and parse response from just one device. In case of handling multiple devices, it is necessary to synchronize multiple responses from different device and serialize them and stream it over TCP socket, so that remote computers can also analyze the data. Well, that is not a problem because it is a well know multiple-producer and single consumer problem. My main concern is, what approach should I use in order to maintain alive-connection 1000+ devices.
After reading over Internet and browsing for similar questions on this website, I'm mainly considering two options:-
Use one thread per device. In order to control 1000+ device, I would end up in making 1000+ threads which does not look feasible to me.
Use multiplexing approach, selecting FD that requires attention and deal with it one at a time. I'm not sure how would I go about it and if multiplexing approach would be able to maintain alive-connection with all the devices considering above two constants.
I need some suggestions and advice on how to deal with this situation where you need to control 1000+ real-time-device over UDP sockets. Each device requires some alive-signal every 300 milliseconds (differ for different devices) and they require poll request in about 3 times the time interval mentioned during association phase. For example, patient monitors in ICU may require real-time (1 second averaged) data where as patient monitors in general wards may require 10-seconds averaged data, therefore, poll period for two devices would be 3*1(3 seconds) and 3*10 (30 seconds) respectively.
Thanks
Shivam Kalra
for the most part either approach is at least functionally capable of handling the functionality you describe, but by the sounds of things performance will be a crucial issue. From the figures you have provided it seems that the application could be CPU-buond.
A multithreaded approach has the advantage of using all of the available CPU cores on the machine, but multithreaded programs are notorious for being difficult to make reliable and robust.
You could also use the Apache's old tried-and-true forked-worker model - create, say, a separate process to handle a maximum of 100 devices. You could then need to write code to manage the mapping of connections to processes.
You could also use multiple hosts and some mechanism to distribute devices among them. This would have the advantage of making it easier to handle recovery situations. It sounds like your application could well be mission critical, and it may need to be architected so that if any one piece of hardware breaks then other hardware will take over automatically.

Best approach to non blocking server/listening socket in a multi-thread application on Windows?

I'm writing a TCP server/client application on Windows, to become familiar with the Winsock API. I come from an UNIX background and would like to know which of these could be the best approach to implement the application:
First the specification
Must scale well on multiprocessor and single-processor systems.
No hardset limit of connections.
Application can both listen for connections, acting as server, and act as client.
Multi threaded.
First approach:
Non-blocking select-like socket for listening, in the 'server' thread.
for each client connecting we spawn a separate thread.
Second approach:
Blocking socket for listening, in the 'server' thread.
for each client connecting we spawn a separate thread.
Third approach:
Non-blocking select-like socket for listening, in the 'server' thread.
No separate thread for each incoming connection, the protocol would need state information kept across sessions I suppose.
I wonder what is the most efficient and scalable approach, and especially if it can work with a UDP socket too.
Note: I'm writing the application in plain and old C. No .NET nor C++ involved, C++ exceptions disabled too.
As Gary says, I/O Completion Ports are the most efficient way to manage multiple network connections in a non-blocking/async manner on Windows platforms.
With IOCP you get notified when your networking operations complete and you can process these completions with a small number of threads. You get to decide how many threads you allocate to process the completions and the kernel decides when to use the threads that you're providing. It uses them in a LIFO order, to reduce context switching, so that if you are only using the minimal number of threads required at any point and you're reusing the same threads rather than cycling through all of the threads that you have available for use.
The asynchronous nature of IOCP programming can be a little confusing to start with, but once you get the hang of it it's fairly straight forward.
I have some free IOCP server code which demonstrates the basics and provides some example servers that are pretty easy to build on. You can find the code here: http://www.serverframework.com/products---the-free-framework.html. That page also links to some articles that I wrote to explain the code.
Relating this to the detail of your question. You should be looking at a variation on your third approach. Use AcceptEx() to accept new connections, this can be used in an asynchronous manner and so you don't need a separate thread for connection acceptance and can use the threads that are also processing your overlapped/async read and write operations.
I've written an asynchronous client which does not use blocking sockets, so if you're interested in that approach, then take a look at my client: http://codesprout.blogspot.com/2011/04/asynchronous-http-client.html
It's an HTTP client, but I've shown very little HTTP protocol processing in there, it's all just .NET sockets. The server would work in a similar way: you can take advantage of the *Async methods such as AsseptAsync.
Under Windows, the best performances are achieved by using I/O completion calls.
This is because the lists and queuing mechanism is done in the kernel, far from the heavy user-mode overhead (which drags your code down if you dare to do the hard work yourself).
Unfortunately, Windows I/O completion calls need to allocate many threads to scale and this is quickly killing the performances (as compared to Linux epoll which can scale independently of the number of worker threads you decide to involve in the task).
Recently, I discovered http://gwan.com/ a Web server which came from Windows and was then ported under Linux. And their authors describe the problem in details on their forum.

How to scale a TCP listener on modern multicore/multisocket machines

I have a daemon to write in C, that will need to handle 20-150K TCP connections simultaneously. They are long running connections, and rarely ever tear down. They have a very small amount of data (rarely exceeding MTU even.. it's a stimulus/response protocol) in transmit at any given time, but response times to them are critical. I'm wondering what the current UNIX community is using to get large amounts of sockets, and minimizing the latency on response of them. I've seen designs revolving around multiplexing connects to fork worker pools, threads (per connection), static sized thread pools. Any suggestions?
the easiest suggestion is to use libevent, it makes it easy to write a simple non-blocking single-threaded server that would comply with your requirements.
if the processing for each response takes some time, or if it uses some blocking API (like almost anything from a DB), then you'll need some threading.
One answer is the worker threads, where you spawn a set of threads, each listening on some queue to work. it can be separate processes, instead of threads, if you like. The main difference would be the communications mechanism to tell the workers what to do.
A different way to do is to use several threads, and give to each of them a portion of those 150K connections. each will have it's own process loop and work mostly like the single-threaded server, except for the listening port, which will be handled by a single thread. This helps spreading the load between cores, but if you use a blocking resource, it would block all the connections handled by this specific thread.
libevent lets you use the second way if you're careful; but there's also an alternative: libev. it's not as well known as libevent, but it specifically supports the multi-loop scheme.
If performance is critical then you'll really want to go for a multithreaded event loop solution - i.e. a pool of worker threads to handle your connections. Unfortunately, there is no abstraction library to do this that works on most Unix platforms (note that libevent is only single-threaded as are most of these event-loop libraries), so you'll have to do the dirty work yourself.
On Linux that means using edge-triggered epoll with a pool of worker threads (Windows would have I/O completion ports which also works fine in a multithreaded environment - I am not sure about other Unixes).
BTW, I have done some work trying to abstract edge-triggered epoll on Linux and Windows I/O completion ports on http://nginetd.cmeerw.org (it is work in progress, but might provide some ideas).
If you have system configuration access don't over-do it and set up some iptables/pf/etc to load-balance connections across n daemon instances (processes) as this will work out of the box. Depending on how blocking the nature of the daemon n should be from the number of cores on the system or several times higher. This approach looks crude but it can handle broken daemons and even restart them if necessary. Also migration would be smooth as you could start diverting new connections to another set of processes (for example, a new release or migrating to a new box) instead of service interruptions. On top of that you get several features like source affinity wich can help significantly caching and contention of problematic sessions.
If you don't have system access (or ops can't be bothered), you can use load balancer daemon (there are plenty of open source ones) instead of iptables/pf/etc and use also n service daemons, like above.
Also this approach helps with separating privileges of ports. If the external service needs to service on a low port (<1024) you only need the load balancer running privileged/or admin/root, or kernel.)
I've written several IP load balancers in the past and it can be very error-prone in production. You don't want to support and debug that. Also operations and management will tend second-guess your code more than external code.
i think javier's answer makes the most sense. if you want to test the theory out, then check out the node javascript project.
Node is based on Google's v8 engine which compiles javascript to machine code and is as fast as c for certain tasks. It is also based on libev and is designed to be completely non-blocking, meaning you don't have to worry about context switching between threads (everything runs on a single event loop). It is very similar to erlang in that respect.
Writing high performance servers in javascript is now really, really easy with node. You could also, with a little bit of effort, write your custom code in c and create bindings for node to call into it to do your actual processing (look at the node source to see how to do this - documentation is a little sketchy at the moment). as an uglier alternative, you could build your custom c code as an application and use stdin/stdout to communicate with it.
I've tested node myself with upwards of 150k connections with absolutely no issues (of course you will need some serious hardware if all these connections are going to be communicating at once). A TCP connection in node.js on average uses only 2-3k of memory so you could theoretically handle 350-500k connections per 1GB of RAM.
Note - Node.js is not currently supported on windows, but it is only at an early stage of development and i'd imagine it will be ported at some stage.
Note 2 - you will have to ensure the code you are calling into from Node does not block
Several systems have been developed to improve on select(2) performance: kqueue, epoll, and /dev/poll. In all these systems, you can have a pool of worker threads waiting for tasks; you will not be forced to setup all file handles over and over again when done with one of them.
do you have to start from scratch? You could use something like gearman.

Resources