I have a server (TCP), implemented in "C", running on modern Linux servers.
Server handle large number (>50 req/second). The server is single threaded - it integrate 3rd party code that is NOT thread-safe (can not be modified). Some requests are "fast" (milliseconds), and some requests are "slow" (require lot of processing).
Overtime, the size of the heap for the "main" server becomes significant (> 8GB).
Current logic is to conditionally FORK the "slow" requests, and perform the "fast" requests in the main threads.
I'm trying to optimize the performance of the "slow" requests, as I found that forking become expensive, when the "main" server heap size is significant. I'm trying use the SPAWN calls (running modern Linux), to create a fresh instance of the server, passing minimal information to it via command line, and letting it read the payload from a file descriptor. Basically:
Read request header (short - about 50 bytes)
if slow-request then
if "big-heap"
spawn new instant of the server, pass header as command line parameters
read request data via passed fd (stdin ?)
else
read request data (in forked process)
process requests - parallel to main thread
exit
else
run in main thread
My Question:
Does anyone has experience with this approach ? Specifically, what is a good rule to execute new instances of the process, instead of "fork".
Any pitfalls to be aware of ?
Any existing benchmarking for defining the "big-heap" test ?
I did not dig too much into how "spawn" is working on Linux, but I believe it is smarter/more efficient than plain "fork/exec".
Related
I like to know what should be the execution pattern of Multiple Threads of a Server to implement TCP in request-response cycle of hi-performance Server (like dozens of packets with single or no system call on Linux using Packet MMAP or some other way).
Design 1) For simplicity, Start two thread in main at the start of a Server program. one thread just getting packets directly from network interface(s) like wlan0/eth0. and once number of packets read in one cycle (using while loop with poll() in Linux). wake up the other thread using conditional variable signal call. and after waking up, other thread (sender) process and send packet as tcp response.
Design 2) Start receiver thread at the start of main program. The packet receiver thread reads packets from interfaces using while loop and poll(). When number of packets received, create sender thread and pass number of packets received in one cycle to sender as parameter. Sender thread process the packets and respond as tcp response.
(I think, Design 2 will be more easy to implement but is there any design issue or possible performance issue with this approach this is the question). Since creating buffer to pass to sender thread from receiver thread need to be allocated prior to receiving packets. So I know the size of buffer to allocate. Also in this execution pattern I am creating new thread (which will return and end execution after processing packets and responding tcp response). I like to know what will be the performance issue with this approach since I am creating new thread every time I get a batch of packet from interfaces.
In first approach I am not creating more than two threads (or limited number of threads and threads can be tracked easily for logging and debugging since I will know how many thread are initially created) In second approach I don't know how many threads are hanging around and executing concurrently.
I need any advise how real website like youtube/ or others may have handled this in there hi-performance server if they had followed this way of implementing their front facing servers.
First when going to a 'real' website the magic lies in having a load balancers and a whole bunch of worker nodes to take the load and you easily exceed the boundary of a single system. For example take a look at the following AWS reference architecture for serving web pages at scale AWS Cloud Architecture for serving web whitepaper.
That being said taking this one level down it is always interesting to look at how other well-known products have solved this issue. For example NGINX has an excellent infographic available and matching blogpost describing their architecture and threading.
I've been looking into how I could embed languages (let's use Lua as an example) in Erlang. This of course isn't a new idea and there are many libraries out there that can do this. However I was wondering if it was possible to start a Genserver with state which is modified by Lua. This means that once you start the Genserver, it will start a (long running) Lua process to manipulate the Genserver's state. I know this is possible as well, but I was wondering if I could spawn 1,000 10,000 or even 100,000 of these processes.
I'm not really familiar with this topic but I have done some research.
(Please correct me if I'm wrong on any of these options).
TLDR; Skip to the last paragraph.
First option: NIFs:
This doesn't seem like an option since it will block the Erlang Scheduler of the current process. If I want to spawn a large amount of these it will freeze the entire runtime.
Second option: Port Driver:
It's like a NIF but communicates by sending data to a specified port, which can also send data back to Erlang. This is nice although this also seems to block the scheduler. I've tried a library which does the boiler plat for you as well, but that seemed to block the scheduler after spawning 10 processes. I've also looked into the postgresql example on the Erlang Documentation which is said to be async but I couldn't get the example code to work (R13?). Is it even possible to run as many Port Driver processes without blocking the runtime?
Third option: C Nodes:
I thought this was very interesting and wanted to try it out, but apparently the project "erlang-lua" already does this. It's nice because it won't crash your Erlang VM if something goes wrong and the processes are isolated. But in order to actually spawn a single process you need to spawn an entire node. I have no idea how expensive this is. Nor am I sure what the limit is for connecting nodes in a cluster, but I don't see myself spawning 100,000 C nodes.
Fourth option: Ports:
At first I thought this was the same as a Port Driver but it's actually different. You spawn a process which executes an application and communicates through STDIN and STDOUT. This would work well for spawning a large amount of processes, and (I think?) they aren't a threat to the Erlang VM. But if I'm going to communicate through STDIN / STDOUT, why even bother with an embeddable language to begin with? Might as well use any other scripting language.
And so after much research in a field I'm not familiar with I've come to this. You could a Genserver as an "entity" where the AI is written in Lua. Which is why I'd like to have a processes for each entity. My question is how do I achieve spawning many Genservers which communicate with long running Lua processes? Is this even possible? Should I be tackling my problem differently?
If you can make the Lua code — or more accurately, its underlying native code — cooperate with the Erlang VM, you have a few choices.
Consider one of the most important functions of the Erlang VM: managing the execution of a (potentially large number of) Erlang's lightweight processes across a relatively small set of scheduler threads. It uses several techniques to know when a process has used up its timeslice or is waiting and so should be scheduled out to give another process a chance to run.
You seem to be asking how you can get native code to run however it likes within the VM, but as you've already hinted, the reason native code can cause problems for the VM is that it has no practical way to stop the native code from completely taking over a scheduler thread and thus preventing regular Erlang processes from executing. Because of this, native code has to cooperatively yield the scheduler thread back to the VM.
For older NIFs, the choices for such cooperation were:
Keep the amount of time NIF calls ran on a scheduler thread to 1ms or less.
Create one or more private threads. Transition each long-running NIF call from its scheduler thread over to a private thread for execution, then return the scheduler thread to the VM.
The problems here are that not all calls can complete in 1ms or less, and that managing private threads can be error-prone. To get around the first problem, some developers would break the work down into chunks and use an Erlang function as a wrapper to manage a series of short NIF calls, each of which completed one chunk of work. As for the second problem, well, sometimes you just can't avoid it, despite its inherent difficulty.
NIFs running on Erlang 17.3 or later can also cooperatively yield the scheduler thread using the enif_schedule_nif function. To use this feature, the native code has to be able to do its work in chunks such that each chunk can complete within the usual 1ms NIF execution window, similar to the approach mentioned earlier but without the need to artificially return to an Erlang wrapper. My bitwise example code provides many details about this.
Erlang 17 also brought an experimental feature, off by default, called dirty schedulers. This is a set of VM schedulers that do not have the same native code execution time constraints as the regular schedulers; work there can block for essentially infinite periods without disrupting normal VM operation.
Dirty schedulers come in two flavors: CPU schedulers for CPU-bound work, and I/O schedulers for I/O-bound work. In a VM compiled to enable dirty schedulers, there are by default as many dirty CPU schedulers as there are regular schedulers, and there are 10 I/O schedulers. These numbers can be altered using command-line switches, but note that to try to prevent regular scheduler starvation, you can never have more dirty CPU schedulers than regular schedulers. Applications use the same enif_schedule_nif function mentioned earlier to execute NIFs on dirty schedulers. My bitwise example code provides many details about this too. Dirty schedulers will remain an experimental feature for Erlang 18 as well.
Native code in linked-in port drivers is subject to the same on-scheduler execution time constraints as NIFs, but drivers have two features NIFs don't:
Driver code can register file descriptors into the VM polling subsystem and be notified when any of those file descriptors becomes I/O-ready.
The driver API supports access to a non-scheduler async thread pool, the size of which is configurable but by default has 10 threads.
The first feature allows native driver code to avoid blocking a thread for I/O. For example, instead of performing a blocking recv call, driver code can register the socket file descriptor so the VM can poll it and call the driver back when the file descriptor becomes readable.
The second feature provides a separate thread pool useful for driver tasks that can't conform to the scheduler thread native code execution time constraints. You can achieve the same in a NIF but you have to set up your own thread pool and write your own native code to manage and access it. But regardless of whether you use the driver async thread pool, your own NIF thread pool, or dirty schedulers, note that they are all regular operating system threads, and so trying to start a huge number of them simply isn't practical.
Native driver code does not yet have dirty scheduler access, but this work is on-going and it might become available as an experimental feature in an 18.x release.
If your Lua code can make use of one or more of these features to cooperate with the Erlang VM, then what you're attempting may be possible.
This means, for example, a module can
start compressing the response from a
backend server and stream it to the
client before the module has received
the entire response from the backend.
Nice!
I know it's some kind of asynchronous IO but simple like that isn't enough.
Anyone knows?
Without looking at the source code of an actual implementation, I'm speculating here:
It's most likely some kind of stream (abstract buffered IO) that is passed from one module to the other ("chaining"). One module (maybe a servlet container) writes to a stream that is read by another module (the compression module in your example), which then writes its output to another stream. The contents of that stream may then be processed further or transmitted to the client.
The backend may need to wait on IO before it can fully produce the page. Modules can begin compressing the start of the message before the backend is entirely done writing it.
To understand why this is useful, you need to understand how ngnix is structured. ngninx is a server that relies on non-blocking input and output. Normally, a server will use blocking input and output: it will listen on a connection, and when a connection is found, it will process the page. In order to increase throughput, multiple threads are spawned, called 'workers'.
Contrast this to ngnix: It continually asks the kernel, "Are any of my IO requests ready?" This allows it to handle the same amount of pages with 1) less overhead from all the different processes, and 2) lower memory usage. It has some downsides, however. For extremely low-volume applications, ngnix may use more CPU than a blocking server. Second, it's much less portable. Windows uses an entirely different model for non-blocking IO.
Getting back to your original question, compressing the beginning of a page is useful because it can be ready for the rest of the page when it's done accessing a database or reading from a disk or what-have-you.
I wonder what the most efficient file logging strategy would be in a server written in C?
I can see the following options:
fopen() append and then fwrite() the data for a time frame of say 1 hour, then fclose()?
Caching the data and then occasionally open() append write() and close()?
Using a thread is usually a good solution, we adopted it with interesting results.
The main thread that needs to log prepare the log string and passes it to a second thread. To feed the second thread we use a lockless queue + a circular memory in order to minimize amount of alloc/free and wait time.
The secon thread waits for the lockless queue to be available. When it finds there's some job to do, a new slot of the lockless queue is consumed and the data logged.
Using a separate thread you can save a great amount of time.
After we decided to use a secon thread we had to face another problem. Many istances of the same program (a full text serach engine) must log all together on the same file so the resource shoud be regularly shared among every instance of the server.
We could decide to use a semaphore or another syncornizing methiod but we found another solution: the second thread sends an UDP packet to a local log server that listen on a known port. This server reads each message and logs it on the file (the server is actually the only one that owns he file while it's written). The UDP socket itself grants serialization of logs.
I've been using this solution for more than 10 years and never loose a single line of my logs file, using the second thread I also saved a great percentage of time for every operation (we use to log a lot of information for any single command the server receives).
HTH
Why don't you directly log your data when the events occur?
If your server crashes, you want to retrieve those data at the time it crashed. If you only flush your buffered logs once an hour, you'll miss interesting logs.
File streams are usually buffered by the OS.
If you believe it makes your server slow, due to hard drive writing, you might consider to log into a separate thread. But I wonder if it is the problem. Premature optimizations?
Unless you've benchmarked and found that it's a bottleneck, use fopen and fprintf. There's no reason to put your own complex buffering layer on top unless stdio is too slow for you (and if it is too slow, you might consider whether to rethink the OS/C library your server is running).
The slowest part of writing a system log is the output operation to the physical disks.
Buffering and checksumming the log records are necessary to ensure that you don't lose any log data and that the log data can't be tampered with after the fact, respectively.
Is a server essentially a background process running an infinite loop listening on a port? For example:
while(1){
command = read(127.0.0.1:xxxx);
if(command){
execute(command);
}
}
When I say server, I obviously am not referring to a physical server (computer). I am referring to a MySQL server, or Apache, etc.
Full disclosure - I haven't had time to poke through any source code. Actual code examples would be great!
That's more or less what server software generally does.
Usually it gets more complicated because the infinite loop "only" accepts the connection and each connection can often handle multiple "commands" (or whatever they are called in the used protocol), but the basic idea is roughly this.
There are three kinds of 'servers' - forking, threading and single threaded (non-blocking). All of them generally loop the way you show, the difference is what happens when there is something to be serviced.
A forking service is just that. For every request, fork() is invoked creating a new child process that handles the request, then exits (or remains alive, to handle subsequent requests, depending on the design).
A threading service is like a forking service, but instead of a whole new process, a new thread is created to serve the request. Like forks, sometimes threads stay around to handle subsequent requests. The difference in performance and footprint is simply the difference of threads vs forks. Depending on the memory usage that is not servicing a client (and prone to changing), its usually better to not clone the entire address space. The only added complexity here is synchronization.
A single process (aka single threaded) server will fork only once to daemonize. It will not spawn new threads, it will not spawn child processes. It will continue to poll() the socket to find out when the file descriptor is ready to receive data, or has data available to be processed. Data for each connection is kept in its own structure, identified by various states (writing, waiting for ACK, reading, closing, etc). This can be an extremely efficient design, if done properly. Instead of having multiple children or threads blocking while waiting to do work, you have a single process and event loop servicing requests as they are ready.
There are instances where single threaded services spawn multiple threads, however the additional threads aren't working on servicing incoming requests, one might (for instance) set up a local socket in a thread that allows an administrator to obtain a status of all connections.
A little googling for non blocking http server will yield some interesting hand rolled web servers written as code golf challenges.
In short, the difference is what happens once the endless loop is entered, not just the endless loop :)
In a matter of speaking, yes. A server is simply something that "loops forever" and serves. However, typically you'll find that "daemons" do things like open STDOUT and STDERR onto file handles or /dev/null along with double forks among other things. Your code is a very simplistic "server" in a sense.