So I have just discovered that libuv is a fairly small library as far as C libraries go (compare to FFmpeg). I have spent the past 6 hours reading through the source code to get a feel for the event loop at a deeper level. But still not seeing where the "nonblockingness" is implemented. Where some event interrupt signal or whatnot is being invoked in the codebase.
I have been using Node.js for over 8 years so I am familar with how to use an async non-blocking event loop, but I never actually looked into the implementation.
My question is twofold:
Where exactly is the "looping" occuring within libuv?
What are the key steps in each iteration of the loop that make it non-blocking and async.
So we start with a hello world example. All that is required is this:
#include <stdio.h>
#include <stdlib.h>
#include <uv.h>
int main() {
uv_loop_t *loop = malloc(sizeof(uv_loop_t));
uv_loop_init(loop); // initialize datastructures.
uv_run(loop, UV_RUN_DEFAULT); // infinite loop as long as queue is full?
uv_loop_close(loop);
free(loop);
return 0;
}
The key function which I have been exploring is uv_run. The uv_loop_init function essentially initializes data structures, so not too much fancness there I don't think. But the real magic seems to happen with uv_run, somewhere. A high level set of code snippets from the libuv repo is in this gist, showing what the uv_run function calls.
Essentially it seems to boil down to this:
while (NOT_STOPPED) {
uv__update_time(loop)
uv__run_timers(loop)
uv__run_pending(loop)
uv__run_idle(loop)
uv__run_prepare(loop)
uv__io_poll(loop, timeout)
uv__run_check(loop)
uv__run_closing_handles(loop)
// ... cleanup
}
Those functions are in the gist.
uv__run_timers: runs timer callbacks? loops with for (;;) {.
uv__run_pending: runs regular callbacks? loops through queue with while (!QUEUE_EMPTY(&pq)) {.
uv__run_idle: no source code
uv__run_prepare: no source code
uv__io_poll: does io polling? (can't quite tell what this means tho). Has 2 loops: while (!QUEUE_EMPTY(&loop->watcher_queue)) {, and for (;;) {,
And then we're done. And the program exists, because there is no "work" to be done.
So I think I have answered the first part of my question after all this digging, and the looping is specifically in these 3 functions:
uv__run_timers
uv__run_pending
uv__io_poll
But not having implemented anything with kqueue or multithreading and having dealt relatively little with file descriptors, I am not quite following the code. This will probably help out others along the path to learning this too.
So the second part of the question is what are the key steps in these 3 functions that implement the nonblockingness? Assuming this is where all the looping exists.
Not being a C expert, does for (;;) { "block" the event loop? Or can that run indefinitely and somehow other parts of the code are jumped to from OS system events or something like that?
So uv__io_poll calls poll(...) in that endless loop. I don't think is non-blocking, is that correct? That seems to be all it mainly does.
Looking into kqueue.c there is also a uv__io_poll, so I assume the poll implementation is a fallback and kqueue on Mac is used, which is non-blocking?
So is that it? Is it just looping in uv__io_poll and each iteration you can add to the queue, and as long as there's stuff in the queue it will run? I still don't see how it's non-blocking and async.
Can one outline similar to this how it is async and non-blocking, and which parts of the code to take a look at? Basically, I would like to see where the "free processor idleness" exists in libuv. Where is the processor ever free in the call to our initial uv_run? If it is free, how does it get reinvoked, like an event handler? (Like a browser event handler from the mouse, an interrupt). I feel like I'm looking for an interrupt but not seeing one.
I ask this because I want to implement an MVP event loop in C, but just don't understand how nonblockingness actually is implemented. Where the rubber meets the road.
I think that trying to understand libuv is getting in your way of understanding how reactors (event loops) are implemented in C, and it is this that you need to understand, as opposed to the exact implementation details behind libuv.
(Note that when I say "in C", what I really means is "at or near the system call interface, where userland meets the kernel".)
All of the different backends (select, poll, epoll, etc) are, more-or-less, variations on the same theme. They block the current process or thread until there is work to be done, like servicing a timer, reading from a socket, writing to a socket, or handling a socket error.
When the current process is blocked, it literally is not getting any CPU cycles assigned to it by the OS scheduler.
Part of the issue behind understanding this stuff IMO is the poor terminology: async, sync in JS-land, which don't really describe what these things are. Really, in C, we're talking about non-blocking vs blocking I/O.
When we read from a blocking file descriptor, the process (or thread) is blocked -- prevented from running -- until the kernel has something for it to read; when we write to a blocking file descriptor, the process is blocked until the kernel accepts the entire buffer.
In non-blocking I/O, it's exactly the same, except the kernel won't stop the process from running when there is nothing to do: instead, when you read or write, it tells you how much you read or wrote (or if there was an error).
The select system call (and friends) prevent the C developer from having to try and read from a non-blocking file descriptor over and over again -- select() is, in effect, a blocking system call that unblocks when any of the descriptors or timers you are watching are ready. This lets the developer build a loop around select, servicing any events it reports, like an expired timeout or a file descriptor that can be read. This is the event loop.
So, at its very core, what happens at the C-end of a JS event loop is roughly this algorithm:
while(true) {
select(open fds, timeout);
did_the_timeout_expire(run_js_timers());
for (each error fd)
run_js_error_handler(fdJSObjects[fd]);
for (each read-ready fd)
emit_data_events(fdJSObjects[fd], read_as_much_as_I_can(fd));
for (each write-ready fd) {
if (!pendingData(fd))
break;
write_as_much_as_I_can(fd);
pendingData = whatever_was_leftover_that_couldnt_write;
}
}
FWIW - I have actually written an event loop for v8 based around select(): it really is this simple.
It's important also to remember that JS always runs to completion. So, when you call a JS function (via the v8 api) from C, your C program doesn't do anything until the JS code returns.
NodeJS uses some optimizations like handling pending writes in a separate pthreads, but these all happen in "C space" and you shouldn't think/worry about them when trying to understand this pattern, because they're not relevant.
You might also be fooled into the thinking that JS isn't run to completion when dealing with things like async functions -- but it absolutely is, 100% of the time -- if you're not up to speed on this, do some reading with respect to the event loop and the micro task queue. Async functions are basically a syntax trick, and their "completion" involves returning a Promise.
I just took a dive into libuv's source code, and found at first that it seems like it does a lot of setup, and not much actual event handling.
Nonetheless, a look into src/unix/kqueue.c reveals some of the inner mechanics of event handling:
int uv__io_check_fd(uv_loop_t* loop, int fd) {
struct kevent ev;
int rc;
rc = 0;
EV_SET(&ev, fd, EVFILT_READ, EV_ADD, 0, 0, 0);
if (kevent(loop->backend_fd, &ev, 1, NULL, 0, NULL))
rc = UV__ERR(errno);
EV_SET(&ev, fd, EVFILT_READ, EV_DELETE, 0, 0, 0);
if (rc == 0)
if (kevent(loop->backend_fd, &ev, 1, NULL, 0, NULL))
abort();
return rc;
}
The file descriptor polling is done here, "setting" the event with EV_SET (similar to how you use FD_SET before checking with select()), and the handling is done via the kevent handler.
This is specific to the kqueue style events (mainly used on BSD-likes a la MacOS), and there are many other implementations for different Unices, but they all use the same function name to do nonblocking IO checks. See here for another implementation using epoll.
To answer your questions:
1) Where exactly is the "looping" occuring within libuv?
The QUEUE data structure is used for storing and processing events. This queue is filled by the platform- and IO- specific event types you register to listen for. Internally, it uses a clever linked-list using only an array of two void * pointers (see here):
typedef void *QUEUE[2];
I'm not going to get into the details of this list, all you need to know is it implements a queue-like structure for adding and popping elements.
Once you have file descriptors in the queue that are generating data, the asynchronous I/O code mentioned earlier will pick it up. The backend_fd within the uv_loop_t structure is the generator of data for each type of I/O.
2) What are the key steps in each iteration of the loop that make it non-blocking and async?
libuv is essentially a wrapper (with a nice API) around the real workhorses here, namely kqueue, epoll, select, etc. To answer this question completely, you'd need a fair bit of background in kernel-level file descriptor implementation, and I'm not sure if that's what you want based on the question.
The short answer is that the underlying operating systems all have built-in facilities for non-blocking (and therefore async) I/O. How each system works is a little outside the scope of this answer, I think, but I'll leave some reading for the curious:
https://www.quora.com/Network-Programming-How-is-select-implemented?share=1
The first thing to keep in mind is that work must be added to libuv's queues using its API; one cannot just load up libuv, start its main loop, and then code up some I/O and get async I/O.
The queues maintained by libuv are managed by looping. The infinite loop in uv__run_timers isn't actually infinite; notice that the first check verifies that a soonest-expiring timer exists (presumably, if the list is empty, this is NULL), and if not, breaks the loop and the function returns. The next check breaks the loop if the current (soonest-expiring) timer hasn't expired. If neither of those conditions breaks the loop, the code continues: it restarts the timer, calls its timeout handler, and then loops again to check more timers. Most times when this code runs, it's going to break the loop and exit, allowing the other loops to run.
What makes all this non-blocking is the caller/user following the guidelines and API of libuv: adding your work to queues, and allowing libuv to perform its work on those queues. Processing-intensive work may block these loops and other work from running, so it's important to break your work into chunks.
btw, uv__run_idle, uv__run_check, uv__run_prepare 's source code is defined on src/unix/loop-watcher.c
I'm trying to write a server able to handle multiple (more than a thousand) client connections concurrently in C language. Every connection is meant to accomplish three things:
Send data to the server
The server processes the data
The server returns data to the client
I am using non-blocking sockets and epoll() for handling all the connections, but my problem is right in the moment after the server receives the data from one client and has to call a function which spends several seconds in processing the data before it returns the result that has to be sent back to the client before closing the connection.
My question is, what paradigm can I use in order to be able to keep handling more connections while the data of one client "is cooking"?
I've been researching a bit about the possibilities of doing it by creating a thread or a process every time I need to call the computing function, but I'm not sure if this would be possible given the number of possible concurrent connections, that's why I came here expecting that someone more experienced that me in the matter could shed some light on my ignorance.
Code snippet:
while (1)
{
ssize_t count;
char buf[512];
count = read (events[i].data.fd, buf, sizeof buf); // read the data
if (count == -1)
{
/* If errno == EAGAIN, that means we have read all
data. So go back to the main loop. */
if (errno != EAGAIN)
{
perror ("read");
done = 1;
}
/* Here is where I should call the processing function before
exiting the loop and closing the actual connection */
answer = proc_function(buf);
count = write (events[i].data.fd, answer, sizeof answer); // send the answer to the client
break;
}
...
Thanks in advance.
It seems sensible to multi-thread or multi-process to some degree to accomplish this. The degree to which you multi-thread or multi-process is the question.
1) You could dump the polling system entirely and use a thread/process per connection. That thread can then stall as long as it wants working on the processing for that connection. You'd then have to decide on creating/killing a thread/process each time (probably easiest) or having a pool of threads/processes (probably fastest).
2) You could have a thread/process for the networky bits and hand off the processing to one other thread. This is less parallel, but it does mean you can at least keep handling network connections whilst you're chopping through the list of work. This gives you control of what processing is being handled at least. It would be easy to prioritise incoming connections this way, whereas option 1 might not.
3) (sort of possible 1 & 2) You could use asynchronous I/O to multiplex your connections. You still to handle the processing in the same way as 1 & 2 above.
You also have the question of threads vs processes. Threads are probably quicker to get going but it's more difficult to ensure data integrity. Processes are going to be more resilient but require more interfacing between them.
You also have to decide on a way to pass data between the threads/processes. This is less of an issue for option 1 as you only have to pass off the connection to the thread. Option 2 may (depending on what your data is) be more of a problem. You could use a message queue for passing the messages about but if you have a lot of data to send shared memory is more appropriate. Shared memory is a pain to engineer for processes but easy with threads (as all threads share the same memory space).
There are performance issues as you get to this scale too. It's worth investigating performance characteristics for these things. The differences to how calls like select and poll scale is significant when you're dealing with a lot of connections.
Without knowledge of what data is being sent and received it's hard to give solid recommendations.
Incidentally, this isn't a new problem. Dan Kegel had a good article about it a few years back. It's now out-of-date, but the overview is still good. You should research the current state of the art for the concepts he discusses though.
I'm coding a part of little complex communication protocol to control multiple medical devices from single computer terminal. Computer terminal need to manage about 20 such devices. Every device uses same protocol fro communication called DEP. Now, I've created a loop that multiplexes within different devices to send the request and received the patient data associated with a particular device. So structure of this loop, in general, is something like this:
Begin Loop
Select Device i
if Device.Socket has Data
Strip Header
Copy Data on Queue
end if
rem_time = TIMEOUT - (CurrentTime - Device.Session.LastRequestTime)
if TIMEOUT <= 0
Send Re-association Request to Device
else
Sort Pending Request According to Time
Select First Request
Send the Request
Set Request Priority Least
end Select
end if
end Select
end Loop
I might have made some mistake in above pseudo-code, but I hope I've made myself clear about what this loop is trying to do. I've priority list structure that selects the device and pending request for that device, so that, all the requests and devices are selected at good optimal intervals.
I forgot to mention, above loop do not actually parse the received data, but it only strips off the header and put it in a queue. The data in queue is parsed in different thread and recorded in file or database.
I wish to add a feature so that other computers may also import the data and control the devices attached to computer terminal remotely. For this, I would need to create socket that would listen to commands in this INFINITE LOOP and send the data in different thread where PARSING is performed.
Now, my question to all the concurrency experts is that:
Is it a good design to use single socket for reading and writing in two different threads? Where each of the thread will be strictly involved in either reading or writing not both. Also, I believe socket is synchronized on process level, so do I need locks to synchronize the read and write over one socket from different threads?
There is nothing inherently wrong with having multiple threads handle a single socket; however, there are many good and bad designs based around this one very general idea. If you do not want to rediscover the problems as you code your application, I suggest you search around for designs that best fit your planned particular style of packet handling.
There is also nothing inherently wrong with having a single thread handle a single socket; however, if you put the logic handling on that thread, then you have selected a bad design, as then that thread cannot handle requests while it is "working" on the last reqeust.
In your particular code, you might have an issue. If your packets support fragmentation, or even if your algorithm gets a little ahead of the hardware due to timing issues, you might have just part of the packet "received" in the buffer. In that case, your algorithm will fail in two ways.
It will process a partial packet, one which has the first part of it's data.
It will mis-process the subsequent packet, as the information in the buffer will not start with a valid packet header.
Such failures are difficult to conceive and diagnose until they are encountered. Perhaps your library already buffers and splits messages, perhaps not.
In short, your design is not dictated by how many threads are accessing your socket: how many threads access your socket is dictated by your design.
I'm trying to implement a UDP-based server that maintains two sockets, one for controlling(ctrl_sock) and the other for data transmission(data_sock). The thing is, ctrl_sock is always uplink and data_sock is downlink. That is, clients will request data transmission/stop via the ctrl_sock and data will be sent to them via data_sock.
Now the problem is, since the model is connection-less, the server will have to maintain a list of registered clients' information( I call it peers_context) such that it can "blindly" push data to them until they ask to stop. During this blind transmission, clients may send controlling messages to the server via the ctrl_sock asynchronously. These information, besides initial Request and Stop, can also be, for example, preferences of file parts. Therefore, the peers_context has to be updated asynchronously. However, the transmission over the data_sock relies on this peers_context structure, hence raises a synchronization problem between ctrl_sock and data_sock. My question is, what can I do to safely maintain these two socks and the peers_context structure such that the asynchronous update of peers_context won't cause a havoc. By the way, the update of peers_context wouldn't be very frequent, that is why I need to avoid the request-reply model.
My initial consideration of the implementation is, to maintain ctrl_sock in the main thread(listener thread), and transmission over data_sock is maintained in the other thread(worker thread). However, I found it is difficult to synchronize in this case. For example, if I use mutexes in peers_context, whenever the worker thread locks peers_context, the listener thread wouldn't have access to it anymore when it needs to modify peers_context, because the worker thread works endlessly. On the other hand, if the listener thread holds the peers_context and writes to it, the worker thread would fail to read peers_context and terminates. Can anybody give me some suggestions?
By the way, the implementation is done in Linux environment in C. Only the listener thread would need to modify peers_context occasionally, the worker thread only needs to read. Thanks sincerely!
If there is strong contention for your peers_context then you need to need to shorten your critical sections. You talked about using a mutex. I assume you've already considered changing to a reader+writer lock and rejected it because you don't want the constant readers to starve a writer. How about this?
Make a very small structure that is an indirect reference to a peers_context like this:
struct peers_context_handle {
pthread_mutex_t ref_lock;
struct peers_context *the_actual_context;
pthread_mutex_t write_lock;
};
Packet senders (readers) and control request processors (writers) always access the peers_mutex through this indirection.
Assumption: the packet senders never modify the peers_context, nor do they ever free it.
Packer senders briefly lock the handle, obtain the current version of the peers_context and unlock it:
pthread_mutex_lock(&(handle->ref_lock));
peers_context = handle->the_actual_context;
pthread_mutex_unlock(&(handle->ref_lock));
(In practice, you can even do away with the lock if you introduce memory barriers, because a pointer dereference is atomic on all platforms that Linux supports, but I wouldn't recommend it since you would have to start delving into memory barriers and other low-level stuff, and neither C nor POSIX guarantees that it will work anyway.)
Request processors don't update the peers_context, they make a copy and completely replace it. That's how they keep their critical section small. They do use write_lock to serialize updates, but updates are infrequent so that's not a problem.
pthread_mutex_lock(&(handle->write_lock));
/* Short CS to get the old version */
pthread_mutex_lock(&(handle->ref_lock));
old_peers_context = handle->the_actual_context;
pthread_mutex_unlock(&(handle->ref_lock));
new_peers_context = allocate_new_structure();
*new_peers_context = *old_peers_context;
/* Now make the changes that are requested */
new_peers_context->foo = 42;
new_peers_context->bar = 52;
/* Short CS to replace the context */
pthread_mutex_lock(&(handle->ref_lock));
handle->the_actual_context = new_peers_context;
pthread_mutex_unlock(&(handle->ref_lock));
pthread_mutex_unlock(&(handle->write_lock));
magic(old_peers_context);
What's the catch? It's the magic in the last line of code. You have to free the old copy of the peers_context to avoid a memory leak but you can't do it because there might be packet senders still using that copy.
The solution is similar to RCU, as used inside the Linux kernel. You have to wait for all of the packet sender threads to have entered a quiescent state. I'm leaving the implementation of this as an exercise for you :-) but here are the guidelines:
The magic() function adds old_peers_context so a to-be-freed queue (which has to be protected by a mutex).
One dedicated thread frees this list in a loop:
It locks the to-be-freed list
It obtains a pointer to the list
It replaced the list with a new empty list
It unlocks the to-be-freed list
It clears a mark associated with each worker thread
It waits for all marks to be set again
It frees each item in its previously obtained copy of the to-be-freed list
Meanwhile, each worker thread sets its own mark at an idle point in its event loop (i.e. a point when it is not busy sending any packets or holding any peer_contexts.
I am trying to get Thread A to communicate with Thread B. I should be using message passing between threads to do this but I am trying to find some sample source code which explains message passing.
Does anyone have any good link to some sample source code (in C) which explains message passing ?
While not having a link, there are many ways to implement this.
First is to use sockets. This is not actually a method I would recommend, as it can be quite a lot of work to get it to work right.
The second is related to the first method, and is to use something called anonymous pipes.
A third way, and the one I usually use, is "inspired" by how message passing worked on the old Amiga operating system: Simply use a queue. Since memory is shared between threads it's easy to just pass pointers around. Use one queue per thread. Just remember to protect the queues, with something like a mutex.
The platform you are using will probably have other ways of communication. Do a Google search for (for example) linux ipc.
ONe very easy way that's fairly fast on Linux at least is to use either TCP or UDP sockets for message passing between threads. The Linux kernel is pretty smart and if I remember correctly it will bypass the network stack which makes it pretty fast. Then you don't have to worry about locking and various other issues which are basically handled by the kernel. Should be good enough for a homework assignment.
Uri's TCP/IP Resources List
FAQs, tutorials, guides, web pages & sites, and books about TCP/IP
Each thread in a process can see all of the memory of other threads. If two threads hold a pointer to the same location in memory, then they can both access it.
Following is the code but not tested.
struct MessageQueue
{
std::queue<std::string> msg_queue;
pthread_mutex_t mu_queue;
pthread_cond_t cond;
};
{
// In a reader thread, far, far away...
MessageQueue *mq = <a pointer to the same instance that the main thread has>;
std::string msg = read_a_line_from_irc_or_whatever();
pthread_mutex_lock(&mq->mu_queue);
mq->msg_queue.push(msg);
pthread_mutex_unlock(&mq->mu_queue);
pthread_cond_signal(&mq->cond);
}
{
// Main thread
MessageQueue *mq = <a pointer to the same instance that the main thread has>;
while(1)
{
pthread_mutex_lock(&mq->mu_queue);
if(!mq->msg_queue.empty())
{
std::string s = mq->msg_queue.top();
mq->msg_queue.pop();
pthread_mutex_unlock(&mq->mu_queue);
handle_that_string(s);
}
else
{
pthread_cond_wait(&mq->cond, &mq->mu_queue)
}
}