How do I replace select() with kevent() for higher performance? - c

From the Kqueue Wikipedia Page:
Kqueue provides efficient input and output event pipelines between the kernel and userland. Thus, it is possible to modify event filters as well as receive pending events while using only a single system call to kevent(2) per main event loop iteration. This contrasts with older traditional polling system calls such as poll(2) and select(2) which are less efficient, especially when polling for events on a large number of file descriptors
That sounds great. I target FreeBSD for my server, and I'm handling a significant amount of network socket fd's - using a select() on them all and figuring out who to read data from. I'd rather use kevent() calls to get higher performance, since that's what it's there for!
I've read the man page for kevent on FreeBSD here but it's cryptic to me and I'm not finding good resources that explain it. An example of using kevent to replace select would solve my problem, and would also help me get a better idea of how kevent() is used.

First, create new kqueue:
int kq=kqueue();
Now register your fd in kq:
struct kevent kev;
kev.ident=your_fd;
kev.flags=EV_ADD | EV_CLEAR;
kev.filter=EVFILT_READ;
kev.fflags=0;
kev.data=0;
kev.udata=&your_data;
int res=kevent(kq,&kev,1,0,0,0);
Finally, wait for data to arrive into your socket:
struct kevent res_kevs[5];
int res=kevent(kq,0,0,res_kevs,5,0);
After return, res_kevs[i].ident will contain your socket's descriptor, res_kevs[i].data - number of bytes ready to be read.
See man kevent for more details and features.

What you do normally, is use libevent, which takes care of all the details for you, and also means that you can move your program on to another OS which has a different scheme (e.g. Linux and epoll) for doing something similar.

Related

How does the libuv implementation of *non-blockingness* work exactly?

So I have just discovered that libuv is a fairly small library as far as C libraries go (compare to FFmpeg). I have spent the past 6 hours reading through the source code to get a feel for the event loop at a deeper level. But still not seeing where the "nonblockingness" is implemented. Where some event interrupt signal or whatnot is being invoked in the codebase.
I have been using Node.js for over 8 years so I am familar with how to use an async non-blocking event loop, but I never actually looked into the implementation.
My question is twofold:
Where exactly is the "looping" occuring within libuv?
What are the key steps in each iteration of the loop that make it non-blocking and async.
So we start with a hello world example. All that is required is this:
#include <stdio.h>
#include <stdlib.h>
#include <uv.h>
int main() {
uv_loop_t *loop = malloc(sizeof(uv_loop_t));
uv_loop_init(loop); // initialize datastructures.
uv_run(loop, UV_RUN_DEFAULT); // infinite loop as long as queue is full?
uv_loop_close(loop);
free(loop);
return 0;
}
The key function which I have been exploring is uv_run. The uv_loop_init function essentially initializes data structures, so not too much fancness there I don't think. But the real magic seems to happen with uv_run, somewhere. A high level set of code snippets from the libuv repo is in this gist, showing what the uv_run function calls.
Essentially it seems to boil down to this:
while (NOT_STOPPED) {
uv__update_time(loop)
uv__run_timers(loop)
uv__run_pending(loop)
uv__run_idle(loop)
uv__run_prepare(loop)
uv__io_poll(loop, timeout)
uv__run_check(loop)
uv__run_closing_handles(loop)
// ... cleanup
}
Those functions are in the gist.
uv__run_timers: runs timer callbacks? loops with for (;;) {.
uv__run_pending: runs regular callbacks? loops through queue with while (!QUEUE_EMPTY(&pq)) {.
uv__run_idle: no source code
uv__run_prepare: no source code
uv__io_poll: does io polling? (can't quite tell what this means tho). Has 2 loops: while (!QUEUE_EMPTY(&loop->watcher_queue)) {, and for (;;) {,
And then we're done. And the program exists, because there is no "work" to be done.
So I think I have answered the first part of my question after all this digging, and the looping is specifically in these 3 functions:
uv__run_timers
uv__run_pending
uv__io_poll
But not having implemented anything with kqueue or multithreading and having dealt relatively little with file descriptors, I am not quite following the code. This will probably help out others along the path to learning this too.
So the second part of the question is what are the key steps in these 3 functions that implement the nonblockingness? Assuming this is where all the looping exists.
Not being a C expert, does for (;;) { "block" the event loop? Or can that run indefinitely and somehow other parts of the code are jumped to from OS system events or something like that?
So uv__io_poll calls poll(...) in that endless loop. I don't think is non-blocking, is that correct? That seems to be all it mainly does.
Looking into kqueue.c there is also a uv__io_poll, so I assume the poll implementation is a fallback and kqueue on Mac is used, which is non-blocking?
So is that it? Is it just looping in uv__io_poll and each iteration you can add to the queue, and as long as there's stuff in the queue it will run? I still don't see how it's non-blocking and async.
Can one outline similar to this how it is async and non-blocking, and which parts of the code to take a look at? Basically, I would like to see where the "free processor idleness" exists in libuv. Where is the processor ever free in the call to our initial uv_run? If it is free, how does it get reinvoked, like an event handler? (Like a browser event handler from the mouse, an interrupt). I feel like I'm looking for an interrupt but not seeing one.
I ask this because I want to implement an MVP event loop in C, but just don't understand how nonblockingness actually is implemented. Where the rubber meets the road.
I think that trying to understand libuv is getting in your way of understanding how reactors (event loops) are implemented in C, and it is this that you need to understand, as opposed to the exact implementation details behind libuv.
(Note that when I say "in C", what I really means is "at or near the system call interface, where userland meets the kernel".)
All of the different backends (select, poll, epoll, etc) are, more-or-less, variations on the same theme. They block the current process or thread until there is work to be done, like servicing a timer, reading from a socket, writing to a socket, or handling a socket error.
When the current process is blocked, it literally is not getting any CPU cycles assigned to it by the OS scheduler.
Part of the issue behind understanding this stuff IMO is the poor terminology: async, sync in JS-land, which don't really describe what these things are. Really, in C, we're talking about non-blocking vs blocking I/O.
When we read from a blocking file descriptor, the process (or thread) is blocked -- prevented from running -- until the kernel has something for it to read; when we write to a blocking file descriptor, the process is blocked until the kernel accepts the entire buffer.
In non-blocking I/O, it's exactly the same, except the kernel won't stop the process from running when there is nothing to do: instead, when you read or write, it tells you how much you read or wrote (or if there was an error).
The select system call (and friends) prevent the C developer from having to try and read from a non-blocking file descriptor over and over again -- select() is, in effect, a blocking system call that unblocks when any of the descriptors or timers you are watching are ready. This lets the developer build a loop around select, servicing any events it reports, like an expired timeout or a file descriptor that can be read. This is the event loop.
So, at its very core, what happens at the C-end of a JS event loop is roughly this algorithm:
while(true) {
select(open fds, timeout);
did_the_timeout_expire(run_js_timers());
for (each error fd)
run_js_error_handler(fdJSObjects[fd]);
for (each read-ready fd)
emit_data_events(fdJSObjects[fd], read_as_much_as_I_can(fd));
for (each write-ready fd) {
if (!pendingData(fd))
break;
write_as_much_as_I_can(fd);
pendingData = whatever_was_leftover_that_couldnt_write;
}
}
FWIW - I have actually written an event loop for v8 based around select(): it really is this simple.
It's important also to remember that JS always runs to completion. So, when you call a JS function (via the v8 api) from C, your C program doesn't do anything until the JS code returns.
NodeJS uses some optimizations like handling pending writes in a separate pthreads, but these all happen in "C space" and you shouldn't think/worry about them when trying to understand this pattern, because they're not relevant.
You might also be fooled into the thinking that JS isn't run to completion when dealing with things like async functions -- but it absolutely is, 100% of the time -- if you're not up to speed on this, do some reading with respect to the event loop and the micro task queue. Async functions are basically a syntax trick, and their "completion" involves returning a Promise.
I just took a dive into libuv's source code, and found at first that it seems like it does a lot of setup, and not much actual event handling.
Nonetheless, a look into src/unix/kqueue.c reveals some of the inner mechanics of event handling:
int uv__io_check_fd(uv_loop_t* loop, int fd) {
struct kevent ev;
int rc;
rc = 0;
EV_SET(&ev, fd, EVFILT_READ, EV_ADD, 0, 0, 0);
if (kevent(loop->backend_fd, &ev, 1, NULL, 0, NULL))
rc = UV__ERR(errno);
EV_SET(&ev, fd, EVFILT_READ, EV_DELETE, 0, 0, 0);
if (rc == 0)
if (kevent(loop->backend_fd, &ev, 1, NULL, 0, NULL))
abort();
return rc;
}
The file descriptor polling is done here, "setting" the event with EV_SET (similar to how you use FD_SET before checking with select()), and the handling is done via the kevent handler.
This is specific to the kqueue style events (mainly used on BSD-likes a la MacOS), and there are many other implementations for different Unices, but they all use the same function name to do nonblocking IO checks. See here for another implementation using epoll.
To answer your questions:
1) Where exactly is the "looping" occuring within libuv?
The QUEUE data structure is used for storing and processing events. This queue is filled by the platform- and IO- specific event types you register to listen for. Internally, it uses a clever linked-list using only an array of two void * pointers (see here):
typedef void *QUEUE[2];
I'm not going to get into the details of this list, all you need to know is it implements a queue-like structure for adding and popping elements.
Once you have file descriptors in the queue that are generating data, the asynchronous I/O code mentioned earlier will pick it up. The backend_fd within the uv_loop_t structure is the generator of data for each type of I/O.
2) What are the key steps in each iteration of the loop that make it non-blocking and async?
libuv is essentially a wrapper (with a nice API) around the real workhorses here, namely kqueue, epoll, select, etc. To answer this question completely, you'd need a fair bit of background in kernel-level file descriptor implementation, and I'm not sure if that's what you want based on the question.
The short answer is that the underlying operating systems all have built-in facilities for non-blocking (and therefore async) I/O. How each system works is a little outside the scope of this answer, I think, but I'll leave some reading for the curious:
https://www.quora.com/Network-Programming-How-is-select-implemented?share=1
The first thing to keep in mind is that work must be added to libuv's queues using its API; one cannot just load up libuv, start its main loop, and then code up some I/O and get async I/O.
The queues maintained by libuv are managed by looping. The infinite loop in uv__run_timers isn't actually infinite; notice that the first check verifies that a soonest-expiring timer exists (presumably, if the list is empty, this is NULL), and if not, breaks the loop and the function returns. The next check breaks the loop if the current (soonest-expiring) timer hasn't expired. If neither of those conditions breaks the loop, the code continues: it restarts the timer, calls its timeout handler, and then loops again to check more timers. Most times when this code runs, it's going to break the loop and exit, allowing the other loops to run.
What makes all this non-blocking is the caller/user following the guidelines and API of libuv: adding your work to queues, and allowing libuv to perform its work on those queues. Processing-intensive work may block these loops and other work from running, so it's important to break your work into chunks.
btw, uv__run_idle, uv__run_check, uv__run_prepare 's source code is defined on src/unix/loop-watcher.c

What is the minimum size of data guaranteed to be sent in a single call of send() (tcp sockets)? [duplicate]

This question already has an answer here:
Why is it assumed that send may return with less than requested data transmitted on a blocking socket?
(1 answer)
Closed 9 years ago.
After select returns with write fd set for a tcp socket. If I try to send data on that socket, what is the minimum guaranteed size of data to be sent at once using send api? I understand that I have to run a loop to make sure all the data is sent. Still i want to understand what is the minimum guaranteed data sent and why?
This has come up before. I'm still searching for the referenced answer.
Let's start with the function prototype for send()
ssize_t send(int sockfd, const void *buf, size_t len, int flags);
For blocking TCP sockets - all the documentation will suggest that send() and write() will return a value between [1..len], unless there was an error. However, the reality is that no one I know has ever observed send() returning something other than -1 (error) or just "len" in the success case to indicate all of "buf" was sent in one call. I've never felt good about this, so I code defensively and just put my blocking send calls in a loop until the entire buffer is sent.
For non-blocking TCP sockets - you should just code as if the minimum was "1" (or -1 on error). Don't make any assumptions about a minimum data size.
And for recv(), you should always assume recv() will return some random value between 1..len in the success case, or 0 (closed), or -1 (error). Don't EVER assume recv will return a full buffer.
From the official POSIX reference:
A descriptor shall be considered ready for writing when a call to an output function with O_NONBLOCK clear would not block, whether or not the function would transfer data successfully.
As you see it doesn't actually mentions any size, or that the write will even be successful, just that you will be able to write to the socket without blocking.
So the answer is that there is no minimum guaranteed size.
When select() indicates that a socket is writable it is guaranteed you can transfer at least one byte without incurring EWOULDBLOCK or EAGAIN. Nothing to say you won't incur a different error though :-)
I dont think that on a POSIX layer you have any control over that. send() will accept anything you pass into internal buffers of TCP/IP stack implementation. And what happens next with it it's already another story which you can watch over using same select() call.
If you are asking about size which will fit in one packet that's MTU (but this assumption also should be used with care in regards of fragmentation and reassembling of the packets on the way)
UPD:
I will answer your comment here. No, you shouldn't bother about fragmentation at all. Leave it to TCP/IP stack. There are a lot of reasons why you shouldn't do this.. One as an example. Your application works on Application(7) layer of OSI model (although I consider OSI model in most cases to be an evil thing, it's really applicable for this example). And from this layer you trying to affect functionality/properties of a logic which is on much lower layers (Session/Transport). You shouldn't do this. POSIX calls like send(), recv() are designed to give your application an ability to instruct underneath layers that you need to pass certain amount of data and you have a way to monitor an execution of a command ( select()), that's all you have to do. And lower layers suppose to do their best to deliver data you instruct them to do in most optimal way depending on OS network settings/etc.
UPD2: everything above mostly consider NON_BLOCKING sockets. Sorry forgot to mention this, I don't use blocking sockets in my projects for ages.. in case your socket is blocking I still would consider to pass everything at once and just wait for operation result in another thread for example, because trying to optimise this could lead to very OS/drivers dependant code.

Non-Blocking Sockets required for a proxy?

i have the following code:
{
send(dstSocket, rcvBuffer, recvMsgSize, 0);
sndMsgSize = recv(dstSocket, sndBuffer, RCVBUFSIZE, 0);
send(rcvSocket, sndBuffer, sndMsgSize, 0);
recvMsgSize = recv(rcvSocket, rcvBuffer, RCVBUFSIZE, 0);
}
which eventually should become part of a generic TCP-Proxy. Now as it stands, it doesn't work quite correctly, since the recv() waits for input so the data only gets transmitted in chunks, depending where it currently is.
What i read up on it is that i need something like "non-blocking sockets" and a mechanism to monitor them. This mechanism as i found out is either select, poll or epoll in Linux. Could anyone give me a confirmation that i am on the right track here? Or could this excercise also be done with blocking sockets?
Regards
You are on the right track.
"select" and "poll" are system calls where you can pass in one or more sockets and block (for a specific amount of time) until data has been received (or ready for sending) on one of those sockets.
"non-blocking sockets" is a setting you can apply to a socket (or a recv call flag) such that if you try to call recv, but no data is available, the call will return immediately. Similar semantics exist for "send". You can use non-blocking sockets with or without the select/poll method described above. It's usually not a bad idea to use non-blocking operations just in case you get signaled for data that isn't there.
"epoll" is a highly scalable version of select and poll. A "select" set is actually limited to something like 64-256 sockets for monitoring at a time and it takes a perf hit as the number of monitored sockets goes up. "epoll" can scale up to thousands of simultaneous network connections.
Yes you are in the right track. Use non-blocking socket passing their relative file descriptors to select (see FD_SET()).
This way select will monitor for events (read/write) on them.
When select returns you can check on which fd has occurred an event (look at FD_ISSET()) and handle it.
You can also set a timeout on select, and it will return after that period even if not events have been occurred.
Yes you'll have to use one of those mechanisms. poll is portable and IMO the most easy one to use. You don't have to turn off blocking in this case, provided you use a small enough value for RCVBUFSIZE (around 2k-10k should be appropriate). Non-blocking sockets are a bit more complicated to handle, since if you get EAGAIN on send, you can't just loop to try again (well you can, but you shouldn't since it uses CPU unnecessarily).
But I would recommend to use a wrapper such as libevent. In this case a struct bufferevent would work particularly well. It will make a callback when new data is available, and you just queue it up for sending on the other socket.
Tried to find an bufferevent example but seems to be a bit short on them. The documentation is here anyway: http://monkey.org/~provos/libevent/doxygen-2.0.1/index.html

Reading from the serial port in a multi-threaded program on Linux

I'm writing a program in linux to interface, through serial, with a piece of hardware. The device sends packets of approximately 30-40 bytes at about 10Hz. This software module will interface with others and communicate via IPC so it must perform a specific IPC sleep to allow it to receive messages that it's subscribed to when it isn't doing anything useful.
Currently my code looks something like:
while(1){
IPC_sleep(some_time);
read_serial();
process_serial_data();
}
The problem with this is that sometimes the read will be performed while only a fraction of the next packet is available at the serial port, which means that it isn't all read until next time around the loop. For the specific application it is preferable that the data is read as soon as it's available, and that the program doesn't block while reading.
What's the best solution to this problem?
The best solution is not to sleep ! What I mean is a good solution is probably to mix
the IPC event and the serial event. select is a good tool to do this. Then you have to find and IPC mechanism that is select compatible.
socket based IPC is select() able
pipe based IPC is select() able
posix message queue are also selectable
And then your loop looks like this
while(1) {
select(serial_fd | ipc_fd); //of course this is pseudo code
if(FD_ISSET(fd_set, serial_fd)) {
parse_serial(serial_fd, serial_context);
if(complete_serial_message)
process_serial_data(serial_context)
}
if(FD_ISSET(ipc_fd)) {
do_ipc();
}
}
read_serial is replaced with parse_serial, because if you spend all your time waiting for complete serial packet, then all the benefit of the select is lost. But from your question, it seems you are already doing that, since you mention getting serial data in two different loop.
With the proposed architecture you have good reactivity on both the IPC and the serial side. You read serial data as soon as they are available, but without stopping to process IPC.
Of course it assumes you can change the IPC mechanism. If you can't, perhaps you can make a "bridge process" that interface on one side with whatever IPC you are stuck with, and on the other side uses a select()able IPC to communicate with your serial code.
Store away what you got so far of the message in a buffer of some sort.
If you don't want to block while waiting for new data, use something like select() on the serial port to check that more data is available. If not, you can continue doing some processing or whatever needs to be done instead of blocking until there is data to fetch.
When the rest of the data arrives, add to the buffer and check if there is enough to comprise a complete message. If there is, process it and remove it from the buffer.
You must cache enough of a message to know whether or not it is a complete message or if you will have a complete valid message.
If it is not valid or won't be in an acceptable timeframe, then you toss it. Otherwise, you keep it and process it.
This is typically called implementing a parser for the device's protocol.
This is the algorithm (blocking) that is needed:
while(! complete_packet(p) && time_taken < timeout)
{
p += reading_device.read(); //only blocks for t << 1sec.
time_taken.update();
}
//now you have a complete packet or a timeout.
You can intersperse a callback if you like, or inject relevant portions in your processing loops.

select()-able timers

select() is a great system call. You can pack any number of file descriptors, socket descriptors, pipes, etc. and get notified in a synchronous fashion when input becomes available.
Is there a way to create an interval/oneshot timer and use it with select()? That would save me from having multiple threads for IO and timing.
timerfd_create does exactly this. It's a fairly recent addition to the linux kernel and might not be available on all distros yet though.
Use the timeout parameter - keep your timer events in a priority queue, check the top item and set the timeout accordingly - if the timeout is reached, then you can check that the event is ready to run, run the event and continue.
At least that's what I do.
Note that poll has a nicer interface (in some ways) and may be more efficient with lots of file descriptors.
MarkR has a nice portable solution, but here's another:
Use a POSIX timer (timer_create) and you can transform the problem into "select-able signals". This problem has a classic solution: writing to a pipe from the signal handler and selecting on the read end of the pipe.
Building on #MarkR, using a sorted structure to store callback+closure with an int and a pointer to an int. If the two ints have the same value then the event is active otherwise it was discarded.
This way events can be discarded simply by increasing an int. Perhaps not the most straightforward solution but it was all I could think of.
https://github.com/cheako/tor2web/tree/6ac67f80daaea01d14a5d07e6026e1af4258dc96/src
hextree.c contains the code for the data structure used.
schedule.c:156 is where the int is changed.
gnutls.c:197 is where the timers are created.

Resources