How to use libuv for direct file descriptor reads? - libuv

As part of an investigation for a project I am working on, I've been looking into different event loop mechanisms/libraries to use for detection and reading of data from sockets. Specifically, what I need to do is simple:
Detect data from client connections
Pass the file descriptor to worker threads to read and process
Epoll edge triggering worked great for this purpose, and I like the edge triggered behavior so I only get notified once when data is available. I tried implementing using libev doing something like the below pseudo code and this appears to work:
void read_cb(struct ev_loop *loop, struct ev_io *watcher, int revents) {
1. Check for errors
2. ev_io_stop(loop, watcher) so I don't get constantly notified
3. Assign the ev_io watcher pointer into worker thread accessible data structure
3. Signal worker thread
4. Worker thread begins reading from watcher->fd
5. When worker thread get EAGAIN, start the watcher again
Since libuv is intended for similar purpose and is edge triggered, I am trying to do something similar but haven't been successful yet. With libuv, I understand that you can use uv_read_start for reading data from streams, but with this method, the uv_read_cb returns a buffer filled with the data. As I need to manipulate that amount of data that needs to be read, and to avoid extra copy of the data from this buffer to a different structure, I'd like to be able to read directly from the socket.
Is this scenario something that libuv can be used for?
Thanks in advance!

This commit adds the possibility to get the file descriptor of an underlying stream: https://github.com/joyent/libuv/commit/4ca9a363897cfa60f4e2229e4f15ac5abd7fd103
You can use:
int uv_fileno(const uv_handle_t* handle, uv_os_fd_t* fd);
Then read from the FD however you see fit.

I was finally able to find an example that does what I describe in my previous post. For those who'd be interested on how this is done, here is the link.
Testing this had yielded additional questions but I'll post those separately as they are related more to edge/level trigger behaviour rather than the library.

Related

How do I implement event driven POSIX threads?

I'm coding for a linux platform using C. Let's say I have 2 threads. A and B.
A is an infinite loop and constantly trying to find out if there is data on the socket localhost:8080, where as B is a thread that spends most of its time in a blocked state until A calls mutex unlock function on a mutex that B uses to block itself. A will unlock B when it received appropriate data on the socket.
So you see here is a problem. B is "event driven" largely whereas A is in a constant running state. My target platform isn't resource rich so I wish A could be "activated" and enter running state only when it received data on socket, instead of constantly looping.
So how can I do that? If it matters - I wish to do this for both UDP and TCP sockets.
There are Multiple was of doing what you want in a clean was. One approach, you are kind of using already, is a event system. A real event system would be overkill for the kind of problem you are dealing with, but can be found here. This is a (random) better implementation, capable of listening for multiple file descriptors and time based events, all in a single thread.
If you want to build one yourself, you should take a look at the select or poll function.
But I agree with #Jeremy Friesner, you should definitely use the functions made for socket programming, they are perfect for your kind of problem. Only use the event system approach if you really need it (with multiple sockets/timed events).
You simply call recv (or recvfrom, recvmsg, etc) and it doesn't return until some data has been received. There's no need to "constantly try to find out if there is data" - that's silly.
If you set the socket to non-blocking mode then recv will return even if there's no data. If that's what you're doing, then the solution is simple: don't set the socket to non-blocking mode.

epoll IO with worker threads in C

I am writing a small server that will receive data from multiple sources and process this data. The sources and data received is significant, but no more than epoll should be able to handle quite well. However, all received data must be parsed and run through a large number of tests which is time consuming and will block a single thread despite epoll multiplexing. Basically, the pattern should be something like follows: IO-loop receives data and bundles it into a job, sends to the first thread available in the pool, the bundle is processed by the job and the result is passed pack to the IO loop for writing to file.
I have decided to go for a single IO thread and N worker threads. The IO thread for accepting tcp connections and reading data is easy to implement using the example provided at:
http://linux.die.net/man/7/epoll
Thread are also usually easy enough to deal with, but I am struggling to combine the epoll IO loop with a threadpool in an elegant manner. I am unable to find any "best practice" for using epoll with a worker pool online either, but quite a few questions regarding the same topic.
I therefore have some question I hope someone can help me answering:
Could (and should) eventfd be used as a mechanism for 2-way synchronization between the IO thread and all the workers? For instance, is it a good idea for each worker thread to have its own epoll routine waiting on a shared eventfd (with a struct pointer, containing data/info about the job) i.e. using the eventfd as a job queue somehow? Also perhaps have another eventfd to pass results back into the IO thread from multiple worker threads?
After the IO thread is signaled about more data on a socket, should the actual recv take place on the IO thread, or should the worker recv the data on their own in order to not block the IO thread while parsing data frames etc.? In that case, how can I ensure safety, e.g. in case recv reads 1,5 frames of data in a worker thread and another worker thread receives the last 0,5 frame of data from the same connection?
If the worker thread pool is implemented through mutexes and such, will waiting for locks block the IO thread if N+1 threads are trying to use the same lock?
Are there any good practice patterns for how to build a worker thread pool around epoll with two way communication (i.e. both from IO to workers and back)?
EDIT: Can one possible solution be to update a ring buffer from the IO-loop, after update send the ring buffer index to the workers through a shared pipe for all workers (thus giving away control of that index to the first worker that reads the index off the pipe), let the worker own that index until end of processing and then send the index number back into the IO-thread through a pipe again, thus giving back control?
My application is Linux-only, so I can use Linux-only functionality in order to achieve this in the most elegant way possible. Cross platform support is not needed, but performance and thread safety is.
In my tests, one epoll instance per thread outperformed complicated threading models by far. If listener sockets are added to all epoll instances, the workers would simply accept(2) and the winner would be awarded the connection and process it for its lifetime.
Your workers could look something like this:
for (;;) {
nfds = epoll_wait(worker->efd, &evs, 1024, -1);
for (i = 0; i < nfds; i++)
((struct socket_context*)evs[i].data.ptr)->handler(
evs[i].data.ptr,
evs[i].events);
}
And every file descriptor added to an epoll instance could have a struct socket_context associated with it:
void listener_handler(struct socket_context* ctx, int ev)
{
struct socket_context* conn;
conn->fd = accept(ctx->fd, NULL, NULL);
conn->handler = conn_handler;
/* add to calling worker's epoll instance or implement some form
* of load balancing */
}
void conn_handler(struct socket_context* ctx, int ev)
{
/* read all available data and process. if incomplete, stash
* data in ctx and continue next time handler is called */
}
void dummy_handler(struct socket_context* ctx, int ev)
{
/* handle exit condition async by adding a pipe with its
* own handler */
}
I like this strategy because:
very simple design;
all threads are identical;
workers and connections are isolated--no stepping on each other's toes or calling read(2) in the wrong worker;
no locks are required (the kernel gets to worry about synchronization on accept(2));
somewhat naturally load balanced since no busy worker will actively contend on accept(2).
And some notes on epoll:
use edge-triggered mode, non-blocking sockets and always read until EAGAIN;
avoid dup(2) family of calls to spare yourself from some surprises (epoll registers file descriptors, but actually watches file descriptions);
you can epoll_ctl(2) other threads' epoll instances safely;
use a large struct epoll_event buffer for epoll_wait(2) to avoid starvation.
Some other notes:
use accept4(2) to save a system call;
use one thread per core (1 for each physical if CPU-bound, or 1 for each each logical if I/O-bound);
poll(2)/select(2) is likely faster if connection count is low.
I hope this helps.
When performing this model, because we only know the packet size once we have fully received the packet, unfortunately we cannot offload the receive itself to a worker thread. Instead the best we can still do is a thread to receive the data which will have to pass off pointers to fully received packets.
The data itself is probably best held in a circular buffer, however we will want a separate buffer for each input source (if we get a partial packet we can continue receiving from other sources without splitting up the data. The remaining question is how to inform the workers of when a new packet is ready, and to give them a pointer to the data in said packet. Because there is little data here, just some pointers the most elegant way of doing this would be with posix message queues. These provide the ability for multiple senders and multiple receivers to write and read messages, always ensuring every message is received and by precisely 1 thread.
You will want a struct resembling the one below for each data source, I shall go through the fields purposes now.
struct DataSource
{
int SourceFD;
char DataBuffer[MAX_PACKET_SIZE * (THREAD_COUNT + 1)];
char *LatestPacket;
char *CurrentLocation
int SizeLeft;
};
The SourceFD is obviously the file descriptor to the data stream in question, the DataBuffer is where Packets contents are held while being processed, it is a circular buffer. The LatestPacket pointer is used to temporarily hold a pointer to the most resent packet in case we receive a partial packet and move onto another source before passing the packet off. The CurrentLocation stores where the latest packet ends so that we know where to place the next one, or where to carry on in case of partial receive. The size left is the room left in the buffer, this will be used to tell if we can fit the packet or need to circle back around to the beginning.
The receiving function will thus effectively
Copy the contents of the packet into the buffer
Move CurrentLocation to point to the end of the packet
Update SizeLeft to account for the now decreased buffer
If we cannot fit the packet in the end of the buffer we cycle around
If there is no room there either we try again a bit later, going to another source meanwhile
If we had a partial receive store the LatestPacket pointer to point to the start of the packet and go to another stream until we get the rest
Send a message using a posix message queue to a worker thread so it can process the data, the message will contain a pointer to the DataSource structure so it can work on it, it also needs a pointer to the packet it is working on, and it's size, these can be calculated when we receive the packet
The worker thread will do its processing using the received pointers and then increase the SizeLeft so the receiver thread will know it can carry on filling the buffer. The atomic functions will be needed to work on the size value in the struct so we don't get race conditions with the size property (as it is possible it is written by a worker and the IO thread simultaneously, causing lost writes, see my comment below), they are listed here and are simple and extremely useful.
Now, I have given some general background but will address the points given specifically:
Using the EventFD as a synchronization mechanism is largely a bad idea, you will find yourself using a fair amount of unneeded CPU time and it is very hard to perform any synchronization. Particularly if you have multiple threads pick up the same file descriptor you could have major problems. This is in effect a nasty hack that will work sometimes but is no real substitute for proper synchronization.
It is also a bad idea to try and offload the receive as explained above, you can get around the issue with complex IPC but frankly it is unlikely receiving IO will take enough time to stall your application, your IO is also likely much slower than CPU so receiving with multiple threads will gain little. (this assumes you do not say, have several 10 gigabit network cards).
Using mutexes or locks is a silly idea here, it fits much better into lockless coding given the low amount of (simultaneously) shared data, you are really just handing off work and data. This will also boost performance of the receive thread and make your app far more scalable. Using the functions mentioned here http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html you can do this nice and easily. If you did do it this way, what you would need is a semaphore, this can be unlocked every time a packet is received and locked by each thread which starts a job to allow dynamically more threads in if more packets are ready, that would have far less overhead then a homebrew solution with mutexes.
There is not really much difference here to any thread pool, you spawn a lot of threads then have them all block in mq_receive on the data message queue to wait for messages. When they are done they send their result back to the main thread which adds the results message queue to its epoll list. It can then receive results this way, it is simple and very efficient for small data payloads like pointers. This will also use little CPU and not force the main thread to waste time managing workers.
Finally your edit is fairly sensible, except for the fact as I ave suggested, message queues are far better than pipes here as they very efficiently signal events , guarantee a full message read and provide automatic framing.
I hope this helps, however it is late so if I missed anything or you have questions feel free to comment for clarification or more explanation.
I post the same answer in other post: I want to wait on both a file descriptor and a mutex, what's the recommended way to do this?
==========================================================
This is a very common seen problem, especially when you are developing network server-side program. Most Linux server-side program's main look will loop like this:
epoll_add(serv_sock);
while(1){
ret = epoll_wait();
foreach(ret as fd){
req = fd.read();
resp = proc(req);
fd.send(resp);
}
}
It is single threaded(the main thread), epoll based server framework. The problem is, it is single threaded, not multi-threaded. It requires that proc() should never blocks or runs for a significant time(say 10 ms for common cases).
If proc() will ever runs for a long time, WE NEED MULTI THREADS, and executes proc() in a separated thread(the worker thread).
We can submit task to the worker thread without blocking the main thread, using a mutex based message queue, it is fast enough.
Then we need a way to obtain the task result from a worker thread. How? If we just check the message queue directly, before or after epoll_wait(), however, the checking action will execute after epoll_wait() to end, and epoll_wait() usually blocks for 10 micro seconds(common cases) if all file descriptors it wait are not active.
For a server, 10 ms is quite a long time! Can we signal epoll_wait() to end immediately when task result is generated?
Yes! I will describe how it is done in one of my open source project.
Create a pipe for all worker threads, and epoll waits on that pipe as well. Once a task result is generated, the worker thread writes one byte into the pipe, then epoll_wait() will end in nearly the same time! - Linux pipe has 5 us to 20 us latency.
In my project SSDB(a Redis protocol compatible in-disk NoSQL database), I create a SelectableQueue for passing messages between the main thread and worker threads. Just like its name, SelectableQueue has an file descriptor, which can be wait by epoll.
SelectableQueue: https://github.com/ideawu/ssdb/blob/master/src/util/thread.h#L94
Usage in main thread:
epoll_add(serv_sock);
epoll_add(queue->fd());
while(1){
ret = epoll_wait();
foreach(ret as fd){
if(fd is worker_thread){
sock, resp = worker->pop_result();
sock.send(resp);
}
if(fd is client_socket){
req = fd.read();
worker->add_task(fd, req);
}
}
}
Usage in worker thread:
fd, req = queue->pop_task();
resp = proc(req);
queue->add_result(fd, resp);

How to handle hooked WSARecv

I'm working on a project that involves hooking WSARecv. I know how to hook this function, I mean its just the same as hooking another function. Anyway the hard part is when WSARecv is used to perform overlapped operations. The idea is that when an application receives data to intercept that and be possible to modify this, I'm using pipes for this. The native DLL tunnels all data to a managed 'server'. This processes the input etc and returns it back to the native DLL. This works great for WSASend, send and recv. However the hard part is when an application uses overlapped sockets.
So I need the received data first before I can process it, this is the hard part. How would I do something like this? I thought of this, but they both seem like a mess:
When WSARecv is called using the WSAOverlapped:
Create a new thread, use WaitForSingleObject and pass the hEvent of the WSAOverlapped structure. When the event is signaled process the data to the managed server and pass the data to the program.
When WSARecv is called using the completion routine:
Create a new thread, modify the call to the original function with lpOperationCompleted to a new function. Use SleepEx to put the thread in an alertable state. When the OperationCompleted is called process the data and pass data back to the program.
I could post my code but I didn't write because it seems like a bad solution.. So there is not really a point for that.
I cannot think of a better solution and this seems horrible because when an application calls WSARecv a lot (for example a large server using overlapped sockets to handle lots of clients) it creates a new thread for every call and that just seems like a bad idea.
So how can I do such thing?
There's no need to create a thread for each overlapped IO call.
When overlapped operations are used, they either have an associated event (which you can safely ignore), a completion routine, or are associated with an I/O Completion port.
To handle the first two cases you should hook both WSARecv() and WSAGetOverlappedResult().
If you need to handle the last, you'll also need to hook GetQueuedCompletionStatus()
Now, when you get a call to WSARecv(), for the event case, you do nothing special there (except possibly save some information in relation to the lpOverlapped, eg. the buffer), and process the data in WSAGetOverlappedResult() (which the application must call to get the success/error and bytes transferred.)
If a completion routine is present, save the lpOverlapped and lpCompletionRoutine, and pass your own completion routine to the real WSARecv().
Your routine should process the data and call the original completion routine.
To handle the I/O completion port case, have WSARecv() save lpOverlapped and buffers etc., in GetQueuedCompletionStatus(), call the original, and if the returned overlapped structure matches, handle the data.
You should also note that overlapped operations may complete immediately, in which case the event isn't signaled, the completion routine isn't called, and (IIRC) no completion is queued on the IOCP.

Proper way to handle pthread communication / signals in this instance?

I'm writing a small client / server demo that shares files between peers. One a peer gets a list of ip addresses from the main server, the main thread creates a thread for each respective file. The process looks like this:
Main thread gets list of files from server
Thread created for each file (detached)
In each created thread, connect to the peers specified / associated with a file
Thread downloads the file in chunks
Thread announces the file was complete
My problem comes into play when trying to "query" a thread. In each thread, I keep track of the progress of a transfer. In my main thread, I would like the user to be able to see the progress of all of the transfers taking place. What would be the best way to do so? I was thinking about sending a signal using pthread_kill to each thread respectively, although it seems like there should be a better way. If anyone has an idea, I'd love to hear it.
When you create your thread, you include a void * to point to anything you wish. In your example, you could declare an array of progress values and pass the address of one of them to each thread you create, let the thread perform a simple update when it needs to, and your main thread can periodically check the values.
If you're already using that parameter for something, you will need to create a structure comprising this new value and whatever you're already using, and pass the address of it so the thread gets everything it needs.

How to signal select() to return immediately?

I have a worker thread that is listening to a TCP socket for incoming traffic, and buffering the received data for the main thread to access (let's call this socket A). However, the worker thread also has to do some regular operations (say, once per second), even if there is no data coming in. Therefore, I use select() with a timeout, so that I don't need to keep polling. (Note that calling receive() on a non-blocking socket and then sleeping for a second is not good: the incoming data should be immediately available for the main thread, even though the main thread might not always be able to process it right away, hence the need for buffering.)
Now, I also need to be able to signal the worker thread to do some other stuff immediately; from the main thread, I need to make the worker thread's select() return right away. For now, I have solved this as follows (approach basically adopted from here and here):
At program startup, the worker thread creates for this purpose an additional socket of the datagram (UDP) type, and binds it to some random port (let's call this socket B). Likewise, the main thread creates a datagram socket for sending. In its call to select(), the worker thread now lists both A and B in the fd_set. When the main thread needs to signal, it sendto()'s a couple of bytes to the corresponding port on localhost. Back in the worker thread, if B remains in the fd_set after select() returns, then recvfrom() is called and the bytes received are simply ignored.
This seems to work very well, but I can't say I like the solution, mainly as it requires binding an extra port for B, and also because it adds several additional socket API calls which may fail I guess – and I don't really feel like figuring out the appropriate action for each of the cases.
I think ideally, I would like to call some function which takes A as input, and does nothing except makes select() return right away. However, I don't know such a function. (I guess I could for example shutdown() the socket, but the side effects are not really acceptable :)
If this is not possible, the second best option would be creating a B which is much dummier than a real UDP socket, and doesn't really require allocating any limited resources (beyond a reasonable amount of memory). I guess Unix domain sockets would do exactly this, but: the solution should not be much less cross-platform than what I currently have, though some moderate amount of #ifdef stuff is fine. (I am targeting mainly for Windows and Linux – and writing C++ by the way.)
Please don't suggest refactoring to get rid of the two separate threads. This design is necessary because the main thread may be blocked for extended periods (e.g., doing some intensive computation – and I can't start periodically calling receive() from the innermost loop of calculation), and in the meanwhile, someone needs to buffer the incoming data (and due to reasons beyond what I can control, it cannot be the sender).
Now that I was writing this, I realized that someone is definitely going to reply simply "Boost.Asio", so I just had my first look at it... Couldn't find an obvious solution, though. Do note that I also cannot (easily) affect how socket A is created, but I should be able to let other objects wrap it, if necessary.
You are almost there. Use a "self-pipe" trick. Open a pipe, add it to your select() read and write fd_set, write to it from main thread to unblock a worker thread. It is portable across POSIX systems.
I have seen a variant of similar technique for Windows in one system (in fact used together with the method above, separated by #ifdef WIN32). Unblocking can be achieved by adding a dummy (unbound) datagram socket to fd_set and then closing it. The downside is that, of course, you have to re-open it every time.
However, in the aforementioned system, both of these methods are used rather sparingly, and for unexpected events (e.g., signals, termination requests). Preferred method is still a variable timeout to select(), depending on how soon something is scheduled for a worker thread.
Using a pipe rather than socket is a bit cleaner, as there is no possibility for another process to get hold of it and mess things up.
Using a UDP socket definitely creates the potential for stray packets to come in and interfere.
An anonymous pipe will never be available to any other process (unless you give it to it).
You could also use signals, but in a multithreaded program you'll want to make sure that all threads except for the one you want have that signal masked.
On unix it will be straightforward with using a pipe. If you are on windows and want to keep using the select statement to keep your code compatible with unix, the trick to create an unbound UDP socket and close it, works well and easy. But you have to make it multi-threadsafe.
The only way I found to make this multi-threadsafe is to close and recreate the socket in the same thread as the select statement is running. Of course this is difficult if the thread is blocking on the select. And then comes in the windows call QueueUserAPC. When windows is blocking in the select statement, the thread can handle Asynchronous Procedure Calls. You can schedule this from a different thread using QueueUserAPC. Windows interrupts the select, executes your function in the same thread, and continues with the select statement. You can now in your APC method close the socket and recreate it. Guaranteed thread safe and you will never loose a signal.
To be simple:
a global var saves the socket handle, then close the global socket, the select() will return immediately: closesocket(g_socket);

Resources