If FILE_SKIP_COMPLETION_PORT_ON_SUCCESS is set on a file handle that is bound to an I/O completion port, then an OVERLAPPED structure needs to be deallocated when its I/O completes synchronously.
Otherwise, it needs to stay alive until a worker processes the notification from an I/O completion port.
This all sounds good until you realize that this only works if you manage the file handle yourself.
But if someone else gives you the file handle, how are you supposed to know when you should free the OVERLAPPED structure? Is there any way to discover this after the fact?
Otherwise, does this basically imply you cannot correctly perform overlapped I/O on any file handle that you cannot guarantee the completion notification state of...?
I'm not sure that your scenario makes sense.
Your clarified scenario - successfully performing I/O on an arbitrary file handle, without even knowing whether it is asynchronous or not - is challenging, I think very unusual, and almost certainly not how the API was designed to be used, but perhaps (as you suggest) not entirely implausible.
(Although I don't think you can avoid requiring some cooperation between the caller and your code, because in the IOCP case, the caller has to be able to tell whose I/O a dequeued packet belongs to. You could do this by having the caller allocate the OVERLAPPED structures, as RbMm suggests, but it might be simpler to ask them for a completion key to use.)
I'm not certain offhand how Windows behaves if you provide a redundant event handle, e.g., when the I/O is actually synchronous or using IOCP. But I would guess that it isn't going to be a problem in practice, so provided you're not too worried about future-proofing, you're probably OK.
At any rate, it isn't all that difficult to deal with the particular issue your question asks about. Basically, you just need to prevent the structure from being released twice.
Before making each call, assign a unique completion key and add it to a linked list or other suitable global structure. (The structure must be capable of an atomic find-and-remove operation, or protected by a critical section or similar.)
If the call succeeds immediately, i.e., does not report that the I/O is pending, treat it exactly as if a queued packet were received from the IOCP queue. Typically, you would either use a common function that is called by both your IOCP thread and your I/O thread, or a call to PostQueuedCompletionStatus to manually insert a packet to the IOCP queue.
When a packet is received (or when the call succeeds immediately) first perform a find-and-remove for the completion key against the global structure. If the find fails, you know that you have already been notified of the success of the I/O, and don't need to do anything.
If the find-and-remove succeeds, process the I/O as appropriate and release the OVERLAPPED structure.
There are undoubtedly ways to optimize the same basic approach.
Addendum: if the caller is processing the IOCP packets, and providing you with a completion key to use, you won't be able to use a unique completion key on each request. In this scenario, you can use the pointer to the OVERLAPPED structure instead.
The reason (in the general case) for not using the pointer is that you might receive a packet containing a completion key from one I/O request along with an OVERLAPPED structure from a different one, because the OVERLAPPED structure might be both released and reassigned before a duplicate notification is processed. That doesn't matter in this case, because all of your requests will use the same completion key anyway.
Addendum^2: if you don't know anything about the handle, you'll also need to provide an event object for each OVERLAPPED structure, and wait on them in case notification of the I/O completion arrives that way. It's getting too late in the day for me to try to figure out the exact consequences of that, but it may mean that under some circumstances you get three notifications for the same I/O operation. You might be able to avoid that, but if not, this approach will still work.
Is there any way to discover this after the fact?
yes, exist - need use ZwQueryInformationFile with FileIoCompletionNotificationInformation
FILE_IO_COMPLETION_NOTIFICATION_INFORMATION is defined in wdm.h
so code which we need for query:
FILE_IO_COMPLETION_NOTIFICATION_INFORMATION ficni;
ZwQueryInformationFile(hFile, &iosb, &ficni, sizeof(ficni), FileIoCompletionNotificationInformation);
demo code for set and query
HANDLE hFile;
IO_STATUS_BLOCK iosb;
STATIC_OBJECT_ATTRIBUTES(oa, "\\systemroot\\notepad.exe");
if (0 <= ZwOpenFile(&hFile, FILE_GENERIC_READ, &oa, &iosb, FILE_SHARE_VALID_FLAGS, 0))
{
FILE_IO_COMPLETION_NOTIFICATION_INFORMATION ficni = { FILE_SKIP_COMPLETION_PORT_ON_SUCCESS };
if (0 <= ZwSetInformationFile(hFile, &iosb, &ficni, sizeof(ficni), FileIoCompletionNotificationInformation))
{
ficni.Flags = 0x12345678;
if (
0 > ZwQueryInformationFile(hFile, &iosb, &ficni, sizeof(ficni), FileIoCompletionNotificationInformation)
||
!(ficni.Flags & FILE_SKIP_COMPLETION_PORT_ON_SUCCESS)
)
{
__debugbreak();
}
}
ZwClose(hFile);
}
also let copy paste from wdm.h (for not say that this is "undocumented" )
//
// Don't queue an entry to an associated completion port if returning success
// synchronously.
//
#define FILE_SKIP_COMPLETION_PORT_ON_SUCCESS 0x1
//
// Don't set the file handle event on IO completion.
//
#define FILE_SKIP_SET_EVENT_ON_HANDLE 0x2
//
// Don't set user supplied event on successful fast-path IO completion.
//
#define FILE_SKIP_SET_USER_EVENT_ON_FAST_IO 0x4
typedef struct _FILE_IO_COMPLETION_NOTIFICATION_INFORMATION {
ULONG Flags;
} FILE_IO_COMPLETION_NOTIFICATION_INFORMATION, *PFILE_IO_COMPLETION_NOTIFICATION_INFORMATION;
I have question - for what reason this is declared in wdm.h ?
Related
So I have just discovered that libuv is a fairly small library as far as C libraries go (compare to FFmpeg). I have spent the past 6 hours reading through the source code to get a feel for the event loop at a deeper level. But still not seeing where the "nonblockingness" is implemented. Where some event interrupt signal or whatnot is being invoked in the codebase.
I have been using Node.js for over 8 years so I am familar with how to use an async non-blocking event loop, but I never actually looked into the implementation.
My question is twofold:
Where exactly is the "looping" occuring within libuv?
What are the key steps in each iteration of the loop that make it non-blocking and async.
So we start with a hello world example. All that is required is this:
#include <stdio.h>
#include <stdlib.h>
#include <uv.h>
int main() {
uv_loop_t *loop = malloc(sizeof(uv_loop_t));
uv_loop_init(loop); // initialize datastructures.
uv_run(loop, UV_RUN_DEFAULT); // infinite loop as long as queue is full?
uv_loop_close(loop);
free(loop);
return 0;
}
The key function which I have been exploring is uv_run. The uv_loop_init function essentially initializes data structures, so not too much fancness there I don't think. But the real magic seems to happen with uv_run, somewhere. A high level set of code snippets from the libuv repo is in this gist, showing what the uv_run function calls.
Essentially it seems to boil down to this:
while (NOT_STOPPED) {
uv__update_time(loop)
uv__run_timers(loop)
uv__run_pending(loop)
uv__run_idle(loop)
uv__run_prepare(loop)
uv__io_poll(loop, timeout)
uv__run_check(loop)
uv__run_closing_handles(loop)
// ... cleanup
}
Those functions are in the gist.
uv__run_timers: runs timer callbacks? loops with for (;;) {.
uv__run_pending: runs regular callbacks? loops through queue with while (!QUEUE_EMPTY(&pq)) {.
uv__run_idle: no source code
uv__run_prepare: no source code
uv__io_poll: does io polling? (can't quite tell what this means tho). Has 2 loops: while (!QUEUE_EMPTY(&loop->watcher_queue)) {, and for (;;) {,
And then we're done. And the program exists, because there is no "work" to be done.
So I think I have answered the first part of my question after all this digging, and the looping is specifically in these 3 functions:
uv__run_timers
uv__run_pending
uv__io_poll
But not having implemented anything with kqueue or multithreading and having dealt relatively little with file descriptors, I am not quite following the code. This will probably help out others along the path to learning this too.
So the second part of the question is what are the key steps in these 3 functions that implement the nonblockingness? Assuming this is where all the looping exists.
Not being a C expert, does for (;;) { "block" the event loop? Or can that run indefinitely and somehow other parts of the code are jumped to from OS system events or something like that?
So uv__io_poll calls poll(...) in that endless loop. I don't think is non-blocking, is that correct? That seems to be all it mainly does.
Looking into kqueue.c there is also a uv__io_poll, so I assume the poll implementation is a fallback and kqueue on Mac is used, which is non-blocking?
So is that it? Is it just looping in uv__io_poll and each iteration you can add to the queue, and as long as there's stuff in the queue it will run? I still don't see how it's non-blocking and async.
Can one outline similar to this how it is async and non-blocking, and which parts of the code to take a look at? Basically, I would like to see where the "free processor idleness" exists in libuv. Where is the processor ever free in the call to our initial uv_run? If it is free, how does it get reinvoked, like an event handler? (Like a browser event handler from the mouse, an interrupt). I feel like I'm looking for an interrupt but not seeing one.
I ask this because I want to implement an MVP event loop in C, but just don't understand how nonblockingness actually is implemented. Where the rubber meets the road.
I think that trying to understand libuv is getting in your way of understanding how reactors (event loops) are implemented in C, and it is this that you need to understand, as opposed to the exact implementation details behind libuv.
(Note that when I say "in C", what I really means is "at or near the system call interface, where userland meets the kernel".)
All of the different backends (select, poll, epoll, etc) are, more-or-less, variations on the same theme. They block the current process or thread until there is work to be done, like servicing a timer, reading from a socket, writing to a socket, or handling a socket error.
When the current process is blocked, it literally is not getting any CPU cycles assigned to it by the OS scheduler.
Part of the issue behind understanding this stuff IMO is the poor terminology: async, sync in JS-land, which don't really describe what these things are. Really, in C, we're talking about non-blocking vs blocking I/O.
When we read from a blocking file descriptor, the process (or thread) is blocked -- prevented from running -- until the kernel has something for it to read; when we write to a blocking file descriptor, the process is blocked until the kernel accepts the entire buffer.
In non-blocking I/O, it's exactly the same, except the kernel won't stop the process from running when there is nothing to do: instead, when you read or write, it tells you how much you read or wrote (or if there was an error).
The select system call (and friends) prevent the C developer from having to try and read from a non-blocking file descriptor over and over again -- select() is, in effect, a blocking system call that unblocks when any of the descriptors or timers you are watching are ready. This lets the developer build a loop around select, servicing any events it reports, like an expired timeout or a file descriptor that can be read. This is the event loop.
So, at its very core, what happens at the C-end of a JS event loop is roughly this algorithm:
while(true) {
select(open fds, timeout);
did_the_timeout_expire(run_js_timers());
for (each error fd)
run_js_error_handler(fdJSObjects[fd]);
for (each read-ready fd)
emit_data_events(fdJSObjects[fd], read_as_much_as_I_can(fd));
for (each write-ready fd) {
if (!pendingData(fd))
break;
write_as_much_as_I_can(fd);
pendingData = whatever_was_leftover_that_couldnt_write;
}
}
FWIW - I have actually written an event loop for v8 based around select(): it really is this simple.
It's important also to remember that JS always runs to completion. So, when you call a JS function (via the v8 api) from C, your C program doesn't do anything until the JS code returns.
NodeJS uses some optimizations like handling pending writes in a separate pthreads, but these all happen in "C space" and you shouldn't think/worry about them when trying to understand this pattern, because they're not relevant.
You might also be fooled into the thinking that JS isn't run to completion when dealing with things like async functions -- but it absolutely is, 100% of the time -- if you're not up to speed on this, do some reading with respect to the event loop and the micro task queue. Async functions are basically a syntax trick, and their "completion" involves returning a Promise.
I just took a dive into libuv's source code, and found at first that it seems like it does a lot of setup, and not much actual event handling.
Nonetheless, a look into src/unix/kqueue.c reveals some of the inner mechanics of event handling:
int uv__io_check_fd(uv_loop_t* loop, int fd) {
struct kevent ev;
int rc;
rc = 0;
EV_SET(&ev, fd, EVFILT_READ, EV_ADD, 0, 0, 0);
if (kevent(loop->backend_fd, &ev, 1, NULL, 0, NULL))
rc = UV__ERR(errno);
EV_SET(&ev, fd, EVFILT_READ, EV_DELETE, 0, 0, 0);
if (rc == 0)
if (kevent(loop->backend_fd, &ev, 1, NULL, 0, NULL))
abort();
return rc;
}
The file descriptor polling is done here, "setting" the event with EV_SET (similar to how you use FD_SET before checking with select()), and the handling is done via the kevent handler.
This is specific to the kqueue style events (mainly used on BSD-likes a la MacOS), and there are many other implementations for different Unices, but they all use the same function name to do nonblocking IO checks. See here for another implementation using epoll.
To answer your questions:
1) Where exactly is the "looping" occuring within libuv?
The QUEUE data structure is used for storing and processing events. This queue is filled by the platform- and IO- specific event types you register to listen for. Internally, it uses a clever linked-list using only an array of two void * pointers (see here):
typedef void *QUEUE[2];
I'm not going to get into the details of this list, all you need to know is it implements a queue-like structure for adding and popping elements.
Once you have file descriptors in the queue that are generating data, the asynchronous I/O code mentioned earlier will pick it up. The backend_fd within the uv_loop_t structure is the generator of data for each type of I/O.
2) What are the key steps in each iteration of the loop that make it non-blocking and async?
libuv is essentially a wrapper (with a nice API) around the real workhorses here, namely kqueue, epoll, select, etc. To answer this question completely, you'd need a fair bit of background in kernel-level file descriptor implementation, and I'm not sure if that's what you want based on the question.
The short answer is that the underlying operating systems all have built-in facilities for non-blocking (and therefore async) I/O. How each system works is a little outside the scope of this answer, I think, but I'll leave some reading for the curious:
https://www.quora.com/Network-Programming-How-is-select-implemented?share=1
The first thing to keep in mind is that work must be added to libuv's queues using its API; one cannot just load up libuv, start its main loop, and then code up some I/O and get async I/O.
The queues maintained by libuv are managed by looping. The infinite loop in uv__run_timers isn't actually infinite; notice that the first check verifies that a soonest-expiring timer exists (presumably, if the list is empty, this is NULL), and if not, breaks the loop and the function returns. The next check breaks the loop if the current (soonest-expiring) timer hasn't expired. If neither of those conditions breaks the loop, the code continues: it restarts the timer, calls its timeout handler, and then loops again to check more timers. Most times when this code runs, it's going to break the loop and exit, allowing the other loops to run.
What makes all this non-blocking is the caller/user following the guidelines and API of libuv: adding your work to queues, and allowing libuv to perform its work on those queues. Processing-intensive work may block these loops and other work from running, so it's important to break your work into chunks.
btw, uv__run_idle, uv__run_check, uv__run_prepare 's source code is defined on src/unix/loop-watcher.c
I have a program written in C and I'm pondering on how to design a part of it.
No C++ here, the existing code is C and so this part must also be C.
Basically, I have a file which splits up and combines parts of data for transmission. I'm just working on the receive part of the code.
It works like this:
If you send it data which wasn't split up because it was small enough (but the caller wasn't to know that yet) the function simply returns DATA_AVAILABLE so the caller can call GetData()
However, if you send a chunk of data, the function would return PARTIAL_PACKET, and the caller would have to keep sending data until the function returned DATA_AVAILABLE, so the caller can call GetData() to get the fully reassembled data.
QUESTION: Is this the best way to do it, or should I apply some kind of event system. Eg. the caller does something like "SetOnDataReceived(&processData)" and then just feeds data to the function, not caring about the result code, knowing that the function "processData" will be called once valid data has been received.
Perhaps you could implement a thread to maintain a 'ring-buffer'. The thread would listen for incoming data, and store the data in the buffer. The thread could also parse the data received to delineate when each packet has been fully received.
Then, perhaps your code could offer a suite of functions to the caller. Such as:
/* Initialize your ring-buffer, and start listening for packets, etc. */
int STEVE_Initialize();
/* Returns the number of fully recieved packets ready for processing. */
int STEVE_PacketsReadyCount();
/* Read a full packet from the ring-buffer and return it to the caller. */
int STEVE_FetchNextPacket();
/* Stop listening for packets, free ring-buffer, etc. */
int STEVE_Terminate();
POLL METHOD
Given such an implementation, the caller can use 'Steve_PacketsReadyCount()' to implement a polling loop. When a packet is ready, the caller would call 'STEVE_FetchNextPacket()' to obtain the next full packet (at which time, it would be removed from the ring-buffer).
SIGNAL METHOD
A more elaborate implementation might be to use system signals, such as the USR1 signal, to signal the caller that a full packet is ready. The caller would alerted to the complete arrival of a full packet from a caller-implemented signal handler.
SEMAPHORE METHOD
Perhaps your code could provide a semaphore to the client that could be used by the client to "sleep" until a full packet arrives.
There are many more methods that might be used, including some sort of a message queue implementation. However it is designed, perhaps the most important input to the architecture is "what would be easiest for the caller". In order to more fully answer your question, more detail about the needs of the caller are required.
I have to send a command over serial and receive back an answer based on the command and do something based on the message received. I was told that I have to use callbacks as this is an asynchronous operation.
I have a 2 threads, one that can send messages and one that receives the messages.
Example:
//Thread 1
sendMessage("Initialize");
//Thread 2
while(1)
{
checkForMessages();
}
How can I write a function that is initialized for a specific message and handles the message recieved.
Example:
CommHandle(Command,MsgReceived)
{
if(command)
{
if(MsgReceived == ok)
...
if(MsgReceived == error)
...
}
}
I was told that I have to use callbacks as this is an asynchronous operation.
Not necessarily. There is something in Windows called "asynchronous I/O", this is to be regarded as an internal Windows term, which is synonymous with "overlapped I/O" (explanation here). When you are using overlapped I/O, you will get a callback when the transmission is finished. This is nice, since it reduces CPU load, but it is not really necessary if your program has nothing better to do while waiting. So it depends on the nature of your application.
But no matter the nature of your application, you should indeed handle all serial communication through threads, so that you won't cause the main GUI thread to freeze up in embarrassing ways.
Having one rx and one tx thread gives you a dilemma though: they are using the same port handle and they cannot freely access it, because that wouldn't be thread-safe. The solution is to either make one single super-thread handling all transmissions, or to protect the port handle through a mutex.
I'm not sure which method that is best, I have no recommendation. I have only used the "super-thread" one myself: one obvious advantage was that I could centralize WaitFor instructions like "kill thread", "port is open", "port is closed" at one place. But at the same time the code turned out rather complex.
How can I write a function that is initialized for a specific message and handles the message recieved.
Let your thread(s) shovel their received data into some buffers. A tx buffer and a rx buffer. Depending on your serial protocol and performance, you might have to use double buffers: one that is used for the current transmission and one that contains the most recently received data.
Then from main, pick up the data from the buffers. They need to be thread-safe. Once you have gotten that far, simply write a parser like you would with any form of data and take actions from there
I'm making small library for controlling various embedded devices using C language. I'm using UDP sockets to communicate with each of the devices. Devices send me various interesting data, alarms and notifications and at the same time they send some data that is used internally by the library but may not be interesting to users. So, I've implemented a callback approach, where user could register a callback function with some interesting events on each of the devices. Right now, overall design of this library is something like this:-
I've two threads running.
In one of the thread, there is a infinite while event-loop that uses select and non-blocking sockets to maintain the communication with each of the devices.
Basically, every time I receive a packet from any of devices, I strip off the header which is 20 bytes of some useless information and add my own header containing DEVICE_ID, REQUES_TIME (time request was sent to retrieve that packet and RETRIEVAL_TIME (time now when packet actually arrived) and REQUEST_ID and REQUEST_TYPE (alarm, data, notification etc..).
Now, this thread (one with infinite loop) put packet with new header into ring buffer and then notifies other thread (thread #2) to parse this information.
In thread #2, when notification is received, it locks the buffer and read pop the packet and start parsing it.
Every message contains some information that user may not be interested, so I'm providing user call back approach to act upon data which is useful to user.
Basically, I'm doing something like this in thread 2:-
THREAD #2
wait(data_put_in_buffer_cond)
lock(buffer_mutex)
packet_t* packet = pop_packet_from_buffer(buf);
unlock(buffer_mutex)
/* parsing the package... */
parsed_packet_t* parsed_packet = parse_and_change_endianess(packet->data);
/* header for put by thread #1 with host byte order only not parsing necessary */
header_t* header = get_header(packet);
/* thread 1 sets free callback for kind of packet it puts in buffer
* This not a critical section section of buffer, so fine without locks
*/
buffer.free_callback(packet);
foreach attribute in parsed_packet->attribute_list {
register_info_t* rinfo = USER_REGISTRED_EVENT_TABLE[header->device_id][attribute.attr_id];
/*user is register with this attribute ID on this device ID */
if(rinfo != NULL) {
rinof->callback(packet);
}
// Do some other stuff with this attribute..
}
free(parsed_packet);
Now, my concerned is that what will happen if callback function that user implements takes some time to complete and meanwhile I may drop some packet because ring buffer is in overwriting mode? I've tested my API for 3 to 4 devices, I don't see much drop event if callback function wait decent amount of time..I'm speculating that this approach may not be best.
Would it be a better design, if I use some sort of thread-pool to run user callback functions? In that case I would need to make explicit copy of packet before I send it to user callback? Each packet is about 500 to 700 bytes, I get around 2 packets per second from each device. Any suggestions or comments on improving the current design or solving this issues would be appreciated.
Getting 500-700 bytes per device is not a problem at all, especially if you only have 3-4 devices. Even if you had, let's say, 100 devices, it should not be a problem. The copy overhead would be most probably negligible. So, my suggest would be: do not try to optimize beforehand until you are certain that buffer copying is your bottleneck.
About losing packets, as you say in your question, you are already using a buffer ring (I assume that is something like a circular queue, right?). If the queue becomes full, then you just need to make thread #1 to wait until there is some available space in the queue. Clearly, more events from different devices may arrive, but that should not be a problem. Once, you have space again, select will let you know that you have available data from different devices, so you will just need to process all that data. Of course, in order to have a balanced system, you can set the size of the queue to a value that reduces as much as possible the number of times that the queue is full, and thus, thread #1 needs to wait.
I'm trying to implement a UDP-based server that maintains two sockets, one for controlling(ctrl_sock) and the other for data transmission(data_sock). The thing is, ctrl_sock is always uplink and data_sock is downlink. That is, clients will request data transmission/stop via the ctrl_sock and data will be sent to them via data_sock.
Now the problem is, since the model is connection-less, the server will have to maintain a list of registered clients' information( I call it peers_context) such that it can "blindly" push data to them until they ask to stop. During this blind transmission, clients may send controlling messages to the server via the ctrl_sock asynchronously. These information, besides initial Request and Stop, can also be, for example, preferences of file parts. Therefore, the peers_context has to be updated asynchronously. However, the transmission over the data_sock relies on this peers_context structure, hence raises a synchronization problem between ctrl_sock and data_sock. My question is, what can I do to safely maintain these two socks and the peers_context structure such that the asynchronous update of peers_context won't cause a havoc. By the way, the update of peers_context wouldn't be very frequent, that is why I need to avoid the request-reply model.
My initial consideration of the implementation is, to maintain ctrl_sock in the main thread(listener thread), and transmission over data_sock is maintained in the other thread(worker thread). However, I found it is difficult to synchronize in this case. For example, if I use mutexes in peers_context, whenever the worker thread locks peers_context, the listener thread wouldn't have access to it anymore when it needs to modify peers_context, because the worker thread works endlessly. On the other hand, if the listener thread holds the peers_context and writes to it, the worker thread would fail to read peers_context and terminates. Can anybody give me some suggestions?
By the way, the implementation is done in Linux environment in C. Only the listener thread would need to modify peers_context occasionally, the worker thread only needs to read. Thanks sincerely!
If there is strong contention for your peers_context then you need to need to shorten your critical sections. You talked about using a mutex. I assume you've already considered changing to a reader+writer lock and rejected it because you don't want the constant readers to starve a writer. How about this?
Make a very small structure that is an indirect reference to a peers_context like this:
struct peers_context_handle {
pthread_mutex_t ref_lock;
struct peers_context *the_actual_context;
pthread_mutex_t write_lock;
};
Packet senders (readers) and control request processors (writers) always access the peers_mutex through this indirection.
Assumption: the packet senders never modify the peers_context, nor do they ever free it.
Packer senders briefly lock the handle, obtain the current version of the peers_context and unlock it:
pthread_mutex_lock(&(handle->ref_lock));
peers_context = handle->the_actual_context;
pthread_mutex_unlock(&(handle->ref_lock));
(In practice, you can even do away with the lock if you introduce memory barriers, because a pointer dereference is atomic on all platforms that Linux supports, but I wouldn't recommend it since you would have to start delving into memory barriers and other low-level stuff, and neither C nor POSIX guarantees that it will work anyway.)
Request processors don't update the peers_context, they make a copy and completely replace it. That's how they keep their critical section small. They do use write_lock to serialize updates, but updates are infrequent so that's not a problem.
pthread_mutex_lock(&(handle->write_lock));
/* Short CS to get the old version */
pthread_mutex_lock(&(handle->ref_lock));
old_peers_context = handle->the_actual_context;
pthread_mutex_unlock(&(handle->ref_lock));
new_peers_context = allocate_new_structure();
*new_peers_context = *old_peers_context;
/* Now make the changes that are requested */
new_peers_context->foo = 42;
new_peers_context->bar = 52;
/* Short CS to replace the context */
pthread_mutex_lock(&(handle->ref_lock));
handle->the_actual_context = new_peers_context;
pthread_mutex_unlock(&(handle->ref_lock));
pthread_mutex_unlock(&(handle->write_lock));
magic(old_peers_context);
What's the catch? It's the magic in the last line of code. You have to free the old copy of the peers_context to avoid a memory leak but you can't do it because there might be packet senders still using that copy.
The solution is similar to RCU, as used inside the Linux kernel. You have to wait for all of the packet sender threads to have entered a quiescent state. I'm leaving the implementation of this as an exercise for you :-) but here are the guidelines:
The magic() function adds old_peers_context so a to-be-freed queue (which has to be protected by a mutex).
One dedicated thread frees this list in a loop:
It locks the to-be-freed list
It obtains a pointer to the list
It replaced the list with a new empty list
It unlocks the to-be-freed list
It clears a mark associated with each worker thread
It waits for all marks to be set again
It frees each item in its previously obtained copy of the to-be-freed list
Meanwhile, each worker thread sets its own mark at an idle point in its event loop (i.e. a point when it is not busy sending any packets or holding any peer_contexts.