sharing information among two sockets

sharing information among two sockets - c

I'm trying to implement a UDP-based server that maintains two sockets, one for controlling(ctrl_sock) and the other for data transmission(data_sock). The thing is, ctrl_sock is always uplink and data_sock is downlink. That is, clients will request data transmission/stop via the ctrl_sock and data will be sent to them via data_sock.
Now the problem is, since the model is connection-less, the server will have to maintain a list of registered clients' information( I call it peers_context) such that it can "blindly" push data to them until they ask to stop. During this blind transmission, clients may send controlling messages to the server via the ctrl_sock asynchronously. These information, besides initial Request and Stop, can also be, for example, preferences of file parts. Therefore, the peers_context has to be updated asynchronously. However, the transmission over the data_sock relies on this peers_context structure, hence raises a synchronization problem between ctrl_sock and data_sock. My question is, what can I do to safely maintain these two socks and the peers_context structure such that the asynchronous update of peers_context won't cause a havoc. By the way, the update of peers_context wouldn't be very frequent, that is why I need to avoid the request-reply model.
My initial consideration of the implementation is, to maintain ctrl_sock in the main thread(listener thread), and transmission over data_sock is maintained in the other thread(worker thread). However, I found it is difficult to synchronize in this case. For example, if I use mutexes in peers_context, whenever the worker thread locks peers_context, the listener thread wouldn't have access to it anymore when it needs to modify peers_context, because the worker thread works endlessly. On the other hand, if the listener thread holds the peers_context and writes to it, the worker thread would fail to read peers_context and terminates. Can anybody give me some suggestions?
By the way, the implementation is done in Linux environment in C. Only the listener thread would need to modify peers_context occasionally, the worker thread only needs to read. Thanks sincerely!

If there is strong contention for your peers_context then you need to need to shorten your critical sections. You talked about using a mutex. I assume you've already considered changing to a reader+writer lock and rejected it because you don't want the constant readers to starve a writer. How about this?
Make a very small structure that is an indirect reference to a peers_context like this:
struct peers_context_handle {
pthread_mutex_t ref_lock;
struct peers_context *the_actual_context;
pthread_mutex_t write_lock;
};
Packet senders (readers) and control request processors (writers) always access the peers_mutex through this indirection.
Assumption: the packet senders never modify the peers_context, nor do they ever free it.
Packer senders briefly lock the handle, obtain the current version of the peers_context and unlock it:
pthread_mutex_lock(&(handle->ref_lock));
peers_context = handle->the_actual_context;
pthread_mutex_unlock(&(handle->ref_lock));
(In practice, you can even do away with the lock if you introduce memory barriers, because a pointer dereference is atomic on all platforms that Linux supports, but I wouldn't recommend it since you would have to start delving into memory barriers and other low-level stuff, and neither C nor POSIX guarantees that it will work anyway.)
Request processors don't update the peers_context, they make a copy and completely replace it. That's how they keep their critical section small. They do use write_lock to serialize updates, but updates are infrequent so that's not a problem.
pthread_mutex_lock(&(handle->write_lock));
/* Short CS to get the old version */
pthread_mutex_lock(&(handle->ref_lock));
old_peers_context = handle->the_actual_context;
pthread_mutex_unlock(&(handle->ref_lock));
new_peers_context = allocate_new_structure();
*new_peers_context = *old_peers_context;
/* Now make the changes that are requested */
new_peers_context->foo = 42;
new_peers_context->bar = 52;
/* Short CS to replace the context */
pthread_mutex_lock(&(handle->ref_lock));
handle->the_actual_context = new_peers_context;
pthread_mutex_unlock(&(handle->ref_lock));
pthread_mutex_unlock(&(handle->write_lock));
magic(old_peers_context);
What's the catch? It's the magic in the last line of code. You have to free the old copy of the peers_context to avoid a memory leak but you can't do it because there might be packet senders still using that copy.
The solution is similar to RCU, as used inside the Linux kernel. You have to wait for all of the packet sender threads to have entered a quiescent state. I'm leaving the implementation of this as an exercise for you :-) but here are the guidelines:
The magic() function adds old_peers_context so a to-be-freed queue (which has to be protected by a mutex).
One dedicated thread frees this list in a loop:
It locks the to-be-freed list
It obtains a pointer to the list
It replaced the list with a new empty list
It unlocks the to-be-freed list
It clears a mark associated with each worker thread
It waits for all marks to be set again
It frees each item in its previously obtained copy of the to-be-freed list
Meanwhile, each worker thread sets its own mark at an idle point in its event loop (i.e. a point when it is not busy sending any packets or holding any peer_contexts.

Related

How to juggle FILE_SKIP_COMPLETION_PORT_ON_SUCCESS, IOCP, and cleanup

If FILE_SKIP_COMPLETION_PORT_ON_SUCCESS is set on a file handle that is bound to an I/O completion port, then an OVERLAPPED structure needs to be deallocated when its I/O completes synchronously.
Otherwise, it needs to stay alive until a worker processes the notification from an I/O completion port.
This all sounds good until you realize that this only works if you manage the file handle yourself.
But if someone else gives you the file handle, how are you supposed to know when you should free the OVERLAPPED structure? Is there any way to discover this after the fact?
Otherwise, does this basically imply you cannot correctly perform overlapped I/O on any file handle that you cannot guarantee the completion notification state of...?

I'm not sure that your scenario makes sense.
Your clarified scenario - successfully performing I/O on an arbitrary file handle, without even knowing whether it is asynchronous or not - is challenging, I think very unusual, and almost certainly not how the API was designed to be used, but perhaps (as you suggest) not entirely implausible.
(Although I don't think you can avoid requiring some cooperation between the caller and your code, because in the IOCP case, the caller has to be able to tell whose I/O a dequeued packet belongs to. You could do this by having the caller allocate the OVERLAPPED structures, as RbMm suggests, but it might be simpler to ask them for a completion key to use.)
I'm not certain offhand how Windows behaves if you provide a redundant event handle, e.g., when the I/O is actually synchronous or using IOCP. But I would guess that it isn't going to be a problem in practice, so provided you're not too worried about future-proofing, you're probably OK.
At any rate, it isn't all that difficult to deal with the particular issue your question asks about. Basically, you just need to prevent the structure from being released twice.
Before making each call, assign a unique completion key and add it to a linked list or other suitable global structure. (The structure must be capable of an atomic find-and-remove operation, or protected by a critical section or similar.)
If the call succeeds immediately, i.e., does not report that the I/O is pending, treat it exactly as if a queued packet were received from the IOCP queue. Typically, you would either use a common function that is called by both your IOCP thread and your I/O thread, or a call to PostQueuedCompletionStatus to manually insert a packet to the IOCP queue.
When a packet is received (or when the call succeeds immediately) first perform a find-and-remove for the completion key against the global structure. If the find fails, you know that you have already been notified of the success of the I/O, and don't need to do anything.
If the find-and-remove succeeds, process the I/O as appropriate and release the OVERLAPPED structure.
There are undoubtedly ways to optimize the same basic approach.
Addendum: if the caller is processing the IOCP packets, and providing you with a completion key to use, you won't be able to use a unique completion key on each request. In this scenario, you can use the pointer to the OVERLAPPED structure instead.
The reason (in the general case) for not using the pointer is that you might receive a packet containing a completion key from one I/O request along with an OVERLAPPED structure from a different one, because the OVERLAPPED structure might be both released and reassigned before a duplicate notification is processed. That doesn't matter in this case, because all of your requests will use the same completion key anyway.
Addendum^2: if you don't know anything about the handle, you'll also need to provide an event object for each OVERLAPPED structure, and wait on them in case notification of the I/O completion arrives that way. It's getting too late in the day for me to try to figure out the exact consequences of that, but it may mean that under some circumstances you get three notifications for the same I/O operation. You might be able to avoid that, but if not, this approach will still work.

Is there any way to discover this after the fact?
yes, exist - need use ZwQueryInformationFile with FileIoCompletionNotificationInformation
FILE_IO_COMPLETION_NOTIFICATION_INFORMATION is defined in wdm.h
so code which we need for query:
FILE_IO_COMPLETION_NOTIFICATION_INFORMATION ficni;
ZwQueryInformationFile(hFile, &iosb, &ficni, sizeof(ficni), FileIoCompletionNotificationInformation);
demo code for set and query
HANDLE hFile;
IO_STATUS_BLOCK iosb;
STATIC_OBJECT_ATTRIBUTES(oa, "\\systemroot\\notepad.exe");
if (0 <= ZwOpenFile(&hFile, FILE_GENERIC_READ, &oa, &iosb, FILE_SHARE_VALID_FLAGS, 0))
{
FILE_IO_COMPLETION_NOTIFICATION_INFORMATION ficni = { FILE_SKIP_COMPLETION_PORT_ON_SUCCESS };
if (0 <= ZwSetInformationFile(hFile, &iosb, &ficni, sizeof(ficni), FileIoCompletionNotificationInformation))
{
ficni.Flags = 0x12345678;
if (
0 > ZwQueryInformationFile(hFile, &iosb, &ficni, sizeof(ficni), FileIoCompletionNotificationInformation)
||
!(ficni.Flags & FILE_SKIP_COMPLETION_PORT_ON_SUCCESS)
)
{
__debugbreak();
}
}
ZwClose(hFile);
}
also let copy paste from wdm.h (for not say that this is "undocumented" )
//
// Don't queue an entry to an associated completion port if returning success
// synchronously.
//
#define FILE_SKIP_COMPLETION_PORT_ON_SUCCESS 0x1
//
// Don't set the file handle event on IO completion.
//
#define FILE_SKIP_SET_EVENT_ON_HANDLE 0x2
//
// Don't set user supplied event on successful fast-path IO completion.
//
#define FILE_SKIP_SET_USER_EVENT_ON_FAST_IO 0x4
typedef struct _FILE_IO_COMPLETION_NOTIFICATION_INFORMATION {
ULONG Flags;
} FILE_IO_COMPLETION_NOTIFICATION_INFORMATION, *PFILE_IO_COMPLETION_NOTIFICATION_INFORMATION;
I have question - for what reason this is declared in wdm.h ?

Deadlock of powerfail sequence during write to flash page

I'm currently working on an embedded project using an ARM Cortex M3 microcontroller with FreeRTOS as system OS. The code was written by a former colleague and sadly the project has some weird bugs which I have to find and fix as soon as possible.
Short description: The device is integrated into vehicles and sends some "special" data using an integrated modem to a remote server.
The main problem: Since the device is integrated into a vehicle, the power supply of the device can be lost at any time. Therefore the device stores some parts of the "special" data to two reserved flash pages. This code module is laid out as an eeprom emulation on two flash pages(for wear leveling and data transfer from one flash page to another).
The eeprom emulation works with so called "virtual addresses", where you can write data blocks of any size to the currently active/valid flash page and read it back by using those virtual addresses.
The former colleague implemented the eeprom emulation as multitasking module, where you can read/write to the flash pages from every task in the application. At first sight everything seems fine.
But my project manager told me, that the device always loses some of the "special" data at moments, where the power supply level in the vehicle goes down to some volts and the device tries to save the data to flash.
Normally the power supply is about 10-18 volts, but if it goes down to under 7 volts, the device receives an interrupt called powerwarn and it triggers a task called powerfail task.
The powerfail task has the highest priority of all tasks and executes some callbacks where e.g. the modem is turned off and also where the "special" data is stored in the flash page.
I tried to understand the code and debugged for days/weeks and now I'm quite sure that I found the problem:
Within those callbacks which the powerfail task executes (called powerfail callbacks), there are RTOS calls,
where other tasks get suspended. But unfortunately those supended task could also have a unfinished EEPROM_WriteBlock() call just before the powerwarn interrupt is received.
Therefore the powerfail task executes the callbacks and in one of the callbacks there is a EE_WriteBlock() call where the task can't take the mutex in EE_WriteBlock() since another task (which was suspended) has taken it already --> Deadlock!
This is the routine to write data to flash:
uint16_t
EE_WriteBlock (EE_TypeDef *EE, uint16_t VirtAddress, const void *Data, uint16_t Size)
{
.
.
xSemaphoreTakeRecursive(EE->rw_mutex, portMAX_DELAY);
/* Write the variable virtual address and value in the EEPROM */
.
.
.
xSemaphoreGiveRecursive(EE->rw_mutex);
return Status;
}
This is the RTOS specific code when 'xSemaphoreTakeRecursive()' is called:
portBASE_TYPE xQueueTakeMutexRecursive( xQueueHandle pxMutex, portTickType xBlockTime )
{
portBASE_TYPE xReturn;
/* Comments regarding mutual exclusion as per those within
xQueueGiveMutexRecursive(). */
traceTAKE_MUTEX_RECURSIVE( pxMutex );
if( pxMutex->pxMutexHolder == xTaskGetCurrentTaskHandle() )
{
( pxMutex->uxRecursiveCallCount )++;
xReturn = pdPASS;
}
else
{
xReturn = xQueueGenericReceive( pxMutex, NULL, xBlockTime, pdFALSE );
/* pdPASS will only be returned if we successfully obtained the mutex,
we may have blocked to reach here. */
if( xReturn == pdPASS )
{
( pxMutex->uxRecursiveCallCount )++;
}
else
{
traceTAKE_MUTEX_RECURSIVE_FAILED( pxMutex );
}
}
return xReturn;
}
My project manager is happy that I've found the bug but he also forces me to create a fix as quickly as possible, but what I really want is a rewrite of the code.
Maybe one of you might think, just avoid the suspension of the other tasks and you are done, but that is not a possible solution, since this could trigger another bug.
Does anybody have a quick solution/idea how I could fix this deadlock problem?
Maybe I could use xTaskGetCurrentTaskHandle() in EE_WriteBlock() to determine who has the ownership of the mutex and then give it if the task is not running anymore.
Thx

Writing flash, on many systems, requires interrupts to be disabled for the duration of the write so I'm not sure how powerFail can be made running while a write is in progress, but anyway:
Don't control access to the reserved flash pages directly with a mutex - use a blocking producer-consumer queue instead.
Delegate all those writes to one 'flashWriter' thread by queueing requests to it. If the threads requesting the writes require synchronous access, include an event or semaphore in the request struct that the requesting thread waits on after pushing its request. The flashWriter can signal it when done, (or after loading the struct with an error indication:).
There are variations on a theme - if all the write requesting threads need only synchronous access, maybe they can keep their own static request struct with their own semaphore and just queue up a pointer to it.
Use a producer-consumer queue class that allows a high-priority push at the head of the queue and, when powerfail runs, push a 'stopWriting' request at the front of the queue. The flashWriter will then complete any write operation in progress, pop the stopWriting request and so be instructed to suspend itself, (or you could use a 'stop' volatile boolean that the flashWriter checks every time before attempting to pop the queue).
That should prevent deadlock by removing the hard mutex lock from the flash write requests pushed in the other threads. It won't matter if other threads continue to queue up write requests - they will never be executed.
Edit: I've just had two more coffees and, thinking about this, the 'flashWriter' thread could easily become the 'FlashWriterAndPowerFail' thread:
You could arrange for your producer-consumer queue to return a pop() result of null if a volatile 'stop' boolean is set, no matter whether there were entries on the queue or no. In the 'FWAPF' thread, do a null-check after every pop() return and do the powerFail actions upon null or flashWrite actions if not.
When the powerFail interrupt occurs, set the stop bool and signal the 'count' semaphore in the queue to ensure that the FWAPF thread is made running if it's currently blocked on the queue.
That way, you don't need a separate 'powerFail' thread and stack - one thread can do the flashWrite and powerFail while still ensuring that there are no mutex deadlocks.

Overlapping communications with computations in MPI (mvapich2) for large messages

I have a very simple code, a data decomposition problem in which in a loop each process sends two large messages to the ranks before and after itself at each cycle. I run this code in a cluster of SMP nodes (AMD Magny cores, 32 core per node, 8 cores per socket). It's a while I'm in the process of optimizing this code. I have used pgprof and tau for profiling and it looks to me that the bottleneck is the communication. I have tried to overlap the communication with the computations in my code however it looks that the actual communication starts when the computations finish :(
I use persistent communication in ready mode (MPI_Rsend_init) and in between the MPI_Start_all and MPI_Wait_all bulk of the computation is done. The code looks like this:
void main(int argc, char *argv[])
{
some definitions;
some initializations;
MPI_Init(&argc, &argv);
MPI_Rsend_init( channel to the rank before );
MPI_Rsend_init( channel to the rank after );
MPI_Recv_init( channel to the rank before );
MPI_Recv_init( channel to the rank after );
for (timestep=0; temstep<Time; timestep++)
{
prepare data for send;
MPI_Start_all();
do computations;
MPI_Wait_all();
do work on the received data;
}
MPI_Finalize();
}
Unfortunately the actual data transfer does not start until the computations are done, I don't understand why. The network uses QDR InfiniBand Interconnect and mvapich2. each message size is 23MB (totally 46 MB message is sent). I tried to change the message passing to eager mode, since the memory in the system is large enough. I use the following flags in my job script:
MV2_SMP_EAGERSIZE=46M
MV2_CPU_BINDING_LEVEL=socket
MV2_CPU_BINDING_POLICY=bunch
Which gives me an improvement of about 8%, probably because of better placement of the ranks inside the SMP nodes however still the problem with communication remains. My question is why can't I effectively overlap the communications with the computations? Is there any flag that I should use and I'm missing it? I know something is wrong, but whatever I have done has not been enough.
By the order of ranks inside the SMP nodes the actual message sizes between the nodes is also 46MB (2x23MB) and the ranks are in a loop. Can you please help me? To see the flags that other users use I have checked /etc/mvapich2.conf however it is empty.
Is there any other method that I should use? do you think one sided communication gives better performance? I feel there is a flag or something that I'm not aware of.
Thanks alot.

There is something called progression of operations in MPI. The standard allows for non-blocking operations to only be progressed to completion once the proper testing/waiting call was made:
A nonblocking send start call initiates the send operation, but does not complete it. The send start call can return before the message was copied out of the send buffer. A separate send complete call is needed to complete the communication, i.e., to verify that the data has been copied out of the send buffer. With suitable hardware, the transfer of data out of the sender memory may proceed concurrently with computations done at the sender after the send was initiated and before it completed. Similarly, a nonblocking receive start call initiates the receive operation, but does not complete it. The call can return before a message is stored into the receive buffer. A separate receive complete call is needed to complete the receive operation and verify that the data has been received into the receive buffer. With suitable hardware, the transfer of data into the receiver memory may proceed concurrently with computations done after the receive was initiated and before it completed.
(words in bold are also bolded in the standard text; emphasis added by me)
Although this text comes from the section about non-blocking communication (§3.7 of MPI-3.0; the text is exactly the same in MPI-2.2), it also applies to persistent communication requests.
I haven't used MVAPICH2, but I am able to speak about how things are implemented in Open MPI. Whenever a non-blocking operation is initiated or a persistent communication request is started, the operation is added to a queue of pending operations and is then progressed in one of the two possible ways:
if Open MPI was compiled without an asynchronous progression thread, outstanding operations are progressed on each call to a send/receive or to some of the wait/test operations;
if Open MPI was compiled with an asynchronous progression thread, operations are progressed in the background even if no further communication calls are made.
The default behaviour is not to enable the asynchronous progression thread as doing so increases the latency of the operations somehow.
The MVAPICH site is unreachable at the moment from here, but earlier I saw a mention of asynchronous progress in the features list. Probably that's where you should start from - search for ways to enable it.
Also note that MV2_SMP_EAGERSIZE controls the shared memory protocol eager message size and does not affect the InfiniBand protocol, i.e. it can only improve the communication between processes that reside on the same cluster node.
By the way, there is no guarantee that the receive operations would be started before the ready send operations in the neighbouring ranks, so they might not function as expected as the ordering in time is very important there.

For MPICH, you can set MPICH_ASYNC_PROGRESS=1 environment variable when runing mpiexec/mpirun. This will spawn a background process which does "asynchronous progress" stuff.
MPICH_ASYNC_PROGRESS - Initiates a spare thread to provide
asynchronous progress. This improves progress semantics for
all MPI operations including point-to-point, collective,
one-sided operations and I/O. Setting this variable would
increase the thread-safety level to
MPI_THREAD_MULTIPLE. While this improves the progress
semantics, it might cause a small amount of performance
overhead for regular MPI operations.
from MPICH Environment Variables
I have tested on my cluster with MPICH-3.1.4, it worked! I believe MVAPICH will also work.

Suggestion for callbacks oriented library in C

I'm making small library for controlling various embedded devices using C language. I'm using UDP sockets to communicate with each of the devices. Devices send me various interesting data, alarms and notifications and at the same time they send some data that is used internally by the library but may not be interesting to users. So, I've implemented a callback approach, where user could register a callback function with some interesting events on each of the devices. Right now, overall design of this library is something like this:-
I've two threads running.
In one of the thread, there is a infinite while event-loop that uses select and non-blocking sockets to maintain the communication with each of the devices.
Basically, every time I receive a packet from any of devices, I strip off the header which is 20 bytes of some useless information and add my own header containing DEVICE_ID, REQUES_TIME (time request was sent to retrieve that packet and RETRIEVAL_TIME (time now when packet actually arrived) and REQUEST_ID and REQUEST_TYPE (alarm, data, notification etc..).
Now, this thread (one with infinite loop) put packet with new header into ring buffer and then notifies other thread (thread #2) to parse this information.
In thread #2, when notification is received, it locks the buffer and read pop the packet and start parsing it.
Every message contains some information that user may not be interested, so I'm providing user call back approach to act upon data which is useful to user.
Basically, I'm doing something like this in thread 2:-
THREAD #2
wait(data_put_in_buffer_cond)
lock(buffer_mutex)
packet_t* packet = pop_packet_from_buffer(buf);
unlock(buffer_mutex)
/* parsing the package... */
parsed_packet_t* parsed_packet = parse_and_change_endianess(packet->data);
/* header for put by thread #1 with host byte order only not parsing necessary */
header_t* header = get_header(packet);
/* thread 1 sets free callback for kind of packet it puts in buffer
* This not a critical section section of buffer, so fine without locks
*/
buffer.free_callback(packet);
foreach attribute in parsed_packet->attribute_list {
register_info_t* rinfo = USER_REGISTRED_EVENT_TABLE[header->device_id][attribute.attr_id];
/*user is register with this attribute ID on this device ID */
if(rinfo != NULL) {
rinof->callback(packet);
}
// Do some other stuff with this attribute..
}
free(parsed_packet);
Now, my concerned is that what will happen if callback function that user implements takes some time to complete and meanwhile I may drop some packet because ring buffer is in overwriting mode? I've tested my API for 3 to 4 devices, I don't see much drop event if callback function wait decent amount of time..I'm speculating that this approach may not be best.
Would it be a better design, if I use some sort of thread-pool to run user callback functions? In that case I would need to make explicit copy of packet before I send it to user callback? Each packet is about 500 to 700 bytes, I get around 2 packets per second from each device. Any suggestions or comments on improving the current design or solving this issues would be appreciated.

Getting 500-700 bytes per device is not a problem at all, especially if you only have 3-4 devices. Even if you had, let's say, 100 devices, it should not be a problem. The copy overhead would be most probably negligible. So, my suggest would be: do not try to optimize beforehand until you are certain that buffer copying is your bottleneck.
About losing packets, as you say in your question, you are already using a buffer ring (I assume that is something like a circular queue, right?). If the queue becomes full, then you just need to make thread #1 to wait until there is some available space in the queue. Clearly, more events from different devices may arrive, but that should not be a problem. Once, you have space again, select will let you know that you have available data from different devices, so you will just need to process all that data. Of course, in order to have a balanced system, you can set the size of the queue to a value that reduces as much as possible the number of times that the queue is full, and thus, thread #1 needs to wait.

How to make a UDP socket replace old messages (not yet recv()'d) when new arrive?

First, a little bit of context to explain why I am on the "UDP sampling" route:
I would like to sample data produced at a fast rate for an unknown period of time. The data I want to sample is on another machine than the one consuming the data. I have a dedicated Ethernet connection between the two so bandwidth is not an issue. The problem I have is that the machine consuming the data is much slower than the one producing it. An added constraint is that while it's ok that I don't get all the samples (they are just samples), it is mandatory that I get the last one.
My 1st solution was to make the data producer send a UDP datagram for each produced sample and let the data consumer try to get the samples it could and let the others be discarded by the socket layer when the UDP socket is full. The problem with this solution is that when new UDP datagrams arrive and the socket is full, it is the new datagrams that get discarded and not the old ones. Therefore I am not guarantueed to have the last one!
My question is: is there a way to make a UDP socket replace old datagrams when new arrive?
The receiver is currently a Linux machine, but that could change in favor of another unix-like OS in the future (windows may be possible as it implements BSD sockets, but less likely)
The ideal solution would use widespread mecanisms (like setsockopt()s) to work.
PS: I thought of other solutions but they are more complex (involve heavy modification of the sender), therefore I would like first to have a definite status on the feasability of what I ask! :)
Updates:
- I know that the OS on the receiving machine can handle the network load + reassembly of the traffic generated by the sender. It's just that its default behaviour is to discard new datagrams when the socket buffer is full. And because of the processing times in the receiving process, I know it will become full whatever I do (wasting half of the memory on a socket buffer is not an option :)).
- I really would like to avoid having an helper process doing what the OS could have done at packet-dispatching time and waste resource just copying messages in a SHM.
- The problem I see with modifying the sender is that the code which I have access to is just a PleaseSendThisData() function, it has no knowledge that it can be the last time it is called before a long time, so I don't see any doable tricks at that end... but I'm open to suggestions! :)
If there are really no way to change the UDP receiving behaviour in a BSD socket, then well... just tell me, I am prepared to accept this terrible truth and will start working on the "helper process" solution when I go back to it :)

Just set the socket to non-blocking, and loop on recv() until it returns < 0 with errno == EAGAIN. Then process the last packet you got, rinse and repeat.

I agree with "caf".
Set the socket to a non-blocking mode.
Whenever you receive something on the socket - read in a loop until nothing more is left. Then handle the last read datagram.
Only one note: you should set a large system receive buffer for the socket
int nRcvBufSize = 5*1024*1024; // or whatever you think is ok
setsockopt(sock, SOL_SOCKET, SO_RCVBUF, (char*) &nRcvBufSize, sizeof(nRcvBufSize));

This will be difficult to get completely right just on the listener side since it could actually miss the last packet in the Network Interface Chip, which will keep your program from ever having had a chance at seeing it.
The operating system's UDP code would be the best place to try to deal with this since it will get new packets even if it decides to discard them because it already has too many queued up. Then it could make the decision of dropping an old one or dropping a new one, but I don't know how to go about telling it that this is what you would want it to do.
You can try to deal with this on the receiver by having one program or thread that always tries to read in the newest packet and another that always tries to get that newest packet. How to do this would differ based on if you did it as two separate programs or as two threads.
As threads you would need a mutex (semaphore or something like it) to protect a pointer (or reference) to a structure used to hold 1 UDP payload and whatever else you wanted in there (size, sender IP, sender port, timestamp, etc).
The thread that actually read packets from the socket would store the packet's data in a struct, acquire the mutex protecting that pointer, swap out the current pointer for a pointer to the struct it just made, release the mutex, signal the processor thread that it has something to do, and then clear out the structure that it just got a pointer to and use it to hold the next packet that comes in.
The thread that actually processed packet payloads should wait on the signal from the other thread and/or periodically (500 ms or so is probably a good starting point for this, but you decide) and aquire the mutex, swap its pointer to a UDP payload structure with the one that is there, release the mutex, and then if the structure has any packet data it should process it and then wait on the next signal. If it did not have any data it should just go ahead and wait on the next signal.
The processor thread should probably run at a lower priority than the UDP listener so that the listener is less likely to ever miss a packet. When processing the last packet (the one you really care about) the processor will not be interrupted because there are no new packets for the listener to hear.
You could extend this by using a queue rather than just a single pointer as the swapping place for the two threads. The single pointer is just a queue of length 1 and is very easy to process.
You could also extend this by attempting to have the listener thread detect if there are multiple packets waiting and only actually putting the last of those into the queue for the processor thread. How you do this will differ by platform, but if you are using a *nix then this should return 0 for sockets with nothing waiting:
while (keep_doing_this()) {
ssize_t len = read(udp_socket_fd, my_udp_packet->buf, my_udp_packet->buf_len);
// this could have been recv or recvfrom
if (len < 0) {
error();
}
int sz;
int rc = ioctl(udp_socket_fd, FIONREAD, &sz);
if (rc < 0) {
error();
}
if (!sz) {
// There aren't any more packets ready, so queue up the one we got
my_udp_packet->current_len = len;
my_udp_packet = swap_udp_packet(my_ucp_packet);
/* swap_udp_packet is code you would have to write to implement what I talked
about above. */
tgkill(this_group, procesor_thread_tid, SIGUSR1);
} else if (sz > my_udp_packet->buf_len) {
/* You could resize the buffer for the packet payload here if it is too small.*/
}
}
A udp_packet would have to be allocated for each thread as well as 1 for the swapping pointer. If you use a queue for swapping then you must have enough udp_packets for each position in the queue -- since the pointer is just a queue of length 1 it only needs 1.
If you are using a POSIX system then consider not using a real time signal for the signaling because they queue up. Using a regular signal will allow you to treat being signaled many times the same as being signaled just once until the signal is handled, while real time signals queue up. Waking up periodically to check the queue also allows you to handle the possibility of the last signal arriving just after you have checked to see if you had any new packets but before you call pause to wait on a signal.

Another idea is to have a dedicated reader process that does nothing but loops on the socket and reads incoming packets into circular buffer in shared memory (you'll have to worry about proper write ordering). Something like kfifo. Non-blocking is fine here too. New data overrides old data. Then other process(es) would always have access to latest block at the head of the queue and all the previous chunks not yet overwritten.
Might be too complicated for a simple one-way reader, just an option.

I'm pretty sure that this is a provably insoluble problem closely related to the Two Army Problem.
I can think of a dirty solution: establish a TCP "control" sideband connection which carries the last packet which is also a "end transmission" indication. Otherwise you need to use one of the more general pragmatic means noted in Engineering Approaches.

This is an old question, but you are basically wanting to turn the socket queue (FIFO) into a stack (LIFO). It's not possible, unless you want to fiddle with the kernel.
You'll need to move the datagrams from kernel space to user space and then process. Easiest approach would be a loop like this...
Block until there is data on the socket (see select, poll, epoll)
Drain the socket, storing datagrams per your own selection policy
Process the stored datagrams
Repeat

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight