SocketCAN select() and write() don't block

SocketCAN select() and write() don't block - c

I'm testing the CAN interface on an embedded device (SOC / ARM core / Linux) using SocketCAN, and I want to send data as fast as possible for testing, using efficient code.
I can open the CAN device ("can0") as a BSD socket, and send frames with "write". This all works well.
My desktop can obviously generate frames faster than the CAN transmission rate (I'm using 500000 bps). To send efficiently, I tried using a "select" on the socket file descriptor to wait for it to become ready, followed by the "write". However, the "select" seems to return immediately regardless of the state of the send buffer, and "write" also doesn't block. This means that when the buffer fills up, I get an error from "write" (return value -1), and errno is set to 105 ("No buffer space available").
This mean I have to wait an arbitrary amount of time, then try the write again, which seems very inefficient (polling!).
Here's my code (C, edited for brevity):
printf("CAN Data Generator\n");
int skt; // CAN raw socket
struct sockaddr_can addr;
struct canfd_frame frame;
const int WAIT_TIME = 500;
// Create socket:
skt = socket(PF_CAN, SOCK_RAW, CAN_RAW);
// Get the index of the supplied interface name:
unsigned int if_index = if_nametoindex(argv[1]);
// Bind CAN device to socket created above:
addr.can_family = AF_CAN;
addr.can_ifindex = if_index;
bind(skt, (struct sockaddr *)&addr, sizeof(addr));
// Generate example CAN data: 8 bytes; 0x11,0x22,0x33,...
// ...[Omitted]
// Send CAN frames:
fd_set fds;
const struct timeval timeout = { .tv_sec=2, .tv_usec=0 };
struct timeval this_timeout;
int ret;
ssize_t bytes_writ;
while (1)
{
// Use 'select' to wait for socket to be ready for writing:
FD_ZERO(&fds);
FD_SET(skt, &fds);
this_timeout = timeout;
ret = select(skt+1, NULL, &fds, NULL, &this_timeout);
if (ret < 0)
{
printf("'select' error (%d)\n", errno);
return 1;
}
else if (ret == 0)
{
// Timeout waiting for buffer to be free
printf("ERROR - Timeout waiting for buffer to clear.\n");
return 1;
}
else
{
if (FD_ISSET(skt, &fds))
{
// Ready to write!
bytes_writ = write(skt, &frame, CAN_MTU);
if (bytes_writ != CAN_MTU)
{
if (errno == 105)
{
// Buffer full!
printf("X"); fflush(stdout);
usleep(20); // Wait for buffer to clear
}
else
{
printf("FAIL - Error writing CAN frame (%d)\n", errno);
return 1;
}
}
else
{
printf("."); fflush(stdout);
}
}
else
{
printf("-"); fflush(stdout);
}
}
usleep(WAIT_TIME);
}
When I set the per-frame WAIT_TIME to a high value (e.g. 500 uS) so that the buffer never fills, I see this output:
CAN Data Generator
...............................................................................
................................................................................
...etc
Which is good! At 500 uS I get 54% CAN bus utilisation (according to canbusload utility).
However, when I try a delay of 0 to max out my transmission rate, I see:
CAN Data Generator
................................................................................
............................................................X.XX..X.X.X.X.XXX.X.
X.XX..XX.XX.X.XX.X.XX.X.X.X.XX..X.X.X.XX..X.X.X.XX.X.XX...XX.X.X.X.X.XXX.X.XX.X.
X.X.XXX.X.XX.X.X.X.XXX.X.X.X.XX.X.X.X.X.XX..X..X.XX.X..XX.X.X.X.XX.X..X..X..X.X.
.X.X.XX.X.XX.X.X.X.X.X.XX.X.X.XXX.X.X.X.X..XX.....XXX..XX.X.X.X.XXX.X.XX.XX.XX.X
.X.X.XX.XX.XX.X.X.X.X.XX.X.X.X.X.XX.XX.X.XXX...XX.X.X.X.XX..X.XX.X.XX.X.X.X.X.X.
The initial dots "." show the buffer filling up; Once the buffer is full, "X" starts appearing meaning that the "write" call failed with error 105.
Tracing through the logic, this means the "select" must have returned and the "FD_ISSET(skt, &fds)" was true, although the buffer was full! (or did I miss something?).
The SockedCAN docs just say "Writing CAN frames can be done similarly, with the write(2) system call"
This post suggests using "select".
This post suggests that "write" won't block for CAN priority arbitration, but doesn't cover other circumstances.
So is "select" the right way to do it? Should my "write" block? What other options could I use to avoid polling?

After a quick look at canbusload:184, it seems that it computes efficiency (#data/#total bits on the bus).
On the other hand, according to this, max efficiency for CAN bus is around 57% for 8-byte frames, so you seem not to be far away from that 57%... I would say you are indeed flooding the bus.
When setting a 500uS delay, 500kbps bus bitrate, 8-byte frames, it gives you a (control+data) bitrate of 228kbps, which is lower than max bitrate of the CAN bus, so, no bottleneck here.
Also, since in this case only 1 socket is being monitored, you don't need pselect, really. All you can do with pselect and 1 socket can be done without pselect and using write.
(Disclamer: hereinafter, this is just guessing since I cannot test it right now, sorry.)
As of why the behavior of pselect, think that the buffer could have byte semantics, so it tells you there is still room for more bytes (1 at least), not necessarily for more can_frames. So, when returning, pselect does not inform you can send the whole CAN frame. I guess you could solve this by using SIOCOUTQ and the max size of the Rx buffer SO_SNDBUF, but not sure if it works for CAN sockets (the nice thing would be to use SO_SNDLOWAT flags, but it is not changable in Linux's implementation).
So, to answer your questions:
Is "select" the right way to do it?
Well, you can do it both ways, either (p)select or write, since you are only waiting for one file descriptor, there is no real difference.
Should my "write" block? It should if there is no single byte available in the send buffer.
What other options could I use to avoid polling? Maybe by ioctl'ing SIOCOUTQ and getsockopt'ing SO_SNDBUF and substracting... you will need to check this yourself. Alternatively, maybe you could set the send buffer size to a multiple of sizeof(can_frame) and see if it keeps you signaling when less than sizeof(can_frame) are available.
Anyhow, if you are interested in having a more precise timing, you could use a BCM socket. There, you can instruct the kernel to send a specific frame at a specific interval. Once set, the process run in kernel space, without any system call. In this way, user-kernel buffer problem is avoided. I would test different rates until canbusload shows no rise in bus utilization.

select and poll worked for me right with SocketCan. However, carefull configuration is require.
some background:
between user app and the HW, there are 2 buffers:
socket buffer, where its size (in bytes) is controlled by the setsockopt's SO_SNDBUF option
driver's qdisc, where its size (in packets) is controlled by the "ifconfig can0 txqueuelen 5" command.
data path is: user app "write" command --> socket buffer -> driver's qdisc -> HW TX mailbox.
2 flow control points exist along this path:
when there is no free TX mailboxe, driver freeze driver's qdisc (__QUEUE_STATE_DRV_XOFF), to prevent more packets to be dequeued from driver's qdisc into HW. it will be un-freezed when TX mailbox is free (upon TX completion interrupt).
when socket buffer goes above half of its capacity, poll/select blocks, until socket buffer goes beyond half of its capacity.
now, assume that socket buffer has room for 20 packets, while driver's qdisc has room for 5 packets. lets assume also that HW have single TX mailbox.
poll/select let user write up to 10 packets.
those packets are moved down to socket buffer.
5 of those packets continue and fill driver's qdisc.
driver dequeue 1st packet from driver's qdisc, put it into HW TX mailbox and freeze driver's qdisc (=no more dequeue). now there is room for 1 packet in driver's qdisc
6th packet is moved down successfully from socket buffer to driver's qdisc.
7th packet is moved down from socket buffer to driver's qdisc, but since there is no room - it is dropped and error 105 ("No buffer space available") is generated.
what is the solution?
in the above assumptions, lets configure socket buffer for 8 packets. in this case, poll/select will block user app after 4 packets, ensuring that there is room in driver's qdisc for all of those 4 packets.
however, socket buffer is configured to bytes, not to packet. translation should be made as the following: each CAN packet occupy ~704 bytes at socket buffer (most of them for the socket structure). so, to configure socket buffer to 8 packet, the size in bytes should be 8*704:
int size = 8*704;
setsockopt(s, SOL_SOCKET, SO_SNDBUF, &size, sizeof(size));

Related

Is it OK to loop over recv / read to read all data from socket

I'm building a multi-client<->server messaging application over TCP.
I created a non blocking server using epoll to multiplex linux file descriptors.
When a fd receives data, I read() /or/ recv() into buf.
I know that I need to either specify a data length* at the start of the transmission, or use a delimiter** at the end of the transmission to segregate the messages.
*using a data length:
char *buffer_ptr = buffer;
do {
switch (recvd_bytes = recv(new_socket, buffer_ptr, rem_bytes, 0)) {
case -1: return SOCKET_ERR;
case 0: return CLOSE_SOCKET;
default: break;
}
buffer_ptr += recvd_bytes;
rem_bytes -= recvd_bytes;
} while (rem_bytes != 0);
**using a delimiter:
void get_all_buf(int sock, std::string & inStr)
{
int n = 1, total = 0, found = 0;
char c;
char temp[1024*1024];
// Keep reading up to a '\n'
while (!found) {
n = recv(sock, &temp[total], sizeof(temp) - total - 1, 0);
if (n == -1) {
/* Error, check 'errno' for more details */
break;
}
total += n;
temp[total] = '\0';
found = (strchr(temp, '\n') != 0);
}
inStr = temp;
}
My question: Is it OK to loop over recv() until one of those conditions is met? What if a client sends a bogus message length or no delimiter or there is packet loss? Wont I be stuck looping recv() in my program forever?

Is it OK to loop over recv() until one of those conditions is met?
Probably not, at least not for production-quality code. As you suggested, the problem with looping until you get the full message is that it leaves your thread at the mercy of the client -- if a client decides to only send part of the message and then wait for a long time (or even forever) without sending the last part, then your thread will be blocked (or looping) indefinitely and unable to serve any other purpose -- usually not what you want.
What if a client sends a bogus message length
Then you're in trouble (although if you've chosen a maximum-message-size you can detect obviously bogus message-lengths that are larger than that size, and defend yourself by e.g. forcibly closing the connection)
or there is packet loss?
If there is a reasonably small amount of packet loss, the TCP layer will automatically retransmit the data, so your program won't notice the difference (other than the message officially "arriving" a bit later than it otherwise would have). If there is really bad packet loss (e.g. someone pulled the Ethernet cable out of the wall for 5 minutes), then the rest of the message might be delayed for several minutes or more (until connectivity recovers, or the TCP layer gives up and closes the TCP connection), trapping your thread in the loop.
So what is the industrial-grade, evil-client-and-awful-network-proof solution to this dilemma, so that your server can remain responsive to other clients even when a particular client is not behaving itself?
The answer is this: don't depend on receiving the entire message all at once. Instead, you need to set up a simple state-machine for each client, such that you can recv() as many (or as few) bytes from that client's TCP socket as it cares to send to you at any particular time, and save those bytes to a local (per-client) buffer that is associated with that client, and then go back to your normal event loop even though you haven't received the entire message yet. Keep careful track of how many valid received-bytes-of-data you currently have on-hand from each client, and after each recv() call has returned, check to see if the associated per-client incoming-data-buffer contains an entire message yet, or not -- if it does, parse the message, act on it, then remove it from the buffer. Lather, rinse, and repeat.

Why are particular UDP messages always getting dropped below a particular buffer size?

3 different messages are being sent to the same port at different rates:
Message size (bytes) Sent everytransmit speed
High 232 10 ms 100Hz
Medium 148 20ms 50Hz
Low 20 60 ms 16.6Hz
I can only process one message every ~ 6 ms.
Single threaded. Blocking read.
A strange situation is occurring, and I don't have an explanation for it.
When I set my receive buffer to 4,799 bytes, all of my low speed messages get dropped.
I see maybe one or two get processed, and then nothing.
When I set my receive buffer to 4,800(or higher!), it appears as though all of the low speed messages start getting processed. I see about 16/17 a second.
This has been observed consistently. The application sending the packets is always started before the receiving application. The receiving application always has a long delay after the sockets are created, and before it begins processing. So the buffer is always full when the processing starts, and it is not the same starting buffer each time a test occurs. This is because the socket is created after the sender is already sending out messages, so the receiver might start listening in the middle of a send cycle.
Why does increasing the received buffer size a single byte, cause a huge change in low speed message processing?
I built a table to better visualize the expected processing:
As some of these messages get processed, more messages presumably get put on the queue instead of being dropped.
Nonetheless, I would expect a 4,799 byte buffer to behave the same way as 4,800 bytes.
However that is not what I have observed.
I think the issue is related to the fact that low speed messages are sent at the same time as the other two messages. It is always received after the high/medium speed messages. (This has been confirmed over wireshark).
For example, assuming the buffer was empty to begin with, it is clear that the low speed message would need queued longer than the other messages.
*1 message every 6ms is about 5 messages every 30ms.
This still doesn't explain the buffer size.
We are running VxWorks, and using their sockLib, which is an implementation of Berkeley sockets. Here is a snippet of what our socket creation looks like:
SOCKET_BUFFER_SIZE is what I'm changing.
struct sockaddr_in tSocketAddress; // Socket address
int nSocketAddressSize = sizeof(struct sockaddr_in); // Size of socket address structure
int nSocketOption = 0;
// Already created
if (*ptParameters->m_pnIDReference != 0)
return FALSE;
// Create UDP socket
if ((*ptParameters->m_pnIDReference = socket(AF_INET, SOCK_DGRAM, 0)) == ERROR)
{
// Error
CreateSocketMessage(ptParameters, "CreateSocket: Socket create failed with error.");
// Not successful
return FALSE;
}
// Valid local address
if (ptParameters->m_szLocalIPAddress != SOCKET_ADDRESS_NONE_STRING && ptParameters->m_usLocalPort != 0)
{
// Set up the local parameters/port
bzero((char*)&tSocketAddress, nSocketAddressSize);
tSocketAddress.sin_len = (u_char)nSocketAddressSize;
tSocketAddress.sin_family = AF_INET;
tSocketAddress.sin_port = htons(ptParameters->m_usLocalPort);
// Check for any address
if (strcmp(ptParameters->m_szLocalIPAddress, SOCKET_ADDRESS_ANY_STRING) == 0)
tSocketAddress.sin_addr.s_addr = htonl(INADDR_ANY);
else
{
// Convert IP address for binding
if ((tSocketAddress.sin_addr.s_addr = inet_addr(ptParameters->m_szLocalIPAddress)) == ERROR)
{
// Error
CreateSocketMessage(ptParameters, "Unknown IP address.");
// Cleanup socket
close(*ptParameters->m_pnIDReference);
*ptParameters->m_pnIDReference = ERROR;
// Not successful
return FALSE;
}
}
// Bind the socket to the local address
if (bind(*ptParameters->m_pnIDReference, (struct sockaddr *)&tSocketAddress, nSocketAddressSize) == ERROR)
{
// Error
CreateSocketMessage(ptParameters, "Socket bind failed.");
// Cleanup socket
close(*ptParameters->m_pnIDReference);
*ptParameters->m_pnIDReference = ERROR;
// Not successful
return FALSE;
}
}
// Receive socket
if (ptParameters->m_eType == SOCKTYPE_RECEIVE || ptParameters->m_eType == SOCKTYPE_RECEIVE_AND_TRANSMIT)
{
// Set the receive buffer size
nSocketOption = SOCKET_BUFFER_SIZE;
if (setsockopt(*ptParameters->m_pnIDReference, SOL_SOCKET, SO_RCVBUF, (char *)&nSocketOption, sizeof(nSocketOption)) == ERROR)
{
// Error
CreateSocketMessage(ptParameters, "Socket buffer size set failed.");
// Cleanup socket
close(*ptParameters->m_pnIDReference);
*ptParameters->m_pnIDReference = ERROR;
// Not successful
return FALSE;
}
}
and the socket receive that's being called in an infinite loop:
*The buffer size is definitely large enough
int SocketReceive(int nSocketIndex, char *pBuffer, int nBufferLength)
{
int nBytesReceived = 0;
char szError[256];
// Invalid index or socket
if (nSocketIndex < 0 || nSocketIndex >= SOCKET_COUNT || g_pnSocketIDs[nSocketIndex] == 0)
{
sprintf(szError,"SocketReceive: Invalid socket (%d) or ID (%d)", nSocketIndex, g_pnSocketIDs[nSocketIndex]);
perror(szError);
return -1;
}
// Invalid buffer length
if (nBufferLength == 0)
{
perror("SocketReceive: zero buffer length");
return 0;
}
// Send data
nBytesReceived = recv(g_pnSocketIDs[nSocketIndex], pBuffer, nBufferLength, 0);
// Error in receiving
if (nBytesReceived == ERROR)
{
// Create error string
sprintf(szError, "SocketReceive: Data Receive Failure: <%d> ", errno);
// Set error message
perror(szError);
// Return error
return ERROR;
}
// Bytes received
return nBytesReceived;
}
Any clues on why increasing the buffer size to 4,800 results in successful and consistent reading of low speed messages?

The basic answer to the question of why a SO_RCVBUF size of 4799 results in lost low speed messages and a size of 4800 works fine is that with the mixture of the UDP packets coming in, the rate at which they are coming in, the rate at which you are processing incoming packets, and the sizing of the mbuff and cluster numbers in your vxWorks kernel allow for sufficient network stack throughput that the low speed messages are not being discarded with the larger size.
The SO_SNDBUF option description in the setsockopt() man page at URL http://www.vxdev.com/docs/vx55man/vxworks/ref/sockLib.html#setsockopt mentioned in a comment above has this to say about the size specified and the effect on mbuff usage:
The effect of setting the maximum size of buffers (for both SO_SNDBUF
and SO_RCVBUF, described below) is not actually to allocate the mbufs
from the mbuf pool. Instead, the effect is to set the high-water mark
in the protocol data structure, which is used later to limit the
amount of mbuf allocation.
UDP packets are discrete units. If you send 10 packets of size 232 that is not considered to be 2320 bytes of data in contiguous memory. Instead that is 10 memory buffers within the network stack because UDP is discrete packets while TCP is a continuous stream of bytes.
See How do I tune the network buffering in VxWorks 5.4? in the DDS community web site which gives a discussion about the interdependence of the mixture of mbuff sizes and network clusters.
See How do I resolve a problem with VxWorks buffers? in the DDS community web site.
See this PDF of a slide presentation, A New Tool to study Network Stack Exhaustion in VxWorks from 2004 which discusses using various tools such as mBufShow and inetStatShow to see what is happening in the network stack.

Without detailed analysis of every network stack implementation along the path your UDP messages are being sent it is nearly impossible to state the resulting behaviour.
UDP implementations are allowed to drop any packet at their own discretion. Usually this happens when a stack comes to the conclusion that it would need to drop packets to be able to receive new ones. There is no formal requirement that the packets dropped are the oldest or the newest being received. It could also be that a certain size class is more affected due to internal memory management strategies.
From the IP stacks involved the most interesting one is the one on the receiving machine.
For sure you will get better receive experience if you change the receive side to have a receive buffer that will take several seconds full of expected messages. I'd start with at least 10k.
The observed "change" in behaviour when going from 4,799 to 4,800 may result from the later just allowing one of the small messages to be received before it needs to be dropped again, while the smaller size just causes it to be dropped slightly earlier. If the receiving application is quick enough to read the pending message you will receive small messages in the one case and no small messages in the other case.

domain socket fragmentation

I am using domain sockets (AF_UNIX) to communicate between two threads for inter process communication. This is chosen to work well with libev: I use it on the recv end of the domain socket. This works very well except that the data I am sending is constant 4864 bytes. I cannot afford to get this data fragmented. I always thought domain sockets won't fragment data, but as it turns out it does. When the communication is at its peak between the threads, I observe the following
Thread 1:
SEND = 4864 actual size = 4864
Thread 2:
READ = 3328 actual size = 4864
Thread 1:
SEND = 4864 actual size = 4864
Thread 2:
READ = 1536 actual size = 4864
As you can see, thread 2 received the data in fragments (3328 + 1536). This is really bad for my application. Is there anyway we can make it not fragment it? I understand that IP_DONTFRAG can be set to only AF_INET family? Can someone suggest an alternative?
Update: sendto code
ssize_t
socket_domain_writer_dgram_send(int *domain_sd, domain_packet_t *pkt) {
struct sockaddr_un remote;
unsigned long len = 0;
ssize_t ret = 0;
memset(&remote, '\0', sizeof(struct sockaddr_un));
remote.sun_family = AF_UNIX;
strncpy(remote.sun_path, DOMAIN_SOCK_PATH, strlen(DOMAIN_SOCK_PATH));
len = strlen(remote.sun_path) + sizeof(remote.sun_family) + 1;
ret = sendto(*domain_sd, pkt, sizeof(*pkt), 0, (struct sockaddr *)&remote, sizeof(struct sockaddr_un));
if (ret == -1) {
bps_log(BPS_LOGGER_RD, ASL_LEVEL_ERR, "Domain writer could not connect send packets", errno);
}
return ret;
}

SOCK_STREAM by definition doesn't preserve message boundaries. Try again with SOCK_DGRAM or SOCK_SEQPACKET:
http://man7.org/linux/man-pages/man7/unix.7.html
On the other hand, consider that you may be passing messages larger than your architecture page size. For example, for amd64, a memory page is 4K. If that's a problem for any reason it might make sense to split the packets in 2.
Note however, that's not a real issue for the packets to arrive fragmented. It's common to have a packet assembler in the receiving end of the socket. What's wrong with implementing it ?

4864 + 3328 = 8192. My guess is that you're transmitting two 4864-byte packets back to back in some cases, and it's filling an 8 KB kernel buffer somewhere. IP_DONTFRAG isn't applicable because IP is not involved here — the "fragmentation" you're seeing is happening via a completely different mechanism.
If all the data you're transmitting consists of packets, you would do well to use a datagram socket (SOCK_DGRAM) instead of a stream. This should make the send() block when the kernel buffer doesn't have sufficient space to store an entire packet, rather than allowing a partial write through, and will make each recv() return exactly one packet, so you don't need to deal with framing.

Get the number of bytes available in socket by 'recv' with 'MSG_PEEK' in C++

C++ has the following function to receive bytes from socket, it can check for number of bytes available with the MSG_PEEK flag. With MSG_PEEK, the returned value of 'recv' is the number of bytes available in socket:
#include <sys/socket.h>
ssize_t recv(int socket, void *buffer, size_t length, int flags);
I need to get the number of bytes available in the socket without creating buffer (without allocating memory for buffer). Is it possible and how?

You're looking for is ioctl(fd,FIONREAD,&bytes_available) , and under windows ioctlsocket(socket,FIONREAD,&bytes_available).
Be warned though, the OS doesn't necessarily guarantee how much data it will buffer for you, so if you are waiting for very much data you are going to be better off reading in data as it comes in and storing it in your own buffer until you have everything you need to process something.
To do this, what is normally done is you simply read chunks at a time, such as
char buf[4096];
ssize_t bytes_read;
do {
bytes_read = recv(socket, buf, sizeof(buf), 0);
if (bytes_read > 0) {
/* do something with buf, such as append it to a larger buffer or
* process it */
}
} while (bytes_read > 0);
And if you don't want to sit there waiting for data, you should look into select or epoll to determine when data is ready to be read or not, and the O_NONBLOCK flag for sockets is very handy if you want to ensure you never block on a recv.

On Windows, you can use the ioctlsocket() function with the FIONREAD flag to ask the socket how many bytes are available without needing to read/peek the actual bytes themselves. The value returned is the minimum number of bytes recv() can return without blocking. By the time you actually call recv(), more bytes may have arrived.

Be careful when using FIONREAD! The problem with using ioctl(fd, FIONREAD, &available) is that it will always return the total number of bytes available for reading in the socket buffer on some systems.
This is no problem for STREAM sockets (TCP) but misleading for DATAGRAM sockets (UDP). As for datagram sockets read requests are capped to the size of the first datagram in the buffer and when reading less than she size of the first datagram, all unread bytes of that datagram are still discarded. So ideally you want to know only the size of the next datagram in the buffer.
E.g. on macOS/iOS it is documented that FIONREAD always returns the total amount (see comments about SO_NREAD). To only get the size of the next datagram (and total size for stream sockets), you can use the code below:
int available;
socklen_t optlen = sizeof(readable);
int err = getsockopt(soc, SOL_SOCKET, SO_NREAD, &available, &optlen);
On Linux FIONREAD is documented to only return the size of the next datagram for UDP sockets.
On Windows ioctlsocket(socket, FIONREAD, &available) is documented to always give the total size:
If the socket passed in the s parameter is message oriented (for example, type SOCK_DGRAM), FIONREAD returns the reports the total number of bytes available to read, not the size of the first datagram (message) queued on the socket.
Source: https://learn.microsoft.com/en-us/windows/win32/api/ws2spi/nc-ws2spi-lpwspioctl
I am unaware of a way how to get the size of the first datagram only on Windows.

The short answer is : this cannot be done with MS-Windows WinSock2,
as I can discovered over the last week of trying.
Glad to have finally found this post, which sheds some light on the issues I've been having, using latest Windows 10 Pro, version 20H2 Build 19042.867 (x86/x86_64) :
On a bound, disconnected UDP socket 'sk' (in Listening / Server mode):
1. Any attempt to use either ioctlsocket(sk, FIONREAD, &n_bytes)
OR WsaIoctl with a shifted FIONREAD argument, though they succeed,
and retern 0, after a call to select() returns > with that
'sk' FD bit set in the read FD set,
and the ioctl call returns 0 (success), and n_bytes is > 0,
causes the socket sk to be in a state where any
subsequent call to recv(), recvfrom(), or ReadFile() returns
SOCKET_ERROR with a WSAGetLastError() of :
10045, Operation Not Supported, or ReadFile
error 87, 'Invalid Parameter'.
Moreover, even worse:
2. Any attempt to use recv or recvfrom with the 'MSG_PEEK' msg_flags
parameter returns -1 and WSAGetLastError returns :
10040 : 'A message sent on a datagram socket was larger than
the internal message buffer or some other network limit,
or the buffer used to receive a datagram into was smaller
than the datagram itself.
' .
Yet for that socket I DID successfully call:
setsockopt(s, SOL_SOCKET, SO_RCVBUF, bufsz = 4096 , sizeof(bufsz) )
and the UDP packet being received was of only 120 bytes in size.
In short, with modern windows winsock2 ( winsock2.h / Ws2_32.dll) ,
there appears to be absolutely no way to use any documented API
to determine the number of bytes received on a bound UDP socket
before calling recv() / recvfrom() in MSG_WAITALL blocking mode to
actually receive the whole packet.
If you do not call ioctlsocket() or WsaIoctl or
recv{,from}(...,MSG_PEEK,...)
before entering recv{,from}(...,MSG_WAITALL,...) ,
then the recv{,from} succeeds.
I am considering advising clients that they must install and run
a Linux instance with MS Services for Linux under their windows
installation , and developing some
API to communicate with it from Windows, so that reliable
asynchronous UDP communication can be achieved - or does anyone
know of a good open source replacement for WinSock2 ?
I need access to a "C" library TCP+UDP/IP implementation for
modern Windows 10 that conforms to its own documentation,
unlike WinSock2 - does anyone know of one ?

Effect of SO_SNDBUF

I am unable to make sense of how and why the following code segments work :
/* Now lets try to set the send buffer size to 5000 bytes */
size = 5000;
err = setsockopt(sockfd, SOL_SOCKET, SO_SNDBUF, &size, sizeof(int));
if (err != 0) {
printf("Unable to set send buffer size, continuing with default size\n");
}
If we check the value of the send buffer, it is indeed correctly set to 5000*2 = 10000.
However, if we try to send more than the send buffer size, it does send all of it. For example:
n = send(sockfd, buf, 30000, 0);
/* Lets check how much us actually sent */
printf("No. of bytes sent is %d\n", n);
This prints out 30000.
How exactly did this work? Didn't the fact that the send buffer size was limited to 10000 have any effect? If it did, what exactly happened? Some kind of fragmentation?
UPDATE: What happens if the socket is in non-blocking mode? I tried the following:
Changing buffer size to 10000 (5000*2) causes 16384 bytes to be sent
Changing buffer size to 20000 (10000*2) causes 30000 bytes to be sent
Once again, why?

The effect of setting SO_SNDBUF option is different for TCP and UDP.
For UDP this sets the limit on the size of the datagram, i.e. anything larger will be discarded.
For TCP this just sets the size of in-kernel buffer for given socket (with some rounding to page boundary and with an upper limit).
Since it looks like you are talking about TCP, the effect you are observing is explained by the socket being in blocking mode, so send(2) blocks until kernel can accept all of your data, and/or the network stack asynchronously de-queueing data and pushing it to the network card, thus freeing space in the buffer.
Also, TCP is a stream protocol, it does not preserve any "message" structure. One send(2) can correspond to multiple recv(2)s on the other side, and the other way around. Treat it as byte-stream.

SO_SNDBUF configures the buffer that the socket implementation uses internally. If your socket is non-blocking you can only send up to the configured size, if your socket is blocking there is no limitation for your call.