Linux TCP recv() with MSG_TRUNC - writes to buffer? - c

I've just encountered a surprising buffer overflow, while trying to use the flag MSG_TRUNC in recv on a TCP socket.
And it seems to only happen with gcc (not clang) and only when compiling with optimization.
According to this link: http://man7.org/linux/man-pages/man7/tcp.7.html
Since version 2.4, Linux supports the use of MSG_TRUNC in the flags argument of recv(2) (and recvmsg(2)). This flag causes the received bytes of data to be discarded, rather than passed back in a caller-supplied buffer. Since Linux 2.4.4, MSG_PEEK also has this effect when used in conjunction with MSG_OOB to receive out-of-band data.
Does this mean that a supplied buffer will not be written to? I expected so, but was surprised.
If you pass a buffer (non-zero pointer) and size bigger than the buffer size, it results in buffer overflow when client sends something bigger than buffer. It doesn't actually seem to write the message to the buffer if the message is small and fits in the buffer (no overflow).
Apparently if you pass a null pointer the problem goes away.
Client is a simple netcat sending a message bigger than 4 characters.
Server code is based on:
http://www.linuxhowtos.org/data/6/server.c
Changed read to recv with MSG_TRUNC, and buffer size to 4 (bzero to 4 as well).
Compiled on Ubuntu 14.04. These compilations work fine (no warnings):
gcc -o server.x server.c
clang -o server.x server.c
clang -O2 server.x server.c
This is the buggy (?) compilation, it also gives a warning hinting about the problem:
gcc -O2 -o server.x server.c
Anyway like I mentioned changing the pointer to null fixes the problem, but is this a known issue? Or did I miss something in the man page?
UPDATE:
The buffer overflow happens also with gcc -O1.
Here is the compilation warning:
In function ‘recv’,
inlined from ‘main’ at server.c:47:14:
/usr/include/x86_64-linux-gnu/bits/socket2.h:42:2: warning: call to ‘__recv_chk_warn’ declared with attribute warning: recv called with bigger length than size of destination buffer [enabled by default]
return __recv_chk_warn (__fd, __buf, __n, __bos0 (__buf), __flags);
Here is the buffer overflow:
./server.x 10003
* buffer overflow detected *: ./server.x terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fcbdc44b38f]
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fcbdc4e2c9c]
/lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fcbdc4e1b60]
/lib/x86_64-linux-gnu/libc.so.6(+0x10a023)[0x7fcbdc4e2023]
./server.x[0x400a6c]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fcbdc3f9ec5]
./server.x[0x400879]
======= Memory map: ========
00400000-00401000 r-xp 00000000 08:01 17732 > /tmp/server.x
... more messages here
Aborted (core dumped)
And gcc version:
gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
The buffer and recv call:
char buffer[4];
n = recv(newsockfd,buffer,255,MSG_TRUNC);
And this seems to fix it:
n = recv(newsockfd,NULL,255,MSG_TRUNC);
This will not generate any warnings or errors:
gcc -Wall -Wextra -pedantic -o server.x server.c
And here is the complete code:
/* A simple server in the internet domain using TCP
The port number is passed as an argument */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
void error(const char *msg)
{
perror(msg);
exit(1);
}
int main(int argc, char *argv[])
{
int sockfd, newsockfd, portno;
socklen_t clilen;
char buffer[4];
struct sockaddr_in serv_addr, cli_addr;
int n;
if (argc < 2) {
fprintf(stderr,"ERROR, no port provided\n");
exit(1);
}
sockfd = socket(AF_INET, SOCK_STREAM, 0);
if (sockfd < 0)
error("ERROR opening socket");
bzero((char *) &serv_addr, sizeof(serv_addr));
portno = atoi(argv[1]);
serv_addr.sin_family = AF_INET;
serv_addr.sin_addr.s_addr = INADDR_ANY;
serv_addr.sin_port = htons(portno);
if (bind(sockfd, (struct sockaddr *) &serv_addr,
sizeof(serv_addr)) < 0)
error("ERROR on binding");
listen(sockfd,5);
clilen = sizeof(cli_addr);
newsockfd = accept(sockfd,
(struct sockaddr *) &cli_addr,
&clilen);
if (newsockfd < 0)
error("ERROR on accept");
bzero(buffer,4);
n = recv(newsockfd,buffer,255,MSG_TRUNC);
if (n < 0) error("ERROR reading from socket");
printf("Here is the message: %s\n",buffer);
n = write(newsockfd,"I got your message",18);
if (n < 0) error("ERROR writing to socket");
close(newsockfd);
close(sockfd);
return 0;
}
UPDATE:
Happens also on Ubuntu 16.04, with gcc version:
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0 20160609

I think you have misunderstood.
With datagram sockets, MSG_TRUNC option behaves as described in man 2 recv man page (at Linux man pages online for most accurate and up to date information).
With TCP sockets, the explanation in the man 7 tcp man page is a bit poorly worded. I believed it is not a discard flag, but a truncate (or "throw away the rest") operation. However, the implementation (in particular, net/ipv4/tcp.c:tcp_recvmsg() function in the Linux kernel handles the details for TCP/IPv4 and TCP/IPv6 sockets) indicates otherwise.
There is also a separate MSG_TRUNC socket flag. These are stored in the error queue associated with the socket, and can be read using recvmsg(socketfd, &msg, MSG_ERRQUEUE). It indicates a datagram that was read was longer than the buffer, so some of it was lost (truncated). This is rarely used, because it is really only relevant to datagram sockets, and there are much easier ways to determine overlength datagrams.
Datagram sockets:
With datagram sockets, the messages are separate, and not merged. When read, the unread part of each received datagram is discarded.
If you use
nbytes = recv(socketfd, buffer, buffersize, MSG_TRUNC);
it means that the kernel will copy up to first buffersize bytes of the next datagram, and discard the rest of the datagram if it is longer (as usual), but nbytes will reflect the true length of the datagram.
In other words, with MSG_TRUNC, nbytes may exceed buffersize, even though only up to buffersize bytes are copied to buffer.
TCP sockets in Linux, kernels 2.4 and later, edited:
A TCP connection is stream-like; there are no "messages" or "message boundaries", just a sequence of bytes flowing. (Although, there can be out-of-band data, but that is not pertinent here).
If you use
nbytes = recv(socketfd, buffer, buffersize, MSG_TRUNC);
the kernel will discard up to next buffersize bytes, whatever is already buffered (but will block until at least one byte is buffered, unless the socket is in non-blocking mode or MSG_TRUNC | MSG_DONTWAIT is used instead). The number of bytes discarded is returned in nbytes.
However, both buffer and buffersize should be valid, because a recv() or recvfrom() call goes through the kernel net/socket.c:sys_recvfrom() function, which verifies buffer and buffersize are valid, and if so, populates the internal iterator structure to match, before calling the aforementioned net/ipv4/tcp.c:tcp_recvmsg().
In other words, the recv() with a MSG_TRUNC flag does not actually try to modify buffer. However, the kernel does check if buffer and buffersize are valid, and if not, will cause the recv() syscall to fail with -EFAULT.
When buffer overflow checks are enabled, GCC and glibc recv() does not just return -1 with errno==EFAULT; it instead halts the program, producing the shown backtraces. Some of these checks include mapping the zero page (where the target of a NULL pointer resides in Linux on x86 and x86-64), in which case the access check done by the kernel (before actually trying to read or write to it) succeeds.
To avoid the GCC/glibc wrappers (so that code compiled with e.g. gcc and clang should behave the same), one can use real_recv() instead,
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
ssize_t real_recv(int fd, void *buf, size_t n, int flags)
{
long retval = syscall(SYS_recvfrom, fd, buf, n, flags, NULL, NULL);
if (retval < 0) {
errno = -retval;
return -1;
} else
return (ssize_t)retval;
}
which calls the syscall directly. Note that this does not include the pthreads cancellation logic; use this only in single-threaded test programs.
In summary, with the stated problem regarding MSG_TRUNC flag for recv() when using TCP sockets, there are several factors complicating the full picture:
recv(sockfd, data, size, flags) actually calls the recvfrom(sockfd, data, size, flags, NULL, NULL) syscall (there is no recv syscall in Linux)
With a TCP socket, recv(sockfd, data, size, MSG_TRUNC) acts as if it were to read up to size bytes into data, if (char *)data+0 to (char *)data+size-1 are valid; it just does not copy them into data. The number of bytes thus skipped is returned.
The kernel verifies data (from (char *)data+0 to (char *)data+size-1, inclusive) is readable, first. (I suspect this check is erroneous, and might be turned into a writability check sometime in the future, so do not rely on this being a readability test.)
Buffer overflow checks can detect the -EFAULT result from the kernel, and instead halts the program with some kind of "out of bounds" error message (with a stack trace)
Buffer overflow checks may make NULL pointer seem like valid from the kernel point of view (because the kernel test is for reading, currently), in which case the kernel verification accepts the NULL pointer as valid. (One can verify if this is the case by recompiling without buffer overflow checks, using e.g. the above real_recv(), and seeing if a NULL pointer causes an -EFAULT result then.)
The reason for such a mapping (that, if allowed by hardware and the kernel structures, only exists, and is not readable or writable) is that with such a mapping, any access generates a SIGBUS signal, which a (library or compiler-provided signal handler) can catch, and dump not just a stack trace, but more details about the exact access (address, code that attempted the access, and so on).
I do believe the kernel access check treats such mappings readable and writable, because there needs to be a read or write attempt for the signal to be generated.
Buffer overflow checks are done by both the compiler and the C library, so different compilers may implement the checks, and the NULL pointer case, differently.

Nota bene: I’m adding this answer here after all this time, as this is still one of the first results on google for recv buffer overflow MSG_TRUNC, and if someone else ends up here, they’ll save themselves a lot of grief, searching and trial-and-error.
The original question is answered well enough already, but the subtlety I wanted to highlight, is the difference between stream and datagram sockets.
A common code pattern is to use recv( socket_, NULL, 0, MSG_DONTWAIT | MSG_PEEK | MSG_TRUNC ) to find how much data is queued before a read. This works perfectly for stream sockets (TCP and SCTP) but for datagram sockets (UDP, UDPL and DCCP) it will intermittently buffer overflow, but only if the executable is compiled with gcp and with optimisations enabled. Without optimisations it seems to work perfectly, which means it will sail through development QA, only to fail in staging/live.
Finding this was a total PITA. You’re welcome. ;)

Related

Create unbuffered file descriptor in C under linux

For testing purposes I want to create a fully unbuffered file descriptor under linux in C.
It shall have a reading and a writing side.
It can be one of:
fifo
pipe
local or tcp socket
using stdin/stdout
virtual kernel file (e.g. /proc/uptime)
(I think the list is complete)
So a call to write() will copy the contents directly into the buffer provided by read().
Is this even possible or is there always a buffer in between?
If it's possible than how to achieve this?
This is my code:
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <unistd.h>
int main(int argc, char** argv)
{
int fd = socket(AF_INET, SOCK_STREAM, 0);
int len = 0;
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &len, 4);
struct sockaddr_in serv_addr = {};
serv_addr.sin_family = AF_INET;
serv_addr.sin_port = htons(80);
inet_pton(AF_INET, "127.0.0.1", &serv_addr.sin_addr);
connect(fd, (struct sockaddr *)&serv_addr, sizeof(serv_addr));
char buf;
read(fd, &buf, 1);
return 0;
}
But the data is not read directly into buf. The reason is that my setsockopt() call does not really set the receiving buffer to zero. Instead it is set to the minimum value 256 as can be read on man page socket(7).
So a call to write() will copy the contents directly into the buffer provided by read()
Is this even possible
No. Only when using splice() or sendfile() or similar system call. This strongly depends on what the actual file descriptor refers to.
is there always a buffer in between? If it's possible than how to achieve this?
No. You can:
fifo / pipe
use splice() or sendfile()
use shared memory instead
local or tcp socket
open /proc/mem and write directly into the hardware.
write/use a specific kernel device driver that does the same
(I do not know, but maybe copy_from_user allows copying straight to a hardware address)
using stdin/stdout
file descriptor is just a pointer - both stdin and stdout can refer to a tcp socket, to a fifo, or to anything else
this depends on what the file descriptor refers to
virtual kernel file (e.g. /proc/uptime)
depends on the actual "virtual kernel file" implementation
most, if not all, implement seeking/support partial reading or use struct seq_file, which has to have an intermediary buffer.

How can I reliably make send(2) do a short send?

send(2) takes a buffer and a buffer length. It can return either an error, or some number of bytes successfully sent up to the size of the buffer length. In some cases, send will send fewer than the number of bytes requested (e.g. https://stackoverflow.com/a/2618755/939259).
Is there a way to consistently trigger a short send in a unit test, other than sending a big message and firing a signal from another thread and hoping to get lucky?
Just roll your own:
#include <sys/types.h>
#include <sys/socket.h>
ssize_t mysend(int fd, void * buff, size_t len, int flags)
{
#if WANT_PARTIAL_SEND
len = 1 + urand(len -1);
#endif
return send(fd, buff, len, flags);
}
If you pack the code-to-be tested into a shared library (or a static library with the symbols weakened), then your testing executable (which links with the library) will be able to override send for both itself and the libraries it links.
Example (overrides write rather than send):
#!/bin/sh -eu
cat > code.c <<EOF
#include <unistd.h>
#include <stdio.h>
ssize_t hw(void)
{
static const char b[]="hello world\n";
return write(1, b, sizeof(b)-1);
}
void libfunc(void)
{
puts(__func__);
hw();
}
EOF
cat > test.c <<'EOF'
#include <stdio.h>
void libfunc(void);
ssize_t hw(void);
#if TEST
ssize_t hw(void)
{
puts("override");
return 42;
}
#endif
int main()
{
libfunc();
puts("====");
printf("%zu\n", hw());
}
EOF
gcc code.c -fpic -shared -o libcode.so
gcc test.c $PWD/libcode.so -o real
gcc -DTEST test.c $PWD/libcode.so -o mocked
set -x
./real
./mocked
Example output:
hello world
hello world
libfunc
====
12
libfunc
override
====
override
42
This overshadows the libc implementation of the symbol and while there are mechanism for accessing the overridee (namely dlopen and/or -Wl,--wrap), you shouldn't need to access it in a unit test (if you do need it in other unit tests, it's simplest to just put those other unit tests in a different program).
send(2) by default returns only when all data was successfully copied to the send buffers.
The possible ways to force it to send less bytes highly depend on the circumstances.
If you
1. can access the socket
2. do not want to alter the behaviour of all calls to send in your linked binary,
then you could set the socket non-blocking. Then, a call to send will send as much octects as possible. The amount of octets sent depends mainly on the amount of free memory in the send buffer of the socket you want to send on.
Thus, if you got
uint8_t my_data[NUM_BYTES_TO_SEND] = {0}; /* We don't care what your buffer actually contains in this example ... */
size_t num_bytes = sizeof(my_data);
send(fd, my_data, num_bytes);
and want send to send less than num_bytes, you could try to decrease the send buffer of your socket fd .
Whether this is possible, how to accomplish this might depend on your OS.
Under Linux, you could try to shrink the send buffer by setting the buffer size manually by using setsockopt(2) via the option SO_SNDBUF, descibed in the man page `socket(7):
uint8_t my_data[NUM_BYTES_TO_SEND] = {0};
size_t num_bytes = sizeof(my_data);
size_t max_bytes_to_send = num_bytes - 1; /* Force to send at most 1 byte less than in our buffer */
/* Set the socket non-blocking - you should check status afterwards */
int status = fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) | O_NONBLOCK);
/* reduce the size of the send buffer below the number of bytes you want to send */
setsockopt (fd, SOL_SOCKET, SO_SNDBUF, &max_bytes_to_send, sizeof (size_t));
...
send(fd, my_data, num_bytes);
/* Possibly restore the old socket state */
Possibly you also have to fiddle around with the option SO_SNDBUFFORCE .
Further info for
Setting non-blocking: How do I change a TCP socket to be non-blocking?
Infos on SO_SNDBUF etc: What are SO_SNDBUF and SO_RECVBUF
Anyhow, the best way to go for you depends on the circumstance.
If you look for a reliable solution to check code you perhaps cannot even access, but link into your project dynamically, you might go with the other approach suggested here: Overshadow the send symbol in your compiled code .
On the other hand, this will impact all calls to send in your code (you could, of course bypass this problem e.g. by having your send replacement depend on some flags you could set).
If you can access the socket fd, and want only specific send calls to be impacted (as I guess is the case with you, since you talk about unit tests, and checking for sending less bytes than expected in all your tests is probably not what you want?), then going the way to shrink the send buffer could be the way to go.

c- recvfrom error 22

Okay first here is the code:
int recvMast_sock;
struct sockaddr_in serv_addr, cli_addr;
socklen_t cli_len;
if ((recvMast_sock = socket(AF_INET, SOCK_DGRAM, 0)) == -1)
{
critErr("listen:socket=");
}
fillSockaddrAny(&serv_addr, UDP_NODE_LISTEN_PORT);// fills the sockaddr_in works fine elsewhere
if ((bind(recvMast_sock, (struct sockaddr*) &serv_addr, sizeof serv_addr)) < 0)
{
critErr("listen:bind recv_mast_sock:");
}
recvReturn_i = recvfrom(recvMast_sock, &recvBuff[0], (size_t)1, 0, (struct sockaddr*) &cli_addr, &cli_len);
if(recvReturn_i <0)
printf("recv error%d\n",errno);
critErr is a function to handle errors which also includes a print of the error and an exit.
This runs in a thread, if this is of any relevance. If I compile and run this on a Zedboard (ZYNQ-7000 SoC) which has an ARM Cortex A9 and Linaro Linux (based on precise Ubuntu). It prints error 22 but still has the received value in recvBuff[0].
Running this in my VM with xubuntu it works fine.
Error 22 equals EINVAL which is described as Invalid argument.
In the manpage of recvfrom(2) it states EINVAL means that the MSG_OOB flag is set but I don't use any flags (passing 0).
Before leaving on friday I started an apt-get upgrade because I hope it is a faulty library or something like this. I can check back at monday but maybe someone here has another idea what is wrong.
You need to initialize cli_len before passing it to recvfrom():
cli_len = sizeof(cli_addr);
You are not initializing it, so it has a random value. If that value happens to be < sizeof(cli_addr), recvfrom() can fail with EINVAL, or at least truncate the address, because it thinks cli_addr is not large enough to receive the client address. If the value is vastly larger than sizeof(cli_addr), recvfrom() might consider the buffer to be outside of the valid memory range.
You have to tell recvfrom() how large cli_addr actually is. This is clearly stated in the documentation:
The argument addrlen is a value-result argument, which the caller should initialize before the call to the size of the buffer associated with src_addr, and modified on return to indicate the actual size of the source address. The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
So you have to initialize cli_len with the total size of cli_addr before calling recvfrom(), then recvfrom() updates cli_len with the size of the address that was actually written into cli_addr. cli_addr can be larger than the address, for instance when using a sockaddr_storage structure to accept either IPv4 or IPv6 addresses on a dual-stack socket. In the example in the question, an IPv4 socket is being used, so cli_len must be initialized to a value >= sizeof(sockaddr_in).
This was not caused by the OS or the architecture. The function was not called on the x86-system because of a blocked mutex. So I didn't got the error there.
The problem was that I passed the socket to this function from 'main' (which i did not state in the question because I thought it was irrelevant, my bad...)
In 'main' I used it and used it in this function. Even though it was mutually exclusive, there was this error.
Remy's answer was also relevant but not a solution to the problem. Not setting cli_len beforehand just leads to a cut of sockaddr if its too small. No error was generated for that.

segmentation fault in linux (socket programming (TCP) in C)

I am just learning socket programming on Linux by some websites and here are some parts of my code on server side by using TCP:
#define BufferLength 100
#define SERVPORT 3111
int main()
{
/* Variable and structure definitions. */
int sd, sd2, rc, length = sizeof(int);
int totalcnt = 0, on = 1;
char temp;
char buffer[BufferLength];
struct sockaddr_in serveraddr;
struct sockaddr_in their_addr;
fd_set read_fd;
/* Get a socket descriptor */
if((sd = socket(AF_INET, SOCK_STREAM, 0)) < 0)
{
perror("Server-socket() error");
exit (-1);
}
else
printf("Server-socket() is OK\n");
/* Allow socket descriptor to be reusable */
if((rc = setsockopt(sd, SOL_SOCKET, SO_REUSEADDR, (char *)&on, sizeof(on))) < 0)
{
perror("Server-setsockopt() error");
close(sd);
exit (-1);
}
else
printf("Server-setsockopt() is OK\n");
/* bind to an address */
memset(&serveraddr, 0x00, sizeof(struct sockaddr_in));
serveraddr.sin_family = AF_INET;
serveraddr.sin_port = htons(SERVPORT);
serveraddr.sin_addr.s_addr = htonl(INADDR_ANY);
printf("Using %s, listening at %d\n", inet_ntoa(serveraddr.sin_addr), SERVPORT);
/* continue */
}
When I did the last line (printf("using......")), I got a segmentation fault, why? Thanks.
The code as shown misses to #include any headers, so as it stands won't compile due to some undefined symbols.
It would compile however if you missed to just prototype any library functions referenced by the code, which would lead to any function being assumed to return int.
The latter fact might be fatal or not.
On a 64bit system at least it is fatal in the case of inet_ntoa() used as a parameter to printf(), as on a 64bit system it most likely is expected to return a 64bit (char-pointer) value (but a 32bit int). So (assuming the prototype misses) when generating the code the compilers assumes inet_ntoa() to return a 32bit int which would lead to "chopping-off" the most significant 32bits of the address returned. Trying to printf() from such a "crippled" and therefore (most likely) invalid address provokes undefined behaviour and in your case leads to the segmentation violation observed.
To fix this, add the relevant prototype (for inet_ntoa()) by adding:
#include <arpa/inet.h>
The compiler should have warned you about this. To enable all compiler's warnings for gcc use the options -Wall -Wextra -pedantic. Take such warnings serious.
It seems likely that inet_ntoa() is somehow returning NULL, leading to the segfault when it is dereferenced in the printf(). I can't find a direct reference plainly stating that this is possible with the Linux version of inet_ntoa, but I found several people who made that claim, and it is the only point in that code where a pointer is being dereferenced.
The answer at the bottom of this question: segmentation fault for inet_ntoa makes the claim that inet_ntoa can return NULL. However, following his reference links, I couldn't find an actual statement of that fact.
There is an MSDN article (which is suggestive, but of course doesn't apply directly to Linux code) that does state plainly that inet_ntoa() can return NULL here: https://msdn.microsoft.com/en-us/library/windows/desktop/ms738564%28v=vs.85%29.aspx

Get the number of bytes available in socket by 'recv' with 'MSG_PEEK' in C++

C++ has the following function to receive bytes from socket, it can check for number of bytes available with the MSG_PEEK flag. With MSG_PEEK, the returned value of 'recv' is the number of bytes available in socket:
#include <sys/socket.h>
ssize_t recv(int socket, void *buffer, size_t length, int flags);
I need to get the number of bytes available in the socket without creating buffer (without allocating memory for buffer). Is it possible and how?
You're looking for is ioctl(fd,FIONREAD,&bytes_available) , and under windows ioctlsocket(socket,FIONREAD,&bytes_available).
Be warned though, the OS doesn't necessarily guarantee how much data it will buffer for you, so if you are waiting for very much data you are going to be better off reading in data as it comes in and storing it in your own buffer until you have everything you need to process something.
To do this, what is normally done is you simply read chunks at a time, such as
char buf[4096];
ssize_t bytes_read;
do {
bytes_read = recv(socket, buf, sizeof(buf), 0);
if (bytes_read > 0) {
/* do something with buf, such as append it to a larger buffer or
* process it */
}
} while (bytes_read > 0);
And if you don't want to sit there waiting for data, you should look into select or epoll to determine when data is ready to be read or not, and the O_NONBLOCK flag for sockets is very handy if you want to ensure you never block on a recv.
On Windows, you can use the ioctlsocket() function with the FIONREAD flag to ask the socket how many bytes are available without needing to read/peek the actual bytes themselves. The value returned is the minimum number of bytes recv() can return without blocking. By the time you actually call recv(), more bytes may have arrived.
Be careful when using FIONREAD! The problem with using ioctl(fd, FIONREAD, &available) is that it will always return the total number of bytes available for reading in the socket buffer on some systems.
This is no problem for STREAM sockets (TCP) but misleading for DATAGRAM sockets (UDP). As for datagram sockets read requests are capped to the size of the first datagram in the buffer and when reading less than she size of the first datagram, all unread bytes of that datagram are still discarded. So ideally you want to know only the size of the next datagram in the buffer.
E.g. on macOS/iOS it is documented that FIONREAD always returns the total amount (see comments about SO_NREAD). To only get the size of the next datagram (and total size for stream sockets), you can use the code below:
int available;
socklen_t optlen = sizeof(readable);
int err = getsockopt(soc, SOL_SOCKET, SO_NREAD, &available, &optlen);
On Linux FIONREAD is documented to only return the size of the next datagram for UDP sockets.
On Windows ioctlsocket(socket, FIONREAD, &available) is documented to always give the total size:
If the socket passed in the s parameter is message oriented (for example, type SOCK_DGRAM), FIONREAD returns the reports the total number of bytes available to read, not the size of the first datagram (message) queued on the socket.
Source: https://learn.microsoft.com/en-us/windows/win32/api/ws2spi/nc-ws2spi-lpwspioctl
I am unaware of a way how to get the size of the first datagram only on Windows.
The short answer is : this cannot be done with MS-Windows WinSock2,
as I can discovered over the last week of trying.
Glad to have finally found this post, which sheds some light on the issues I've been having, using latest Windows 10 Pro, version 20H2 Build 19042.867 (x86/x86_64) :
On a bound, disconnected UDP socket 'sk' (in Listening / Server mode):
1. Any attempt to use either ioctlsocket(sk, FIONREAD, &n_bytes)
OR WsaIoctl with a shifted FIONREAD argument, though they succeed,
and retern 0, after a call to select() returns > with that
'sk' FD bit set in the read FD set,
and the ioctl call returns 0 (success), and n_bytes is > 0,
causes the socket sk to be in a state where any
subsequent call to recv(), recvfrom(), or ReadFile() returns
SOCKET_ERROR with a WSAGetLastError() of :
10045, Operation Not Supported, or ReadFile
error 87, 'Invalid Parameter'.
Moreover, even worse:
2. Any attempt to use recv or recvfrom with the 'MSG_PEEK' msg_flags
parameter returns -1 and WSAGetLastError returns :
10040 : 'A message sent on a datagram socket was larger than
the internal message buffer or some other network limit,
or the buffer used to receive a datagram into was smaller
than the datagram itself.
' .
Yet for that socket I DID successfully call:
setsockopt(s, SOL_SOCKET, SO_RCVBUF, bufsz = 4096 , sizeof(bufsz) )
and the UDP packet being received was of only 120 bytes in size.
In short, with modern windows winsock2 ( winsock2.h / Ws2_32.dll) ,
there appears to be absolutely no way to use any documented API
to determine the number of bytes received on a bound UDP socket
before calling recv() / recvfrom() in MSG_WAITALL blocking mode to
actually receive the whole packet.
If you do not call ioctlsocket() or WsaIoctl or
recv{,from}(...,MSG_PEEK,...)
before entering recv{,from}(...,MSG_WAITALL,...) ,
then the recv{,from} succeeds.
I am considering advising clients that they must install and run
a Linux instance with MS Services for Linux under their windows
installation , and developing some
API to communicate with it from Windows, so that reliable
asynchronous UDP communication can be achieved - or does anyone
know of a good open source replacement for WinSock2 ?
I need access to a "C" library TCP+UDP/IP implementation for
modern Windows 10 that conforms to its own documentation,
unlike WinSock2 - does anyone know of one ?

Resources