Cannot capture the packet using libpcap

Cannot capture the packet using libpcap - c

I am kind of new to use libpcap.
I am using this library to capture the packet and the code i wrote to the capture the packet is below.
The interface that I am tapping is always flooded with arp packet so there is always packet coming to the interface.But I cannot able to tap these packet. The interface is UP and running.
I got no error on pcap_open_live function.
The code is in C. And I am running this code on FreeBSD10 machine 32 bit.
void captutre_packet(char* ifname , int snaplen) {
char ebuf[PCAP_ERRBUF_SIZE];
int pflag = 0;/*promiscuous mode*/
snaplen = 100;
pcap_t* pcap = pcap_open_live(ifname, snaplen, !pflag , 0, ebuf);
if(pcap!=NULL) {
printf("pcap_open_live for %s \n" ,ifname );
}
int fd = pcap_get_selectable_fd(pcap);
pcap_setnonblock(pcap, 1, ebuf);
fd_set fds;
struct timeval tv;
FD_ZERO(&fds);
FD_SET(fd, &fds);
tv.tv_sec = 3;
tv.tv_usec = 0;
int retval = select(fd + 1, &fds, NULL, NULL, &tv);
if (retval == -1)
perror("select()");
else if (retval) {
printf("Data is available now.\n");
printf("calling pcap_dispatch \n");
pcap_dispatch(pcap , -1 , (pcap_handler) callback , NULL);
}
else
printf("No data within 3 seconds.\n");
}
void
callback(const char *unused, struct pcap_pkthdr *h, uint8_t *packet)
{
printf("got some packet \n");
}
I am always getting retval as 0 which is timeout.
I don't know what is happening under the hood I follow the tutorial and they also did exactly the same thing I do not know what i am missing.
I also want to understand how the packet from the ethernet layer once received get copied into this opened bpf socket/device (using pcap_open_live) and how the buffer is copied from kernel space to user space?
And for how long we can tap the packet till the kernel consume or reject the packet?

The pcap_open_live() call provided 0 as the packet buffer timeout value (the fourth argument). libpcap does not specify what a value of 0 means, because different packet capture mechanisms, on different operating systems, treat that value differently.
On systems using BPF, such as the BSDs and macOS, it means "wait until the packet buffer is completely full before providing the packets. If the packet buffer is large (it defaults to about 256K, on FreeBSD), and the packets are small (60 bytes for ARP packets), it may take a significant amount of time for the buffer to fill - longer than the timeout you're handing to select().
It's probably best to have a timeout value of between 100 milliseconds and 1 second, so pass an argument of somewhere between 100 and 1000, not 0.

Related

Getting Ethernet frame length on raw socket(nonblocking)

I am trying to send and receive raw ethernet frames to include a network device as a media access controller in a simulation environment.
Therefore it is important that the receiving of the packets works through nonblocking statements.
Now the sending of the raw ethernet frames works fine but there's one thing about the receive path that is confusing me:
How do I know where the one frame ends and the other frame begins.
What I fundamentally do is to open a raw socket:
device.socket = socket(AF_PACKET, SOCK_RAW, IPPROTO_RAW);
setting it up as non blocking:
flags = fcntl(s,F_GETFL,0);
assert(flags != -1);
fcntl(s, F_SETFL, flags | O_NONBLOCK);
and then call the recv() function cyclical to get the data from the socket:
length = recv(s, buffer, ETH_FRAME_LEN_MY, 0);
But as far as I know the recv() function only returns the amount of bytes, that is currently availible in the receive buffer and therefore I do not know if another frame starts or if I am still reading the "old" packet.
And because of the fact, that the length of the ethernet frame is not included in the header I can not do this on my own.
Thank you in advance!

If anyone runs into the same problem here's a possible solution:
You can use the libpcap library(in windows winpcap) to open the device as a capture device:
char errbuf[PCAP_ERRBUF_SIZE]; /* error buffer */
Pcap_t *handle; /* packet capture handle */
/* open capture device*/
/* max possible length, not in promiscous mode, no timeout!*/
handle = pcap_open_live(dev, 65536, 0, 0, errbuf);
if (handle == NULL) {
fprintf(stderr, "Couldn't open device %s: %s\n", dev, errbuf);
}
/* set capture device to non blocking*/
if(pcap_setnonblock(pcap_t *p, int nonblock, char *errbuf)){
fprintf("Could not set pcap interface in non blocking mode: %s \n", errbuf);
}
Now you can cyclic call the pcap_dispatch function to receive packet(s):
int pcap_dispatch(pcap_t *p, int cnt, pcap_handler callback, u_char *user);
You have to provide a callback function in which the data is handled.
See https://www.freebsd.org/cgi/man.cgi?query=pcap_dispatch&apropos=0&sektion=3&manpath=FreeBSD+11-current&format=html for further information.
You can send raw ethernet frames by using the inject function:
pcap_inject(pcap,frame,sizeof(frame));

SocketCAN select() and write() don't block

I'm testing the CAN interface on an embedded device (SOC / ARM core / Linux) using SocketCAN, and I want to send data as fast as possible for testing, using efficient code.
I can open the CAN device ("can0") as a BSD socket, and send frames with "write". This all works well.
My desktop can obviously generate frames faster than the CAN transmission rate (I'm using 500000 bps). To send efficiently, I tried using a "select" on the socket file descriptor to wait for it to become ready, followed by the "write". However, the "select" seems to return immediately regardless of the state of the send buffer, and "write" also doesn't block. This means that when the buffer fills up, I get an error from "write" (return value -1), and errno is set to 105 ("No buffer space available").
This mean I have to wait an arbitrary amount of time, then try the write again, which seems very inefficient (polling!).
Here's my code (C, edited for brevity):
printf("CAN Data Generator\n");
int skt; // CAN raw socket
struct sockaddr_can addr;
struct canfd_frame frame;
const int WAIT_TIME = 500;
// Create socket:
skt = socket(PF_CAN, SOCK_RAW, CAN_RAW);
// Get the index of the supplied interface name:
unsigned int if_index = if_nametoindex(argv[1]);
// Bind CAN device to socket created above:
addr.can_family = AF_CAN;
addr.can_ifindex = if_index;
bind(skt, (struct sockaddr *)&addr, sizeof(addr));
// Generate example CAN data: 8 bytes; 0x11,0x22,0x33,...
// ...[Omitted]
// Send CAN frames:
fd_set fds;
const struct timeval timeout = { .tv_sec=2, .tv_usec=0 };
struct timeval this_timeout;
int ret;
ssize_t bytes_writ;
while (1)
{
// Use 'select' to wait for socket to be ready for writing:
FD_ZERO(&fds);
FD_SET(skt, &fds);
this_timeout = timeout;
ret = select(skt+1, NULL, &fds, NULL, &this_timeout);
if (ret < 0)
{
printf("'select' error (%d)\n", errno);
return 1;
}
else if (ret == 0)
{
// Timeout waiting for buffer to be free
printf("ERROR - Timeout waiting for buffer to clear.\n");
return 1;
}
else
{
if (FD_ISSET(skt, &fds))
{
// Ready to write!
bytes_writ = write(skt, &frame, CAN_MTU);
if (bytes_writ != CAN_MTU)
{
if (errno == 105)
{
// Buffer full!
printf("X"); fflush(stdout);
usleep(20); // Wait for buffer to clear
}
else
{
printf("FAIL - Error writing CAN frame (%d)\n", errno);
return 1;
}
}
else
{
printf("."); fflush(stdout);
}
}
else
{
printf("-"); fflush(stdout);
}
}
usleep(WAIT_TIME);
}
When I set the per-frame WAIT_TIME to a high value (e.g. 500 uS) so that the buffer never fills, I see this output:
CAN Data Generator
...............................................................................
................................................................................
...etc
Which is good! At 500 uS I get 54% CAN bus utilisation (according to canbusload utility).
However, when I try a delay of 0 to max out my transmission rate, I see:
CAN Data Generator
................................................................................
............................................................X.XX..X.X.X.X.XXX.X.
X.XX..XX.XX.X.XX.X.XX.X.X.X.XX..X.X.X.XX..X.X.X.XX.X.XX...XX.X.X.X.X.XXX.X.XX.X.
X.X.XXX.X.XX.X.X.X.XXX.X.X.X.XX.X.X.X.X.XX..X..X.XX.X..XX.X.X.X.XX.X..X..X..X.X.
.X.X.XX.X.XX.X.X.X.X.X.XX.X.X.XXX.X.X.X.X..XX.....XXX..XX.X.X.X.XXX.X.XX.XX.XX.X
.X.X.XX.XX.XX.X.X.X.X.XX.X.X.X.X.XX.XX.X.XXX...XX.X.X.X.XX..X.XX.X.XX.X.X.X.X.X.
The initial dots "." show the buffer filling up; Once the buffer is full, "X" starts appearing meaning that the "write" call failed with error 105.
Tracing through the logic, this means the "select" must have returned and the "FD_ISSET(skt, &fds)" was true, although the buffer was full! (or did I miss something?).
The SockedCAN docs just say "Writing CAN frames can be done similarly, with the write(2) system call"
This post suggests using "select".
This post suggests that "write" won't block for CAN priority arbitration, but doesn't cover other circumstances.
So is "select" the right way to do it? Should my "write" block? What other options could I use to avoid polling?

After a quick look at canbusload:184, it seems that it computes efficiency (#data/#total bits on the bus).
On the other hand, according to this, max efficiency for CAN bus is around 57% for 8-byte frames, so you seem not to be far away from that 57%... I would say you are indeed flooding the bus.
When setting a 500uS delay, 500kbps bus bitrate, 8-byte frames, it gives you a (control+data) bitrate of 228kbps, which is lower than max bitrate of the CAN bus, so, no bottleneck here.
Also, since in this case only 1 socket is being monitored, you don't need pselect, really. All you can do with pselect and 1 socket can be done without pselect and using write.
(Disclamer: hereinafter, this is just guessing since I cannot test it right now, sorry.)
As of why the behavior of pselect, think that the buffer could have byte semantics, so it tells you there is still room for more bytes (1 at least), not necessarily for more can_frames. So, when returning, pselect does not inform you can send the whole CAN frame. I guess you could solve this by using SIOCOUTQ and the max size of the Rx buffer SO_SNDBUF, but not sure if it works for CAN sockets (the nice thing would be to use SO_SNDLOWAT flags, but it is not changable in Linux's implementation).
So, to answer your questions:
Is "select" the right way to do it?
Well, you can do it both ways, either (p)select or write, since you are only waiting for one file descriptor, there is no real difference.
Should my "write" block? It should if there is no single byte available in the send buffer.
What other options could I use to avoid polling? Maybe by ioctl'ing SIOCOUTQ and getsockopt'ing SO_SNDBUF and substracting... you will need to check this yourself. Alternatively, maybe you could set the send buffer size to a multiple of sizeof(can_frame) and see if it keeps you signaling when less than sizeof(can_frame) are available.
Anyhow, if you are interested in having a more precise timing, you could use a BCM socket. There, you can instruct the kernel to send a specific frame at a specific interval. Once set, the process run in kernel space, without any system call. In this way, user-kernel buffer problem is avoided. I would test different rates until canbusload shows no rise in bus utilization.

select and poll worked for me right with SocketCan. However, carefull configuration is require.
some background:
between user app and the HW, there are 2 buffers:
socket buffer, where its size (in bytes) is controlled by the setsockopt's SO_SNDBUF option
driver's qdisc, where its size (in packets) is controlled by the "ifconfig can0 txqueuelen 5" command.
data path is: user app "write" command --> socket buffer -> driver's qdisc -> HW TX mailbox.
2 flow control points exist along this path:
when there is no free TX mailboxe, driver freeze driver's qdisc (__QUEUE_STATE_DRV_XOFF), to prevent more packets to be dequeued from driver's qdisc into HW. it will be un-freezed when TX mailbox is free (upon TX completion interrupt).
when socket buffer goes above half of its capacity, poll/select blocks, until socket buffer goes beyond half of its capacity.
now, assume that socket buffer has room for 20 packets, while driver's qdisc has room for 5 packets. lets assume also that HW have single TX mailbox.
poll/select let user write up to 10 packets.
those packets are moved down to socket buffer.
5 of those packets continue and fill driver's qdisc.
driver dequeue 1st packet from driver's qdisc, put it into HW TX mailbox and freeze driver's qdisc (=no more dequeue). now there is room for 1 packet in driver's qdisc
6th packet is moved down successfully from socket buffer to driver's qdisc.
7th packet is moved down from socket buffer to driver's qdisc, but since there is no room - it is dropped and error 105 ("No buffer space available") is generated.
what is the solution?
in the above assumptions, lets configure socket buffer for 8 packets. in this case, poll/select will block user app after 4 packets, ensuring that there is room in driver's qdisc for all of those 4 packets.
however, socket buffer is configured to bytes, not to packet. translation should be made as the following: each CAN packet occupy ~704 bytes at socket buffer (most of them for the socket structure). so, to configure socket buffer to 8 packet, the size in bytes should be 8*704:
int size = 8*704;
setsockopt(s, SOL_SOCKET, SO_SNDBUF, &size, sizeof(size));

WinPcap doesnt catch any arp packets

I try to sniff all the arp traffic. Here is my code:
void start(){
pcap_if_t *alldevs;
pcap_if_t *d;
char errbuf[PCAP_ERRBUF_SIZE];
int choice;
pcap_t* pcap_handle;
struct bpf_program filter;
int i=0;
if(pcap_findalldevs(&alldevs, errbuf)==-1){
fatal("pcap_findalldevs", errbuf);
return;
}
d = alldevs;
for(d=alldevs; d; d=d->next){
cout<<"#"<<i<<" "<<d->name<<endl;
if (d->description)
cout<<"description: "<<d->description<<(d->addresses==NULL?" invalid":" valid")<<endl;
++i;
}
if(i==0){
cout<<"No interfaces!"<<endl;
}
while(true){
cout<<"choose interface number: ";
cin>>choice;
if(choice<0 || choice>i-1){
cout<<"choice is out of range!"<<endl;
pcap_freealldevs(alldevs);
return;
}
d=alldevs;
for(int j=0;j<choice;++j){
d=d->next;
}
if(d->addresses==NULL)
cout<<"device is invalid!"<<endl;
else
break;
if(i==1){
return;
}
}
cout<<"###\tGuarding device #"<<choice<<" "<<d->name<<endl;
pcap_handle = pcap_open_live(d->name, 65535, 1, 0, errbuf);
pcap_freealldevs(alldevs);
if(pcap_handle == NULL){
fatal("pcap_open_live", errbuf);
return;
}
unsigned int netmask=((struct sockaddr_in *)(d->addresses->netmask))->sin_addr.S_un.S_addr;
if(pcap_compile(pcap_handle, &filter, "arp", 1, netmask)<0){
fatal("pcap_compile", errbuf);
return;
}
if(pcap_setfilter(pcap_handle, &filter)<0){
fatal("pcap_setfilter", errbuf);
return;
}
pcap_loop(pcap_handle, 5, packet_handler, NULL);
pcap_close(pcap_handle);
}
Unfortunately, no arp packets are caught (wireshark shows me that there is arp traffic!). If I change the filter to "ip" packets are caught.. Any ideas?
regards

I would change the pcap_open_live() call to
pcap_handle = pcap_open_live(d->name, 65535, 1, 1000, errbuf);
On several platforms, including Windows, the mechanism libpcap/WinPcap uses buffers packets up as they pass the filter, and only deliver packets to the application when the buffer fills up or the timeout expires; this is done to deliver multiple packets in one kernel->user transition, to reduce the overhead of packet capture with high volumes of traffic.
You were supplying a timeout value of 0; on several platforms, including Windows, this means "no timeout", so packets won't be delivered until the buffer fills up. ARP packets are small, and the buffer is big enough that it could take many ARP packets to fill it up; ARP packets are also relatively rare, so it could take a long time for enough ARP packets to arrive to fill it up. IP packets are bigger, sometimes much bigger, and are more frequent, so the buffer probably doesn't take too long to fill up.
A timeout value of 1000 is a timeout of 1 second, so the packets should show up within a second; that's the timeout value tcpdump uses. You can also use a lower value.

UDP buffer overflow w/o filling the receive buffer?

If I send 1000 "Hello World!" UDP messages (12 bytes + 28 IP/UDP overhead), I observe that on the receiving side I only buffer 658 (always the same number, 658*40 = 26320 bytes). I do that, by sending the UDP messages while sleeping on the server (after creating the socket).
Curiously the SO_RCVBUF option on the server is 42080 bytes. So, I wonder why I can not buffer the 1000 messages. Do you know where are spend the remaining 15760 bytes?
Below the server code (where distrib.h contains basic error handling wrappers of the socket and signal handling functions):
#include "distrib.h"
static int count;
static void sigint_handler(int s) {
printf("\n%d UDP messages received\n",count);
exit(0);
}
int main(int argc, char **argv)
{
struct addrinfo* serverinfo;
struct addrinfo hints;
struct sockaddr_storage sender;
socklen_t len;
int listenfd,n;
char buf[MAXLINE+1];
if (argc != 2) {
log_error("usage: %s <port>\n", argv[0]);
exit(1);
}
Signal(SIGINT,sigint_handler);
bzero(&hints,sizeof(hints));
hints.ai_family = AF_INET;
hints.ai_socktype = SOCK_DGRAM;
hints.ai_protocol = IPPROTO_UDP;
Getaddrinfo("127.0.0.1", argv[1], &hints, &serverinfo);
listenfd = Socket(serverinfo->ai_family, serverinfo->ai_socktype,
serverinfo->ai_protocol);
Bind(listenfd, serverinfo->ai_addr,serverinfo->ai_addrlen);
freeaddrinfo(serverinfo);
count =0;
sleep(20);
while(true) {
bzero(buf,sizeof(buf));
len = sizeof(sender);
n = Recvfrom(listenfd, buf, MAXLINE, 0, (struct sockaddr*)&sender,&len);
buf[n]='\0';
count++;
}
close(listenfd);
return 0;
}

It's more informative to do the reverse calculation -- your buffer is 42080 and it's buffering 658 packets before it starts dropping. Now 42080/658 = 63.95, so it looks like it is counting each packet as 64 bytes and dropping packets if the total size of the packets buffered so far is at or above the limit. Since it buffers entire packets, it actually ends up buffering slightly more than the limit.
Why 64 bytes instead of 40? Perhaps it's including some queuing overhead or perhaps it's rounding up to a multiple of some power of 2 for alignment, or perhaps some combination of both.

I dont have a complete answer, but I tested this on my Linux box and this is what I observed.
When I send one "Hello World!\n" with a terminating '0'. I get:
Client:
$./sendto
sent 14 bytes
Socket "Recv-Q" has 768 bytes (seems probable its in bytes, did not check ss sources):
$ ss -ul|grep 55555
UNCONN 768 0 127.0.0.1:55555 *:*
When I send 1000 packets I get:
$ ./sendto
sent 14000 bytes
Recv-Q:
$ ss -ul|grep 55555
UNCONN 213504 0 127.0.0.1:55555 *:*
Your server (after ctrl-c):
$ ./recvfrom 55555
^C
278 UDP messages received
Incidentally 213504/768 = 278. With quick experimentation I could not figure out what setting to tune, to increase the buffered amount. Also, I dont know why a received packet takes so much space in this queue. Lots of metadata maybe? As on you osX, the dropped packets show up in netstat -su.
EDIT: Additional observation with ss -ulm, which prints "socket memory usage" in more detail:
UNCONN 213504 0 127.0.0.1:55555 *:*
skmem:(r213504,rb212992,t0,tb212992,f3584,w0,o0,bl0)
The 213504 bytes buffered are 512 bytes above the rb value. Might not be a coincidence, but would require reading the kernel source to find out.
Did you check how much one UDP datagram takes up on osX?
EDIT 2:
This is still not a suitable answer for osX, but on Linux I found that increasing the kernel memory allocated for receive buffers allowed me to buffer all the 1000 packets sent.
A bit of overkill, but I used these (disclaimer) tweaking the buffer values randomly might seriously mess up your networking and kernel):
net.core.rmem_max=1048568
net.core.rmem_default=1048568

Flush kernel's TCP buffer for `MSG_MORE`-flagged packets

send()'s man page reveals the MSG_MORE flag which is asserted to act like TCP_CORK. I have a wrapper function around send():
int SocketConnection_Write(SocketConnection *this, void *buf, int len) {
errno = 0;
int sent = send(this->fd, buf, len, MSG_NOSIGNAL);
if (errno == EPIPE || errno == ENOTCONN) {
throw(exc, &SocketConnection_NotConnectedException);
} else if (errno == ECONNRESET) {
throw(exc, &SocketConnection_ConnectionResetException);
} else if (sent != len) {
throw(exc, &SocketConnection_LengthMismatchException);
}
return sent;
}
Assuming I want to use the kernel buffer, I could go with TCP_CORK, enable whenever it is necessary and then disable it to flush the buffer. But on the other hand, thereby the need for an additional system call arises. Thus, the usage of MSG_MORE seems more appropriate to me. I'd simply change the above send() line to:
int sent = send(this->fd, buf, len, MSG_NOSIGNAL | MSG_MORE);
According to lwm.net, packets will be flushed automatically if they are large enough:
If an application sets that option on
a socket, the kernel will not send out
short packets. Instead, it will wait
until enough data has shown up to fill
a maximum-size packet, then send it.
When TCP_CORK is turned off, any
remaining data will go out on the
wire.
But this section only refers to TCP_CORK. Now, what is the proper way to flush MSG_MORE packets?
I can only think of two possibilities:
Call send() with an empty buffer and without MSG_MORE being set
Re-apply the TCP_CORK option as described on this page
Unfortunately the whole topic is very poorly documented and I couldn't find much on the Internet.
I am also wondering how to check that everything works as expected? Obviously running the server through strace is not an option. So the simplest way would be to use netcat and then look at its strace output? Or will the kernel handle traffic transmitted over a loopback interface differently?

I have taken a look at the kernel source and both assumptions seem to be true. The following code are extracts from net/ipv4/tcp.c (2.6.33.1).
static inline void tcp_push(struct sock *sk, int flags, int mss_now,
int nonagle)
{
struct tcp_sock *tp = tcp_sk(sk);
if (tcp_send_head(sk)) {
struct sk_buff *skb = tcp_write_queue_tail(sk);
if (!(flags & MSG_MORE) || forced_push(tp))
tcp_mark_push(tp, skb);
tcp_mark_urg(tp, flags, skb);
__tcp_push_pending_frames(sk, mss_now,
(flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
}
}
Hence, if the flag is not set, the pending frames will definitely be flushed. But this is be only the case when the buffer is not empty:
static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
size_t psize, int flags)
{
(...)
ssize_t copied;
(...)
copied = 0;
while (psize > 0) {
(...)
if (forced_push(tp)) {
tcp_mark_push(tp, skb);
__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
} else if (skb == tcp_send_head(sk))
tcp_push_one(sk, mss_now);
continue;
wait_for_sndbuf:
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
wait_for_memory:
if (copied)
tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
goto do_error;
mss_now = tcp_send_mss(sk, &size_goal, flags);
}
out:
if (copied)
tcp_push(sk, flags, mss_now, tp->nonagle);
return copied;
do_error:
if (copied)
goto out;
out_err:
return sk_stream_error(sk, flags, err);
}
The while loop's body will never be executed because psize is not greater 0. Then, in the out section, there is another chance, tcp_push() gets called but because copied still has its default value, it will fail as well.
So sending a packet with the length 0 will never result in a flush.
The next theory was to re-apply TCP_CORK. Let's take a look at the code first:
static int do_tcp_setsockopt(struct sock *sk, int level,
int optname, char __user *optval, unsigned int optlen)
{
(...)
switch (optname) {
(...)
case TCP_NODELAY:
if (val) {
/* TCP_NODELAY is weaker than TCP_CORK, so that
* this option on corked socket is remembered, but
* it is not activated until cork is cleared.
*
* However, when TCP_NODELAY is set we make
* an explicit push, which overrides even TCP_CORK
* for currently queued segments.
*/
tp->nonagle |= TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
tcp_push_pending_frames(sk);
} else {
tp->nonagle &= ~TCP_NAGLE_OFF;
}
break;
case TCP_CORK:
/* When set indicates to always queue non-full frames.
* Later the user clears this option and we transmit
* any pending partial frames in the queue. This is
* meant to be used alongside sendfile() to get properly
* filled frames when the user (for example) must write
* out headers with a write() call first and then use
* sendfile to send out the data parts.
*
* TCP_CORK can be set together with TCP_NODELAY and it is
* stronger than TCP_NODELAY.
*/
if (val) {
tp->nonagle |= TCP_NAGLE_CORK;
} else {
tp->nonagle &= ~TCP_NAGLE_CORK;
if (tp->nonagle&TCP_NAGLE_OFF)
tp->nonagle |= TCP_NAGLE_PUSH;
tcp_push_pending_frames(sk);
}
break;
(...)
As you can see, there are two ways to flush. You can either set TCP_NODELAY to 1 or TCP_CORK to 0. Luckily, both won't check whether the flag is already set. Thus, my initial plan to re-apply the TCP_CORK flag can be optimized to just disable it, even if it's currently not set.
I hope this helps someone with similar issues.

That's a lot of research... all I can offer is this empirical post note:
Sending a bunch of packet with MSG_MORE set, followed by a packet without MSG_MORE, the whole lot goes out. It works a treat for something like this:
for (i=0; i<mg_live.length; i++) {
// [...]
if ((n = pth_send(sock, query, len, MSG_MORE | MSG_NOSIGNAL)) < len) {
printf("error writing to socket (sent %i bytes of %i)\n", n, len);
exit(1);
}
}
}
pth_send(sock, "END\n", 4, MSG_NOSIGNAL);
That is, when you're sending out all the packets at once, and have a clearly defined end... AND you are only using one socket.
If you tried writing to another socket in the middle of the above loop, you may find that Linux releases the previously held packets. At least that appears to be the trouble I'm having right now. But it might be an easy solution for you.