Zero-copy with and without Scatter/Gather operations - c

I just read an article that explains the zero-copy mechanism.
It talks about the difference between zero-copy with and without Scatter/Gather supports.
NIC without SG support, the data copies are as follows
NIC with SG support, the data copies are as follows
In a word, zero-copy with SG support can eliminate one CPU copy.
My question is that why data in kernel buffer could be scattered?

Because the Linux kernel's mapping / memory allocation facilities by default will create virtually-contiguous but possibly physically-disjoint memory regions.
That means the read from the filesystem which sendfile() does internally goes to a buffer in kernel virtual memory, which the DMA code has to "transmogrify" (for lack of a better word) into something that the network card's DMA engine can grok.
Since DMA (often but not always) uses physical addresses, that means you either duplicate the data buffer (into a specially-allocated physically-contigous region of memory, your socket buffer above), or else transfer it one-physical-page-at-a-time.
If your DMA engine, on the other hand, is capable of aggregating multiple physically-disjoint memory regions into a single data transfer (that's called "scatter-gather") then instead of copying the buffer, you can simply pass a list of physical addresses (pointing to physically-contigous sub-segments of the kernel buffer, that's your aggregate descriptors above) and you no longer need to start a separate DMA transfer for each physical page. This is usually faster, but whether it can be done or not depends on the capabilities of the DMA engine.

Re: My question is that why data in kernel buffer could be scattered?
Because it already is scattered. The data queue in front of a TCP socket is not divided into the datagrams that will go out onto the network interface. Scatter allows you to keep the data where it is and not have to copy it to make a flat buffer that is acceptable to the hardware.
With the gather feature, you can give the network card a datagram which is broken into pieces at different addresses in memory, which can be references to the original socket buffers. The card will read it from those locations and send it as a single unit.
Without gather (hardware requires simple, linear buffers) a datagram has to be prepared as a contiguously allocated byte string, and all the data which belongs to it has to be memcpy-d into place from the buffers that are queued for transmission on the socket.

Because when you write to a socket, the headers of the packet are assembled in a different place from your user-data, so to be coalesced into a network packet, the device needs "gather" capability, at least to get the headers and data.
Also to avoid the CPU having to read the data (and thus, fill its cache up with useless stuff it's never going to need again), the network card also needs to generate its own IP and TCP checksums (I'm assuming TCP here, because 99% of your bulk data transfers are going to be TCP). This is OK, because nowadays they all can.
What I'm not sure is, how this all interacts with TCP_CORK.
Most protocols tend to have their own headers, so a hypothetical protocol looks like:
Client: Send request
Server: Send some metadata; send the file data
So we tend to have a server application assembling some headers in memory, issuing a write(), followed by a sendfile()-like operation. I suppose the headers still get copied into a kernel buffer in this case.

Related

How to do IPC with a DPDK process involved?

My goal is to create a DPDK app which will act as a middle-man between a virtual machine manager (which in my case is an userspace process) and the NIC Hardware.
So far I have tried to do something on a smaller scale. Instead of using the VMM, I created a dummy process using C language.
I managed to "bind" the dummy process to the DPDK process using named semaphores and shared memory.
Basically in this little demo, the DPDK app is reading from the RX buffer and puts the content on the shared memory. Then the dummy process gets the data and prints it to stdout.
All of the DPDK support for multi process communication is targeting the specific case where both apps are using the dpdk's libraries.
I am wondering if there is some sort of support for the case where one app is not using those libraries.
Why? Because the VMM is written in Rust and I do not know how to add DPDK libraries in Rust.
What do you think it would be the most efficient way of communication?
I was thinking if it is possible to put the mempool inside the shared memory and access the mbufs directly from the dummy process.
I am currently using dpdk 20.11
Ubuntu 20.04
Thank you!
UPDATE 1:
is your question Can I interface/interact DPDK application with non
DPDK application
What I am actually struggling to find is this: how do I efficiently move data received on RX buffer to a non dpdk app?
My current approach is this: https://imgur.com/a/cF2lq29
That is the main logic loop for a dpdk app which gets data on RX buffer and sends it to the "non dpdk app".
How it is happening:
Read data from RX buffer.
Wait untill the non dpdk app says "I am not using the shared memory, you can write on it"
Write on the shared memory (nb_rx is written instead of the whole packet just for simplicity)
Signal the "non dpdk" app that now the Shared Memory is available for being read.
As one can see, it is not quite efficient and I am afraid my synchronization method will create a bottleneck.
So this makes me wonder "are there any better, by the book ways, of accomplishing this communication?
There are 3 ways to solve HOST to GUEST/Docker problem.
Common way: Allow all physical NIC in dpdk application like SPP, OVS, VPP and DPDK priamry-secodnary to leverage virtio, vhost library, memif or shared MMAP Huge page to allow copy/zero-copy mode to VM/Docker.
Complex Copy way: create a shared memory location between DPDK application on host and non-DPDK application that runs in HOST/GUEST/Docker.
Mimic Zero Copy way: the non DPDK application create DMA buffer areas in shared memory at fixed location. In DPDK applciation use external Memory Buffer MBUF for physical ports. DPDK PMD (that supports external MBUF) can then DMA the packet to shared area.
Since option 2 and 3 are not common, let me explain how you might end up developing the solution.
Option-2:
develop simple non DPDK application using shared mmap area, divide the area into fixed packet size (max size). Then distribute one half for TX and half for RX.
Initialize the DPDK application to make use of the mmap area that is created.
Maintain packet access with atomic head and tail pointers
DPDK application after RX-burst when it receives packets will get a blank index by querying head pointer, then memcopy to packet to a specific index. Once completed DPDK application will invoke rte_mbuf_free.
Non-DPDK application can go in use tail pointer to get valid RX packet from shared memory.
Perform similar operation for TX using separate location index and head/tail poitners.
Disadvantages:
packet throughput is heavily reduced.
copy packet memory utilizes CPU CYCLES
complex common library to maintain index, head and tail pointers for rx and tx.
memory space over-provisioned for the largest packet since traffic can not be predictable.
option-3:
Create shared mmap with posix_memalign API, multiple region of 2000Byte.
Use simple Data Structure (descriptor) to hold {virtual address, physical address, length, value}
Create SHM area with each index populated with the above format.
initialize DPDK application to access both SHM and MMAPed area
In DPDK application to populate rte_pktmbuf_pool_create_extbuf, where ext_mem represents the DMA region populated under SHM physical address (not sure if this will work as the original intent was different purpose).
register the callback handler to do garbage collection for once we rx_burst the packet.
In TX, there are 2 options, a) easiest way to simply copy the buffer to rte_mbuf and b) create an indirect buffer to attach rte_mbuf with external buffer, wait till NIC actually send the packet (via completion queue)
Disadvantages of option-2:
complex way fo using zero copy in RX side
copy mode is easiest method to implement.
Buffer management is fragile.
assuming 1 thread for RX-TX, is applicable.
Recommendation: if the intention is not to use VPP, SPP, OVS then simplest approach is use Primary-Secondary DPDK format where all the RX mbuf and TX mbuf is available between 2 process as it is mmaped in hugepage.

get_user_pages_fast() for DMA?

I have a Linux driver that does DMA transfers to/from a device. For sending data to the device (to prevent copy operations) the driver maps the userspace buffer and uses it for DMA directly via get_user_pages_fast(). The user pages are then added to a scatter-gather list and used for DMA.
This works rather well, but the one issue is that this forces the userspace buffer to have various alignment requirements to the cache line of the CPU. My system returns 128 when you call dma_get_cache_alignment(), which means that in userspace I have to ensure that the start address is aligned to this value. Also, I have to check that the buffer is sized to a multiple of 128.
I see two options for handling this:
Deal with it. That is, in userspace ensure that the buffer is properly aligned. This sounds reasonable, but I have run into some issues since my device has to be integrated into a larger project, and I don't have control over the buffers that get passed to me. As a result, I have to allocate a properly aligned buffer in userspace to sit between the driver and the application and use that buffer in the event the caller's buffer is not aligned. This adds a copy operation and isn't the end of the world, but the resulting code is rather messy.
Rework the driver to use a kernel space buffer. That is, change the code such that the driver uses copy_from_user() to move the data into a properly aligned kernel space buffer. I'm not too concerned about the performance here, so this is an option, but would require a good amount of rework.
Is there anything that I'm missing? I'm hoping that there might be some magic flag or something that I overlooked to remove the alignment requirement altogether.

Pipelining a set of C buffers

I am creating Ethernet packets in an embedded system. I have my Data / IP and UDP packet headers defined in pre-allocated buffers and I have a large buffer that is used to grab data from the FPGA's fabric using DMA.
I also have some user data headers and footers where the data comes from the fabric in other ways, mostly SPI transfer of temperature, PCB address etc. Or even grabs of some of the configuration registers (single transaction, on-boot).
Now, at the moment I concatenate these using memcpy into a new larger buffer (also pre-allocated), and then send to the Transmit buffer of the on-FPGA MAC.
My issues:
1) All these buffers are on the FPGA hence requiring memory, I could copy them one at a time into the MAC Tx buffer but this would prevent my second idea.
2) All being buffers, gives the possibility of forming a pipeline, where new data (DN+1) can be put into the first buffers, while subsequent buffers are storing and concatenating the data of (DN+0).
If I have a nice modularised code, how do I create a pipeline from buffer to buffer. In hardware I'd use flags, only passing data from Buffer A to B when Buffer B has finished passing its data to C. In terms of C, memcpy and memmove return only void, I'd therefore need to make my own boolean flag that is modified after memcpy finishes and I'd need to make these flags globals so that I can easily pass their status into other functions.
Finally, as this is embedded, I don't have access to the full C libraries and both time and memory are at a premium.
Thanks
Ed

how to design a server for variable size messages

I want some feedback or suggestion on how to design a server that handles variable size messages.
to simplify the answer lets assume:
single thread epoll() based
the protocol is: data-size + data
data is stored on a ringbuffer
the read code, with some simplification for clarity, looks like this:
if (client->readable) {
if (client->remaining > 0) {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 1+ bytes */
rd = read(client->sock, client->buffer, client->remaining);
client->buffer += rd;
client->remaining -= rd;
} else {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 4 bytes */
read(client->sock, &(client->remaining), 4);
client->buffer = acquire_ringbuf_slot(client->remaining);
}
}
please, do not focus on the 4 byte. just assume we have the data size in the beginning compressed or not does not make difference for this discussion.
now, the question is: what is the best way to do the above?
assume both small "data", few bytes and large data MBs
how can we reduce the number of read() calls? e.g. in case we have 4 message of 16 bytes on the stream, it seems a waste doing 8 calls to read().
are there better alternatives to this design?
PART of the solution depends on the transport layer protocol you use.
I assume you are using TCP which provides connection oriented and reliable communication.
From your code I assume you understand TCP is a stream-oriented protocol
(So when a client sends a piece of data, that data is stored in the socket send buffer and TCP may use one or more TCP segments to convey it to the other end (server)).
So the code, looks very good so far (considering you have error checks and other things in the real code).
Now for your questions, I give you my responses, what I think is best based on my experience (but there could be better solutions):
1-This is a solution with challenges similar to how an OS manages memory, dealing with fragmentation.
For handling different message sizes, you have to understand there are always trade-offs depending on your performance goals.
One solution to improve memory utilization and parallelization is to have a list of free buffer chunks of certain size, say 4KB.
You will retrieve as many as you need for storing your received message. In the last one you will have unused data. You play with internal fragmentation.
The drawback could be when you need to apply certain type of processing (maybe a visitor pattern) on the message, like parsing/routing/transformation/etc. It will be more complex and less efficient than a case of a huge buffer of contiguous memory. On the other side, the drawback of a huge buffer is much less efficient memory utilization, memory bottlenecks, and less parallelization.
You can implement something smarter in the middle (think about chunks that could also be contiguous whenever available). Always depending on your goals. Something useful is to implement an abstraction over the fragmented memory so that every function (or visitor) that is applied works as it were dealing with contiguous memory.
If you use these chunks, when the message was processed and dropped/forwarded/eaten/whatever, you return the unused chunks to the list of free chunks.
2-The number of read calls will depend on how fast TCP conveys the data from client to server. Remember this is stream oriented and you don't have much control over it. Of course, I'm assuming you try to read the max possible (remaining) data in each read.
If you use the chunks I mentioned above the max data to read will also depend on the chunk size.
Something you can do at TCP layer is to increase the server receive buffer. Thus, it can receive more data even when server cannot read it fast enough.
3-The ring buffer is OK, if you used chunked, the ring buffer should provide the abstraction. But I don't know why you need a ring buffer.
I like ring buffers because there is a way of implementing producer-consumer synchronization without locking (Linux Kernel uses this for moving packets from L2 to IP layer) but I don't know if that's your goal.
To pass messages to other components and/or upper-layers you could also use ring buffers of pointers to messages.
A better design may be as follows:
Set up your user-space socket read buffer to be the same size as the kernel socket buffer. If your user-space socket read buffer is smaller, then you would need more than one read syscall to read the kernel buffer. If your user-space buffer is bigger, then the extra space is wasted.
Your read function should only read as much data as possible in one read syscall. This function must not know anything about the protocol. This way you do not need to re-implement this function for different wire formats.
When your read function has read into the user-space buffer it should call a callback passing the iterators to the data available in the buffer. That callback is a parser function that should extract all available complete messages and pass these messages to another higher-level callback. Upon return the parser function should return the number of bytes consumed, so that these bytes can be discarded from the user-space socket buffer.

Socket and buffers

I know standard c library functions fwrite and fread are a sort buffering wrappers of write and read system calls, buffers are used for performance reasons I totally understand.
what I don't understand is the role of buffers in socket programming functions write and read.
can you help me understand what they are used for, highlighting differences and similarities with files buffers?
I'm a newbie in socket programming...
When the kernel receives packets it has to put those data somewhere. It stores it in the buffer. When your app does the next read it can fetch the data from those buffers. If you have a UDP connection and your app doesn't read those buffers it gets full and the kernel starts to drop the received packets. If you have a TCP connection it will acknowledge the packets as long as there is free space in the buffers, but after that it will signal that it cannot read more.
Write buffers are necessary because the network interface is a scarce resource, the kernel typically cannot send immediately a packet. If you do a big write() it could be chopped up to hundreds of packets. So the kernel will store that data in the buffers. The buffer also does a good job if you do a lot of small writes, see Nagle's algorithm.
Imagine if you sent your information one byte at a time. You'd be generating a 100 byte packet to send 1 byte, and if it's a TCP connection, depending in implementation, waiting until you got a syn/ack before sending more? Sounds pretty inefficient to me.
Instead, you use a buffer to store up a large amount of data and send that across in a single packet, just like storing up data before writing to disk.

Resources