How to do IPC with a DPDK process involved? - c

My goal is to create a DPDK app which will act as a middle-man between a virtual machine manager (which in my case is an userspace process) and the NIC Hardware.
So far I have tried to do something on a smaller scale. Instead of using the VMM, I created a dummy process using C language.
I managed to "bind" the dummy process to the DPDK process using named semaphores and shared memory.
Basically in this little demo, the DPDK app is reading from the RX buffer and puts the content on the shared memory. Then the dummy process gets the data and prints it to stdout.
All of the DPDK support for multi process communication is targeting the specific case where both apps are using the dpdk's libraries.
I am wondering if there is some sort of support for the case where one app is not using those libraries.
Why? Because the VMM is written in Rust and I do not know how to add DPDK libraries in Rust.
What do you think it would be the most efficient way of communication?
I was thinking if it is possible to put the mempool inside the shared memory and access the mbufs directly from the dummy process.
I am currently using dpdk 20.11
Ubuntu 20.04
Thank you!
UPDATE 1:
is your question Can I interface/interact DPDK application with non
DPDK application
What I am actually struggling to find is this: how do I efficiently move data received on RX buffer to a non dpdk app?
My current approach is this: https://imgur.com/a/cF2lq29
That is the main logic loop for a dpdk app which gets data on RX buffer and sends it to the "non dpdk app".
How it is happening:
Read data from RX buffer.
Wait untill the non dpdk app says "I am not using the shared memory, you can write on it"
Write on the shared memory (nb_rx is written instead of the whole packet just for simplicity)
Signal the "non dpdk" app that now the Shared Memory is available for being read.
As one can see, it is not quite efficient and I am afraid my synchronization method will create a bottleneck.
So this makes me wonder "are there any better, by the book ways, of accomplishing this communication?

There are 3 ways to solve HOST to GUEST/Docker problem.
Common way: Allow all physical NIC in dpdk application like SPP, OVS, VPP and DPDK priamry-secodnary to leverage virtio, vhost library, memif or shared MMAP Huge page to allow copy/zero-copy mode to VM/Docker.
Complex Copy way: create a shared memory location between DPDK application on host and non-DPDK application that runs in HOST/GUEST/Docker.
Mimic Zero Copy way: the non DPDK application create DMA buffer areas in shared memory at fixed location. In DPDK applciation use external Memory Buffer MBUF for physical ports. DPDK PMD (that supports external MBUF) can then DMA the packet to shared area.
Since option 2 and 3 are not common, let me explain how you might end up developing the solution.
Option-2:
develop simple non DPDK application using shared mmap area, divide the area into fixed packet size (max size). Then distribute one half for TX and half for RX.
Initialize the DPDK application to make use of the mmap area that is created.
Maintain packet access with atomic head and tail pointers
DPDK application after RX-burst when it receives packets will get a blank index by querying head pointer, then memcopy to packet to a specific index. Once completed DPDK application will invoke rte_mbuf_free.
Non-DPDK application can go in use tail pointer to get valid RX packet from shared memory.
Perform similar operation for TX using separate location index and head/tail poitners.
Disadvantages:
packet throughput is heavily reduced.
copy packet memory utilizes CPU CYCLES
complex common library to maintain index, head and tail pointers for rx and tx.
memory space over-provisioned for the largest packet since traffic can not be predictable.
option-3:
Create shared mmap with posix_memalign API, multiple region of 2000Byte.
Use simple Data Structure (descriptor) to hold {virtual address, physical address, length, value}
Create SHM area with each index populated with the above format.
initialize DPDK application to access both SHM and MMAPed area
In DPDK application to populate rte_pktmbuf_pool_create_extbuf, where ext_mem represents the DMA region populated under SHM physical address (not sure if this will work as the original intent was different purpose).
register the callback handler to do garbage collection for once we rx_burst the packet.
In TX, there are 2 options, a) easiest way to simply copy the buffer to rte_mbuf and b) create an indirect buffer to attach rte_mbuf with external buffer, wait till NIC actually send the packet (via completion queue)
Disadvantages of option-2:
complex way fo using zero copy in RX side
copy mode is easiest method to implement.
Buffer management is fragile.
assuming 1 thread for RX-TX, is applicable.
Recommendation: if the intention is not to use VPP, SPP, OVS then simplest approach is use Primary-Secondary DPDK format where all the RX mbuf and TX mbuf is available between 2 process as it is mmaped in hugepage.

Related

How to create vm_area mapping if using __get_free_pages() with order greater than 1?

I am re-implementing mmap in a device driver for DMA.
I saw this question: Linux Driver: mmap() kernel buffer to userspace without using nopage that has an answer using vm_insert_page() to map one page at a time; hence, for multiple pages, needed to execute in a loop. Is there another API that handles this?
Previously I used dma_alloc_coherent to allocate a chunk of memory for DMA and used remap_pfn_range to build a page table that associates process's virtual memory to physical memory.
Now I would like to allocate a much larger chunk of memory using __get_free_pages with order greater than 1. I am not sure how to build page table in that case. The reason is as follows:
I checked the book Linux Device Drivers and noticed the following:
Background:
When a user-space process calls mmap to map device memory into its address space, the system responds by creating a new VMA to represent that mapping. A driver that supports mmap (and, thus, that implements the mmap method) needs to help that process by completing the initialization of that VMA.
Problem with remap_pfn_range:
remap_pfn_range won’t allow you to remap conventional addresses, which include the ones you obtain by calling get_free_page. Instead, it maps in the zero page. Everything appears to work, with the exception that the process sees private, zero-filled pages rather than the remapped RAM that it was hoping for.
The corresponding implementation using get_free_pages with order 0, i.e. only 1 page in scullp device driver:
The mmap method is disabled for a scullp device if the allocation order is greater than zero, because nopage deals with single pages rather than clusters of pages. scullp simply does not know how to properly manage reference counts for pages that are part of higher-order allocations.
May I know if there is a way to create VMA for pages obtained using __get_free_pages with order greater than 1?
I checked Linux source code and noticed there are some drivers re-implementing struct dma_map_ops->alloc() and struct dma_map_ops->map_page(). May I know if this is the correct way to do it?
I think I got the answer to my question. Feel free to correct me if I am wrong.
I happened to see this patch: mm: Introduce new vm_map_pages() and vm_map_pages_zero() API while I was googling for vm_insert_page.
Previouly drivers have their own way of mapping range of kernel pages/memory into user vma and this was done by invoking vm_insert_page() within a loop.
As this pattern is common across different drivers, it can be generalized by creating new functions and use it across the drivers.
vm_map_pages() is the API which could be used to mapped kernel memory/pages in drivers which has considered vm_pgoff.
After reading it, I knew I found what I want.
That function also could be found in Linux Kernel Core API Documentation.
As for the difference between remap_pfn_range() and vm_insert_page() which requires a loop for a list of contiguous pages, I found this answer to this question extremely helpful, in which it includes a link to explanation by Linus.
As a side note, this patch mm: Introduce new vm_insert_range and vm_insert_range_buggy API indicates that the earlier version of vm_map_pages() was vm_insert_range(), but we should stick to vm_map_pages(), since under the hood vm_map_pages() calls vm_insert_range().

Mapping kmalloc memory to user space

I want to create a single kernel module driver for my application.
It interfaces with an AXIS FIFO in Programmable logic and I need to send the physical addresses of allocated memory to this device to be used in programmable logic.
My platform driver recognises the AXIS FIFO device, and using mmap makes its registers available to my user space app. (previous post of mine)
I also want to allocate memory to be used by the programmable logic, and do this by using and IOCTL command which calls a kmalloc with the given size as argument. Since I want to use the physical address - I get the physical address using __pa(x).
If I want to access this allocated memory to verify that the correct info was stored in RAM, how do I do this? Through
fd = open("/dev/mem", ...)
va = mmap (phys_address, ....)
The problem I have with this is that I can still wrongfully access parts of memory that I shouldn't. Is there a better way to do this?
Thanks.
I think the best way to do this is to create a /proc device file that maps to the allocated memory. Your kernel module kmalloc's the memory, creates the proc device, and services all the I/O calls to the device. Your userspace program reads and writes to this device, or possibly mmaps to it (if that will work, I'm not sure...).

Pipelining a set of C buffers

I am creating Ethernet packets in an embedded system. I have my Data / IP and UDP packet headers defined in pre-allocated buffers and I have a large buffer that is used to grab data from the FPGA's fabric using DMA.
I also have some user data headers and footers where the data comes from the fabric in other ways, mostly SPI transfer of temperature, PCB address etc. Or even grabs of some of the configuration registers (single transaction, on-boot).
Now, at the moment I concatenate these using memcpy into a new larger buffer (also pre-allocated), and then send to the Transmit buffer of the on-FPGA MAC.
My issues:
1) All these buffers are on the FPGA hence requiring memory, I could copy them one at a time into the MAC Tx buffer but this would prevent my second idea.
2) All being buffers, gives the possibility of forming a pipeline, where new data (DN+1) can be put into the first buffers, while subsequent buffers are storing and concatenating the data of (DN+0).
If I have a nice modularised code, how do I create a pipeline from buffer to buffer. In hardware I'd use flags, only passing data from Buffer A to B when Buffer B has finished passing its data to C. In terms of C, memcpy and memmove return only void, I'd therefore need to make my own boolean flag that is modified after memcpy finishes and I'd need to make these flags globals so that I can easily pass their status into other functions.
Finally, as this is embedded, I don't have access to the full C libraries and both time and memory are at a premium.
Thanks
Ed

maximum allocated memory by linux-kernel module

I want to write a module whose task is to capture the incoming packets without sending them to the user space application & doing some modification on the captured packet. then this module will send this packet for transmission to the NIC.
But main problem is that my module is very big in size & it also does a lot of processing. So will it be good to do this processing inside kernel module or should we pass the information & packet to the user space for processing to avoid complexity.
& i m doing it only for getting packet processing very quick.
so maximum how much memory could be allocated by a linux-kernel module.
A network packet will always be faster when running in kernel space instead of user-space. Remember, that it has to be copied to user-space, which is an expensive operation. However, not everything should be running in kernel space as this would make the system very unstable, because every bug is a potential kernel crash.
So if you want to program your application using kernel or user space heavily depends on your specifications.
In contrast, the amount of memory to be allocated does not matter at all. Using kmalloc() in the linux module you can allocate as much memory as there is physically available in the system, so you should be fine.

Zero-copy with and without Scatter/Gather operations

I just read an article that explains the zero-copy mechanism.
It talks about the difference between zero-copy with and without Scatter/Gather supports.
NIC without SG support, the data copies are as follows
NIC with SG support, the data copies are as follows
In a word, zero-copy with SG support can eliminate one CPU copy.
My question is that why data in kernel buffer could be scattered?
Because the Linux kernel's mapping / memory allocation facilities by default will create virtually-contiguous but possibly physically-disjoint memory regions.
That means the read from the filesystem which sendfile() does internally goes to a buffer in kernel virtual memory, which the DMA code has to "transmogrify" (for lack of a better word) into something that the network card's DMA engine can grok.
Since DMA (often but not always) uses physical addresses, that means you either duplicate the data buffer (into a specially-allocated physically-contigous region of memory, your socket buffer above), or else transfer it one-physical-page-at-a-time.
If your DMA engine, on the other hand, is capable of aggregating multiple physically-disjoint memory regions into a single data transfer (that's called "scatter-gather") then instead of copying the buffer, you can simply pass a list of physical addresses (pointing to physically-contigous sub-segments of the kernel buffer, that's your aggregate descriptors above) and you no longer need to start a separate DMA transfer for each physical page. This is usually faster, but whether it can be done or not depends on the capabilities of the DMA engine.
Re: My question is that why data in kernel buffer could be scattered?
Because it already is scattered. The data queue in front of a TCP socket is not divided into the datagrams that will go out onto the network interface. Scatter allows you to keep the data where it is and not have to copy it to make a flat buffer that is acceptable to the hardware.
With the gather feature, you can give the network card a datagram which is broken into pieces at different addresses in memory, which can be references to the original socket buffers. The card will read it from those locations and send it as a single unit.
Without gather (hardware requires simple, linear buffers) a datagram has to be prepared as a contiguously allocated byte string, and all the data which belongs to it has to be memcpy-d into place from the buffers that are queued for transmission on the socket.
Because when you write to a socket, the headers of the packet are assembled in a different place from your user-data, so to be coalesced into a network packet, the device needs "gather" capability, at least to get the headers and data.
Also to avoid the CPU having to read the data (and thus, fill its cache up with useless stuff it's never going to need again), the network card also needs to generate its own IP and TCP checksums (I'm assuming TCP here, because 99% of your bulk data transfers are going to be TCP). This is OK, because nowadays they all can.
What I'm not sure is, how this all interacts with TCP_CORK.
Most protocols tend to have their own headers, so a hypothetical protocol looks like:
Client: Send request
Server: Send some metadata; send the file data
So we tend to have a server application assembling some headers in memory, issuing a write(), followed by a sendfile()-like operation. I suppose the headers still get copied into a kernel buffer in this case.

Resources