maximum allocated memory by linux-kernel module

maximum allocated memory by linux-kernel module - c

I want to write a module whose task is to capture the incoming packets without sending them to the user space application & doing some modification on the captured packet. then this module will send this packet for transmission to the NIC.
But main problem is that my module is very big in size & it also does a lot of processing. So will it be good to do this processing inside kernel module or should we pass the information & packet to the user space for processing to avoid complexity.
& i m doing it only for getting packet processing very quick.
so maximum how much memory could be allocated by a linux-kernel module.

A network packet will always be faster when running in kernel space instead of user-space. Remember, that it has to be copied to user-space, which is an expensive operation. However, not everything should be running in kernel space as this would make the system very unstable, because every bug is a potential kernel crash.
So if you want to program your application using kernel or user space heavily depends on your specifications.
In contrast, the amount of memory to be allocated does not matter at all. Using kmalloc() in the linux module you can allocate as much memory as there is physically available in the system, so you should be fine.

Related

How to do IPC with a DPDK process involved?

My goal is to create a DPDK app which will act as a middle-man between a virtual machine manager (which in my case is an userspace process) and the NIC Hardware.
So far I have tried to do something on a smaller scale. Instead of using the VMM, I created a dummy process using C language.
I managed to "bind" the dummy process to the DPDK process using named semaphores and shared memory.
Basically in this little demo, the DPDK app is reading from the RX buffer and puts the content on the shared memory. Then the dummy process gets the data and prints it to stdout.
All of the DPDK support for multi process communication is targeting the specific case where both apps are using the dpdk's libraries.
I am wondering if there is some sort of support for the case where one app is not using those libraries.
Why? Because the VMM is written in Rust and I do not know how to add DPDK libraries in Rust.
What do you think it would be the most efficient way of communication?
I was thinking if it is possible to put the mempool inside the shared memory and access the mbufs directly from the dummy process.
I am currently using dpdk 20.11
Ubuntu 20.04
Thank you!
UPDATE 1:
is your question Can I interface/interact DPDK application with non
DPDK application
What I am actually struggling to find is this: how do I efficiently move data received on RX buffer to a non dpdk app?
My current approach is this: https://imgur.com/a/cF2lq29
That is the main logic loop for a dpdk app which gets data on RX buffer and sends it to the "non dpdk app".
How it is happening:
Read data from RX buffer.
Wait untill the non dpdk app says "I am not using the shared memory, you can write on it"
Write on the shared memory (nb_rx is written instead of the whole packet just for simplicity)
Signal the "non dpdk" app that now the Shared Memory is available for being read.
As one can see, it is not quite efficient and I am afraid my synchronization method will create a bottleneck.
So this makes me wonder "are there any better, by the book ways, of accomplishing this communication?

There are 3 ways to solve HOST to GUEST/Docker problem.
Common way: Allow all physical NIC in dpdk application like SPP, OVS, VPP and DPDK priamry-secodnary to leverage virtio, vhost library, memif or shared MMAP Huge page to allow copy/zero-copy mode to VM/Docker.
Complex Copy way: create a shared memory location between DPDK application on host and non-DPDK application that runs in HOST/GUEST/Docker.
Mimic Zero Copy way: the non DPDK application create DMA buffer areas in shared memory at fixed location. In DPDK applciation use external Memory Buffer MBUF for physical ports. DPDK PMD (that supports external MBUF) can then DMA the packet to shared area.
Since option 2 and 3 are not common, let me explain how you might end up developing the solution.
Option-2:
develop simple non DPDK application using shared mmap area, divide the area into fixed packet size (max size). Then distribute one half for TX and half for RX.
Initialize the DPDK application to make use of the mmap area that is created.
Maintain packet access with atomic head and tail pointers
DPDK application after RX-burst when it receives packets will get a blank index by querying head pointer, then memcopy to packet to a specific index. Once completed DPDK application will invoke rte_mbuf_free.
Non-DPDK application can go in use tail pointer to get valid RX packet from shared memory.
Perform similar operation for TX using separate location index and head/tail poitners.
Disadvantages:
packet throughput is heavily reduced.
copy packet memory utilizes CPU CYCLES
complex common library to maintain index, head and tail pointers for rx and tx.
memory space over-provisioned for the largest packet since traffic can not be predictable.
option-3:
Create shared mmap with posix_memalign API, multiple region of 2000Byte.
Use simple Data Structure (descriptor) to hold {virtual address, physical address, length, value}
Create SHM area with each index populated with the above format.
initialize DPDK application to access both SHM and MMAPed area
In DPDK application to populate rte_pktmbuf_pool_create_extbuf, where ext_mem represents the DMA region populated under SHM physical address (not sure if this will work as the original intent was different purpose).
register the callback handler to do garbage collection for once we rx_burst the packet.
In TX, there are 2 options, a) easiest way to simply copy the buffer to rte_mbuf and b) create an indirect buffer to attach rte_mbuf with external buffer, wait till NIC actually send the packet (via completion queue)
Disadvantages of option-2:
complex way fo using zero copy in RX side
copy mode is easiest method to implement.
Buffer management is fragile.
assuming 1 thread for RX-TX, is applicable.
Recommendation: if the intention is not to use VPP, SPP, OVS then simplest approach is use Primary-Secondary DPDK format where all the RX mbuf and TX mbuf is available between 2 process as it is mmaped in hugepage.

get_user_pages_fast() for DMA?

I have a Linux driver that does DMA transfers to/from a device. For sending data to the device (to prevent copy operations) the driver maps the userspace buffer and uses it for DMA directly via get_user_pages_fast(). The user pages are then added to a scatter-gather list and used for DMA.
This works rather well, but the one issue is that this forces the userspace buffer to have various alignment requirements to the cache line of the CPU. My system returns 128 when you call dma_get_cache_alignment(), which means that in userspace I have to ensure that the start address is aligned to this value. Also, I have to check that the buffer is sized to a multiple of 128.
I see two options for handling this:
Deal with it. That is, in userspace ensure that the buffer is properly aligned. This sounds reasonable, but I have run into some issues since my device has to be integrated into a larger project, and I don't have control over the buffers that get passed to me. As a result, I have to allocate a properly aligned buffer in userspace to sit between the driver and the application and use that buffer in the event the caller's buffer is not aligned. This adds a copy operation and isn't the end of the world, but the resulting code is rather messy.
Rework the driver to use a kernel space buffer. That is, change the code such that the driver uses copy_from_user() to move the data into a properly aligned kernel space buffer. I'm not too concerned about the performance here, so this is an option, but would require a good amount of rework.
Is there anything that I'm missing? I'm hoping that there might be some magic flag or something that I overlooked to remove the alignment requirement altogether.

Should I use block device over char device for reading and writing to memory?

I just started to work in a new company and I'm new in the embedded world.
They gave me a task, I have done it and it's working but I don't know if I did it the right way.
I will describe the task and what I have done.
I was requested to hide some small piece of the DDR from the Linux OS, then some HW feature can write something to this small piece of memory I saved. After that I need to be able to read this small piece of memory to a file.
To hide a chunk of the DDR from the Linux I just changed the Linux memory arg to be equal to the real memory size - (the size I needed + some small size for safety). I have got the idea and the idea for the driver I will describe in a sec from this post.
After that the Linux is seeing less memory then the HW has and the top section of the DDR is hided from the kernel and I can use it for my storage without worry.
I think that I have done this part right, not something I can say about the next part.
For the next part, to be able to read this piece of DDR I saved, I wrote a Char device driver, it’s working, it’s reading the DDR chunk I saved to a file piece by piece, every piece is of size no more then some value I decided, can't do it in one copy because it will require allocating a big buffer and I don’t have enough RAM space for that.
Now I read about block device and I started to think that maybe block device fits better for my program, but I'm not relay sure because first it's working and if it's not broken... second I never wrote block device driver, I also never wrote char device driver until the one I described before, so I don't sure if this is the time to use block device over char device.

This depends on the intended use, but according to your description a character device is much more likely to be what you want. The difference:
a character device takes simple read and write commands and gets no help from the kernel. This is suitable for reading or writing from devices (and from anything that resembles a device, both if it is an actual stream that's read sequentially or supports 'seek' and can read the same data over and over again).
a block device hooks into the kernel's memory paging system and is capable of serving as a back-end for virtual memory pages. It can host a swap space, be the storage for a file system, etc. It is a much more complex beast than a character device. You need this only for something that stores a large amount of data that needs to be accessed by mapping it into the address space of a process (normally this is needed only if you put a file system on it).

Increasing Linux DMA_ZONE memory on ARM i.MX287

I am working in an Embedded Linux system which has the 2.6.35.3 kernel.
Within the device we require a 4MB+192kB contiguous DMA capable buffer for one of our data capture drivers. The driver uses SPI transfers to copy data into this buffer.
The user space application issues a mmap system call to map the buffer into user space and after that, it directly reads the available data.
The buffer is allocated using "alloc_bootmem_low_pages" call, because it is not possible to allocate more than 4 MB buffer using other methods, such as kmalloc.
However, due to a recent upgrade, we need to increase the buffer space to 22MB+192kB. As I've read, the Linux kernel has only 16MB of DMA capable memory. Therefore, theoretically this is not possible unless there is a way to tweak this setting.
If there is anyone who knows how to perform this, please let me know?
Is this a good idea or will this make the system unstable?

The ZONE_DMA 16MB limit is imposed by a hardware limitation of certain devices. Specifically, on the PC architecture in the olden days, ISA cards performing DMA needed buffers allocated in the first 16MB of the physical address space because the ISA interface had 24 physical address lines which were only capable of addressing the first 2^24=16MB of physical memory. Therefore, device drivers for these cards would allocate DMA buffers in the ZONE_DMA area to accommodate this hardware limitation.
Depending on your embedded system and device hardware, your device either is or isn't subject to this limitation. If it is subject to this limitation, there is no software fix you can apply to allow your device to address a 22MB block of memory, and if you modify the kernel to extend the DMA address space beyond 16MB, then of course the system will become unstable.
On the other hand, if your device is not subject to this limitation (which is the only way it could possibly write to a 22MB buffer), then there is no reason to allocate memory in ZONE_DMA. In this case, I think if you simply replace your alloc_bootmem_low_pages call with an alloc_bootmem_pages call, it should work fine to allocate your 22MB buffer. If the system becomes unstable, then it's probably because your device is subject to a hardware limitation, and you cannot use a 22MB buffer.

It looks like my first attempt at an answer was a little too generic. I think that for the specific i.MX287 architecture you mention in the comments, the DMA zone size is configurable through the CONFIG_DMA_ZONE_SIZE parameter which can be made as large as 32Megs. The relevant configuration option should be under "System Type -> Freescale i.MXS implementations -> DMA memory zone size".
On this architecture, it's seems safe to modify it, as it looks like it's not addressing a hardware limitation (the way it was on x86 architectures) but just determining how to lay out memory.
If you try setting it to 32Meg and testing both alloc_bootmem_pages and alloc_bootmem_low_pages in your own driver, perhaps one of those will work.
Otherwise, I think I'm out of ideas.

Mapping kmalloc memory to user space

I want to create a single kernel module driver for my application.
It interfaces with an AXIS FIFO in Programmable logic and I need to send the physical addresses of allocated memory to this device to be used in programmable logic.
My platform driver recognises the AXIS FIFO device, and using mmap makes its registers available to my user space app. (previous post of mine)
I also want to allocate memory to be used by the programmable logic, and do this by using and IOCTL command which calls a kmalloc with the given size as argument. Since I want to use the physical address - I get the physical address using __pa(x).
If I want to access this allocated memory to verify that the correct info was stored in RAM, how do I do this? Through
fd = open("/dev/mem", ...)
va = mmap (phys_address, ....)
The problem I have with this is that I can still wrongfully access parts of memory that I shouldn't. Is there a better way to do this?
Thanks.

I think the best way to do this is to create a /proc device file that maps to the allocated memory. Your kernel module kmalloc's the memory, creates the proc device, and services all the I/O calls to the device. Your userspace program reads and writes to this device, or possibly mmaps to it (if that will work, I'm not sure...).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight