get_user_pages_fast() for DMA? - c

I have a Linux driver that does DMA transfers to/from a device. For sending data to the device (to prevent copy operations) the driver maps the userspace buffer and uses it for DMA directly via get_user_pages_fast(). The user pages are then added to a scatter-gather list and used for DMA.
This works rather well, but the one issue is that this forces the userspace buffer to have various alignment requirements to the cache line of the CPU. My system returns 128 when you call dma_get_cache_alignment(), which means that in userspace I have to ensure that the start address is aligned to this value. Also, I have to check that the buffer is sized to a multiple of 128.
I see two options for handling this:
Deal with it. That is, in userspace ensure that the buffer is properly aligned. This sounds reasonable, but I have run into some issues since my device has to be integrated into a larger project, and I don't have control over the buffers that get passed to me. As a result, I have to allocate a properly aligned buffer in userspace to sit between the driver and the application and use that buffer in the event the caller's buffer is not aligned. This adds a copy operation and isn't the end of the world, but the resulting code is rather messy.
Rework the driver to use a kernel space buffer. That is, change the code such that the driver uses copy_from_user() to move the data into a properly aligned kernel space buffer. I'm not too concerned about the performance here, so this is an option, but would require a good amount of rework.
Is there anything that I'm missing? I'm hoping that there might be some magic flag or something that I overlooked to remove the alignment requirement altogether.

Related

Should I use block device over char device for reading and writing to memory?

I just started to work in a new company and I'm new in the embedded world.
They gave me a task, I have done it and it's working but I don't know if I did it the right way.
I will describe the task and what I have done.
I was requested to hide some small piece of the DDR from the Linux OS, then some HW feature can write something to this small piece of memory I saved. After that I need to be able to read this small piece of memory to a file.
To hide a chunk of the DDR from the Linux I just changed the Linux memory arg to be equal to the real memory size - (the size I needed + some small size for safety). I have got the idea and the idea for the driver I will describe in a sec from this post.
After that the Linux is seeing less memory then the HW has and the top section of the DDR is hided from the kernel and I can use it for my storage without worry.
I think that I have done this part right, not something I can say about the next part.
For the next part, to be able to read this piece of DDR I saved, I wrote a Char device driver, it’s working, it’s reading the DDR chunk I saved to a file piece by piece, every piece is of size no more then some value I decided, can't do it in one copy because it will require allocating a big buffer and I don’t have enough RAM space for that.
Now I read about block device and I started to think that maybe block device fits better for my program, but I'm not relay sure because first it's working and if it's not broken... second I never wrote block device driver, I also never wrote char device driver until the one I described before, so I don't sure if this is the time to use block device over char device.
This depends on the intended use, but according to your description a character device is much more likely to be what you want. The difference:
a character device takes simple read and write commands and gets no help from the kernel. This is suitable for reading or writing from devices (and from anything that resembles a device, both if it is an actual stream that's read sequentially or supports 'seek' and can read the same data over and over again).
a block device hooks into the kernel's memory paging system and is capable of serving as a back-end for virtual memory pages. It can host a swap space, be the storage for a file system, etc. It is a much more complex beast than a character device. You need this only for something that stores a large amount of data that needs to be accessed by mapping it into the address space of a process (normally this is needed only if you put a file system on it).

Increasing Linux DMA_ZONE memory on ARM i.MX287

I am working in an Embedded Linux system which has the 2.6.35.3 kernel.
Within the device we require a 4MB+192kB contiguous DMA capable buffer for one of our data capture drivers. The driver uses SPI transfers to copy data into this buffer.
The user space application issues a mmap system call to map the buffer into user space and after that, it directly reads the available data.
The buffer is allocated using "alloc_bootmem_low_pages" call, because it is not possible to allocate more than 4 MB buffer using other methods, such as kmalloc.
However, due to a recent upgrade, we need to increase the buffer space to 22MB+192kB. As I've read, the Linux kernel has only 16MB of DMA capable memory. Therefore, theoretically this is not possible unless there is a way to tweak this setting.
If there is anyone who knows how to perform this, please let me know?
Is this a good idea or will this make the system unstable?
The ZONE_DMA 16MB limit is imposed by a hardware limitation of certain devices. Specifically, on the PC architecture in the olden days, ISA cards performing DMA needed buffers allocated in the first 16MB of the physical address space because the ISA interface had 24 physical address lines which were only capable of addressing the first 2^24=16MB of physical memory. Therefore, device drivers for these cards would allocate DMA buffers in the ZONE_DMA area to accommodate this hardware limitation.
Depending on your embedded system and device hardware, your device either is or isn't subject to this limitation. If it is subject to this limitation, there is no software fix you can apply to allow your device to address a 22MB block of memory, and if you modify the kernel to extend the DMA address space beyond 16MB, then of course the system will become unstable.
On the other hand, if your device is not subject to this limitation (which is the only way it could possibly write to a 22MB buffer), then there is no reason to allocate memory in ZONE_DMA. In this case, I think if you simply replace your alloc_bootmem_low_pages call with an alloc_bootmem_pages call, it should work fine to allocate your 22MB buffer. If the system becomes unstable, then it's probably because your device is subject to a hardware limitation, and you cannot use a 22MB buffer.
It looks like my first attempt at an answer was a little too generic. I think that for the specific i.MX287 architecture you mention in the comments, the DMA zone size is configurable through the CONFIG_DMA_ZONE_SIZE parameter which can be made as large as 32Megs. The relevant configuration option should be under "System Type -> Freescale i.MXS implementations -> DMA memory zone size".
On this architecture, it's seems safe to modify it, as it looks like it's not addressing a hardware limitation (the way it was on x86 architectures) but just determining how to lay out memory.
If you try setting it to 32Meg and testing both alloc_bootmem_pages and alloc_bootmem_low_pages in your own driver, perhaps one of those will work.
Otherwise, I think I'm out of ideas.

Pipelining a set of C buffers

I am creating Ethernet packets in an embedded system. I have my Data / IP and UDP packet headers defined in pre-allocated buffers and I have a large buffer that is used to grab data from the FPGA's fabric using DMA.
I also have some user data headers and footers where the data comes from the fabric in other ways, mostly SPI transfer of temperature, PCB address etc. Or even grabs of some of the configuration registers (single transaction, on-boot).
Now, at the moment I concatenate these using memcpy into a new larger buffer (also pre-allocated), and then send to the Transmit buffer of the on-FPGA MAC.
My issues:
1) All these buffers are on the FPGA hence requiring memory, I could copy them one at a time into the MAC Tx buffer but this would prevent my second idea.
2) All being buffers, gives the possibility of forming a pipeline, where new data (DN+1) can be put into the first buffers, while subsequent buffers are storing and concatenating the data of (DN+0).
If I have a nice modularised code, how do I create a pipeline from buffer to buffer. In hardware I'd use flags, only passing data from Buffer A to B when Buffer B has finished passing its data to C. In terms of C, memcpy and memmove return only void, I'd therefore need to make my own boolean flag that is modified after memcpy finishes and I'd need to make these flags globals so that I can easily pass their status into other functions.
Finally, as this is embedded, I don't have access to the full C libraries and both time and memory are at a premium.
Thanks
Ed

read() on Linux and page-aligned buffers

I was implementing an efficient text file loader and found some good advice from the author of GNU grep in this post:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
One of things he suggests is to do read() calls of page aligned blocks of data into page aligned buffers. Apparently this allows the kernel to avoid some extra buffering.
I've been searching and I haven't heard anyone else back up this claim. Is it true that calling read() into a page aligned buffer (perhaps allocated with mmap/posix_memalign etc..) is actually more efficient? If its not true, is it something that used to be true? Does it heavily depend on the underlying file system or other factors like that?
Thanks!
Normally, read() will read into a kernel buffer, then copy it to user space. This extra copy is what is being discussed.
Linux supports "direct I/O" via the O_DIRECT flag to open(). This will skip kernel buffering and read directly into the userspace buffer. However, this direct I/O requires aligned accesses and buffers. So I don't think the author of that post meant that magic happens when you're aligned, but rather that if you align carefully, you can use "closer-to-the-metal" techniques to extract more performance.
mmap() is a much easier way to get the same effect. When the mapping is first set up, no I/O happens. When the user first accesses a page in the mapping, a page fault is triggered, which the kernel handles by allocating the user's page and performing the I/O to fill it. No copy. But again, the I/O happens in page-sized chunks, on page-aligned boundaries.
Whether this is a big deal or not depends on how fast memory copies happen relative to the I/O, and what proportion of CPU time is spent copying rather than doing real work. A web server, for instance, often doesn't even have to look at what it's reading: it just writes it out again out a socket (which incurs another copy). That's why a bunch of work has gone into "zerocopy" techniques like system calls sendfile() and splice(). These are specialized workloads. Normally, the buffering is too small an effect to worry about.

read() system call does a copy of data instead of passing the reference

The read() system call causes the kernel to copy the data instead of passing the buffer by reference. I was asked the reason for this in an interview. The best I could come up with were:
To avoid concurrent writes on the same buffer across multiple processes.
If the user-level process tries to access a buffer mapped to kernel virtual memory area it will result in a segfault.
As it turns out the interviewer was not entirely satisfied with either of these answers. I would greatly appreciate if anybody could elaborate on the above.
A zero copy implementation would mean the user level process would have to be given access to the buffers used internally by the kernel/driver for reading. The user would have to make an explicit call to the kernel to free the buffer after they were done with it.
Depending on the type of device being read from, the buffers could be more than just an area of memory. (For example, some devices could require the buffers to be in a specific area of memory. Or they could only support writing to a fixed area of memory be given to them at startup.) In this case, failure of the user program to "free" those buffers (so that the device could write more data to them) could cause the device and/or its driver to stop functioning properly, something a user program should never be able to do.
The buffer is specified by the caller, so the only way to get the data there is to copy them. And the API is defined the way it is for historical reasons.
Note, that your two points above are no problem for the alternative, mmap, which does pass the buffer by reference (and writing to it than writes to the file, so you than can't process the data in place, while many users of read do just that).
I might have been prepared to dispute the interviewer's assertion. The buffer in a read() call is supplied by the user process and therefore comes from the user address space. It's also not guaranteed to be aligned in any particular way with respect to page frames. That makes it tricky to do what is necessary to perform IO directly into the buffer ie. map the buffer into the device driver's address space or wire it for DMA. However, in limited circumstances, this may be possible.
I seem to remember the BSD subsystem used by Mac OS X used to copy data between address spaces had an optimisation in this respect, although I may be completely mistaken.

Resources