My embedded ARM device has a 800x480 16 bit Linux framebuffer LCD which needs to be double-buffered manually.
At the moment I'm just using memcpy() to write the double buffer to the framebuffer which is awfully slow. A while(1){memcpy(lfb,dbuf)} loop maxes out the CPU at 100% and updates at approx 40 FPS.
The ARM device I'm using, and the Linux kernel does support DMA memory-memory copy, but I'm having trouble working out how I can access this in a userspace program.
It seems linux/dmaengine.h and dma_async_memcpy_buf_to_buf() is what I need to use, but it appears these are only available from within the kernel?
The proper way of what you are doing is not to copy memory at all, but just use your embedded display controller feature to read next frame from another mem buffer.
Ensure you SoC has/doesn't have this feature. I'm 80% sure it has.
this information, from: http://pandorawiki.org/Kernel_interface
may be helpful>
framebuffer interface
Framebuffers can be accessed Linux fbdev interface:
fbdev = open("/dev/fb0", O_RDWR);
buffer = mmap(0, 800*480*2, PROT_READ | PROT_WRITE, MAP_SHARED, fbdev, 0);
(this is basic example, no error checks)
the returned pointer can be used to draw on the screen.
Be sure to #include to get access to the FB device ioctl interface, and for access to ioctl itself.
double buffering
This can be achieved using FBIOPAN_DISPLAY ioctl system call. For this you need to mmap framebuffer of double size
buffer1 = mmap(0, 800*480*2 * 2, PROT_WRITE, MAP_SHARED, fbdev, 0);
buffer2 = (char *)mem + 800*480*2;
Then to display buffer2 you would call:
struct fb_var_screeninfo fbvar;
ioctl(fbdev, FBIOGET_VSCREENINFO, &fbvar);
fbvar.yoffset = 480;
ioctl(fbdev, FBIOPAN_DISPLAY, &fbvar);
going back to buffer1 would be repeating above with fbvar.yoffset = 0. Tripple or quad buffering can be implemented using the same technique.
Related
I have a PCIe endpoint device connected to the host. The ep's (endpoints) 512MB BAR is mmapped and memcpy is used to transfer data. Memcpy is quite slow (~2.5s). When I don't map all of the BAR (100bytes), but run memcpy for the full 512MB, I get a segfault within 0.5s, however when reading back the end of the BAR, the data shows the correct data. Meaning that the data reads the same as if I did mmap the whole BAR space.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
Code to map the whole BAR (takes 2.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 536870912, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
Code to map only 100 bytes (takes 0.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 100, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
To check the written data, I am using pcimem
https://github.com/billfarrow/pcimem
Edit: I was being dumb while consistent data was being 'written' after the segfault, it was not the data that it should have been. Therefore my conclusion that memcpy was completing after the segfault was false. I am accepting the answer as it provided me useful information.
Assuming filename is just an ordinary file (to save the data), leave off O_SYNC. It will just slow things down [possibly, a lot].
When opening the BAR device, consider using O_DIRECT. This may minimize caching effects. That is, if the BAR device does its own caching, eliminate caching by the kernel, if possible.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
The "short" mmap/read is not working. The extra data comes from the prior "full" mapping. So, your test isn't valid.
To ensure consistent results, do unlink on the output file. Do open with O_CREAT. Then, use ftruncate to extend the file to the full size.
Here is some code to try:
#define SIZE (512 * 1024 * 1024)
// remove the output file
unlink(filename);
// open output file (create it)
int ofile_fd = open(filename, O_RDWR | O_CREAT,0644)
// prevent segfault by providing space in the file
ftruncate(ofile_fd,SIZE);
map_base = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, ofile_fd, 0);
// use O_DIRECT to minimize caching effects when accessing the BAR device
#if 0
int rand_fd = open(infile, O_RDONLY);
#else
int rand_fd = open(infile, O_RDONLY | O_DIRECT);
#endif
rand_base = mmap(0, SIZE, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, SIZE);
if (munmap(map_base, map_size) == -1) {
PRINT_ERROR;
}
// close the output file
close(ofile_fd);
Depending upon the characteristics of the BAR device, to minimize the number of PCIe read/fetch/transaction requests, it may be helpful to ensure that it is being accessed as 32 bit (or 64 bit) elements.
Does the BAR space allow/support/encourage access as "ordinary" memory?
Usually, memcpy is smart enough to switch to "wide" memory access automatically (if memory addresses are aligned--which they are here). That is, memcpy will automatically use 64 bit fetches, with movq or possibly by using some XMM instructions, such as movdqa
It would help to know exactly which BAR device(s) you have. The datasheet/appnote should give enough information.
UPDATE:
Thanks for the sample code. Unfortunately, aarch64-gcc gives 'O_DIRECT undeclared' for some reason. Without using that flag, the speed is the same as my original code.
Add #define _GNU_SOURCE above any #include to resolve O_DIRECT
The PCIe device is an FPGA that we are developing. The bitstream is currently the Xilinx DMA example code. The BAR is just 512MB of memory for the system to R/W to. –
userYou
Serendipitously, my answer was based on my experience with access to the BAR space of a Xilinx FPGA device (it's been a while, circa 2010).
When we were diagnosing speed issues, we used a PCIe bus analyzer. This can show the byte width of the bus requests the CPU has requested. It also shows the turnaround time (e.g. Bus read request time until data packet from device is returned).
We also had to adjust the parameters in the PCIe config registers (e.g. transfer size, transaction replay) for the device/BAR. This was trial-and-error and we (I) tried some 27 different combinations before deciding on the optimum config
On an unrelated arm system (e.g. nVidia Jetson) about 3 years ago, I had to do memcpy to/from the GPU memory. It may have just been the particular cross-compiler I was using, but the disassembly of memcpy showed that it only used bytewide transfers. That is, it wasn't as smart as its x86 counterpart. I wrote/rewrote a version that used unsigned long long [and/or unsigned __int128] transfers. This sped things up considerably. See below.
So, you may wish to disassemble the generated memcpy code. Either the library function and/or code that it may inline into your function.
Just a thought ... If you're just wanting a bulk transfer, you may wish to have the device driver for the device program the DMA engine on the FPGA. This might be handled more effectively with a custom ioctl call to the device driver that accepts a custom struct describing the desired transfer (vs. read or mmap from userspace).
Are you writing a custom device driver for the device? Or, are you just using some generic device driver?
Here's what I had to do to get a fast memcpy on arm. It generates ldp/stp asm instructions.
// qcpy.c -- fast memcpy
#include <string.h>
#include <stddef.h>
#ifndef OPT_QMEMCPY
#define OPT_QMEMCPY 128
#endif
#ifndef OPT_QCPYIDX
#define OPT_QCPYIDX 1
#endif
// atomic type for qmemcpy
#if OPT_QMEMCPY == 32
typedef unsigned int qmemcpy_t;
#elif OPT_QMEMCPY == 64
typedef unsigned long long qmemcpy_t;
#elif OPT_QMEMCPY == 128
typedef unsigned __int128 qmemcpy_t;
#else
#error qmemcpy.c: unknown/unsupported OPT_QMEMCPY
#endif
typedef qmemcpy_t *qmemcpy_p;
typedef const qmemcpy_t *qmemcpy_pc;
// _qmemcpy -- fast memcpy
// RETURNS: number of bytes transferred
size_t
_qmemcpy(qmemcpy_p dst,qmemcpy_pc src,size_t size)
{
size_t cnt;
size_t idx;
cnt = size / sizeof(qmemcpy_t);
size = cnt * sizeof(qmemcpy_t);
if (OPT_QCPYIDX) {
for (idx = 0; idx < cnt; ++idx)
dst[idx] = src[idx];
}
else {
for (; cnt > 0; --cnt, ++dst, ++src)
*dst = *src;
}
return size;
}
// qmemcpy -- fast memcpy
void
qmemcpy(void *dst,const void *src,size_t size)
{
size_t xlen;
// use fast memcpy for aligned size
if (OPT_QMEMCPY > 0) {
xlen = _qmemcpy(dst,src,size);
src += xlen;
dst += xlen;
size -= xlen;
}
// copy remainder with ordinary memcpy
if (size > 0)
memcpy(dst,src,size);
}
UPDATE #2:
Speaking of serendipity, I am using a Jetson Orin. That is very interesting about the byte-wise behavior.
Just a thought ... If you have a Jetson in the same system as the FPGA, you might get DMA action by judicious use of cuda
Due to requirements, I cannot use any custom kernel modules so I am trying to do it all in userspace.
That is a harsh mistress to serve ... With custom H/W, it is almost axiomatic that you can have a custom device driver. So, the requirement sounds like a marketing/executive one rather than a technical one. If it's something like not being able to ship a .ko file because you don't know the target kernel version, it is possible to ship the driver as a .o and defer the .ko creation to the install script.
We want to use the DMA engine, but I am hiking up the learning curve on this one. We are using DMA in the FPGA, but I thought that as long as we could write to the address specified in the dtb, that meant the DMA engine was set up and working. Now I'm wondering if I have completely misunderstood that part. –
userYou
You probably will not get DMA doing that. If you start the memcpy, how does the DMA engine know the transfer length?
You might have better luck using read/write vs mmap to get DMA going, depending upon the driver.
But, if it were me, I'd keep the custom driver option open:
If you have to tweak/modify the BAR config registers on driver/system startup, I can't recall if it's even possible to map the config registers to userspace.
When doing mmap, the device may be treated as the "backing store" for the mapping. That is, there is still an extra layer of kernel buffering [just like there is when mapping an ordinary file]. The device memory is only updated periodically from the kernel [buffer] memory.
A custom driver can set up a [guaranteed] direct mapping, using some trickery that only the kernel/driver has access to.
Historical note:
When I last worked with the Xilinx FPGA (12 years ago), the firmware loader utility (provided by Xilinx in both binary and source form), would read in bytes from the firmware/microcode .xsvf file used (e.g.) fscanf(fi,"%c",&myint) to get the bytes.
This was horrible. I refactored the utility to fix that and the processing of the state machine and reduced the load time from 15 minutes to 45 seconds.
Hopefully, Xilinx has fixed the utility by now.
The issue that I am experimenting is not related with open() or mmap() function, which are executed properly. I have disabled CONFIG_STRICT_DEVMEM in the kernel so I can read from /dev/mem without problems. Actually, I can do the following:
const char *path = "/dev/mem"
int fd = open(path, O_RDWR); /* read and write flags*/
p = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, BASE_ADDR); /* read and write flags*/
And the code does not fail. Nonetheless, I am using this code to write in the PCI address space. So, basically the BASE_ADDR is 0xc000000, and the size is 256 MiB (0x10000000, all the PCI address space).
Said that, when I try to write on these positions (with a specific offset, BDF format), nothing is written; again the code does not fail, it just does not write anything.
In case my code was wrong, I tried BusyBox, with the following parameters:
[horro# ~]$ sudo busybox devmem 0xc00b0a8c w 0xffffffff
[horro# ~]$ sudo busybox devmem 0xc00b0a8c
0x00000000
So, basically it is not writing anything.
There is a CONFIG_STRICT_DEVMEM kernel config option. My understanding is that it must be set at compile-time as CONFIG_STRICT_DEVMEM=n. This is a security reasons.
I'm working on Raspberry PI (Linux rpi 3.12.28+) and I have the following C code that I can use to manipulate GPIO ports:
// IO Acces
struct bcm2835_peripheral {
unsigned long addr_p;
int mem_fd; // memory file descriptor
void *map;
volatile unsigned int *addr;
};
struct bcm2835_peripheral gpio = {0x40000000};
// Exposes the physical address defined in the passed structure using mmap on /dev/mem
int map_peripheral(struct bcm2835_peripheral *p)
{
// Open /dev/mem
if ((p->mem_fd = open("/dev/mem", O_RDWR | O_SYNC) ) < 0) {
return -1;
}
p->map = mmap(
NULL,
BLOCK_SIZE,
PROT_READ | PROT_WRITE,
MAP_SHARED,
p->mem_fd, // File descriptor to physical memory virtual file '/dev/mem'
p->addr_p // Address in physical map that we want this memory block to expose
);
if (p->map == MAP_FAILED) {
return -1;
}
p->addr = (volatile unsigned int *)p->map;
return 0;
}
Above code works fine for normal programs (user space). But I need to create a Linux kernel module that will do the same. The problem is that compiler doesn't recognize methods like open() or mmap(). What is an appropriate approach to convert this code to the kernel module (driver)? Are these functions available for kernel programming or should I do that in a different way? I've seen methods like syscall_open(), filp_open(), sys_mmap2() but I'm confused. I will appreciate any help.
You don't have system calls (open, close, read, write, etc..) in kernel space, instead, you'd have to use the internal interfaces provided by the modules, but seems that is not your case.
Considering you are accessing /dev/mem I suppose you are trying to read the physical memory of the RaspberyPi. From kernel space you can access it directly since there's no memory protection, but, you'd have to use phys_to_virt function to translate the addresses.
It is true that there is no need to access /dev/mem in kernel modules. Accessing memory directly using phys_to_virt is a solution to manipulate memory, but it will not work on Raspberry PI if the goal is to manipulate GPIO ports.
The solution is to access hardware registers. I have found excellent tutorial here:
Creating a Basic LED Driver for Raspberry Pi
I'm working in a driver that uses a buffer backed by hugepages, and I'm finding some problems with the sequentality of the hugepages.
In userspace, the program allocates a big buffer backed by hugepages using the mmap syscall. The buffer is then communicated to the driver through a ioctl call. The driver uses the get_user_pages function to get the memory address of that buffer.
This works perfectly with a buffer size of 1 GB (1 hugepage). get_user_pages returns a lot of pages (HUGE_PAGE_SIZE / PAGE_SIZE) but they're all contigous, so there's no problem. I just grab the address of the first page with page_address and work with that. The driver can also map that buffer back to userspace with remap_pfn_range when another program does a mmap call on the char device.
However, things get complicated when the buffer is backed by more than one hugepage. It seems that the kernel can return a buffer backed by non-sequential hugepages. I.e, if the hugepage pool's layout is something like this
+------+------+------+------+
| HP 1 | HP 2 | HP 3 | HP 4 |
+------+------+------+------+
, a request for a hugepage-backed buffer could be fulfilled by reserving HP1 and HP4, or maybe HP3 and then HP2. That means that when I get the pages with get_user_pages in the last case, the address of page 0 is actually 1 GB after the address of page 262.144 (the next hugepage's head).
Is there any way to sequentalize access to those pages? I tried reordering the addresses to find the lower one so I can use the whole buffer (e.g., if kernel gives me a buffer backed by HP3, HP2 I use as base address the one of HP2), but it seems that would scramble the data in userspace (offset 0 in that reordered buffer is maybe offset 1GB in the userspace buffer).
TL;DR: Given >1 unordered hugepages, is there any way to access them sequentially in a Linux kernel driver?
By the way, I'm working on a Linux machine with 3.8.0-29-generic kernel.
Using the function suggested by CL, vm_map_ram, I was able to remap the memory so it can be accesed sequentially, independently of the number of hugepages mapped. I leave the code here (error control not included) in case it helps anyone.
struct page** pages;
int retval;
unsigned long npages;
unsigned long buffer_start = (unsigned long) huge->addr; // Address from user-space map.
void* remapped;
npages = 1 + ((bufsize- 1) / PAGE_SIZE);
pages = vmalloc(npages * sizeof(struct page *));
down_read(¤t->mm->mmap_sem);
retval = get_user_pages(current, current->mm, buffer_start, npages,
1 /* Write enable */, 0 /* Force */, pages, NULL);
up_read(¤t->mm->mmap_sem);
nid = page_to_nid(pages[0]); // Remap on the same NUMA node.
remapped = vm_map_ram(pages, npages, nid, PAGE_KERNEL);
// Do work on remapped.
I have implemented a linux kernel driver which uses deferred IO mechanism to track the changes in framebuffer node.
static struct fb_deferred_io fb_defio = {
.delay = HZ/2,
.deferred_io = fb_dpy_deferred_io,
};
Per say the registered framebuffer node is /dev/graphics/fb1.
The sample application code to access this node is:
fbfd = open("/dev/graphics/fb1", O_RDWR);
if (!fbfd) {
printf("error\n");
exit(0);
}
screensize = 540*960*4;
/* Map the device to memory */
fbp = (unsigned char *)mmap(0, screensize, PROT_READ | PROT_WRITE, MAP_SHARED,
fbfd, 0);
if ((int)fbp == -1) {
printf("Error: failed to start framebuffer device to memory.");
}
int grey = 0x1;
for(cnt = 0; cnt < screensize; cnt++)
*(fbp + cnt) = grey<<4|grey;
This would fill up entire fb1 node with 1's.
The issue now is at the kernel driver when i try to read the entire buffer I find data mismatch at different locations.
The buffer in kernel is mapped as:
par->buffer = dma_alloc_coherent(dev, roundup((dpyw*dpyh*BPP/8), PAGE_SIZE),(dma_addr_t *) &DmaPhysBuf, GFP_KERNEL);
if (!par->buffer) {
printk(KERN_WARNING "probe: dma_alloc_coherent failed.\n");
goto err_vfree;
}
and finally the buffer is registered through register_framebuffer function.
On reading the source buffer I find that at random locations the data is not been written instead the old data is reflected.
For example:
At buffer location 3964 i was expecting 11111111 but i found FF00FF00.
On running the same application program with value of grey changed to 22222222
At buffer location 3964 i was expecting 22222222 but i found 11111111
It looks like there is some delayed write in the buffer. Is there any solution to this effect, because of partially wrong data my image is getting corrupted.
Please let me know if any more information is required.
Note: Looks like an issue of mapped buffer being cacheable or not. Its a lazy write to copy the data from cache to ram. Need to make sure that the data is copied properly but how still no idea.. :-(
"Deferred io" means that frame buffer memory is not really mapped to a display device. Rather, it's an ordinary memory area shared between user process and kernel driver. Thus it needs to be "synced" for kernel to actually do anything about it:
msync(fbp, screensize, MS_SYNC);
Calling fsync(fbfd) may also work.
You may also try calling ioctl(fbfd, FBIO_WAITFORVSYNC, 0) if your driver supports it. The call will make your application wait until vsync happens and the frame buffer data was definitely transferred to the device.
I was having a similar issue where I was having random artifacts displaying on the screen. Originally, the framebuffer driver was not using dma at all.
I tried the suggestion of using msync(), which improved the situation (artifacts happened less frequently), but it did not completely solve the issue.
After doing some research I came to the conclusion that I need to use dma memory because it is not cached. There is still the issue with mmap because it is mapping the kernel memory to userspace. However, I found that there is already a function in the kernel to handle this.
So, my solution was in my framebuffer driver, set the mmap function:
static int my_fb_mmap(struct fb_info *info, struct vm_area_struct *vma)
{
return dma_mmap_coherent(info->dev, vma, info->screen_base,
info->fix.smem_start, info->fix.smem_len);
}
static struct fb_ops my_fb_ops = {
...
.fb_mmap = my_fb_mmap,
};
And then in the probe function:
struct fb_info *info;
struct my_fb_par *par;
dma_addr_t dma_addr;
char *buf
info = framebuffer_alloc(sizeof(struct my_fb_par), &my_parent->dev);
...
buf = dma_alloc_coherent(info->dev, MY_FB_SIZE, dma_addr, GFP_KERNEL);
...
info->screen_base = buf;
info->fbops = &my_fb_ops;
info->fix = my_fb_fix;
info->fix.smem_start = dma_addr;
info->fix.smem_len = MY_FB_SIZE;
...
par = info->par
...
par->buffer = buf;
Obviously, I've left out the error checking and unwinding, but hopefully I have touched on all of the important parts.
Note: Comments in the kernel source say that dmac_flush_range() is for private use only.
Well eventually i found a better way to solve the issue. The data written through app at mmaped device node is first written in cache which is later written in RAM through delayed write policy. In order to make sure that the data is flushed properly we need to call the flush function in kernel. I used
dmac_flush_range((void *)pSrc, (void *)pSrc + bufSize);
to flush the data completely so that the kernel receives a clean data.