I have a PCIe endpoint device connected to the host. The ep's (endpoints) 512MB BAR is mmapped and memcpy is used to transfer data. Memcpy is quite slow (~2.5s). When I don't map all of the BAR (100bytes), but run memcpy for the full 512MB, I get a segfault within 0.5s, however when reading back the end of the BAR, the data shows the correct data. Meaning that the data reads the same as if I did mmap the whole BAR space.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
Code to map the whole BAR (takes 2.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 536870912, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
Code to map only 100 bytes (takes 0.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 100, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
To check the written data, I am using pcimem
https://github.com/billfarrow/pcimem
Edit: I was being dumb while consistent data was being 'written' after the segfault, it was not the data that it should have been. Therefore my conclusion that memcpy was completing after the segfault was false. I am accepting the answer as it provided me useful information.
Assuming filename is just an ordinary file (to save the data), leave off O_SYNC. It will just slow things down [possibly, a lot].
When opening the BAR device, consider using O_DIRECT. This may minimize caching effects. That is, if the BAR device does its own caching, eliminate caching by the kernel, if possible.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
The "short" mmap/read is not working. The extra data comes from the prior "full" mapping. So, your test isn't valid.
To ensure consistent results, do unlink on the output file. Do open with O_CREAT. Then, use ftruncate to extend the file to the full size.
Here is some code to try:
#define SIZE (512 * 1024 * 1024)
// remove the output file
unlink(filename);
// open output file (create it)
int ofile_fd = open(filename, O_RDWR | O_CREAT,0644)
// prevent segfault by providing space in the file
ftruncate(ofile_fd,SIZE);
map_base = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, ofile_fd, 0);
// use O_DIRECT to minimize caching effects when accessing the BAR device
#if 0
int rand_fd = open(infile, O_RDONLY);
#else
int rand_fd = open(infile, O_RDONLY | O_DIRECT);
#endif
rand_base = mmap(0, SIZE, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, SIZE);
if (munmap(map_base, map_size) == -1) {
PRINT_ERROR;
}
// close the output file
close(ofile_fd);
Depending upon the characteristics of the BAR device, to minimize the number of PCIe read/fetch/transaction requests, it may be helpful to ensure that it is being accessed as 32 bit (or 64 bit) elements.
Does the BAR space allow/support/encourage access as "ordinary" memory?
Usually, memcpy is smart enough to switch to "wide" memory access automatically (if memory addresses are aligned--which they are here). That is, memcpy will automatically use 64 bit fetches, with movq or possibly by using some XMM instructions, such as movdqa
It would help to know exactly which BAR device(s) you have. The datasheet/appnote should give enough information.
UPDATE:
Thanks for the sample code. Unfortunately, aarch64-gcc gives 'O_DIRECT undeclared' for some reason. Without using that flag, the speed is the same as my original code.
Add #define _GNU_SOURCE above any #include to resolve O_DIRECT
The PCIe device is an FPGA that we are developing. The bitstream is currently the Xilinx DMA example code. The BAR is just 512MB of memory for the system to R/W to. –
userYou
Serendipitously, my answer was based on my experience with access to the BAR space of a Xilinx FPGA device (it's been a while, circa 2010).
When we were diagnosing speed issues, we used a PCIe bus analyzer. This can show the byte width of the bus requests the CPU has requested. It also shows the turnaround time (e.g. Bus read request time until data packet from device is returned).
We also had to adjust the parameters in the PCIe config registers (e.g. transfer size, transaction replay) for the device/BAR. This was trial-and-error and we (I) tried some 27 different combinations before deciding on the optimum config
On an unrelated arm system (e.g. nVidia Jetson) about 3 years ago, I had to do memcpy to/from the GPU memory. It may have just been the particular cross-compiler I was using, but the disassembly of memcpy showed that it only used bytewide transfers. That is, it wasn't as smart as its x86 counterpart. I wrote/rewrote a version that used unsigned long long [and/or unsigned __int128] transfers. This sped things up considerably. See below.
So, you may wish to disassemble the generated memcpy code. Either the library function and/or code that it may inline into your function.
Just a thought ... If you're just wanting a bulk transfer, you may wish to have the device driver for the device program the DMA engine on the FPGA. This might be handled more effectively with a custom ioctl call to the device driver that accepts a custom struct describing the desired transfer (vs. read or mmap from userspace).
Are you writing a custom device driver for the device? Or, are you just using some generic device driver?
Here's what I had to do to get a fast memcpy on arm. It generates ldp/stp asm instructions.
// qcpy.c -- fast memcpy
#include <string.h>
#include <stddef.h>
#ifndef OPT_QMEMCPY
#define OPT_QMEMCPY 128
#endif
#ifndef OPT_QCPYIDX
#define OPT_QCPYIDX 1
#endif
// atomic type for qmemcpy
#if OPT_QMEMCPY == 32
typedef unsigned int qmemcpy_t;
#elif OPT_QMEMCPY == 64
typedef unsigned long long qmemcpy_t;
#elif OPT_QMEMCPY == 128
typedef unsigned __int128 qmemcpy_t;
#else
#error qmemcpy.c: unknown/unsupported OPT_QMEMCPY
#endif
typedef qmemcpy_t *qmemcpy_p;
typedef const qmemcpy_t *qmemcpy_pc;
// _qmemcpy -- fast memcpy
// RETURNS: number of bytes transferred
size_t
_qmemcpy(qmemcpy_p dst,qmemcpy_pc src,size_t size)
{
size_t cnt;
size_t idx;
cnt = size / sizeof(qmemcpy_t);
size = cnt * sizeof(qmemcpy_t);
if (OPT_QCPYIDX) {
for (idx = 0; idx < cnt; ++idx)
dst[idx] = src[idx];
}
else {
for (; cnt > 0; --cnt, ++dst, ++src)
*dst = *src;
}
return size;
}
// qmemcpy -- fast memcpy
void
qmemcpy(void *dst,const void *src,size_t size)
{
size_t xlen;
// use fast memcpy for aligned size
if (OPT_QMEMCPY > 0) {
xlen = _qmemcpy(dst,src,size);
src += xlen;
dst += xlen;
size -= xlen;
}
// copy remainder with ordinary memcpy
if (size > 0)
memcpy(dst,src,size);
}
UPDATE #2:
Speaking of serendipity, I am using a Jetson Orin. That is very interesting about the byte-wise behavior.
Just a thought ... If you have a Jetson in the same system as the FPGA, you might get DMA action by judicious use of cuda
Due to requirements, I cannot use any custom kernel modules so I am trying to do it all in userspace.
That is a harsh mistress to serve ... With custom H/W, it is almost axiomatic that you can have a custom device driver. So, the requirement sounds like a marketing/executive one rather than a technical one. If it's something like not being able to ship a .ko file because you don't know the target kernel version, it is possible to ship the driver as a .o and defer the .ko creation to the install script.
We want to use the DMA engine, but I am hiking up the learning curve on this one. We are using DMA in the FPGA, but I thought that as long as we could write to the address specified in the dtb, that meant the DMA engine was set up and working. Now I'm wondering if I have completely misunderstood that part. –
userYou
You probably will not get DMA doing that. If you start the memcpy, how does the DMA engine know the transfer length?
You might have better luck using read/write vs mmap to get DMA going, depending upon the driver.
But, if it were me, I'd keep the custom driver option open:
If you have to tweak/modify the BAR config registers on driver/system startup, I can't recall if it's even possible to map the config registers to userspace.
When doing mmap, the device may be treated as the "backing store" for the mapping. That is, there is still an extra layer of kernel buffering [just like there is when mapping an ordinary file]. The device memory is only updated periodically from the kernel [buffer] memory.
A custom driver can set up a [guaranteed] direct mapping, using some trickery that only the kernel/driver has access to.
Historical note:
When I last worked with the Xilinx FPGA (12 years ago), the firmware loader utility (provided by Xilinx in both binary and source form), would read in bytes from the firmware/microcode .xsvf file used (e.g.) fscanf(fi,"%c",&myint) to get the bytes.
This was horrible. I refactored the utility to fix that and the processing of the state machine and reduced the load time from 15 minutes to 45 seconds.
Hopefully, Xilinx has fixed the utility by now.
I am trying to write an integer (1114129) from my HPS on Cyclone V Altera FPGA from a PUTTY window to a 32bit PIO on the FPGA side via lightweight axis interface. I am using mmap() and cannot get it to work, have been trying for months.
I have set up the hardware side correctly because I was able to read from another 32bit PIO at the address 0xff205000, this works fine but I can't read or write to the second PIO. I have tried multiple different addresses as well and it doesn't seem to make a difference.
Here is the error I get, which is because of mmap returns MAP_FAILED
open /dev/mem successfully !
npheap_alloc(): Invalid argument
#
As you can see above the file is opened correctly and then it fails on the mmap call.
Below is the c code im using. There are no compiling issues.
If anyone could help or even point me in the right direction i would be really appreciative its driving me insane.
#define MAPPED_SIZE 4
#define DDR_RAM_PHYS 0xff205010
int main(void)
{
int _fdmem;
void *map;
const char memDevice[] = "/dev/mem";
_fdmem = open( memDevice, O_RDWR | O_SYNC );
if (_fdmem < 0){
printf("Failed to open the /dev/mem !\n");
return 0;
}
else{
printf("open /dev/mem successfully !\n");
}
/* mmap() the opened /dev/mem */
map = mmap(0, MAPPED_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, _fdmem, DDR_RAM_PHYS);
if (map == MAP_FAILED) { perror("npheap_alloc()"); exit(1); }
*(unsigned int*)(map+(0xff205010)) = (unsigned short)1114129;
//int *q = (int *)map;
//*q = 1114129;
/* unmap the area & error checking */
if (munmap(map,MAPPED_SIZE)==-1){
perror("Error un-mmapping the file");
}
/* close the character device */
close(_fdmem);
}
0xff205010 is not a valid mmap offset: it's not page-aligned. Also, adding 0xff205010 to map would not be valid even if it were; map would already refer to that physical address.
Instead, You need to mmap the correct page (0xff205010 & -PAGESIZE) then use (0xff205010 % PAGESIZE) as the offset into the mapping.
I'm given a physical address, specifically 0x000000368d76c0. I'm trying to mmap it into my program. The code that I'm using is
void *mmap64;
off_t offset = 0x000000368d76c0;
int memFd = open("/dev/mem", O_RDWR);
if (-1 == memFd)
perror("Error ");
mmap64 = mmap(0, sizeof(uint64_t), PROT_WRITE | PROT_READ, MAP_SHARED, memFd, offset);
if (MAP_FAILED == mmap64) {
perror("Error ");
return -1;
}
For some reason when I run this code I get a failure on mmap. Specifically it says Error Invalid argument. I'm pretty sure it is because of the offset value, but I don't know what is wrong with it.
I would appreciate any help on it.
According to mmap(2) - Linux manual page,
offset must be a multiple of the page size as
returned by sysconf(_SC_PAGE_SIZE).
When the page size is 4096 (a page size used in x86 CPU), 0x000000368d76c0 is not a multiple of 4096 and will be considered as invalid.
For that reason, you will have to adjust the offset.
Hi im using a beaglebone black running on debian, and i use mmap on /dev/mem file to access GPIO registers.
I have a .c file that contains my mapping function:
//sample code
unsigned int *gpio_get_map(int gpio)
{
unsigned int *gpio_addr = NULL;
int fd = open("/dev/mem", O_RDWR);
gpio_addr = mmap(0, GPIO_SIZE, PROT_READ | PROT_WRITE,MAP_SHARED,fd, gpio_get_number(gpio));
if(gpio_addr == MAP_FAILED)
{
printf("Unable to map GPIO : %s\n",strerror(errno));
close(fd);
return NULL;
}
close(fd);
return gpio_addr;
}
Then i call this function in another .c file to get the value of gpio_addr and use it to manipulate the GPIOs, it works fine but i am not sure how long the gpio_addr will be valid.
Will the address given by the gpio_addr be always valid? Or should i call another mmap after some period of time? Thanks.
You don't need to renew the mapping. It will stay valid in the calling process until you explicitly unmap it (or the process ends).
I'm currently experimenting with KVM, and trying to get US (userspace) I/O working. Currently, output (i.e. out dx, eax) works, and the US code can see the written value, but input (in eax, dx) does not seem to work - the VM doesn't receive the value written by the US code.
if (run->io.port == 0xface && run->io.direction == KVM_EXIT_IO_IN)
{
printf("Port 0xface read\n");
*(volatile uint32_t *)((uintptr_t)run + run->io.data_offset) = 0xdeadbeefu;
continue;
}
run is a pointer to a struct kvm_run that was mmaped earlier and has enough space (i.e. run->io.data_offset is a valid offset from the pointer). The continue statement eventually causes the VM to restart, and the code continues normally. However, when I try to get the VM's rax register (which should be 0xdeadbeef), I get zero. From what I read in the docs (kvm/Documentation/api.txt), this is how I should be doing it. Am I missing something?
On a semi-related note, if I precede the continue statement with run->io.count = run->io.count;, the I/O is triggered again (even though count isn't changed). Is this expected behavior? Or am I triggering undefined behavior?
The issue is with the actual mmap call:
run = mmap(NULL, mapSize, PROT_READ | PROT_WRITE, MAP_PRIVATE, vcpuID, 0);
The flags parameter should be MAP_SHARED instead of MAP_PRIVATE:
run = mmap(NULL, mapSize, PROT_READ | PROT_WRITE, MAP_SHARED, vcpuID, 0);
^^^^^^^^^^
The virtual machine will see the updated values when KVM_RUN is issued to restart it.