PCIe DMA into system memory without a kernel module - c

I have a PCIe device with DMA functionality residing under Ubuntu Linux 14.04. I can create a DMA transfer coming from the device (confirmed with an analyzer and target application of DMAing into another device's memory). However, I am struggling to understand how to receive the data from the device to Linux system memory.
I boot with maxcpus=1 to make sure I do not run into cache issues (https://bakhi.github.io/devmem/) and mem=2048M (https://www.oreilly.com/library/view/linux-device-drivers/0596005903/ch15.html) kernel parameters to make sure the kernel does not use the memory I would like to use for my DMA buffer. The Ubuntu PC has 16 GB of total RAM and I am trying to target physical address 0x90000000.
I followed this answer https://stackoverflow.com/a/41713401 to try to map the physical memory. It appears that the mmap is successful (returns an address). When I try to mmap memory below 2048M, it fails as expected, since it is allocated to the kernel. Before attempting a DMA transfer, I tried simply reading and writing the memory in the C program. When I read the memory, I get some values back. When I read the memory again, I get the same values. This makes me think that the reading might actually be working (not a very strong argument...). When I attempt to read the memory, write the memory and then read it again, I see the original values, as if the write never happened. I tried to start a DMA transfer but I do not see the data arriving in the physical memory as expected.
Here is the code of my exercise:
#include <stdio.h>
#include <stdint.h>
#include <fcntl.h>
#include <sys/mman.h>
int main()
{
static volatile uint32_t *gpio = NULL;
int fd;
if ((fd = open ("/dev/mem", O_RDWR | O_SYNC | O_CLOEXEC) ) < 0) return -1;
gpio = (uint32_t *)mmap(0, 0x8000000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0x90000000);
if ((int32_t)gpio == -1) return -1;
int i;
for (i = 0; i < 32; i++)
{
printf("0x%08x\n", *(gpio+i));
}
printf("Press any key when ready to continue...\n");
getchar();
*(gpio + 0) = 0x11223344;
*(gpio + 1) = 0x55667788;
printf("---\n");
for (i = 0; i < 32; i++)
{
printf("0x%08x\n", *(gpio+i));
}
munmap((void*)gpio, 0x8000000);
close(fd);
return 0;
}
My goal is to write some data through a PCIe DMA transfer from an end-point to the system memory of the host computer without a kernel module and read the data in the host computer once I know the DMA transfer has occurred. Is this even possible?

Related

Memcpy Complete After Segfault

I have a PCIe endpoint device connected to the host. The ep's (endpoints) 512MB BAR is mmapped and memcpy is used to transfer data. Memcpy is quite slow (~2.5s). When I don't map all of the BAR (100bytes), but run memcpy for the full 512MB, I get a segfault within 0.5s, however when reading back the end of the BAR, the data shows the correct data. Meaning that the data reads the same as if I did mmap the whole BAR space.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
Code to map the whole BAR (takes 2.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 536870912, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
Code to map only 100 bytes (takes 0.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 100, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
To check the written data, I am using pcimem
https://github.com/billfarrow/pcimem
Edit: I was being dumb while consistent data was being 'written' after the segfault, it was not the data that it should have been. Therefore my conclusion that memcpy was completing after the segfault was false. I am accepting the answer as it provided me useful information.
Assuming filename is just an ordinary file (to save the data), leave off O_SYNC. It will just slow things down [possibly, a lot].
When opening the BAR device, consider using O_DIRECT. This may minimize caching effects. That is, if the BAR device does its own caching, eliminate caching by the kernel, if possible.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
The "short" mmap/read is not working. The extra data comes from the prior "full" mapping. So, your test isn't valid.
To ensure consistent results, do unlink on the output file. Do open with O_CREAT. Then, use ftruncate to extend the file to the full size.
Here is some code to try:
#define SIZE (512 * 1024 * 1024)
// remove the output file
unlink(filename);
// open output file (create it)
int ofile_fd = open(filename, O_RDWR | O_CREAT,0644)
// prevent segfault by providing space in the file
ftruncate(ofile_fd,SIZE);
map_base = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, ofile_fd, 0);
// use O_DIRECT to minimize caching effects when accessing the BAR device
#if 0
int rand_fd = open(infile, O_RDONLY);
#else
int rand_fd = open(infile, O_RDONLY | O_DIRECT);
#endif
rand_base = mmap(0, SIZE, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, SIZE);
if (munmap(map_base, map_size) == -1) {
PRINT_ERROR;
}
// close the output file
close(ofile_fd);
Depending upon the characteristics of the BAR device, to minimize the number of PCIe read/fetch/transaction requests, it may be helpful to ensure that it is being accessed as 32 bit (or 64 bit) elements.
Does the BAR space allow/support/encourage access as "ordinary" memory?
Usually, memcpy is smart enough to switch to "wide" memory access automatically (if memory addresses are aligned--which they are here). That is, memcpy will automatically use 64 bit fetches, with movq or possibly by using some XMM instructions, such as movdqa
It would help to know exactly which BAR device(s) you have. The datasheet/appnote should give enough information.
UPDATE:
Thanks for the sample code. Unfortunately, aarch64-gcc gives 'O_DIRECT undeclared' for some reason. Without using that flag, the speed is the same as my original code.
Add #define _GNU_SOURCE above any #include to resolve O_DIRECT
The PCIe device is an FPGA that we are developing. The bitstream is currently the Xilinx DMA example code. The BAR is just 512MB of memory for the system to R/W to. –
userYou
Serendipitously, my answer was based on my experience with access to the BAR space of a Xilinx FPGA device (it's been a while, circa 2010).
When we were diagnosing speed issues, we used a PCIe bus analyzer. This can show the byte width of the bus requests the CPU has requested. It also shows the turnaround time (e.g. Bus read request time until data packet from device is returned).
We also had to adjust the parameters in the PCIe config registers (e.g. transfer size, transaction replay) for the device/BAR. This was trial-and-error and we (I) tried some 27 different combinations before deciding on the optimum config
On an unrelated arm system (e.g. nVidia Jetson) about 3 years ago, I had to do memcpy to/from the GPU memory. It may have just been the particular cross-compiler I was using, but the disassembly of memcpy showed that it only used bytewide transfers. That is, it wasn't as smart as its x86 counterpart. I wrote/rewrote a version that used unsigned long long [and/or unsigned __int128] transfers. This sped things up considerably. See below.
So, you may wish to disassemble the generated memcpy code. Either the library function and/or code that it may inline into your function.
Just a thought ... If you're just wanting a bulk transfer, you may wish to have the device driver for the device program the DMA engine on the FPGA. This might be handled more effectively with a custom ioctl call to the device driver that accepts a custom struct describing the desired transfer (vs. read or mmap from userspace).
Are you writing a custom device driver for the device? Or, are you just using some generic device driver?
Here's what I had to do to get a fast memcpy on arm. It generates ldp/stp asm instructions.
// qcpy.c -- fast memcpy
#include <string.h>
#include <stddef.h>
#ifndef OPT_QMEMCPY
#define OPT_QMEMCPY 128
#endif
#ifndef OPT_QCPYIDX
#define OPT_QCPYIDX 1
#endif
// atomic type for qmemcpy
#if OPT_QMEMCPY == 32
typedef unsigned int qmemcpy_t;
#elif OPT_QMEMCPY == 64
typedef unsigned long long qmemcpy_t;
#elif OPT_QMEMCPY == 128
typedef unsigned __int128 qmemcpy_t;
#else
#error qmemcpy.c: unknown/unsupported OPT_QMEMCPY
#endif
typedef qmemcpy_t *qmemcpy_p;
typedef const qmemcpy_t *qmemcpy_pc;
// _qmemcpy -- fast memcpy
// RETURNS: number of bytes transferred
size_t
_qmemcpy(qmemcpy_p dst,qmemcpy_pc src,size_t size)
{
size_t cnt;
size_t idx;
cnt = size / sizeof(qmemcpy_t);
size = cnt * sizeof(qmemcpy_t);
if (OPT_QCPYIDX) {
for (idx = 0; idx < cnt; ++idx)
dst[idx] = src[idx];
}
else {
for (; cnt > 0; --cnt, ++dst, ++src)
*dst = *src;
}
return size;
}
// qmemcpy -- fast memcpy
void
qmemcpy(void *dst,const void *src,size_t size)
{
size_t xlen;
// use fast memcpy for aligned size
if (OPT_QMEMCPY > 0) {
xlen = _qmemcpy(dst,src,size);
src += xlen;
dst += xlen;
size -= xlen;
}
// copy remainder with ordinary memcpy
if (size > 0)
memcpy(dst,src,size);
}
UPDATE #2:
Speaking of serendipity, I am using a Jetson Orin. That is very interesting about the byte-wise behavior.
Just a thought ... If you have a Jetson in the same system as the FPGA, you might get DMA action by judicious use of cuda
Due to requirements, I cannot use any custom kernel modules so I am trying to do it all in userspace.
That is a harsh mistress to serve ... With custom H/W, it is almost axiomatic that you can have a custom device driver. So, the requirement sounds like a marketing/executive one rather than a technical one. If it's something like not being able to ship a .ko file because you don't know the target kernel version, it is possible to ship the driver as a .o and defer the .ko creation to the install script.
We want to use the DMA engine, but I am hiking up the learning curve on this one. We are using DMA in the FPGA, but I thought that as long as we could write to the address specified in the dtb, that meant the DMA engine was set up and working. Now I'm wondering if I have completely misunderstood that part. –
userYou
You probably will not get DMA doing that. If you start the memcpy, how does the DMA engine know the transfer length?
You might have better luck using read/write vs mmap to get DMA going, depending upon the driver.
But, if it were me, I'd keep the custom driver option open:
If you have to tweak/modify the BAR config registers on driver/system startup, I can't recall if it's even possible to map the config registers to userspace.
When doing mmap, the device may be treated as the "backing store" for the mapping. That is, there is still an extra layer of kernel buffering [just like there is when mapping an ordinary file]. The device memory is only updated periodically from the kernel [buffer] memory.
A custom driver can set up a [guaranteed] direct mapping, using some trickery that only the kernel/driver has access to.
Historical note:
When I last worked with the Xilinx FPGA (12 years ago), the firmware loader utility (provided by Xilinx in both binary and source form), would read in bytes from the firmware/microcode .xsvf file used (e.g.) fscanf(fi,"%c",&myint) to get the bytes.
This was horrible. I refactored the utility to fix that and the processing of the state machine and reduced the load time from 15 minutes to 45 seconds.
Hopefully, Xilinx has fixed the utility by now.

mmap() keeps returning MAP_FAILED whe trying to read or write to a PIO address on cyclone V FPGA

I am trying to write an integer (1114129) from my HPS on Cyclone V Altera FPGA from a PUTTY window to a 32bit PIO on the FPGA side via lightweight axis interface. I am using mmap() and cannot get it to work, have been trying for months.
I have set up the hardware side correctly because I was able to read from another 32bit PIO at the address 0xff205000, this works fine but I can't read or write to the second PIO. I have tried multiple different addresses as well and it doesn't seem to make a difference.
Here is the error I get, which is because of mmap returns MAP_FAILED
open /dev/mem successfully !
npheap_alloc(): Invalid argument
#
As you can see above the file is opened correctly and then it fails on the mmap call.
Below is the c code im using. There are no compiling issues.
If anyone could help or even point me in the right direction i would be really appreciative its driving me insane.
#define MAPPED_SIZE 4
#define DDR_RAM_PHYS 0xff205010
int main(void)
{
int _fdmem;
void *map;
const char memDevice[] = "/dev/mem";
_fdmem = open( memDevice, O_RDWR | O_SYNC );
if (_fdmem < 0){
printf("Failed to open the /dev/mem !\n");
return 0;
}
else{
printf("open /dev/mem successfully !\n");
}
/* mmap() the opened /dev/mem */
map = mmap(0, MAPPED_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, _fdmem, DDR_RAM_PHYS);
if (map == MAP_FAILED) { perror("npheap_alloc()"); exit(1); }
*(unsigned int*)(map+(0xff205010)) = (unsigned short)1114129;
//int *q = (int *)map;
//*q = 1114129;
/* unmap the area & error checking */
if (munmap(map,MAPPED_SIZE)==-1){
perror("Error un-mmapping the file");
}
/* close the character device */
close(_fdmem);
}
0xff205010 is not a valid mmap offset: it's not page-aligned. Also, adding 0xff205010 to map would not be valid even if it were; map would already refer to that physical address.
Instead, You need to mmap the correct page (0xff205010 & -PAGESIZE) then use (0xff205010 % PAGESIZE) as the offset into the mapping.

Access /dev/mem from Linux kernel module

I'm working on Raspberry PI (Linux rpi 3.12.28+) and I have the following C code that I can use to manipulate GPIO ports:
// IO Acces
struct bcm2835_peripheral {
unsigned long addr_p;
int mem_fd; // memory file descriptor
void *map;
volatile unsigned int *addr;
};
struct bcm2835_peripheral gpio = {0x40000000};
// Exposes the physical address defined in the passed structure using mmap on /dev/mem
int map_peripheral(struct bcm2835_peripheral *p)
{
// Open /dev/mem
if ((p->mem_fd = open("/dev/mem", O_RDWR | O_SYNC) ) < 0) {
return -1;
}
p->map = mmap(
NULL,
BLOCK_SIZE,
PROT_READ | PROT_WRITE,
MAP_SHARED,
p->mem_fd, // File descriptor to physical memory virtual file '/dev/mem'
p->addr_p // Address in physical map that we want this memory block to expose
);
if (p->map == MAP_FAILED) {
return -1;
}
p->addr = (volatile unsigned int *)p->map;
return 0;
}
Above code works fine for normal programs (user space). But I need to create a Linux kernel module that will do the same. The problem is that compiler doesn't recognize methods like open() or mmap(). What is an appropriate approach to convert this code to the kernel module (driver)? Are these functions available for kernel programming or should I do that in a different way? I've seen methods like syscall_open(), filp_open(), sys_mmap2() but I'm confused. I will appreciate any help.
You don't have system calls (open, close, read, write, etc..) in kernel space, instead, you'd have to use the internal interfaces provided by the modules, but seems that is not your case.
Considering you are accessing /dev/mem I suppose you are trying to read the physical memory of the RaspberyPi. From kernel space you can access it directly since there's no memory protection, but, you'd have to use phys_to_virt function to translate the addresses.
It is true that there is no need to access /dev/mem in kernel modules. Accessing memory directly using phys_to_virt is a solution to manipulate memory, but it will not work on Raspberry PI if the goal is to manipulate GPIO ports.
The solution is to access hardware registers. I have found excellent tutorial here:
Creating a Basic LED Driver for Raspberry Pi

Linux is not allowing me to access a fixed region of memory

I have some data stored in a FLASH memory that I need to access with C pointers to be able to make a non-Linux graphics driver work (I think this requirement is DMA related, not sure). Calling read works, but I don't want to have intermediate RAM buffers between the FLASH and the non-Linux driver.
However, just creating a pointer and storing the address that I want on it is making Linux emit an exception about invalid access on me.
void *ptr = 0xdeadbeef;
int a = *ptr; // invalid access!
What am I missing here? And could someone point me to a material to make this concepts clear for me?
I'm reading about mmap but I'm not sure that this is what I need.
The problem you have is that linux runs your program in a virtual address space. So every address you use directly in the code (like 0xdeadbeef) is a virtual address that gets translated by the memory management unit into a physical address which is not necessarily the same as your virtual address. This allows easy separation of multiple independent processes and other stuff like paging, etc.
The problem is now, that in your case no physical address is mapped to the virtual address 0xdeadbeef causing the kernel to abort execution.
The call mmap you already found asks the kernel to assign a specific file (from a specific offset) to a virtual address of your process. Note that the returning address of mmap could be a completely different address. So don't make any assumptions about the virtual address you get.
Therefore there are examples with mmap and /dev/mem out there where the offset for the memory device is the physical address. After the kernel was able to assign the file from the offset you gave to a virtual address of your process you can access the memory area asif it were a direct access.
After you don't need the area anymore don't forget to munmap the area. Otherwise you'll cause something similar to a memory leak.
One problem with the /dev/mem method is that the user running the process needs access to this device. This could introduce a security issue (e.g. Samsung recently introduced such a security hole in their hand held devices)
A more secure way is the way described in a article i found (The Userspace I/O HOWTO) as you still have control about the memory areas accessable by the user's process.
You need to access the memory differently. Basically you need to open /dev/mem and use mmap(). (as you suggested). Simple example:
int openMem(unsigned int address, unsigned int size)
{
int mmapFD;
int page_size;
unsigned int page_start_address;
/* Minimum page size for the mmapped region. */
mask = size - 1;
/* Get the page size. */
page_size = (int) sysconf(_SC_PAGE_SIZE);
/* We have to map shared memory to beginning of memory page so adjust
* memory address accordingly. */
page_start_address = address - (address % page_size);
/* Open the file that will be mapped. */
if((mmapFD = open("/dev/mem", (O_RDWR | O_SYNC))) == -1)
{
printf("Opening shared memory device failed\n");
return -1;
}
mmap_base_address = mmap(0, size, (PROT_READ|PROT_WRITE), MAP_SHARED, mmapFD, (off_t)page_start_address & ~mask);
if(mmap_base_address == MAP_FAILED)
{
printf("Mapping memory failed\n");
return -1;
}
return 0;
}
unsigned int *getAddress(unsigned int address)
{
unsigned int log_address;
log_address = (int)((off_t)mmap_base_address + ((off_t)address & mask));
return (unsigned int*)log_address;
}
...
result = openMem(address, 0x10000);
if (result < 0)
return result;
target_address = getValue(address);
*(unsigned int*)target_address = value;
This would set "value" to "address".
You need to call ioremap - something like:
void *myaddr = ioremap(0xdeadbeef, size);
where size is the size of your memory region. You probably want to use a page-aligned address for the first argument, e.g. 0xdeadb000 - but I expect your actual device isn't at "0xdeadbeef" anyways.
Edit: The call to ioremap must be done from a driver!

Reading / writing from using I2C on Linux

I'm trying to read/write to a FM24CL64-GTR FRAM chip that is connected over a I2C bus on address 0b 1010 011.
When I'm trying to write 3 bytes (data address 2 bytes, + data one byte), I get a kernel message ([12406.360000] i2c-adapter i2c-0: sendbytes: NAK bailout.), as well as the write returns != 3. See code below:
#include <linux/i2c-dev.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdint.h>
int file;
char filename[20];
int addr = 0x53; // 0b1010011; /* The I2C address */
uint16_t dataAddr = 0x1234;
uint8_t val = 0x5c;
uint8_t buf[3];
sprintf(filename,"/dev/i2c-%d",0);
if ((file = open(filename,O_RDWR)) < 0)
exit(1);
if (ioctl(file,I2C_SLAVE,addr) < 0)
exit(2);
buf[0] = dataAddr >> 8;
buf[1] = dataAddr & 0xff;
buf[2] = val;
if (write(file, buf, 3) != 3)
exit(3);
...
However when I write 2 bytes, then write another byte, I get no kernel error, but when trying to read from the FRAM, I always get back 0. Here is the code to read from the FRAM:
uint8_t val;
if ((file = open(filename,O_RDWR)) < 0)
exit(1);
if (ioctl(file,I2C_SLAVE,addr) < 0)
exit(2);
if (write(file, &dataAddr, 2) != 2) {
exit(3);
if (read(file, &val, 1) != 1) {
exit(3);
None of the functions return an error value, and I have also tried it with:
#include <linux/i2c.h>
struct i2c_rdwr_ioctl_data work_queue;
struct i2c_msg msg[2];
uint8_t ret;
work_queue.nmsgs = 2;
work_queue.msgs = msg;
work_queue.msgs[0].addr = addr;
work_queue.msgs[0].len = 2;
work_queue.msgs[0].flags = 0;
work_queue.msgs[0].buf = &dataAddr;
work_queue.msgs[1].addr = addr;
work_queue.msgs[1].len = 1;
work_queue.msgs[1].flags = I2C_M_RD;
work_queue.msgs[1].buf = &ret;
if (ioctl(file,I2C_RDWR,&work_queue) < 0)
exit(3);
Which also succeeds, but always returns 0. Does this indicate a hardware issue, or am I doing something wrong?
Are there any FRAM drivers for FM24CL64-GTR over I2C on Linux, and what would the API be? Any link would be helpful.
I do not have experience with that particular device, but in our experience many I2C devices have "quirks" that require a work-around, typically above the driver level.
We use linux (CELinux) and an I2C device driver with Linux as well. But our application code also has a non-trivial I2C module that contains all the work-around intelligence for dealing with all the various devices we have experience with.
Also, when dealing with I2C issues, I often find that I need to re-acquaint myself with the source spec:
http://www.nxp.com/acrobat_download/literature/9398/39340011.pdf
as well as the usage of a decent oscilloscope.
Good luck,
Above link is dead, here are some other links:
http://www.nxp.com/documents/user_manual/UM10204.pdf
and of course wikipedia:
http://en.wikipedia.org/wiki/I%C2%B2C
The NAK was a big hint: the WriteProtect pin was externally pulled up, and had to be driven to ground, after that a single write of the address followed by data-bytes is successful (first code segment).
For reading the address can be written out first (using write()), and then sequential data can be read starting from that address.
Note that the method using the struct i2c_rdwr_ioctl_data and the struct i2c_msg (that is, the last code part you've given) is more efficient than the other ones, since with that method you execute the repeated start feature of I2c.
This means you avoid a STA-WRITE-STO -> STA-READ-<data>...-STO transition, because your communication will become STA-WRITE-RS-READ-<data>...STO (RS = repeated start). So, saves you a redundant STO-STA transient.
Not that it differs a lot in time, but if it's not needed, why losing on it...
Just my 2 ct.
Best rgds,
You had some mistakes!
The address of ic is Ax in hex, x can be anything but the 4 upper bits should be A=1010 !!!

Resources