KVM Userspace Port I/O

KVM Userspace Port I/O - c

I'm currently experimenting with KVM, and trying to get US (userspace) I/O working. Currently, output (i.e. out dx, eax) works, and the US code can see the written value, but input (in eax, dx) does not seem to work - the VM doesn't receive the value written by the US code.
if (run->io.port == 0xface && run->io.direction == KVM_EXIT_IO_IN)
{
printf("Port 0xface read\n");
*(volatile uint32_t *)((uintptr_t)run + run->io.data_offset) = 0xdeadbeefu;
continue;
}
run is a pointer to a struct kvm_run that was mmaped earlier and has enough space (i.e. run->io.data_offset is a valid offset from the pointer). The continue statement eventually causes the VM to restart, and the code continues normally. However, when I try to get the VM's rax register (which should be 0xdeadbeef), I get zero. From what I read in the docs (kvm/Documentation/api.txt), this is how I should be doing it. Am I missing something?
On a semi-related note, if I precede the continue statement with run->io.count = run->io.count;, the I/O is triggered again (even though count isn't changed). Is this expected behavior? Or am I triggering undefined behavior?

The issue is with the actual mmap call:
run = mmap(NULL, mapSize, PROT_READ | PROT_WRITE, MAP_PRIVATE, vcpuID, 0);
The flags parameter should be MAP_SHARED instead of MAP_PRIVATE:
run = mmap(NULL, mapSize, PROT_READ | PROT_WRITE, MAP_SHARED, vcpuID, 0);
^^^^^^^^^^
The virtual machine will see the updated values when KVM_RUN is issued to restart it.

Related

Memcpy Complete After Segfault

I have a PCIe endpoint device connected to the host. The ep's (endpoints) 512MB BAR is mmapped and memcpy is used to transfer data. Memcpy is quite slow (~2.5s). When I don't map all of the BAR (100bytes), but run memcpy for the full 512MB, I get a segfault within 0.5s, however when reading back the end of the BAR, the data shows the correct data. Meaning that the data reads the same as if I did mmap the whole BAR space.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
Code to map the whole BAR (takes 2.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 536870912, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
Code to map only 100 bytes (takes 0.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 100, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
To check the written data, I am using pcimem
https://github.com/billfarrow/pcimem
Edit: I was being dumb while consistent data was being 'written' after the segfault, it was not the data that it should have been. Therefore my conclusion that memcpy was completing after the segfault was false. I am accepting the answer as it provided me useful information.

Assuming filename is just an ordinary file (to save the data), leave off O_SYNC. It will just slow things down [possibly, a lot].
When opening the BAR device, consider using O_DIRECT. This may minimize caching effects. That is, if the BAR device does its own caching, eliminate caching by the kernel, if possible.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
The "short" mmap/read is not working. The extra data comes from the prior "full" mapping. So, your test isn't valid.
To ensure consistent results, do unlink on the output file. Do open with O_CREAT. Then, use ftruncate to extend the file to the full size.
Here is some code to try:
#define SIZE (512 * 1024 * 1024)
// remove the output file
unlink(filename);
// open output file (create it)
int ofile_fd = open(filename, O_RDWR | O_CREAT,0644)
// prevent segfault by providing space in the file
ftruncate(ofile_fd,SIZE);
map_base = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, ofile_fd, 0);
// use O_DIRECT to minimize caching effects when accessing the BAR device
#if 0
int rand_fd = open(infile, O_RDONLY);
#else
int rand_fd = open(infile, O_RDONLY | O_DIRECT);
#endif
rand_base = mmap(0, SIZE, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, SIZE);
if (munmap(map_base, map_size) == -1) {
PRINT_ERROR;
}
// close the output file
close(ofile_fd);
Depending upon the characteristics of the BAR device, to minimize the number of PCIe read/fetch/transaction requests, it may be helpful to ensure that it is being accessed as 32 bit (or 64 bit) elements.
Does the BAR space allow/support/encourage access as "ordinary" memory?
Usually, memcpy is smart enough to switch to "wide" memory access automatically (if memory addresses are aligned--which they are here). That is, memcpy will automatically use 64 bit fetches, with movq or possibly by using some XMM instructions, such as movdqa
It would help to know exactly which BAR device(s) you have. The datasheet/appnote should give enough information.
UPDATE:
Thanks for the sample code. Unfortunately, aarch64-gcc gives 'O_DIRECT undeclared' for some reason. Without using that flag, the speed is the same as my original code.
Add #define _GNU_SOURCE above any #include to resolve O_DIRECT
The PCIe device is an FPGA that we are developing. The bitstream is currently the Xilinx DMA example code. The BAR is just 512MB of memory for the system to R/W to. –
userYou
Serendipitously, my answer was based on my experience with access to the BAR space of a Xilinx FPGA device (it's been a while, circa 2010).
When we were diagnosing speed issues, we used a PCIe bus analyzer. This can show the byte width of the bus requests the CPU has requested. It also shows the turnaround time (e.g. Bus read request time until data packet from device is returned).
We also had to adjust the parameters in the PCIe config registers (e.g. transfer size, transaction replay) for the device/BAR. This was trial-and-error and we (I) tried some 27 different combinations before deciding on the optimum config
On an unrelated arm system (e.g. nVidia Jetson) about 3 years ago, I had to do memcpy to/from the GPU memory. It may have just been the particular cross-compiler I was using, but the disassembly of memcpy showed that it only used bytewide transfers. That is, it wasn't as smart as its x86 counterpart. I wrote/rewrote a version that used unsigned long long [and/or unsigned __int128] transfers. This sped things up considerably. See below.
So, you may wish to disassemble the generated memcpy code. Either the library function and/or code that it may inline into your function.
Just a thought ... If you're just wanting a bulk transfer, you may wish to have the device driver for the device program the DMA engine on the FPGA. This might be handled more effectively with a custom ioctl call to the device driver that accepts a custom struct describing the desired transfer (vs. read or mmap from userspace).
Are you writing a custom device driver for the device? Or, are you just using some generic device driver?
Here's what I had to do to get a fast memcpy on arm. It generates ldp/stp asm instructions.
// qcpy.c -- fast memcpy
#include <string.h>
#include <stddef.h>
#ifndef OPT_QMEMCPY
#define OPT_QMEMCPY 128
#endif
#ifndef OPT_QCPYIDX
#define OPT_QCPYIDX 1
#endif
// atomic type for qmemcpy
#if OPT_QMEMCPY == 32
typedef unsigned int qmemcpy_t;
#elif OPT_QMEMCPY == 64
typedef unsigned long long qmemcpy_t;
#elif OPT_QMEMCPY == 128
typedef unsigned __int128 qmemcpy_t;
#else
#error qmemcpy.c: unknown/unsupported OPT_QMEMCPY
#endif
typedef qmemcpy_t *qmemcpy_p;
typedef const qmemcpy_t *qmemcpy_pc;
// _qmemcpy -- fast memcpy
// RETURNS: number of bytes transferred
size_t
_qmemcpy(qmemcpy_p dst,qmemcpy_pc src,size_t size)
{
size_t cnt;
size_t idx;
cnt = size / sizeof(qmemcpy_t);
size = cnt * sizeof(qmemcpy_t);
if (OPT_QCPYIDX) {
for (idx = 0; idx < cnt; ++idx)
dst[idx] = src[idx];
}
else {
for (; cnt > 0; --cnt, ++dst, ++src)
*dst = *src;
}
return size;
}
// qmemcpy -- fast memcpy
void
qmemcpy(void *dst,const void *src,size_t size)
{
size_t xlen;
// use fast memcpy for aligned size
if (OPT_QMEMCPY > 0) {
xlen = _qmemcpy(dst,src,size);
src += xlen;
dst += xlen;
size -= xlen;
}
// copy remainder with ordinary memcpy
if (size > 0)
memcpy(dst,src,size);
}
UPDATE #2:
Speaking of serendipity, I am using a Jetson Orin. That is very interesting about the byte-wise behavior.
Just a thought ... If you have a Jetson in the same system as the FPGA, you might get DMA action by judicious use of cuda
Due to requirements, I cannot use any custom kernel modules so I am trying to do it all in userspace.
That is a harsh mistress to serve ... With custom H/W, it is almost axiomatic that you can have a custom device driver. So, the requirement sounds like a marketing/executive one rather than a technical one. If it's something like not being able to ship a .ko file because you don't know the target kernel version, it is possible to ship the driver as a .o and defer the .ko creation to the install script.
We want to use the DMA engine, but I am hiking up the learning curve on this one. We are using DMA in the FPGA, but I thought that as long as we could write to the address specified in the dtb, that meant the DMA engine was set up and working. Now I'm wondering if I have completely misunderstood that part. –
userYou
You probably will not get DMA doing that. If you start the memcpy, how does the DMA engine know the transfer length?
You might have better luck using read/write vs mmap to get DMA going, depending upon the driver.
But, if it were me, I'd keep the custom driver option open:
If you have to tweak/modify the BAR config registers on driver/system startup, I can't recall if it's even possible to map the config registers to userspace.
When doing mmap, the device may be treated as the "backing store" for the mapping. That is, there is still an extra layer of kernel buffering [just like there is when mapping an ordinary file]. The device memory is only updated periodically from the kernel [buffer] memory.
A custom driver can set up a [guaranteed] direct mapping, using some trickery that only the kernel/driver has access to.
Historical note:
When I last worked with the Xilinx FPGA (12 years ago), the firmware loader utility (provided by Xilinx in both binary and source form), would read in bytes from the firmware/microcode .xsvf file used (e.g.) fscanf(fi,"%c",&myint) to get the bytes.
This was horrible. I refactored the utility to fix that and the processing of the state machine and reduced the load time from 15 minutes to 45 seconds.
Hopefully, Xilinx has fixed the utility by now.

mmap() causing segmentation fault in C

I'm pretty sure my mistake is very evident, but I just can't seem to find where the problem is.
I'm learning how to use mmap() in C, everything looks correct to me, but I get a segmentation fault.
Here is my code:
int n=50;
char * tab = mmap(NULL, n, PROT_READ | PROT_WRITE, MAP_SHARED, -1, 0);
for(int i=0; i<n; i++)
{
tab[i] = 1;
}
Using valgrind, I get an error saying "Invalid write of size 1" at the line where I do tab[i]=1, (I have tried replacing 1 by '1' thinking that maybe a char has a smaller size than an int, but still get the same error), followed by "Address 0xfffff..ff is not stack'd, malloc'd, or (recently) free'd".
I have no idea where my mistake is. Can somebody help me find it?

From man 2 mmap:
The contents of a file mapping (as opposed to an anonymous mapping;
see MAP_ANONYMOUS below), are initialized using length bytes starting
at offset offset in the file (or other object) referred to by the
file descriptor fd.
I suppose that you are trying to create an anonymous mapping (i.e. not backed by a file). In such case, you need to add MAP_ANONYMOUS to the flags, otherwise the system will try to read from the specified fd, which is invalid (-1) and will fail.
The correct code is:
char *tab = mmap(NULL, n, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (tab == MAP_FAILED) {
perror("mmap");
exit(1);
}
For the future, note that you can easily detect the error like I did above with a simple call to perror() in case the returned value indicates failure. In your case it should have printed the following:
mmap: Bad file descriptor
Checking the manual again you can see in the "ERRORS" section:
EBADF: fd is not a valid file descriptor (and MAP_ANONYMOUS was not set).

Using mmap() instead of malloc()

I am trying to complete an exercise that is done with system calls and need to allocate memory for a struct *. My code is:
myStruct * entry = (myStruct *)mmap(0, SIZEOF(myStruct), PROT_READ|PROT_WRITE,
MAP_ANONYMOUS, -1, 0);
To clarify, I cannot use malloc() but can use mmap(). I was having no issues with this on Windows in Netbeans, now however I'm compiling and running from command line on Ubuntu I am getting "Segmentation Fault" each time I try to access it.
Is there a reason why it will work on one and not the other, and is mmap() a valid way of allocating memory in this fashion? My worry was I was going to be allocating big chunks of memory for each mmap() call initially, now I just cannot get it to run.
Additionally, the error returned my mmap is 22 - Invalid Argument (I did some troubleshooting while writing the question so the error check isn't in the above code). Address is 0, the custom SIZEOF() function works in other mmap arguments, I am using MAP_ANONYMOUS so the fd and offsetparameters must -1 and 0 respectively.
Is there something wrong with the PROT_READ|PROT_WRITE sections?

You need to specify MAP_PRIVATE in your flags.
myStruct * entry = (myStruct *)mmap(0, SIZEOF(myStruct),
PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
From the manual page:
The flags argument determines whether updates to the mapping are
visible to other processes mapping the same region, and whether
updates are carried through to the underlying file. This behavior is
determined by including exactly one of the following values in flags:
You need exactly one of the flags MAP_PRIVATE or MAP_SHARED - but you didn't give either of them.
A complete example:
#include <sys/mman.h>
#include <stdio.h>
typedef struct
{
int a;
int b;
} myStruct;
int main()
{
myStruct * entry = (myStruct *)mmap(0, sizeof(myStruct),
PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (entry == MAP_FAILED) {
printf("Map failed.\n");
}
else {
entry->a = 4;
printf("Success: entry=%p, entry->a = %d\n", entry, entry->a);
}
return 0;
}
(The above, without MAP_PRIVATE of course, is a good example of what you might have provided as a an MCVE. This makes it much easier for others to help you, since they can see exactly what you've done, and test their proposed solutions. You should always provide an MCVE).

The man page for mmap() says that you must specify exactly one of MAP_SHARED and MAP_PRIVATE in the flags argument. In your case, to act like malloc(), you'll want MAP_PRIVATE:
myStruct *entry = mmap(0, sizeof *entry,
PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
(I've also made this more idiomatic C by omitting the harmful cast and matching the sizeof to the actual variable rather than its type).

mmap() fails while devmem2 succeeds (C/CPP) [Allwinner A20]

I am trying to access the hardware registers of an A20 SOM by mapping them to userspace. In this case, target is the PIO, listed at physical address 0x01C20800.
The official Olimex Debian7 (wheezy) image is being used. Kernel Linux a20-olimex 3.4.90+
I was able to verify the location by using the devmem2 tool and Allwinner's documentation on the said memory space (switched the pinmode and level with devmem).
The mmap call on the other hand
*map = mmap(
NULL,
BLOCK_SIZE, // = (4 * 1024)
PROT_READ | PROT_WRITE,
MAP_SHARED,
*mem_fd,
*addr_p
);
fails with mmap error: Invalid argument
Here's a more complete version of the code: http://pastebin.com/mfEuVdbJ
Don't worry about the pointers as the same code does work when accessing UART0 at 0x01C28000.
Although only UART0 (and UART4), which is used as serial console.
I've decompiled the script.bin (still in use despite DTB) without success, as UART 0, 7 and 8 are enabled there.
I am also logged in as user root
I would still guess something related to permissions but I'm pretty lost right now since devmem has no problem at all
> root#a20-olimex:~# devmem2 0x01c20800 w /dev/mem opened. Memory mapped
> at address 0xb6f85000.

While sourcejedi didn't certainly fix my issue, he gave me the right approach.
I took a look at the forementioned devmem tool's source to discover that the mmap call's address is masked
address & ~MAP_MASK to get the entire page, which is essentially the same operation as in my comment.
However, to get back to the right place after the mapping has been done, you have to add the mask back
final_address = mapped_address + (target_address & MAP_MASK);
This resulted in following code (based on OP's pastebin)
Where
MAP_MASK = (sysconf(_SC_PAGE_SIZE) - 1) in this case 4095
int map_peripheral(unsigned long *addr_p, int *mem_fd, void **map, volatile unsigned int **addr)
{
if (!(*addr_p)) {
printf("Called map_peripheral with uninitilized struct.\n");
return -1;
}
// Open /dev/mem
if ((*mem_fd = open("/dev/mem", O_RDWR | O_SYNC)) < 0) {
printf("Failed to open /dev/mem, try checking permissions.\n");
return -1;
}
*map = mmap(
NULL,
MAP_SIZE,
PROT_READ | PROT_WRITE,
MAP_SHARED,
*mem_fd, // file descriptor to physical memory virtual file '/dev/mem'
*addr_p & ~MAP_MASK // address in physical map to be exposed
/************* magic is here **************************************/
);
if (*map == MAP_FAILED) {
perror("mmap error");
return -1;
}
*addr = (volatile unsigned int *)(*map + (*addr_p & MAP_MASK));
/************* and here ******************************************/
return 0;
}

if you read the friendly manual
EINVAL Invalid argument (POSIX.1)
is the error code. (Not EPERM!). So we look it up for the specific function
EINVAL We don't like addr, length, or offset (e.g., they are too large,
or not aligned on a page boundary).
BLOCK_SIZE, // 1024 - ?
You want a multiple of sysconf(_SC_PAGE_SIZE). In practice it will be 4096. I won't bother with the fully general math to calculate it - you'll find examples if you need it.

mmap on /proc/pid/mem

Has anybody succeeded in mmap'ing a /proc/pid/mem file with Linux kernel 2.6? I am getting an ENODEV (No such device) error. My call looks like this:
char * map = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, mem_fd, offset);
And I have verified by looking at the /proc/pid/maps file while debugging that, when execution reaches this call, offset has the value of the top of the stack minus PAGE_SIZE. I have also verified with ptrace that mmap is setting errno to ENODEV.

See proc_mem_operations in /usr/src/linux/fs/proc/base.c: /proc/.../mem does not support mmap.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

KVM Userspace Port I/O - c

Related

Memcpy Complete After Segfault

mmap() causing segmentation fault in C

Using mmap() instead of malloc()

mmap() fails while devmem2 succeeds (C/CPP) [Allwinner A20]

mmap on /proc/pid/mem

Categories

Resources