How does Qemu emulate PCIe devices? - c

I'm writing an open source document about qemu internals so if you help me you're helping the growth of Qemu project
The closest answer I found was: In which conditions the ioctl KVM_RUN returns?
This is the thread loop for a single CPU running on KVM:
static void *qemu_kvm_cpu_thread_fn(void *arg)
{
CPUState *cpu = arg;
int r;
rcu_register_thread();
qemu_mutex_lock_iothread();
qemu_thread_get_self(cpu->thread);
cpu->thread_id = qemu_get_thread_id();
cpu->can_do_io = 1;
current_cpu = cpu;
r = kvm_init_vcpu(cpu);
if (r < 0) {
error_report("kvm_init_vcpu failed: %s", strerror(-r));
exit(1);
}
kvm_init_cpu_signals(cpu);
/* signal CPU creation */
cpu->created = true;
qemu_cond_signal(&qemu_cpu_cond);
qemu_guest_random_seed_thread_part2(cpu->random_seed);
do {
if (cpu_can_run(cpu)) {
r = kvm_cpu_exec(cpu);
if (r == EXCP_DEBUG) {
cpu_handle_guest_debug(cpu);
}
}
qemu_wait_io_event(cpu);
} while (!cpu->unplug || cpu_can_run(cpu));
qemu_kvm_destroy_vcpu(cpu);
cpu->created = false;
qemu_cond_signal(&qemu_cpu_cond);
qemu_mutex_unlock_iothread();
rcu_unregister_thread();
return NULL;
}
You can see here:
do {
if (cpu_can_run(cpu)) {
r = kvm_cpu_exec(cpu);
if (r == EXCP_DEBUG) {
cpu_handle_guest_debug(cpu);
}
}
qemu_wait_io_event(cpu);
} while (!cpu->unplug || cpu_can_run(cpu));
that every time the KVM returns, it gives an opportunity for Qemu to emulate things. I suppose that when the kernel on the guest tries to access a PCIe device, KVM on the host returns. I don't know how KVM knows how to return. Maybe KVM maintains the addresses of the PCIe device and tells Intel's VT-D or AMD's IOV which addresses should generate an exception. Can someone clarify this?
Well, by the look of the qemu_kvm_cpu_thread_fn, the only place where a PCIe access could be emulated, is qemu_wait_io_event(cpu), which is defined here: https://github.com/qemu/qemu/blob/stable-4.2/cpus.c#L1266 and which calls qemu_wait_io_event_common defined here: https://github.com/qemu/qemu/blob/stable-4.2/cpus.c#L1241 which calls process_queued_cpu_work defined here: https://github.com/qemu/qemu/blob/stable-4.2/cpus-common.c#L309
Let's see the code which executes the queue functions:
while (cpu->queued_work_first != NULL) {
wi = cpu->queued_work_first;
cpu->queued_work_first = wi->next;
if (!cpu->queued_work_first) {
cpu->queued_work_last = NULL;
}
qemu_mutex_unlock(&cpu->work_mutex);
if (wi->exclusive) {
/* Running work items outside the BQL avoids the following deadlock:
* 1) start_exclusive() is called with the BQL taken while another
* CPU is running; 2) cpu_exec in the other CPU tries to takes the
* BQL, so it goes to sleep; start_exclusive() is sleeping too, so
* neither CPU can proceed.
*/
qemu_mutex_unlock_iothread();
start_exclusive();
wi->func(cpu, wi->data);
It looks like that the only power the VCPU thread qemu_kvm_cpu_thread_fn has when KVM returns, is to execute the queued functions:
wi->func(cpu, wi->data);
This means that a PCIe device would have to constantly queue itself as a function for qemu to execute. I don't see how it would work.
The functions that are able to queue work on this cpu have run_on_cpu on its name. By searching it on VSCode I found some functions that queue work but none related to PCIe or even emulation. The nicest function I found was this one that apparently patches instructions: https://github.com/qemu/qemu/blob/stable-4.2/hw/i386/kvmvapic.c#L446. Nice, I wanted to know that also.

Device emulation (of all devices, not just PCI) under KVM gets handled by the "case KVM_EXIT_IO" (for x86-style IO ports) and "case KVM_EXIT_MMIO" (for memory mapped IO including PCI) in the "switch (run->exit_reason)" inside kvm_cpu_exec(). qemu_wait_io_event() is unrelated.
Want to know how execution gets to "emulate a register read on a PCI device" ? Run QEMU under gdb, set a breakpoint on, say, the register read/write function for the ethernet PCI card you're using, and then when you get dropped into the debugger look at the stack backtrace. (Compile QEMU --enable-debug to get better debug info for this kind of thing.)
PS: If you're examining QEMU internals for educational purposes, you'd be better to use the current code, not a year-old release of it.

Related

Synchronizing a usermode application to a kernel driver

We have a driver that schedules work using a timer (using add_timer). Whenever the work results in a change, an application should be notified of this change.
Currently this is done by providing a sysfs entry from the driver, that blocks a read operation until there is a change in the data.
(Please see the relevant code from the driver and the application in the code block below.)
I have inspected the source of most functions related to this in the linux kernel (4.14.98), and I did not notice any obvious problems in dev_attr_show, sysfs_kf_seq_show,__vfs_read and vfs_read.
However in seq_read the mutex file->private_data->lock is held for the duration of the read:
https://elixir.bootlin.com/linux/v4.14.98/source/fs/seq_file.c#L165
Will this pose a problem?
Are there any other (potential?) problems that I should be aware of?
Please note that:
The data will change at least once per second, usually way faster
The application should respond as soon as possible to a change in the data
This runs in a controlled (embedded) environment
// driver.c
static DECLARE_WAIT_QUEUE_HEAD(wq);
static int xxx_block_c = 0;
// driver calls this to notify the application that something changed (from a timer, `timer_list`)
void xxx_Persist(void)
{
xxx_block_c = 1;
wake_up_interruptible(&wq);
}
// Sysfs entry that blocks until there is a change in the data.
static ssize_t xxx_show(struct device *dev,
struct device_attribute *attr,
char *buf)
{
wait_event_interruptible(wq, xxx_block_c != 0);
xxx_block_c = 0;
/* Remainder of the implementation */
}
//application.cpp
std::ifstream xxx;
xxx.open("xxx", std::ios::binary | std::ios::in);
while (true)
{
xxx.clear();
xxx.seekg(0, std::ios_base::beg);
xxx.read(data, size);
/* Do something with data */
}

Get maximum available register for Linux PCI device

I am currently debugging a Linux kernel driver.
I want to sweep a PCI device's mmio registers to scan for certain information.
This is the function I wrote so far.
void _sweep_registers(struct pci_dev *dev)
{
int i;
int activecontrolstatus;
int activestatus;
for (i = 0; i < AMD_P2C_MSG_INTSTS; i++) {
activecontrolstatus = readl(privdata->mmio + i);
activestatus = activecontrolstatus >> 4;
dev_info(&dev->dev, "activecontrolstatus = %d / activestatus = %d",
activecontrolstatus, activestatus);
}
}
Currently I am reading mmio until what's specified in AMD_P2C_MSG_INTSTS (which is 0x10694).
But how far can I actually go?
I have zero knowledge of Linux kernel development and only rudimentary knowledge of C.
Background
My goal is to find information about which sensors of the AMD Sensor Fusion Hub are marked as active.
They should be under the register 0x1068C, but are not on my system (it's 0x0, but at least an accelerometer is available, so the bitmask should at least match 0x1).
I want to see, whether they are stored somewhere else.

concurrent I/O - buffers corruption, block device driver

I developing block layered device driver. So, I intercept WRITE request and encrypt data, and decrypt data in the end_bio() routine (during processing and READ request).
So all works fine in single stream. But I getting buffers content corruption if have tried to performs I/O from two and more processes simultaneously. I have not any local storage for buffers.
Do I'm need to count a BIO merging in my driver?
Is the Linux I/O subsystem have some requirements related to the a number of concurrent I/O request?
Is there some tips and tricks related stack using or compilation?
This is under kernel 4.15.
At the time I use next constriction to run over disk sectors:
/*
* A portion of the bio_copy_data() ...
*/
for (vcnt = 0, src_iter = src->bi_iter; ; vcnt++)
{
if ( !src_iter.bi_size)
{
if ( !(src = src->bi_next) )
break;
src_iter = src->bi_iter;
}
src_bv = bio_iter_iovec(src, src_iter);
src_p = bv_page = kmap_atomic(src_bv.bv_page);
src_p += src_bv.bv_offset;
nlbn = src_bv.bv_len512;
for ( ; nlbn--; lbn++ , src_p += 512 )
{
{
/* Simulate a processing of data in the I/O buffer */
char *srcp = src_p, *dstp = src_p;
int count = DUDRV$K_SECTORSZ;
while ( count--)
{
*(dstp++) = ~ (*(srcp++));
}
}
}
kunmap_atomic(bv_page);
**bio_advance_iter**(src, &src_iter, src_bv.bv_len);
}
Is this correct ? Or I'm need to use something like **bio_for_each_segment(bvl, bio, iter) ** ?
The root of the problem is a "feature" of the Block I/O methods. In particularly (see description at Linex site reference )
** Biovecs can be shared between multiple bios - a bvec iter can represent an
arbitrary range of an existing biovec, both starting and ending midway
through biovecs. This is what enables efficient splitting of arbitrary
bios. Note that this means we only use bi_size to determine when we've
reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes
bi_size into account when constructing biovecs.*
So, in my case this is a cause of the buffer with disk sector overrun.
The trick is set REQ_NOMERGE_FLAGS in the .bi_opf before sending BIO to backed device driver.
The second reason is the non-actual .bi_iter is returned by backed device driver. So, we need to save it (before submiting BIO request to backend) and restore it in the our "bio_endio()" routine.
Have you considered the use of vmap with global synchronization, instead?
The use of kmap_atomic has some restrictions:
Since the mapping is restricted to the CPU that issued it, it
performs well, but the issuing task is therefore required to stay on that
CPU until it has finished, lest some other task displace its mappings.
kmap_atomic() may also be used by interrupt contexts, since it is does not
sleep and the caller may not sleep until after kunmap_atomic() is called.
Reference: https://www.kernel.org/doc/Documentation/vm/highmem.txt

Where is the kernel time stored in memory in Linux?

For an assignment I have to use a video driver and system timer handler to display the current running time of the Linux system to the corner of the screen.
However, I have not found anywhere that points me into the direction of obtaining the system time from the kernel when my program runs. I am guessing it is in kernel memory at some address and I can just do something like:
hour = get_word(MEM_LOCATION_OF_HOUR);
sec = get_word(MEM_LOCATION_OF_SEC);
ect...
But I cannot find out if this is possible. My guess is that we are not allowed to use library calls like clock() but if that is the only possible way then maybe we are.
Thanks
Can't use library calls? - that's just craziness. Anyway:
getnstimeofday(struct timespec *ts); is one of many methods from here
In kernel, ktime can be used.
a simple example(for calculating time diff) for your reference.
#include <linux/ktime.h>
int fun()
{
ktime_t entry_stamp, now;
s64 delta;
/* Get the current time .*/
entry_stamp = ktime_get();
/* Do your Stuff... */
now = ktime_get();
delta = ktime_to_ns(ktime_sub(now, entry_stamp));
printk(KERN_INFO "Time Taken:%lld ns to execute\n",(long long)delta);
return 0;
}
I have found that the Real-time clock holds the correct values on boot. The CMOS contains all the needed info.
Here is a link to what I found. http://wiki.osdev.org/CMOS#The_Real-Time_Clock

Delay using timer on raspberry pi

I need to create a accurate delay (around 100us) inside a thread function. I tried using the nanosleep function but it was no accurate enough. I read some post about how to read the hardware 1MHz timer, so on my function in order to create a 100us delay y tried something like this:
prev = *timer;
do {
t = *timer;
} while ((t - prev) < 100);
However, the program seems to stay inside the loop. But if I insert a small nano sleep inside the loop it works (but loosing precision):
sleeper.tv_sec = 0;
sleeper.tv_nsec = (long)(1);
prev = *timer;
do {
nanosleep (&sleeper, &dummy);
t = *timer;
} while ((t - prev) < 500);
I tried the first version in a stand along program and it works, but in my main program, where this is inside a thread it does not.
Does anyone know what the first version (without a small nanosleep) does not work?
I'm sorry to say but Raspberry Pi's OS is not a "real-time OS". In another words, you won't get consistent 100us precision in a user space program due to inherent OS scheduling limitations. If you need that kind of precision, you should use an embedded controller like an Arduino.

Resources