Linux PCI Driver Setup and Teardown - c

After looking at the kernel docs here: https://www.kernel.org/doc/Documentation/PCI/pci.txt I am lost as to the ordering of function calls to set up and tear down a PCI driver.
I have two questions:
For setup, does pci_enable_device() always come before
pci_request_regions()? The documentation seems to point to this
fact, but does state:
OS BUG: we don't check resource allocations before enabling those
resources. The sequence would make more sense if we called
pci_request_resources() before calling pci_enable_device().
Currently, the device drivers can't detect the bug when when two
devices have been allocated the same range. This is not a common
problem and unlikely to get fixed soon. This has been discussed before but not changed as of 2.6.19: http://lkml.org/lkml/2006/3/2/194
However, after doing a quick look through of the source code of several
drivers, the consensus is that pci_enable_device() always comes
first. Which one of these calls is supposed to come first and why?
For tearing down the driver, I get even more confused. Assuming pci_enable_device() comes first, I would expect that you first call pci_release_regions() prior to calling pci_disable_device() (i.e., following some symmetry). However, the kernel docs say that pci_release_regions() should come last. What makes matters more complicated is that I looked at many drivers and almost all of them had pci_release_regions() before pci_disable_device(), like I would expect. However, I then stumbled across this driver: https://elixir.bootlin.com/linux/v4.12/source/drivers/infiniband/hw/hfi1/pcie.c (code is reproduced below).
void hfi1_pcie_cleanup(struct pci_dev *pdev)
{
pci_disable_device(pdev);
/*
* Release regions should be called after the disable. OK to
* call if request regions has not been called or failed.
*/
pci_release_regions(pdev);
}
Which function is supposed to come first when tearing down the driver? It seems that drivers in the kernel itself can't agree.

The statement that gives a final say is as follows :
o wake up the device if it was in suspended state,
o allocate I/O and memory regions of the device (if BIOS did not),
o allocate an IRQ (if BIOS did not).
So, it makes no sense to ask kernel to reserve resource if there is none. In most cases when we do not need to allocate the resource because it has been done by bios, in those cases we can keep either function first but do only if you are absolutely sure.

Related

How to create vm_area mapping if using __get_free_pages() with order greater than 1?

I am re-implementing mmap in a device driver for DMA.
I saw this question: Linux Driver: mmap() kernel buffer to userspace without using nopage that has an answer using vm_insert_page() to map one page at a time; hence, for multiple pages, needed to execute in a loop. Is there another API that handles this?
Previously I used dma_alloc_coherent to allocate a chunk of memory for DMA and used remap_pfn_range to build a page table that associates process's virtual memory to physical memory.
Now I would like to allocate a much larger chunk of memory using __get_free_pages with order greater than 1. I am not sure how to build page table in that case. The reason is as follows:
I checked the book Linux Device Drivers and noticed the following:
Background:
When a user-space process calls mmap to map device memory into its address space, the system responds by creating a new VMA to represent that mapping. A driver that supports mmap (and, thus, that implements the mmap method) needs to help that process by completing the initialization of that VMA.
Problem with remap_pfn_range:
remap_pfn_range won’t allow you to remap conventional addresses, which include the ones you obtain by calling get_free_page. Instead, it maps in the zero page. Everything appears to work, with the exception that the process sees private, zero-filled pages rather than the remapped RAM that it was hoping for.
The corresponding implementation using get_free_pages with order 0, i.e. only 1 page in scullp device driver:
The mmap method is disabled for a scullp device if the allocation order is greater than zero, because nopage deals with single pages rather than clusters of pages. scullp simply does not know how to properly manage reference counts for pages that are part of higher-order allocations.
May I know if there is a way to create VMA for pages obtained using __get_free_pages with order greater than 1?
I checked Linux source code and noticed there are some drivers re-implementing struct dma_map_ops->alloc() and struct dma_map_ops->map_page(). May I know if this is the correct way to do it?
I think I got the answer to my question. Feel free to correct me if I am wrong.
I happened to see this patch: mm: Introduce new vm_map_pages() and vm_map_pages_zero() API while I was googling for vm_insert_page.
Previouly drivers have their own way of mapping range of kernel pages/memory into user vma and this was done by invoking vm_insert_page() within a loop.
As this pattern is common across different drivers, it can be generalized by creating new functions and use it across the drivers.
vm_map_pages() is the API which could be used to mapped kernel memory/pages in drivers which has considered vm_pgoff.
After reading it, I knew I found what I want.
That function also could be found in Linux Kernel Core API Documentation.
As for the difference between remap_pfn_range() and vm_insert_page() which requires a loop for a list of contiguous pages, I found this answer to this question extremely helpful, in which it includes a link to explanation by Linus.
As a side note, this patch mm: Introduce new vm_insert_range and vm_insert_range_buggy API indicates that the earlier version of vm_map_pages() was vm_insert_range(), but we should stick to vm_map_pages(), since under the hood vm_map_pages() calls vm_insert_range().

Calling system calls from the kernel code

I am trying to create a mechanism to read performance counters for processes. I want this mechanism to be executed from within the kernel (version 4.19.2) itself.
I am able to do it from the user space the sys_perf_event_open() system call as follows.
syscall (__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
I would like to invoke this call from the kernel space. I got some basic idea from here How do I use a Linux System call from a Linux Kernel Module
Here are the steps I took to achieve this:
To make sure that the virtual address of the kernel remains valid, I have used set_fs(), get_fs() and get_fd().
Since sys_perf_event_open() is defined in /include/linux/syscalls.h I have included that in the code.
Eventually, the code for calling the systems call looks something like this:
mm_segment_t fs;
fs = get_fs();
set_fs(get_ds());
long ret = sys_perf_event_open(&pe, pid, cpu, group_fd, flags);
set_fs(fs);
Even after these measures, I get an error claiming "implicit declaration of function ‘sys_perf_event_open’ ". Why is this popping up when the header file defining it is included already? Does it have to something with the way one should call system calls from within the kernel code?
In general (not specific to Linux) the work done for systems calls can be split into 3 categories:
switching from user context to kernel context (and back again on the return path). This includes things like changing the processor's privilege level, messing with gs, fiddling with stacks, and doing security mitigations (e.g. for Meltdown). These things are expensive, and if you're already in the kernel they're useless and/or dangerous.
using a "function number" parameter to find the right function to call, and calling it. This typically includes some sanity checks (does the function exist?) and a table lookup, plus code to mangle input and output parameters that's needed because the calling conventions used for system calls (in user space) is not the same as the calling convention that normal C functions use. These things are expensive, and if you're already in the kernel they're useless and/or dangerous.
the final normal C function that ends up being called. This is the function that you might have (see note) been able to call directly without using any of the expensive, useless and/or dangerous system call junk.
Note: If you aren't able to call the final normal C function directly without using (any part of) the system call junk (e.g. if the final normal C function isn't exposed to other kernel code); then you must determine why. For example, maybe it's not exposed because it alters user-space state, and calling it from kernel will corrupt user-space state, so it's not exposed/exported to other kernel code so that nobody accidentally breaks everything. For another example, maybe there's no reason why it's not exposed to other kernel code and you can just modify its source code so that it is exposed/exported.
Calling system calls from inside the kernel using the sys_* interface is discouraged for the reasons that others have already mentioned. In the particular case of x86_64 (which I guess it is your architecture) and starting from kernel versions v4.17 it is now a hard requirement not to use such interface (but for a few exceptions). It was possible to invoke system calls directly prior to this version but now the error you are seeing pops up (that's why there are plenty of tutorials on the web using sys_*). The proposed alternative in the Linux documentation is to define a wrapper between the syscall and the actual syscall's code that can be called within the kernel as any other function:
int perf_event_open_wrapper(...) {
// actual perf_event_open() code
}
SYSCALL_DEFINE5(perf_event_open, ...) {
return perf_event_open_wrapper(...);
}
source: https://www.kernel.org/doc/html/v4.19/process/adding-syscalls.html#do-not-call-system-calls-in-the-kernel
Which kernel version are we talking about?
Anyhow, you could either get the address of the sys_call_table by looking at the System map file, or if it is exported, you can look up the symbol (Have a look at kallsyms.h), once you have the address to the syscall table, you may treat it as a void pointer array (void **), and find your desired functions indexed. i.e sys_call_table[__NR_open] would be open's address, so you could store it in a void pointer and then call it.
Edit: What are you trying to do, and why can't you do it without calling syscalls? You must understand that syscalls are the kernel's API to the userland, and should not be really used from inside the kernel, thus such practice should be avoided.
calling system calls from kernel code
(I am mostly answering to that title; to summarize: it is forbidden to even think of that)
I don't understand your actual problem (I feel you need to explain it more in your question which is unclear and lacks a lot of useful motivation and context). But a general advice -following the Unix philosophy- is to minimize the size and vulnerability area of your kernel or kernel module code, and to deport, as much as convenient, such code in user-land, in particular with the help of systemd, as soon as your kernel code requires some system calls. Your question is by itself a violation of most Unix and Linux cultural norms.
Have you considered to use efficient kernel to user-land communication, in particular netlink(7) with socket(7). Perhaps you also
want some driver specific kernel thread.
My intuition would be that (in some user-land daemon started from systemd early at boot time) AF_NETLINK with socket(2) is exactly fit for your (unexplained) needs. And eventd(2) might also be relevant.
But just thinking of using system calls from inside the kernel triggers a huge flashing red light in my brain and I tend to believe it is a symptom of a major misunderstanding of operating system kernels in general. Please take time to read Operating Systems: Three Easy Pieces to understand OS philosophy.

RTEMS howto get DMA accessible memory

I'm implementing RTEMS driver for Ethernet card by porting it from Linux. Much of the work is done, processor IO mode is working ok, as well as interrupt handling. Now I'm having problems implementing DMA.
Specifically, in Linux driver I use as a base, function dma_alloc_coherent() is used. This function will return two different addresses: one is address that driver code (host CPU) will see, and other one is the address that card will use to access same memory region via PCI during DMA.
I'm having problems finding appropriate replacement function. First I thought of using malloc() and then pci_pci2cpu to translate this address to the one one card can access, however, pci_pci2cpu returns 0xFFFFFFFF for IO and 0x0 for remaining two modes.
Second approach I considered is using dual ported memory manager, but I'm not finding useful examples of it's usage. For example, rtems_port_create() function requires pointers *internal_start and *external_start to be provided, but I'm not sure where are this pointers comming from?
I use Gaisler RTEMS version 4.11 and Sparc architecture (LEON3 cpu).
Best,
Ivan
Ok basically I got this figured out.
First, RTEMS has flat memory model, so address that malloc() returns is actual physical address in memory. That means I don't need dma_alloc_coherent() as malloc() is already doing same thing. For aligned memory, I used posix_memalign() which is also supported.
Second, I needed to see is there any address translation between the card and memory. This is not related to RTEMS, but rather to the system architecture, so after looking into GRLIB user manual and looking at RTEMS initialization code for grpci2 core, I found that there is no memory translation (it's set 1:1).
Bottom line is that if I allocate buffer with simple malloc and give that address to PCI card, it will be able to access (read/write) this buffer.
These were all assumptions that I started with, but in the end my problems were in faulty DMA chip. :)
I'm not sure I got the question right but anyway:
RTEMS does not implements an handler for the LEON family DMA.
To use the DMA you need to exploit the LEON structure you can find in leon.h header file.
That structure is linked to the memory addresses of the LEON3 processor.
Alternatively you can address the registers directly.
After that you need to go to http://www.gaisler.com/index.php/products/components/ut699
and download the functional manual of the UT699 (or search for the SoC you are using :) )
There you will find how to write the registries in the correct order to initiate a DMA transfer from/to PCI target.
Cheers

How to use Readlink

How do I use Readlink for fetching the values.
The answer is:
Don't do it
At least not in the way you're proposing.
You specified a solution here without specifying what you really want to do [and why?]. That is, what are your needs/requirements? Assuming you get it, what do you want to do with the filename? You posted a bare fragment of your userspace application but didn't post any of your kernel code.
As a long time kernel programmer, I can tell you that this won't work, can't work, and is a terrible hack. There is a vast difference in methods to use inside the kernel vs. userspace.
/proc is strictly for userspace applications to snoop on kernel data. The /proc filesystem drivers assume userspace, so they always do copy_to_user. Data will be written to user address space, and not kernel address space, so this will never work from within the kernel.
Even if you could use /proc from within the kernel, it is a genuinely awful way to do it.
You can get the equivalent data, but it's a bit more complicated than that. If you're intercepting the read syscall inside the kernel, you [already] have access to the current task struct and the fd number used in the call. From this, you can locate the struct for the given open file, and get whatever you want, directly, without involving /proc at all. Use this as a starting point.
Note that doing this will necessitate that you read kernel documentation, sources for filesystem drivers, syscalls, etc. How to lock data structures and lists with the various locking methods (e.g. RCU, rw locks, spinlocks). Also, per-cpu variables. kernel thread preemptions. How to properly traverse the necessary filesystem related lists and structs to get the information you want. All this, without causing lockups, panics, segfaults, deadlocks, UB based on stale or inconsistent/dynamically changing data.
You'll need to study all this to become familiar with the way the kernel does things internally, and understand it, before you try doing something like this. If you had, you would have read the source code for the /proc drivers and already known why things were failing.
As a suggestion, forget anything that you've learned about how a userspace application does things. It won't apply here. Internally, the kernel is organized in a completely different way than what you've been used to.
You have no need to use readlink inside the kernel in this instance. That's the way a userspace application would have to do it, but in the kernel it's like driving 100 miles out of your way to get data you already have nearby, and, as I mentioned previously, won't even work.

Get user stackpointer from task_struct

I have kcore and I want to get userspace backtrace from kcore. Because some one from our application is making lot of munmap and making the system hang(CPU soft lockup 22s!). I looked at some macro but still this is just giving me kernel backtrace only. What I want is userspace backtrace.
Good news is I have pointer to task_struct.
task_struct->thread->sp (Kernel stack pointer)
task_struct->thread->usersp (user stack pointer) but this is junk
My question is how to get userspace backtrace from kcore or task_struct.
First of all, vmcore is a immediate full memory snapshot, so it contains all pages (including user pages). But if the user pages are swapped out, they couldn't be accessed. So that is why kdump (and similar tools as your gdb python script) focused on kernel debugging functionality only. For userspace debugging and stacktraces you have to use coredump functionality. By default the coredumps are produced when kernel sends (for example) SIGSEGV to your app, but you can make them when you want by using gcore of modifying kernel. Also there is a "userspace" way of making process dump, see google coredumper project
Also, you can try to unwind user stacktrace directly from kcore - but this is a tricky way, and you will have to hope that userspace stacktrace is not swapped out at the moment. (do you use a swap?) You can see __save_stack_trace_user, it will make sense of how to retrieve userspace context
First of all vmcores typically don't contain user pages. I'm unaware of any magic which would help here - you would have to inspect vm mappings for given task address space and then inspect physical pages, and I highly doubt the debugger knows how to do it.
But most importantly you likely don't have any valid reason to do it in the first place.
So, what are you trying to achieve?
=======================
Given the edit:
some one from our application is making lot of munmap and making the
system hang(CPU soft lockup 22s!).
There may or may not be an actual kernel issue which must be debugged. I don't see any use for userspace stacktraces for this one though.
So as I understand presumed issue is excessive mmap + munmap calls from the application.Inspecting the backtrace of the thread reported with said lockup may or may not happen to catch the culprit. What you really want is to collect backtraces of /all/ callers and sort them by frequency. This can be done (albeit with pain) with systemtap.

Resources