Clarification regarding PCI device initialization

Clarification regarding PCI device initialization - arm

Wikipedia says:
To address a PCI device, it must be enabled by being mapped into the system's I/O port address space or memory-mapped address space. The system's firmware, device drivers or the operating system program the Base Address Registers (commonly called BARs) to inform the device of its address mapping by writing configuration commands to the PCI controller.
Does this mean that a PCI device gets initialized when an address is written to the BAR? I'm trying to initialize the Bochs VGA card on Qemu Aarch64 thru bare metal and that's why I'm asking. Thanks!

Writing to the BAR simply tells the device what address range it should respond to. (It doesn't even enable the device to respond to the address; for that you need to set MSE [memory space enable].) There are many steps typically needed to initialize a device. Some of the steps are common for different PCI devices and others are completely device specific.

Related

Kernel-Mode Driver Development in Windows

I'm developing a new Kernel-Mode driver, that should run on Windows 10 (64-bit).
This driver should allocate 48GB of continuous physical memory, and map it (its base address) to a virtual address in the user space of the Windows application that will use it. The system actually has 64GB of RAM installed on it, so it might be needed to make a segment of the memory dedicated for this use, perhaps by changing a boot entry information.
In addition, the driver should reveal its base address somehow to an FPGA-based device, located on the PCIe slot. The purpose of this, is to use this 48GB of continuous physical allocated memory, as a DMA (Direct Memory Access). Namely, the FPGA-based device will generate data, and write it at the appropriate location in the DMA. The host software will try to read from that location, in a cyclic fashion. That is, the FPGA will override the data in the buffer, and the host will try to keep the pace, and read that data before it is overridden.
Please note that this question only deals with the host side (the driver), and not with the FPGA side.
So, basically my questions are:
How do I make such an allocation (as described above)?
How do I map the base address from the virtual address in the Kernel-Space, to the appropriate virtual address in the User-Space?
How do I reveal that base address to the FPGA (located on the PCIe slot), so it will know where to perform its write operations?
What other Callback-Functions should this driver implement, in terms of events handling?
Thanks a lot!

Is virtual memory used when using Port-mapped I/O?

If I have a Memory-mapped I/O device, and I want to write to a register for this device located at address 0x16D34, the 0x16D34 address is actually a virtual address, and the CPU will translate it to a physical address first, and then write the data to the physical address.
But what about Port-mapped I/O devices (for example: a serial port), so if I want to write to a register for a serial port located at address 0x3F8, is the 0x3F8 address a physical address or a virtual address?
Edit: I am on x86 architecture.

Port-mapped I/O on x86/x86-64 (most other modern architectures don't even support it) happens in an entirely separate address space. This address space is not subject to memory mapping, so there are no virtual port addresses, only physical ones. Special in and out instructions must be used to perform port I/O, simple memory access (e.g. with mov) can't access this separate address space. Access protection based on privilege level is possible; most modern OSes prevent user space processes from accessing I/O ports by default.
For details, you can for example check the chapter "INPUT/OUTPUT" of Intel's "Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 1" (chapter 18 as of this writing).
Note that in the early days of x86, port addresses were hardwired in each device, including ISA add-in cards. If you were lucky, the card had a set of jumpers for selecting one of a limited set of possible port ranges for the device, in order to avoid range clashes between devices. Later, Plug & Play was introduced to make the selection dynamically during system boot. PCI further refined this, so that I/O BARs can pretty much be mapped anywhere within the 0x0000-0xffff address space by the operating system and/or firmware. Port-mapped I/O is now strongly discouraged when designing new hardware due to its many inherent limitations.

It seems your question would be the differences between memory-mapped I/O and Port-mapped IO. There normally two methods for processor to connect external devices, which are memory-mapped or port mapped I/O.
Memory-mapped I/O
Memory-mapped I/O uses the same address space to address both memory and I/O devices. So when an address is accessed by the CPU, it may refer to a portion of physical RAM, but it can also refer to memory of the I/O device (Based on Memory-mapped I/O on Wiki).
The value 0x16D34 in your first example would be virtual memory, and would be mapped to the physical memory. The I/O device would refer the same physical memory as well to allow the access from CPU.
Port mapped I/O
Port mapped I/O uses a separate, dedicated address space and is accessed via a dedicated set of microprocessor instructions. For 0x3F8 in your second example, it's the address of the own address specific to Memory and I/O devices. It's not the address shared between memory and I/O devices as we previously mentioned in Memory-mapped I/O. You might get more detailed in Memory-mapped IO vs Port-mapped IO

Coherently understand the software-hardware interaction with regard to DMA and buses

I've gathered some level of knowledge on several components (including software and hardware) which are involved in general DMA transactions in ARM based boards, but I don't understand how is it all perfectly integrated, I didn't find a full coherent description about this.
I'll write down the high level of the knowledge I already have and I hope that someone could fix me where I'm wrong and complete the missing parts so the whole picture would be clear. My description starts with the userspace software and drills down to the hardware components. The misunderstood parts are in italic-bold format.
The user-mode application requests to read/write from some device, i.e. makes I/O operation.
The operating system receives the request and hand it to the appropriate driver (every OS has its own mechanism to do this, I don't need a further drill down here but if you want to share insights here you are welcome)
The driver which is on charge to handle the I/O request, has to know the address to which the device is mapped to (since I'm interested in ARM based boards, afaik there is only memory-mapped I/O and no port I/O). In most of the cases (if we consider smartphone-like boards) there is a linux kernel that parses the devices addresses from the device-tree which is given from the bootloader at the boot time (the modern approach), or the linux is precompiled for the specific model family and board with the device addresses within it (hardcoded in its source code) (in older and obsolete? approach). In some cases (happens a lot in smartphones) part of the drivers are precompiled and are just packaged into the kernel, i.e. their source is closed, thus, the addresses correspond to the devices are unknown. Is it correct?
Given that the driver knows the address of the relevant registers of the device it want to communicate with, it allocate a buffer (usually in the kernel space) to which the device would write its data (with the help of the DMA). The driver needs to inform the device about the location of that buffer, but the addresses that the devices work with (to manipulate memory) are different from the addresses that the drivers (cpu) work with, hence, the driver needs to inform the device about the 'bus address' of the buffer it has just allocated. How does the driver inform the device about that address? How popular is to use an IOMMU? when using IOMMU is there one hardware component that manages addressing or one per device?
Then the driver commands the device to do its job (by manipulating its registers) and the device transfers output data directly to the allocated buffer in the memory. Here I'm confused a bit with the relation of device-driver:bus:bus-controller:actual-device. Take for example some imaginary device which knows to communicate in the I2C protocol; the SoC specify an I2C bus interface - what is this actually? does the I2C bus has some kind of bus controller? Does the cpu communicate with the I2C bus interface or directly with the device? (i.e. the I2C bus interface is seamless). I guess that someone with some experience with device drivers could answer this easily..
The device populates a DMA channel. Since the device is not connected directly to the memory but rather is connected through some bus to the DMA controller (which masters the bus), it interacts with the DMA to transfer the required data to the allocated buffer in the memory. When the board vendor uses ARM IP cores and bus specifications then this step involves transactions over a bus from the AMBA spec (i.e. AHB/multi-AHB/AXI), and some protocol between the device and a DMAC on top of it. I would like to know more about this step, what actually happens? There are many specifications for DMA controller by ARM, which one is the popular? which is obsolete?
When the device is done, it sends an interrupt, which travel to the OS through the interrupt controller, and the OS's interrupt handler direct it to the appropriate driver which now knows that the DMA transfer is completed.

You've slightly conflated two things here - there are some devices (e.g. UARTs, MMC controllers, audio controllers, typically lower-bandwidth devices) which rely on an external DMA controller ("DMA engine" in Linux terminology), but many devices are simply bus masters in their own right and perform their own DMA directly (e.g. GPUs, USB host controllers, and of course the DMA controllers themselves). The former involves a bunch of extra complexity with the CPU programming the DMA controller, so I'm going to ignore it and just consider straightforward bus-master DMA.
In a typical ARM SoC, the CPU clusters and other master peripherals, and the memory controller and other slave peripherals, are all connected together with various AMBA interconnects, forming a single "bus" (generally all mapped to the "platform bus" in Linux), over which masters address slaves according to the address maps of the interconnect. You can safely assume that the device drivers know (whether by device tree or hardcoded) where devices appear in the CPU's physical address map, because otherwise they'd be useless.
On simpler systems, there is a single address map, so the physical addresses used by the CPU to address RAM and peripherals can be freely shared with other masters as DMA addresses. Other systems are more complex - one of the more well-known is the Raspberry Pi's BCM2835, in which the CPU and GPU have different address maps; e.g. the interconnect is hard-wired such that where the GPU sees peripherals at "bus address" 0x7e000000, the CPU sees them at "physical address" 0x20000000. Furthermore, in LPAE systems with 40-bit physical addresses, the interconnect might need to provide different views to different masters - e.g. in the TI Keystone 2 SoCs, all the DRAM is above the 32-bit boundary from the CPUs' point of view, so the 32-bit DMA masters would be useless if the interconnect didn't show them a different addresses map. For Linux, check out the dma-ranges device tree property for how such CPU→bus translations are described. The CPU must take these translations into account when telling a master to access a particular RAM or peripheral address; Linux drivers should be using the DMA mapping API which provides appropriately-translated DMA addresses.
IOMMUs provide more flexibility than fixed interconnect offsets - typically, addresses can be remapped dynamically, and for system integrity masters can be prevented from accessing any addresses other than those mapped for DMA at any given time. Furthermore, in an LPAE or AArch64 system with more than 4GB of RAM, an IOMMU becomes necessary if a 32-bit peripheral needs to be able to access buffers anywhere in RAM. You'll see IOMMUs on a lot of the current 64-bit systems for the purpose of integrating legacy 32-bit devices, but they are also increasingly popular for the purpose of device virtualisation.
IOMMU topology depends on the system and the IOMMUs in use - the system I'm currently working with has 7 separate ARM MMU-401/400 devices in front of individual bus-master peripherals; the ARM MMU-500 on the other hand can be implemented as a single system-wide device with a separate TLB for each master; other vendors have their own designs. Either way, from a Linux perspective, most device drivers should be using the aforementioned DMA mapping API to allocate and prepare physical buffers for DMA, which will also automatically set up the appropriate IOMMU mappings if the device is attached to one. That way, individual device drivers need not care about the presence of an IOMMU or not. Other drivers (typically GPU drivers) however, depend on an IOMMU and want complete control, so manage the mappings directly via the IOMMU API. Essentially, the IOMMU's page tables are set up to map certain ranges of physical addresses* to ranges of I/O virtual addresses, those IOVAs are given to the device as DMA (i.e. bus) addresses, and the IOMMU translates the IOVAs back to physical addresses as the device accesses them. Once the DMA operation is finished, the driver typically removes the IOMMU mapping, both to free up IOVA space and so that the device no longer has access to RAM.
Note that in some cases the DMA transfer is cyclic and never "finishes". With something like a display controller, the CPU might just map a buffer for DMA, pass that address to the controller and trigger it to start, and it will then continuously perform DMA reads to scan out whatever the CPU writes to that buffer until it is told to stop.
Other peripheral buses beyond the SoC interconnect, like I2C/SPI/USB/etc. work as you suspect - there is a bus controller (which is itself a device on the AMBA bus, so any of the above might apply to it) with its own device driver. In a crude generalisation, the CPU doesn't communicate directly with devices on the external bus - where a driver for an AMBA device says "write X to register Y", that just happens by the CPU performing a store to a memory-mapped address; where an I2C device driver says "write X to register Y", the OS usually has some bus abstraction layer which the bus controller driver implements, whereby the CPU programs the controller with a command saying "write X to register Y on device Z", the bus controller hardware will go off and do that, then notify the OS of the peripheral device's response via an interrupt or some other means.
* technically, the IOMMU itself, being more or less "just another device", could have a different address map in the interconnect as previously described, but I would doubt the sanity of anyone actually building a system like that.

Interfacing a linux device driver with dummy PCI device

I have a user space program that simulates a PCI device. I have downloaded the nvme linux device driver that interacts with the PCI device using the NVMe standard. I have to verify that my userspace program is compatible with the standard.
The nvme.c(the linux device driver) contains the nvme_probe() function that would be called when the device is plugged in. Since I do not have the device so I think I will incorporate the probe functionality in nvme_init() function.
Now I have studied quite a lot on the internet to understand how to emulate a PCI device, posts such as
Installing PCI driver without connection to device,
emulating a PCI device on linux
I do not get the idea how to return the populated struct pci_dev to the function call in the nvme_probe() ofpci_set_drvdata(pdev, dev);
And if you could suggest a tutorial, on how to manually populate the pci_dev struct with dummy device configuration and memory address of the userspace program function pointers to emulate interaction with the nvme driver.

I don't think it is possible to fake such thing with standard linux kernel.
Because in module_init() you are telling the kernel's PCI SUBSYSTEM to load the operation handlers (a.k.a - callbacks through function pointers) when a certain device is present in the system (via id_table).
so whenever you insmod your module, kernel's PCI SUBSYTEM then knows to load your driver whenever a device of matching vid/pid is plugged into the PCIe slot. The operation is like below -
Tell kernel to load {my_driver.ko} when this {vid/pid} pci device is
found in module_init or _init
After kernel knew, whenever a matching {vid/pid} device is connected to the system, it will call the .probe function callback of {my_driver.ko}
You may init the device (for real-device) or just return true to tell kernel that has correctly initialized the device.
You can also register new driver type from this probe function (for
read/write).
I am not sure about any magic VID/PID number which causes the PCI SUBSYTEM to always load the driver.
But you can actually load the PCI driver by using an actual PCI device.
Just remove appropriate driver for a real-PCI device. and use it's VID & PID as your driver's VID PID. Then the PCI SUBSYTEM will load your driver & you can also test your driver to simulate PCI device afterwards.
Hope this helps,
regards.

DMA transfer to a slave PCI device in Linux

I am a bit confused regarding DMA transfers with a PCIe device.
Say, for example, I have a slave PCIe device, and I want to transfer a block of data from the device to the RAM, using a DMA transaction. Note that the device is slave, and does not have a DMA "machine" on it.
I know I need to obtain a DMA-able buffer in RAM (either by allocating a coherent one, or by mapping a page) first.
But what's next? what's the API to start a DMA transfer of N bytes from address S to address D?
Can modern systems issue a DMA transfer to/from a slave pci device? if so, what is the Linux API for that?
As explained here:
[ISA]
In the original IBM PC, there was only one Intel 8237 DMA controller [...]
A PCI architecture has no central DMA controller, unlike ISA. Instead, any PCI component can request control of the bus ("become the bus master") and request to read from and write to system memory
The PCI bus does not have a "central" DMA controller - instead, each device can be a DMA "controller".

First of all, there are no slaves and slave holders inside modern PC. There is south bridge (in PCI) or Root Complex (root of PCI-express device tree) and there are some other PCI/PCIe actors, like bridges, soldered chips, plugged cards, hardware debuggers etc. I'll assume that you are asking about plugged card or some other peripheral device, like soldered Sound Card or Ethernet chip.
According to this detailed description of "Transaction Layer Packet" (TLP, "PCIe’s uppermost layer"), there is "Bus Mastership (DMA)":
On PCIe, it’s significantly less exotic. ... anyone on the bus can send read and write TLPs on the bus, exactly like the Root Complex. This allows the peripheral to access the CPU’s memory directly (DMA) or exchange TLPs with peer peripherals (to the extent that the switching entities support that).
Also, there is some benefits of DMA capability from plugged devices: DMA attack. And PCIe is listed as capable of initiating DMA transfer:
Systems may be vulnerable to a DMA attack by an external device if they have a FireWire, ExpressCard, Thunderbolt, or other expansion port that, like PCI and PCI-Express in general, hooks up attached devices directly to the physical address space.
I think, there is no universal API for programming DMA transfers that are initiated from the peripheral device itself. This depends on the what the device is, when the DMA should be started and what will be sent.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight