I've gathered some level of knowledge on several components (including software and hardware) which are involved in general DMA transactions in ARM based boards, but I don't understand how is it all perfectly integrated, I didn't find a full coherent description about this.
I'll write down the high level of the knowledge I already have and I hope that someone could fix me where I'm wrong and complete the missing parts so the whole picture would be clear. My description starts with the userspace software and drills down to the hardware components. The misunderstood parts are in italic-bold format.
The user-mode application requests to read/write from some device, i.e. makes I/O operation.
The operating system receives the request and hand it to the appropriate driver (every OS has its own mechanism to do this, I don't need a further drill down here but if you want to share insights here you are welcome)
The driver which is on charge to handle the I/O request, has to know the address to which the device is mapped to (since I'm interested in ARM based boards, afaik there is only memory-mapped I/O and no port I/O). In most of the cases (if we consider smartphone-like boards) there is a linux kernel that parses the devices addresses from the device-tree which is given from the bootloader at the boot time (the modern approach), or the linux is precompiled for the specific model family and board with the device addresses within it (hardcoded in its source code) (in older and obsolete? approach). In some cases (happens a lot in smartphones) part of the drivers are precompiled and are just packaged into the kernel, i.e. their source is closed, thus, the addresses correspond to the devices are unknown. Is it correct?
Given that the driver knows the address of the relevant registers of the device it want to communicate with, it allocate a buffer (usually in the kernel space) to which the device would write its data (with the help of the DMA). The driver needs to inform the device about the location of that buffer, but the addresses that the devices work with (to manipulate memory) are different from the addresses that the drivers (cpu) work with, hence, the driver needs to inform the device about the 'bus address' of the buffer it has just allocated. How does the driver inform the device about that address? How popular is to use an IOMMU? when using IOMMU is there one hardware component that manages addressing or one per device?
Then the driver commands the device to do its job (by manipulating its registers) and the device transfers output data directly to the allocated buffer in the memory. Here I'm confused a bit with the relation of device-driver:bus:bus-controller:actual-device. Take for example some imaginary device which knows to communicate in the I2C protocol; the SoC specify an I2C bus interface - what is this actually? does the I2C bus has some kind of bus controller? Does the cpu communicate with the I2C bus interface or directly with the device? (i.e. the I2C bus interface is seamless). I guess that someone with some experience with device drivers could answer this easily..
The device populates a DMA channel. Since the device is not connected directly to the memory but rather is connected through some bus to the DMA controller (which masters the bus), it interacts with the DMA to transfer the required data to the allocated buffer in the memory. When the board vendor uses ARM IP cores and bus specifications then this step involves transactions over a bus from the AMBA spec (i.e. AHB/multi-AHB/AXI), and some protocol between the device and a DMAC on top of it. I would like to know more about this step, what actually happens? There are many specifications for DMA controller by ARM, which one is the popular? which is obsolete?
When the device is done, it sends an interrupt, which travel to the OS through the interrupt controller, and the OS's interrupt handler direct it to the appropriate driver which now knows that the DMA transfer is completed.
You've slightly conflated two things here - there are some devices (e.g. UARTs, MMC controllers, audio controllers, typically lower-bandwidth devices) which rely on an external DMA controller ("DMA engine" in Linux terminology), but many devices are simply bus masters in their own right and perform their own DMA directly (e.g. GPUs, USB host controllers, and of course the DMA controllers themselves). The former involves a bunch of extra complexity with the CPU programming the DMA controller, so I'm going to ignore it and just consider straightforward bus-master DMA.
In a typical ARM SoC, the CPU clusters and other master peripherals, and the memory controller and other slave peripherals, are all connected together with various AMBA interconnects, forming a single "bus" (generally all mapped to the "platform bus" in Linux), over which masters address slaves according to the address maps of the interconnect. You can safely assume that the device drivers know (whether by device tree or hardcoded) where devices appear in the CPU's physical address map, because otherwise they'd be useless.
On simpler systems, there is a single address map, so the physical addresses used by the CPU to address RAM and peripherals can be freely shared with other masters as DMA addresses. Other systems are more complex - one of the more well-known is the Raspberry Pi's BCM2835, in which the CPU and GPU have different address maps; e.g. the interconnect is hard-wired such that where the GPU sees peripherals at "bus address" 0x7e000000, the CPU sees them at "physical address" 0x20000000. Furthermore, in LPAE systems with 40-bit physical addresses, the interconnect might need to provide different views to different masters - e.g. in the TI Keystone 2 SoCs, all the DRAM is above the 32-bit boundary from the CPUs' point of view, so the 32-bit DMA masters would be useless if the interconnect didn't show them a different addresses map. For Linux, check out the dma-ranges device tree property for how such CPU→bus translations are described. The CPU must take these translations into account when telling a master to access a particular RAM or peripheral address; Linux drivers should be using the DMA mapping API which provides appropriately-translated DMA addresses.
IOMMUs provide more flexibility than fixed interconnect offsets - typically, addresses can be remapped dynamically, and for system integrity masters can be prevented from accessing any addresses other than those mapped for DMA at any given time. Furthermore, in an LPAE or AArch64 system with more than 4GB of RAM, an IOMMU becomes necessary if a 32-bit peripheral needs to be able to access buffers anywhere in RAM. You'll see IOMMUs on a lot of the current 64-bit systems for the purpose of integrating legacy 32-bit devices, but they are also increasingly popular for the purpose of device virtualisation.
IOMMU topology depends on the system and the IOMMUs in use - the system I'm currently working with has 7 separate ARM MMU-401/400 devices in front of individual bus-master peripherals; the ARM MMU-500 on the other hand can be implemented as a single system-wide device with a separate TLB for each master; other vendors have their own designs. Either way, from a Linux perspective, most device drivers should be using the aforementioned DMA mapping API to allocate and prepare physical buffers for DMA, which will also automatically set up the appropriate IOMMU mappings if the device is attached to one. That way, individual device drivers need not care about the presence of an IOMMU or not. Other drivers (typically GPU drivers) however, depend on an IOMMU and want complete control, so manage the mappings directly via the IOMMU API. Essentially, the IOMMU's page tables are set up to map certain ranges of physical addresses* to ranges of I/O virtual addresses, those IOVAs are given to the device as DMA (i.e. bus) addresses, and the IOMMU translates the IOVAs back to physical addresses as the device accesses them. Once the DMA operation is finished, the driver typically removes the IOMMU mapping, both to free up IOVA space and so that the device no longer has access to RAM.
Note that in some cases the DMA transfer is cyclic and never "finishes". With something like a display controller, the CPU might just map a buffer for DMA, pass that address to the controller and trigger it to start, and it will then continuously perform DMA reads to scan out whatever the CPU writes to that buffer until it is told to stop.
Other peripheral buses beyond the SoC interconnect, like I2C/SPI/USB/etc. work as you suspect - there is a bus controller (which is itself a device on the AMBA bus, so any of the above might apply to it) with its own device driver. In a crude generalisation, the CPU doesn't communicate directly with devices on the external bus - where a driver for an AMBA device says "write X to register Y", that just happens by the CPU performing a store to a memory-mapped address; where an I2C device driver says "write X to register Y", the OS usually has some bus abstraction layer which the bus controller driver implements, whereby the CPU programs the controller with a command saying "write X to register Y on device Z", the bus controller hardware will go off and do that, then notify the OS of the peripheral device's response via an interrupt or some other means.
* technically, the IOMMU itself, being more or less "just another device", could have a different address map in the interconnect as previously described, but I would doubt the sanity of anyone actually building a system like that.
Related
I would like to understand how the MMIO works on ARM architecture.
I realized that ARM provides 1:1 mapping from physical address to specific peripheral.
For example, to manage the GPIOX on arm, for example in Raspberry Pi, the processor accesses the specific physical addresses (seems that preconfigured by the manufacturer?) without configuring some registers beforehand.
I thought that there are some specific BAR register that arbitrates the read/write request to specific physical addresses to peripherals. However, when I check the spec for BCM2835 (Raspberry pi 3), the physical addresses are translated by another MMU called VC/ARM MMU. Is it a common design to have another MMU that translates the physical addresses to bus addresses in ARM architecture?
Also, I was wondering how the SMMU (IOMMU in x86) is utilized in this concept. I found one article mentioning that the VC/ARM MMU is an example of SMMU but I think that is not true? When I check some monitor code & kernel driver code implementation (not raspberry pi), it seems that the SMMU is also mapped to specific physical addresses and the monitor/kernel uses those addresses to initialize and communicate with SMMU. If the arm architecture utilize VC/ARM MMU as an SMMU, how that physical address mapping for SMMU itself can be accessed to initialize the SMMU..?
Lastly, I thought that all peripherals are managed by the SMMU. If some peripherals are always mapped to fixed physical addresses, what is the role of SMMU? Why some peripherals are communicated with PE (CPU) through fixed physical addresses, and some are communicated with SMMU..? How exactly the peripherals and PE (Processors) can communicate in ARM architecture..?
I am just answering your questions. There are many things wrong with some assumptions you have, which Old Timer tries to clarify.
Is it a common design to have another MMU that translates the physical addresses to bus addresses in ARM architecture?
No. Broadcom has an architecture license. They designed there own HDL code for the ARM CPU and system details can be different. In fact, it is quite common for even direct license (using ARM HDL) that the systems differ from vendor to vendor.
If the arm architecture utilize VC/ARM MMU as an SMMU, how that physical address mapping for SMMU itself can be accessed to initialize the SMMU?
If some peripherals are always mapped to fixed physical addresses, what is the role of SMMU? Why some peripherals are communicated with PE (CPU) through fixed physical addresses, and some are communicated with SMMU..? How exactly the peripherals and PE (Processors) can communicate in ARM architecture..?
The typical solution to this is 'TrustZone'. Here, the access is defined as having a 'secure' or 'normal' access. The master (CPU) tags the access and either the peripheral or a bus access controller permits or prohibits access to the peripheral. These are best statically mapped at boot time and locked.
By defining permitted use cases, the peripheral/master access patterns can be defined for the system. The complication is 'dynamic' peripherals and masters. The ARM TrustZone CPU is dynamic. A 'world switch' changes the CPU access and there are attacks on the communication interface between 'normal' and 'secure' worlds.
The ARM AXI/AHB buses were originally designed for embedded devices. PCs in contrast have dynamic buses ISA->EISA->PCMIA->PCI (etc.) These addresses are typically dynamic. Note However, this bus structure has an expense. So some of your question is like asking why isn't ARM just like an x86 PC. They had different goals and they are different. You can't put a new grahpics card into your cell phone.
Reference:
Handling ARM Trustzones
Explanation of arm Bus architecture
ARM Differences from PC
Note: It is from the modular nature of the PC system which was part of why the PC dominated Apple, Commodore, Atari, etc. in my opinion (the other aspect was piracy), contrary to others who think the answer is Microsoft. The hardware matters.
I will first say that I'm not expert in the field and my question might contain misunderstanding, in which case, I'll be glad if you correct me and attach resources so I can learn further details.
I'm trying to figure out the way that the system bus and how the various devices that appear in a mobile device (such as sensors chips, wifi/BT SoC, touch panel, etc.) are addressed by the CPU (and by other MCUs).
In the PC world we have the bus arbitrator that route the commands/data to the devices, and, afaik, the addresses are hardwired on the board (correct me if I'm wrong). However, in the mobile world I didn't find any evidence of that type of addressing; I did find that ARM has standardized the Advanced Microcontroller Bus Architecture, I don't know, though, whether that standard applied for the components (cpu-cores) which lies inside the same SoC (that is Exynos, OMAP, Snapdragon etc.) or also influence peripheral interfaces. Specifically I'm asking what component is responsible on allocating addresses to peripheral devices and MMIO addresses?
A more basic question would be whether there even exist a bus management in the mobile device architecture or maybe there is some kind of "star" topology (where the CPU is the center).
From this question I get the impression that these devices are considered as platform devices, i.e., devices that are connected directly to the CPU, and not through a bus. Still, my question is how does the OS knows how to address them? Then other threads, this and this about platform devices/drivers made me confused..
A difference between ARM and the x86 is PIO. There are no special instruction on the ARM to access an I/O device. Everything is done through memory mapped I/O.
A second difference is the ARM (and RISC in general) has a separate load/store unit(s) that are separate from normal logic.
A third difference is that ARM licenses both the architecture and logic core. The first is used by companies like Apple, Samsung, etc who make a clean room version of the cores. For the second set, who actually buy the logic, the ARM CPU will include something from the AMBA family.
Other peripherals from ARM such as a GIC (Cortex-A interrupt controller), NVIC (Cortex-M interrupt controller), L2 controllers, UARTs, etc will all come with an AMBA type interface. 3rd party companies (ChipIdea USB, etc) may also make logic that is setup for a specific ARM bus.
Note AMBA at Wikipedia documents several bus types.
APB - a lower speed peripheral bus; sort of like south bridge.
AHB - several versions (older north bridge).
AXI - a newer multi-CPU (master) high speed bus. Example NIC301.
ACE - an AXI extension.
A single CPU/core may have one, two, or more master connection to an AXI bus. There maybe multiple cores attached to the AXI bus. The load/store and instruction fetch units of a core can use the multiple ports to dispatch requests to separate slaves. The SOC vendor will balance the number of ports with expected memory bandwidth needs. GPUs are also often connected to the AXI BUS along with DDR slaves.
It is true that there is no 100% standard topology; especially if you consider all possible future ARM designs. However, typical topologies will include a top level AXI with some AHB peripherals attached. One or multiple 2nd level APB (buses) will provide access to low speed peripherals. Not every SOC vendor wants to spend time to redesign peripherals and the older AHB interface speeds maybe quite fine for a device.
Your question is tagged embedded-linux. For the most part Linux just needs to know the physical addresses. On occasion, the peripheral BUS controllers may need configuration. For instance, an APB may be configure to allow or disallow user mode. This configuration could be locked at boot time. Generally, Linux doesn't care too much about the bus structure directly. Programmers may have coded a driver with knowledge of the structure (like IRAM is fasters, etc).
Still, my question is how does the OS knows how to address them?
Older Linux kernels put these definitions in a machine file and passed a platform resource structure including interrupt number, and the physical address of a register bank. In newer Linux versions, this information is included with Open Firmware or device tree files.
Specifically I'm asking what component is responsible on allocating addresses to peripheral devices and MMIO addresses?
The physical addresses are set by the SOC manufacturer. Linux platform support will use the MMU to map them as non-cacheable to some un-used range. Often the physical addresses may be very sparse so the virtual remapping pack more densely. Each one incurs a TLB hit (MMU cache).
Here is a sample SOC bus structure using AXI with a Cortex-M and Cortex-A connected.
The PBRIDGE components are APB bridges and it is connected in a star topology. As others suggests, you need to look a your particular SOC documentation for specifics. However, if you have no SOC and are trying to understand ARM generally, some of the information above will help you, no matter what SOC you have.
1) ARM does not make chips, they make IP that is sold to chip vendors who make chips. 2) yes the amba/axi bus is the interface from ARM to the world. But that is on chip, so it is up to the chip vendor to decide what to hook up to it. Within a chip vendor you may find standards or habits, those standards or habits may be that for a family of parts the same peripherals may be find at the same addresses (same uart peripheral, same spi peripheral, clock tree, etc). And of course sometimes the same peripheral at different addresses in the family and sometimes there is no consistency. In the intel x86 world intel makes the processors they have historically made many of the peripherals be they individual parts to super I/O parts to north and south bridges to being in the same package. Intels processor success lies primarily in reverse compatibility so you can still access a clone uart at the same address that you could access it on your original ibm pc. When you have various chip vendors you simply cannot do that, arm does not incorporate the peripherals for the most part, so getting the vendors to agree on stuff simply will not happen. This has driven folks crazy yes, and linux is in a constant state of emergency with arm since it rarely if ever works on any platform. The additions tend to be specific to one chip or vendor or nuance not caring to check that the addition is in the wrong place or the workaround or whatever does not apply everywhere and should not be applied everywhere. The cortex-ms have taken a small step, before the arm7tdmi you had the freedom to use whatever address space you wanted for anything. The cortex-m has divided the space up into some major chunks along with some internal addresses (not just the cortex-ms this is true on a number of the cores). But beyond a system timer and maybe a interrupt controller it is still up to the chip vendor. The x86 reverse compatibility habits extend beyond intel so pcs have a lot of consistency across motherboard vendors (partly driven by software that they want to run on their system namely windows). Embedded in general be it arm or mips or whomever puts stuff wherever and the software simply adapts so embedded/phone software the work is on the developer to select the right drivers and adjust physical addresses, etc.
AMBA/AXI is simply the bus standard like wishbone or isa or pci, usb, etc. It defines how to interface to the arm core the processor from arm, this is basically on chip, the chip vendor then adds or buys from someone IP to bridge the amba/axi bus to pci or usb or dram or flash, etc, on chip or off is their choice it is their product. Other than perhaps a few large chunks the chip vendor is free to define the address space, and certainly free to define what peripherals and where. They dont have to use the same usb IP or dram IP as anyone else.
Is the arm at the center? Well with your smart phone processors you tend to have a graphics coprocessor, so then you have to ask who owns the world the arm, the gpu, or someone else? In the case of the raspberry pi which is to some extent one of these flavor of processors albeit older and slower now, the gpu appears to be the center of the world and the arm is a side fixture that has to time share on the gpu's bus, who knows what the protocol/architecture of that bus is, the arm is axi of course but is the whole chip or does the bridge from the arm to gpu side also switch to some other bus protocol? The point being is the answer to your question is no there is no rule there is no standard sometimes the arm is at the center sometimes it isnt. Up to the chip and board vendors.
not interested in terminology maybe someone else will answer, but I would say outside an elementary sim you wont have just one peripheral (okay I will use that term for generic stuff the processor accesses) tied to the amba/axi bus. You need a first level amba/axi interface that then divides up the address space per your design, and then using amba/axi or whatever bus protocol you want (generally you adapt to the interface for the purchased or designed IP). You, the chip vendor decides on the address space. You the programmer, has to read the documentation from the chip vendor or also board vendor to find the physical address space for each thing you want to talk to and you compile that knowledge into your operating system or application per the rules of that software or build system.
This is not unique to arm based systems you have the same problem with mips and powerpc and other cores you can buy in ip form, for whatever reason arm has dominated the world (there are many arm processors in or outside your computer for every x86 you own, x86 processors are extremely low volume compared to arm based). Like Gates had a desktop in every home, a long time ago ARM had a "touch an ARM once a day" type of a thing to push their product and now most things with a power switch and in particular with a battery has an arm in it somewhere. Which is a nightmare for developers because there are so many arm cores now with nuances and every chip vendor and every family and sometimes members within a family are different so as a developer you simply have to adapt, write your stuff in a modular form, mix and match modules, change addresses, etc. Making one binary like windows does for example that runs everywhere, is not in any way a wise goal for arm based products. Make the modules portable and build the modules per target.
Each SoC will be designed to have its own (possibly configurable) memory map. You will need to read the relevant technical reference manual to get the exact details.
Examples are:
Raspeberry pi datasheet (pdf)
OMAP 5 TRM
I'm having a problem with a parallel connection I've got to establish using DMA (Direct Acces Memory).
I've got to write some characters to a parallel port with a given address, through a C application. I know that for a PIO access, there are the _inp/_outp functions, but I don't know how to manage a direct memory access parallel communication.
Does anyone know how I should do or has any good links (I couldn't find any even after long research on the Web
This is not something that can be answered generically.
DMA access is determined by either a DMA controller (in OLD PC's), or using "bus mastering" (PCI onwards). Either of these solutions requires access to the relevant hardware manuals for the device that you are working with (and the DMA controller, if applicable).
In general, the principle works as this:
Reserve a piece of memory (DMA buffer) for the device to store data in.
Configure the device to store the data in said region (remember that in nearly all cases, DMA happens on physical addresses, as opposed to the virtual addresses that Windows or Linux uses).
When the device has stored the requested data, an interrupt is fired, the software responsible for the device takes the interrupt and signals some higher level software that the data is ready, and (perhaps) reprograms the device to start storing data again (either after copying the DMA buffer to someplace else, or assigning a new DMA buffer).
At the end of the day every piece of code we write eventually gets turned into assembler and then machine language.
If you were writing assembler and wanting to perform a simple connection between two computers, how would you know which memory addresses to use (let alone offsets) within the assembler? Would you need to know specific addresses relating to the operating system?
I'm just wondering how somebody would write a really "clean" and "efficient" message passing library/compiler- the thing which is getting me is what on earth would network communications/IPC look like in assembler?
I think part of this answer could lie with querying known addresses relating to the OS? For example 0x4545456 to 0x 60000000 contains the Linux kernel data for communications X etc.
The addresses are not specific to your OS. They are specific to your hardware/system. Accessing those has nothing to do with assembler vs. another programming language (e.g. C), in fact most device driver code (the code that actually interacts with the networking hardware) is typically written in C.
Here's just one random sample of a network (ethernet) controller:
Intel® 82580EB/82580DB GbE Controller: Datasheet
There are a bunch of registers that your software, either in assembler, or in another language, has to program to get this thing to actually communicate over ethernet. It's probably easier to start with a simpler example, something like a serial port. Let's build a hypothetical, fixed baud rate, serial port controller, mapped to memory:
Address Meaning
0 RX status (reads 0 when no data to read, 1 a byte is available)
1 RX buffer
2 TX status (reads 0 when ready to send, 1 when busy)
3 TX buffer
Now your software, either in assembler or any other language, can transmit data to another computer, by monitoring (polling) address 2 until it's ready, writing the next byte to address 3. We can also received data from another computer by monitoring (polling) address 0 to see when data is ready and reading the byte from address 1 when the data is there.
In a modern operating system/OS those are all physical addresses which need to be somehow mapped into virtual addresses.
Real world hardware, such as the one I linked to, will typically use interrupts, so you don't need to poll. It will usually have DMA, so the hardware can access your data directly rather than you feeding it byte by byte. It will handle various protocols and will have registers for checking and setting various aspects of this protocol.
In a modern OS the actual interaction with the hardware is implemented in a device driver and user software can exchange data with the device driver through some API. Again, this user code may be written in assembler or any other language. The API will vary depending on the OS. Communication/networking is generally built as a "stack" with higher level protocols implemented over the lower level ones. Which part of this stack is in a user library or part of the OS will vary between different operating systems.
For the hypothetical device I described above the API may consist of two single byte blocking calls, read() and write(). You then use some sort of system call mechanism from either assembler or a higher level language to call these and pass parameters/retrieve the output. In some operating systems device I/O may look like file I/O so you would use the generic file read/write to perform operations on the device and the OS will dispatch those to the right device driver. Furthermore, in a typical OS the actual system call will be available through some sort of library, which again you may call from various programming languages.
There are two pieces of code for doing networking in assembly - the kernel code used by the operating system to actually do the networking, and client code that wants to tell the OS what data to send over the network.
Typically, the hardware in a machine has certain memory addresses dedicated to communicating with the network hardware. The machine code for the OS can then write the appropriate values into this memory to control the hardware that ends up sending and receiving bytes. These memory addresses would be hardcoded into the machine code.
In the case of user code that does networking (say, Mozilla Firefox), the process is different. There is typically a machine instruction or set of instructions that are used for user code to tell the operating system to perform some task (in MIPS, for example, this is syscall, while I think x86 uses the int instruction). Client code would work by setting up some buffers with the appropriate data to send to the network, then would use one of the assembly instructions above to tell the OS that it should send the data. The hardware then invokes the OS, which reads the user data and then uses its own machine code (described above) to actually control the network device appropriately. In this way, the OS can guard direct access to the network device by blocking access to the physical addresses controlling the device and moderating access through system calls. It also means that you don't need to know any memory addresses when writing user code to do networking. The OS handles these details, and all you need to know about is what instruction to execute to trigger the system call.
Hope this helps!
I am a bit confused regarding DMA transfers with a PCIe device.
Say, for example, I have a slave PCIe device, and I want to transfer a block of data from the device to the RAM, using a DMA transaction. Note that the device is slave, and does not have a DMA "machine" on it.
I know I need to obtain a DMA-able buffer in RAM (either by allocating a coherent one, or by mapping a page) first.
But what's next? what's the API to start a DMA transfer of N bytes from address S to address D?
Can modern systems issue a DMA transfer to/from a slave pci device? if so, what is the Linux API for that?
As explained here:
[ISA]
In the original IBM PC, there was only one Intel 8237 DMA controller [...]
A PCI architecture has no central DMA controller, unlike ISA. Instead, any PCI component can request control of the bus ("become the bus master") and request to read from and write to system memory
The PCI bus does not have a "central" DMA controller - instead, each device can be a DMA "controller".
First of all, there are no slaves and slave holders inside modern PC. There is south bridge (in PCI) or Root Complex (root of PCI-express device tree) and there are some other PCI/PCIe actors, like bridges, soldered chips, plugged cards, hardware debuggers etc. I'll assume that you are asking about plugged card or some other peripheral device, like soldered Sound Card or Ethernet chip.
According to this detailed description of "Transaction Layer Packet" (TLP, "PCIe’s uppermost layer"), there is "Bus Mastership (DMA)":
On PCIe, it’s significantly less exotic. ... anyone on the bus can send read and write TLPs on the bus, exactly like the Root Complex. This allows the peripheral to access the CPU’s memory directly (DMA) or exchange TLPs with peer peripherals (to the extent that the switching entities support that).
Also, there is some benefits of DMA capability from plugged devices: DMA attack. And PCIe is listed as capable of initiating DMA transfer:
Systems may be vulnerable to a DMA attack by an external device if they have a FireWire, ExpressCard, Thunderbolt, or other expansion port that, like PCI and PCI-Express in general, hooks up attached devices directly to the physical address space.
I think, there is no universal API for programming DMA transfers that are initiated from the peripheral device itself. This depends on the what the device is, when the DMA should be started and what will be sent.