increase SRAM space on STM32F207
Hi,
I use the STM32F207ZFT but I haven't enough RAM space in SRAM1 for my application:
Question1: if I don't use the SRAM2 region for the DMA's use (SRAM2 / 16KB: 0x2001 C000 - 0x2001 FFFF), could I use this memory area for normal RAM purpose (for extend the BSS area) in order to be contiguous with the SRAM1 (to increase the overall RAM size for uninitialized variables, initialized to 0)?
Question2: could I use the backup SRAM (BKPSRAM / 4KB: 0x4002 4000 - 0x4002 4FFF) for storing some data buffers or some data arrays, as we could do it by using the BSS RAM area?
Independently of its low consumption (on Vbat pin), is that the characteristics of this BKPSRAM are comparable to the SRAM1 area (access time ...)?
Best regards,
Disclaimer: I'm very familiar with STM32F1xx and somewhat familiar with STM32F4xx, but never used STM32F2xx.
Regarding the first question: from reading the manual (in particular, section 2.3.1), there's nothing special about SRAM2, except that an address in SRAM2 can be accessed at the same time as an address in SRAM1 is being accessed. Other than that, I can't spot any restrictions.
Regarding the second question: BKPSRAM is connected to the bus matrix through the AHB1 bus. In principle this bus runs at the same clock as the core, so if there is no bus contention the speed should be similar. If there are any wait states or anything that could delay accesses to BKPSRAM, I was unable to find it on the manual. Of course, if you have, say, a DMA transaction that is heavily accessing a peripheral connected to AHB1, then you would have bus contention which might delay accesses to BKPSRAM, whereas SRAM1 and SRAM2 have a direct connection, not shared by anything else, to the bus matrix. To summarize: you shouldn't have any problems.
Related
I am reading the book, Professional CUDA C Programming. On page 159, it says:
Aligned memory accesses occur when the first address of a device
memory transaction is an even multiple of the cache granularity being
used to service the transaction (either 32 bytes for L2 cache or 128
bytes for L1 cache).
I am wondering why aligned memory accesses in CUDA need even multiples of the cache granularity rather than just multiples of the cache granularity.
So, I checked the cuda-c-programming-guide from NVDIA. It says:
Global memory resides in device memory and device memory is accessed
via 32-, 64-, or 128-byte memory transactions. These memory
transactions must be naturally aligned: Only the 32-, 64-, or 128-byte
segments of device memory that are aligned to their size (i.e., whose
first address is a multiple of their size) can be read or written by
memory transactions.
It seems that even multiples of the cache granularity is unnecessary for aligned memory access, isn't it?
The quoted sentence from the book seems to be incorrect in two senses:
A memory access has an alignment of N if it is an access to an address that is a multiple of N. That's irrespective of CUDA. What seems to be discussed here is memory access coalescence.
As you suggest, and AFAIK, coalescence requires "multiples of" the cache granularity, not "even multiples of".
I wonder if there is a way to identify memory used for DMA mapping in some proc files, such as mtrr and iomem, or via lspic -vv.
In my /proc/mtrr, there is only one uncachable region, and it seems to be pointing at the 'PCI hole' at 3.5-4GB, almost.
base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable
By cross verifying with /proc/iomem, of this 512MB region, only the last 21 MB before 4GB is NOT consumed by PCI Bus, and that 21MB sliver is occupied by things like pnp/IOAPIC/Reserved.
So my questions are:
What is the signature of DMA region in /proc/mtrr and /proc/iomem
Are there other places, such as other proc files and commands that I can use to see DMA region?
It seems by adding rows to /proc/mtrr, a privileged user can change caching mechanism of any memory, at runtime. So besides the fact that DMA has to be lower 32bit(assuming without DAC), are there other special requirement for DMA memory allocation? If there are no further requirment, then maybe the only hint I can use to identify DMA memory would be /proc/mtrr?
DMA (Direct Memory Access) is just where a device accesses memory itself (without asking CPU to feed the data to the device). For a (simplified) example of DMA; imagine a random process does a write(), and this bubbles its way up (through VFS, through file system, through any RAID layer, etc) until it reaches some kind of disk controller driver; then the disk controller driver tells its disk controller "transfer N bytes from this physical address to that place on the disk and let me know when the transfer has been done". Most devices (disk controllers, network cards, video cards, sound cards, USB controllers, ...) use DMA in some way. Under load, all the devices in your computer may be doing thousands of transfers (via. DMA) per second, potentially scattered across all usable RAM.
As far as I know; there are no files in /proc/ that would help (most likely because it changes too fast and too often to bother providing any, and there'd be very little reason for anyone to ever want to look at it).
The MTTRs are mostly irrelevant - they only control the CPU's caches and have no effect on DMA requests from devices.
The /proc/iomem is also irrelevant. It only shows which areas devices are using for their own registers and has nothing to do with RAM (and therefore has nothing to do with DMA).
Note 1: DMA doesn't have to be in the lower 32-bit (e.g. most PCI devices have supported 64-bit DMA/bus mastering for a decade or more); and for the rare devices that don't support 64-bit it's possible for Linux to use an IOMMU to remap their requests (so the device thinks it's using 32-bit addresses when it actually isn't).
Note 2: Once upon a time (a long time ago) there were "ISA DMA controller chips". Like the ISA bus itself; these were restricted to the first 16 MiB of the physical address space (and had other restrictions - e.g. not supporting transfers that cross a 64 KiB boundary). These chips haven't really had a reason to exist since floppy disk controllers became obsolete. You might have a /proc/dma describing these (but if you do it probably only says "cascade" to indicate how the chips connect, with no devices using them).
I know the question is vague.. but here is what I hope to learn: the MCU directs some part of memory address to devices on the PCI bus, hence in theory user/kernel code can directly read/write device memory as if it were main memory. But data in and out of PCI Express devices are packaged/serialized/transmitted in lanes, which means each read/write incurs significant overhead, such as packaging (add headers) and un-packaging. So that means it is not ideal for user/kernel to read device memory a byte at a time, instead it should do some sort of bulk transfer. If so, what is the preferred mechanism and API?
BTW, I know there is DMA, but it seems to me that DMA does not require device memory to be directly mapped into main memory address space - DMA is about letting device access main memory, and my question is the other way, letting user/kernel access device memory. So I am guessing it is not related to the question above, is that correct?
Yes, accessing memory-mapped I/O (MMIO) is slow.
The primary reason that it is slow is that it is typically uncacheable,
so every access has to go all the way to the device.
In x86 systems, which I am most familiar with, cacheable memory is accessed in 64-byte chunks,
even though processor instructions typically access memory in 1, 2, 4, or 8 byte chunks.
If multiple processor instructions access adjacent cacheable memory locations, all but the first access are satisfied from the cache. For similar accesses to device memory, every access incurs the full latency to the device and back.
The second reason is that the path from the processors to memory are critical to performance and are highly optimized.
The path to devices has always been slow, so software is designed to compensate for that, and optimizing the performance of MMIO isn't a priority.
Another related reason is that PCI has ordering rules that require accesses to be buffered and processed in a strict order.
The memory system can handle ordering in a much more flexible way. For example, a dirty cache line may be written to memory at any convenient time. MMIO accesses must be performed precisely in the order that they are executed by the CPU.
The best way to do bulk transfer of data to a device is to have the device itself perform DMA, "pulling" the data from memory into the device, rather than "pushing" it from the CPU to the device. (This also reduces the load on the CPU, freeing it to do other useful work.)
STM32F4 controllers (with ARM Cortex M4 CPU) allow a so called physical remap of the lowest addresses in the memory space (0x00000000 to 0x03FFFFFF) using the SYSCFG_MEMRMP register. What I do understand is that the register selects which memory (FLASH/RAM/etc.) is aliased to the lowest addresses and therefore from which memory the reset vector and stack pointer is fetched after reset.
The documentation [1] also mentions that
In remap mode, the CPU can access the external memory via ICode bus
instead of System bus which boosts up the performance.
This means that after a remap e.g. to RAM an instruction fetched from within the alias address space (0x00000000 to 0x03FFFFFF) the ICode bus will be used.
Now my question: After such a remap operation e.g. to RAM, will a fetch to the non-aliased location of the RAM use the system bus or the ICode bus?
The background of the question is that I want to write a linker script for an image executing from RAM only (under control of a debugger). To which memory area should the .text section go? The alias space or the physical space?
[1] ST DocID018909 Rev 7
Thanks to Sean I could find the answer in the ARM® Cortex®‑M4 Processor Technical Reference Manual section 2.3.1 Bus interfaces:
ICode memory interface: Instruction fetches from Code memory space,
0x00000000 to 0x1FFFFFFC, are performed over the [sic!: this] 32-bit AHB-Lite bus.
DCode memory interface: Data and debug accesses to Code memory space,
0x00000000 to 0x1FFFFFFF, are performed over the [sic!: this] 32-bit AHB-Lite bus.
System interface: Instruction fetches and data and debug accesses to
address ranges 0x20000000 to 0xDFFFFFFF and 0xE0100000 to 0xFFFFFFFF
are performed over the [sic!: this] 32-bit AHB-Lite bus.
This also makes clear, that the flash memory of STM32F4 MCUs located at 0x08000000 is always accessed (by the CPU core) using the ICode/DCode busses, regardless if it is remapped. This is because both, the original location and the remapped location are within the code memory space (0x00000000 to 0x1FFFFFFF).
However, if the code is located in SRAM at 0x20000000 then access to the remapped location at 0x00000000 uses the ICode/DCode busses while access to the original location (outside the code memory space) uses the system bus.
The choice of bus interface on the core depends on the addresses accessed. If you access an instruction at 0x00000004, this is done on the ICode bus. An access to 0x20000004 is done using the System bus.
What the REMAP function does is change the physical memory system so that an access to 0x00000004 (ICode bus) will use the same RAM as you can also access on the system bus. Any access to 0x20000004 will be unaffected, and still be generated on the System bus by the core.
I have understood why memory should be aligned to 4 byte and 8 byte based on data width of the bus. But following statement confuses me
"IoDrive requires that all I/O performed on a device using O_DIRECT must be 512-byte
alligned and a multiple of 512 bytes in size."
What is the need for aligning address to 512 bytes.
Blanket statements blaming DMA for large buffer alignment restrictions are wrong.
Hardware DMA transfers are usually aligned on 4 or 8 byte boundaries since the PCI bus can physically transfer 32 or 64bits at a time. Beyond this basic alignment, hardware DMA transfers are designed to work with any address provided.
However, the hardware deals with physical addresses, while the OS deals with virtual memory addresses (which is a protected mode construct in the x86 cpu). This means that a contiguous buffer in process space may not be contiguous in physical ram. Unless care is taken to create physically contiguous buffers, the DMA transfer needs to be broken up at VM page boundaries (typically 4K, possibly 2M).
As for buffers needing to be aligned to disk sector size, this is completely untrue; the DMA hardware is completely oblivious to the physical sector size on a hard drive.
Under Linux 2.4 O_DIRECT required 4K alignment, under 2.6 it's been relaxed to 512B. In either case, it was probably a design decision to prevent single sector updates from crossing VM page boundaries and therefor requiring split DMA transfers. (An arbitrary 512B buffer has a 1/4 chance of crossing a 4K page).
So, while the OS is to blame rather than the hardware, we can see why page aligned buffers are more efficient.
Edit: Of course, if we're writing large buffers anyways (100KB), then the number of VM page boundaries crossed will be practically the same whether we've aligned to 512B or not.
So the main case being optimized by 512B alignment is single sector transfers.
Usually large alignment requirements like that are due to underlying DMA hardware. Large block transfers can sometimes be made much faster by requiring much stronger alignment restrictions than what you have here.
On several ARM processors, the first level translation table has to be aligned on a 16 KB boundary!
If you don't know what you're doing, don't use O_DIRECT.
O_DIRECT means "direct device access". This means it bypasses all OS caches, hitting the disk (or possibly RAID controller, etc) directly. Disk accesses are on a per-sector basis.
EDIT: The alignment requirement is for the IO offset/size; it's not usually a memory-alignment requirement.
EDIT: If you're looking at this page (it appears to be the only hit), it also says that the memory must be page-aligned.