Are memory mapped registers separate registers on the bus? - arm

I will use the TM4C123 Arm Microcontroller/Board as an example.
Most of it's I/O registers are memory mapped so you can get/set their values using
regular memory load/store instructions.
My questions is, is there some type of register outside of cpu somewhere on the bus which is mapped to memory and you read/write to it using the memory region essentially having duplicate values one on the register and on memory, or the memory IS the register itself?

There are many buses even in an MCU. Bus after bus after bus splitting off like branches in a tree. (sometimes even merging unlike a tree).
It may predate the intel/motorola battle but certainly in that time frame you had segmented vs flat addressing and you had I/O mapped I/O vs memory mapped I/O, since motorola (and others) did not have a separate I/O bus (well one extra...address...signal).
Look at the arm architecture documents and the chip documentation (arm makes IP not chips). You have load and store instructions that operate on addresses. The documentation for the chip (and to some extent ARM provides rules for the cortex-m address space) provides a long list of addresses for things. As a programmer you simply line up the address you do loads and stores with and the right instructions.
Someones marketing may still carry about terms like memory mapped I/O, because intel x86 still exists (how????), some folks will continue to carry those terms. As a programmer, they are number one just bits that go into registers, and for single instructions here and there those bits are addresses. If you want to add adjectives to that, go for it.
If the address you are using based on the chip and core documentation is pointing at an sram, then that is a read or write of memory. If it is a flash address, then that is the flash. The uart, the uart. timer 5, then timer 5 control and status registers. Etc...
There are addresses in these mcus that point at two or three things, but not at the same time. (address 0x00000000 and some number of kbytes after that). But, again, not at the same time. And this overlap at least in many of these cortex-m mcus, these special address spaces are not going to overlap "memory" and peripherals (I/O). But instead places where you can boot the chip and run some code. With these cortex-ms I do not think you can even use the sort of mmu to mix these spaces. Yes definitely in full sized arms and other architectures you can use a fully blow mcu to mix up the address spaces and have a virtual address space that lands on a physical address space of a different type.

Related

Is accessing mapped device memory slow (in terms of latency)?

I know the question is vague.. but here is what I hope to learn: the MCU directs some part of memory address to devices on the PCI bus, hence in theory user/kernel code can directly read/write device memory as if it were main memory. But data in and out of PCI Express devices are packaged/serialized/transmitted in lanes, which means each read/write incurs significant overhead, such as packaging (add headers) and un-packaging. So that means it is not ideal for user/kernel to read device memory a byte at a time, instead it should do some sort of bulk transfer. If so, what is the preferred mechanism and API?
BTW, I know there is DMA, but it seems to me that DMA does not require device memory to be directly mapped into main memory address space - DMA is about letting device access main memory, and my question is the other way, letting user/kernel access device memory. So I am guessing it is not related to the question above, is that correct?
Yes, accessing memory-mapped I/O (MMIO) is slow.
The primary reason that it is slow is that it is typically uncacheable,
so every access has to go all the way to the device.
In x86 systems, which I am most familiar with, cacheable memory is accessed in 64-byte chunks,
even though processor instructions typically access memory in 1, 2, 4, or 8 byte chunks.
If multiple processor instructions access adjacent cacheable memory locations, all but the first access are satisfied from the cache. For similar accesses to device memory, every access incurs the full latency to the device and back.
The second reason is that the path from the processors to memory are critical to performance and are highly optimized.
The path to devices has always been slow, so software is designed to compensate for that, and optimizing the performance of MMIO isn't a priority.
Another related reason is that PCI has ordering rules that require accesses to be buffered and processed in a strict order.
The memory system can handle ordering in a much more flexible way. For example, a dirty cache line may be written to memory at any convenient time. MMIO accesses must be performed precisely in the order that they are executed by the CPU.
The best way to do bulk transfer of data to a device is to have the device itself perform DMA, "pulling" the data from memory into the device, rather than "pushing" it from the CPU to the device. (This also reduces the load on the CPU, freeing it to do other useful work.)

ARM Cortex-m4 boot sequence

I am a bit confused about boot sequence of ARM Cortex-m processors. From many different resources, i read that upon reset, the cortex-m copies contents from 0x0 to stack pointer and copies reset handler address from 0x4 to PC...
My questions are:
1) how the cortex-m processor copies these two values to appropriate registers, I mean processor need LDR/STR instruction to do so, but here values are automatically copied??? How the processor know thats these two words need to be copied.
2) does cortex-m controller contains any builtin firmware that is executed initially?
3) Normally processors after reset, start execting from a specific memory location in reset vector where the jump instruction is placed to reset handler... but here in cortex-m the processors start by copying first two words into registers and then Program counter points to Reset Handler... No jump instruction no Specific memory location where processor jump on reset.!!! How is it possible??
2) does cortex-m controller contains any builtin firmware that is executed initially?
Depends highly on the model and make. Example: NXP LPC series Cortex-M chips (like LPC17xx) have some masked ROM instructions that are executed before the program in flash. Others may have no such memory build in.
1) how the cortex-m processor copies these two values to appropriate registers, I mean processor need LDR/STR instruction to do so
This happens in hardware before any code execution, so no LDR instructions needed.
Its ridiculously simple, if you know what a state machine is and how to implement one in a hardware description language like VHDL or Verilog.

MMU disabled vs MMU enabled with one-to-one paging

I am trying to understand the difference between these two modes of operation, (mostly in arm processors):
MMU is disabled.
MMU is enabled, but using one-to-one paging, i.e. virtual address is same as physical address.
From my understanding in both cases the memory is accessed as flat memory, one-to-one paging.
Is that correct ?
Thank you.
sure, you can map the virtual to physical however you like including one to one such that they are equal. There are still differences vs having the mmu off, each and every access has to go through the mmu and be looked up and converted (even if one to one), these tables themselves are in ram as well and that takes time, there is a little cache to help (TLB) but pretty small. Then there are the other settings in the mmu, cachable or not, protection which may require an additional lookups within the chip which may or may not take extra clock cycles.
So purely from an addressing perspective, sure the virtual address and physical address can be the same for the whole address space. there are some bits in the mmu table that replace some bits in the physical address and you can set those to match for some or all of the address space.

Significance of Reset Vector in Modern Processors

I am trying to understand how computer boots up in very detail.
I came across two things which made me more curious,
1. RAM is placed at the bottom of ROM, to avoid Memory Holes as in Z80 processor.
2. Reset Vector is used, which takes the processor to a memory location in ROM, whose contents point to the actual location (again ROM) from where processor would actually start executing instructions (POST instruction). Why so?
If you still can't understand me, this link will explain you briefly,
http://lateblt.tripod.com/bit68.txt
The processor logic is generally rigid and fixed, thus the term hardware. Software is something that can be changed, molded, etc. thus the term software.
The hardware needs to start some how, two basic methods,
1) an address, hardcoded in the logic, in the processors memory space is read and that value is an address to start executing code
2) an address, hardcoded in the logic, is where the processor starts executing code
When the processor itself is integrated with other hardware, anything can be mapped into any address space. You can put ram at address 0x1000 or 0x40000000 or both. You can map a peripheral to 0x1000 or 0x4000 or 0xF0000000 or all of the above. It is the choice of the system designers or a combination of the teams of engineers where things will go. One important factor is how the system will boot once reset is relesed. The booting of the processor is well known due to its architecture. The designers often choose two paths:
1) put a rom in the memory space that contains the reset vector or the entry point depending on the boot method of the processor (no matter what architecture there is a first address or first block of addresses that are read and their contents drive the booting of the processor). The software places code or a vector table or both in this rom so that the processor will boot and run.
2) put ram in the memory space, in such a way that some host can download a program into that ram, then release reset on the processor. The processor then follows its hardcoded boot procedure and the software is executed.
The first one is most common, the second is found in some peripherals, mice and network cards and things like that (Some of the firmware in /usr/lib/firmware/ is used for this for example).
The bottom line though is that the processor is usually designed with one boot method, a fixed method, so that all software written for that processor can conform to that one method and not have to keep changing. Also, the processor when designed doesnt know its target application so it needs a generic solution. The target application often defines the memory map, what is where in the processors memory space, and one of the tasks in that assignment is how that product will boot. From there the software is compiled and placed such that it conforms to the processors rules and the products hardware rules.
It completely varies by architecture. There are a few reasons why cores might want to do this though. Embedded cores (think along the lines of ARM and Microblaze) tend to be used within system-on-chip machines with a single address space. Such architectures can have multiple memories all over the place and tend to only dictate that the bottom area of memory (i.e. 0x00) contains the interrupt vectors. Then then allows the programmer to easily specify where to boot from. On Microblaze, you can attach memory wherever the hell you like in XPS.
In addition, it can be used to easily support bootloaders. These are typically used as a small program to do a bit of initialization, then fetch a larger program from a medium that can't be accessed simply (e.g. USB or Ethernet). In these cases, the bootloader typically copies itself to high memory, fetches below it and then jumps there. The reset vector simply allows the programmer to bypass the first step.

How is data from the RAM fetched?

In C each byte is individually addressable. Suppose an integer (say which uses 4 bytes) has an address 0xaddr (which is 32 bits, assuming that we have a 32 bit processor with 32 bit address bus and 32 bit data bus) and suppose the value of the integer is 0x12345678. Now if I am fetching this value from memory, how does the processor do this ? Does the processor place 0xaddr (which is 32bit address) on the address lines and then fetch 8 bit data say 0x12. And then processor will pace 0xaddr+1 on address lines and then fetch another 8 bit data 0x34 and so on for the 4 bytes of an integer? Or does the processor just place 0xaddr and read the 4 bytes at once thus utilizing its full 32 bit data bus?
This is a well known article by the GNU C library lead that describes memory access (particularly in x86 - current PC - systems). It goes into far more detail than you can ever possibly need.
The entire article is spread across many parts:
Introduction
CPU Caches
Virtual Memory
NUMA Support
Programmers
More Programmers
Performance Tools
Future
Appendices
one thing i'd add to gbulmer's answer is that in many systems getting a stream of data is faster than you would expect from getting a single word. in other words, selecting where you want to read from takes some time, but one you have that selected, reading from that point, and then the next 32 or 64 or whatever bits, and then the next... is faster than switching to some unconnected place and reading another value.
and what dominates modern programming is not the behaviour of fetching from memory on the motherboard, but whether the data are in a cpu cache.
If you search the web for "Computer Architecture" you are likely to get some answers to your questions.
For your specific question, a 32bit computer, with a 32bit data and address bus, for a simple case, with no obfuscating hardware. It will read 32bits from a 32bit wide memory.
This is the sort of hardware which existed since the late 1970's as minicompters (e.g. DEC VAX), and still exists as microprocessors (x86, ARM, Cortex-A8, MIPS32) and inside some microcontrollers (e.g. ARM, Cortex-M3, PIC32, etc.).
The simplest case:
The address bus is a set of signals (wires) which carry address signals to memory, plus a few more signals to communicate whether memory is to be 'read from' or 'written to' (data direction), and whether the signals on the address and data direction wires are valid. In the case of your example, there might be 32 wires to carry the bit pattern of the address.
The data bus is a second set of wires which communicate the value to and from memory. Memory might assert a signal to say the data is valid, but it might just be fast enough that everything 'just works'.
When the processor puts the address on the address signals, says it wants to read from memory (data direction is 'read'), memory will retrieve the value stored at that address, and put it onto the data bus signals. The processor (after suitable delays and signals) will sample the data bus wires, and that is the value it uses.
The processor might read the whole 32bits, and extract a byte (if that is all the instruction requires) internally, or the external address bus may provide extra signals so that the external memory system can be built to provide the appropriate byte, double byte or quad byte values. For many years, versions of the ARM processor architecture could only read the whole 32bits, and smaller pieces, e,g, a byte, were extracted internally.
You can see an example of this sort of signal set at http://www.cpu-world.com/info/Pinouts/68000.html
That chip only has a 24bit address bus, and a 16bit data bus.
It has two signals (UDS and LDS) which signal whether the upper data signals are being used, the lower data signals, or both.
I found a reasonably detailed explanation at research.cs.tamu.edu/prism/lectures/mbsd/mbsd_l15.pdf
I found that by searching for "68000 memory bus cycle".
You might usefully look for MIPS, ARM, or x86 to see their bus cycle.

Resources