ARM Trustzone memory aliasing - arm

I am trying to understand the ARM Trustzone implementation and came across the memory aliasing wherein the same memory is interpreted as secure and non-secure based on the 33rd bit of that address. I am not able to understand the concept of memory aliasing and its use. Can you please explain in detail with some good example.

The answer is an attempt to,
enumerate the overloaded uses of the term 'alias'.
Demonstrate what it means in the context of the cited document.
A literal attempt to answer the question is difficult and probably not what the OP intends.
Fundamentally, 'memory alias' means that two addresses refer to the same memory chunk. The cited article is meant more for chip designers on how to resolve issues with TrustZone systems and developing an SOC peripheral/bus in such a system.
We can have aliases in straight 'C' code through pointers.
void test(void)
{
char buffer[16];
char *p1 = &buffer[0], *p2 = p1 + 3.
strcpy(p1, "This is a test\n");
// confusion as pointers are referring to same memory chunk.
memcpy(p1,p2,strlen(p2)+1);
}
The memcpy() can be optimized to perform word moves. It might copy in a forward or reverse manner. When the pointers are aliased, the order and size of transferring memory matters.
For a cache, you have additional complication that the cache value and the 'backing store' (actual memory cell) may not be consistent. There are cache protocols to handle this. The protocols are dependant on the cache type.
VIVT - virtually indexed, virtually tagged
PIPT - physically indexed, physically tagged
VIPT - virtually indexed, physically tagged.
Trustzone deals with physical addresses and it adds an extra bit. For an SOC vendor, you can have a peripheral that uses two memory addresses to refer to the same cell. This is also 'aliasing'. So two pointers actually have the same memory behind. This can be convenient to not deal with TrustZone in the SOC module, but just provide an alias in the bus connection. So the peripheral will respond to two different address ranges. This is implicit in the TrustZone mechanics. A secure address clears 'NS' (address bit 33) and a normal access sets 'NS' (address bit 33).
Caches need to deal with physical addresses and this can cause issues in the cache protocols. An easy fix is to not allow the duplicated address to be cached. The address in 'C' are the same pointer value; but get amended by the CPU world.
Can you please explain in detail with some good example.
Not really example code. I would have to present some Verilog code and bus connection to an SOC peripheral and master and a cache with a protocol. I think the explanations above are sufficient without an 'example'.
Another topic to help/search might be 'full address decoding'. Non-full address decode is sometimes done with memory devices by hardware. This is also an aliasing which is much the same as the article is trying to elucidate.

Related

ARM - Memory map leakages

Lets assume that we are using MCU with ARM Cortex-M4, 256KB of FLASH and 64KB of RAM. This CPU contains memory map like showed below:
As I understand it correctly, the memory map tells us what are the maximum sizes of memories, that limits MCU vendor and where that CPU will look for it. For example, we cannot use Cortex-M4 with FLASH memory above 512MB, right?
In that situation, we have 64KB of RAM, and the limit is 512MB. My question is - does CPU know about that? Does it have any safety mechanisms, that will avoid trying to access beyond that 64KB of RAM (stack overflow) by halting in any way? Or maybe the CPU will work in way like "I have that boundaries, I will move around these if necessary". I know, that compilers may provide some information, that can aware the programmer.
As I understand it correctly, the memory map tells us what are the maximum sizes of memories, that limits MCU vendor and where that CPU will look for it.
Yes.
For example, we cannot use Cortex-M4 with FLASH memory above 512MB, right?
Normally the flash is the part between address 0x0 and 0x1FFFFFFF. Meaning 512MB indeed (1024*1024*512=0x20000000). Which is a ridiculously large size for a Cortex M.
My question is - does CPU know about that?
Yes and no. The physical memory will exist where the vendor placed it. This can at some extent be remapped through the linker script.
The Cortex M does not have an advanced MMU/MPU with support virtual memory, meaning all memory is physical addresses. It does however keep track of various invalid accesses through hardware exceptions. From ARM/Keil AN209 Using Cortex-M3/M4/M7 Fault Exceptions:
Fault exception handlers
Fault exceptions trap illegal memory accesses and illegal program behavior. The following conditions are detected by fault exception handlers:
HardFault: is the default exception and can be triggered because of an error during exception processing, or because an exception cannot be managed by any other exception mechanism.
MemManage: detects memory access violations to regions that are defined in the Memory Management Unit (MPU); for example, code execution from a memory region with read/write access only.
BusFault: detects memory access errors on instruction fetch, data read/write, interrupt vector fetch, and register stacking (save/restore) on interrupt (entry/exit).
UsageFault: detects execution of undefined instructions, unaligned memory access for load/store multiple. When enabled, divide-by-zero and other unaligned memory accesses are detected.
No the CPU does not know - you specify the memory map in the linker script, and the link will fail if your code and/or data cannot be located in the stated available memory.
If you specify the memory map incorrectly, the linker may locate code/data in non-existent memory and when you load it, parts will be missing. For the flash programming very likely the programming tool will fail if it is set to read-back verify the code.
Also if you dynamically load code to non existent memory, or access memory not allocated by the linker at run-time, the results are non-deterministic, other than it won't do anything useful.
The CPU cannot know as everyone has said. The MCU vendor buys the processor ip from arm, as well as ip from other vendors as well as creates some of their own if nothing else the glue that holds the modules together. The flash itself is likely from some third party.
Some chip designers wrap around, this is not uncommon in hardware or software, for example the part may have 16Kbytes starting at 0x08000000 this is the CHIP companies decision ARM has little to do with it other than what you have found that they define wide ranges (likely for caching and other options within their domain). 16K is 16384 bytes or 0x4000 so 14 bits of address. There is likely an address decoder that sees some number of upper bits 0x08...and sends that request to the flash logic, then at the flash logic it would not suprise me to see the lower 14 address bits stripped off and used meaning if you were to address 0x08000000 and 0x08008000 you may get the same 0x0000 offset/address in the flash.
Some engineers may choose to look at those upper bits and declare a fault.
You have to examine this on a case by case basis not just an stm32 for example but each family of stm32, for every datasheet basically. (And there is no reason to expect this level of detail is documented by the chip vendor).
The arm cortex-m as with all processors are very very stupid they do what the bits tell them to do it is our responsibility to feed the a sequential trail of working instructions, just like laying track in front of a train you can lay a lot of track in the wrong place, with gaps, etc. If not per the rules of the train then the train will crash or fail in some way.
The others have mentioned the linker script, and to be clear the linker script does not just magically somehow know what chip you have, ultimately you, the programmer are responsible for telling the toolchain to build programs that follow the rules of the cpu AND CHIP, to be successful. So the right architecture instructions (or a subset, cortex-m0 instructions (armv6m will run on a cortex-m4 (armv7m)). And the linker script needs to define addresses for read only and read write areas that match the chip (not the core, the chip as they are in charge of that definition). And then barring 100 other ways you can fail. It will run.
You are ultimately responsible but most folks grab an sdk or sandbox of some sort and hope for the best, blind faith in others. Gnu and llvm tools are fully capable to be used by you directly without these third parties, but then you are fully responsible for getting everything right.

Does labelling a block of memory volatile imply the cache is always bypassed? [duplicate]

Cache is controlled by cache hardware transparently to processor, so if we use volatile variables in C program, how is it guaranteed that my program reads data each time from the actual memory address specified but not cache.
My understanding is that,
Volatile keyword tells compiler that the variable references shouldn't be optimized and should be read as programmed in the code.
Cache is controlled by cache hardware transparently, hence when processor issues an address, it doesn't know whether the data is coming from cache or the memory.
So, if I have a requirement of having to read a memory address every time required, how can I make sure that its not referred from cache but from required address?
Some how, these two concepts are not fitting together well. Please clarify how its done.
(Imagining we have write-back policy in cache (if required for analyzing the problem))
Thank you,
Microkernel :)
Firmware developer here. This is a standard problem in embedded programming, and one that trips up many (even very experienced) developers.
My assumption is that you are attempting to access a hardware register, and that register value can change over time (be it interrupt status, timer, GPIO indications, etc.).
The volatile keyword is only part of the solution, and in many cases may not be necessary. This causes the variable to be re-read from memory each time it is used (as opposed to being optimized out by the compiler or stored in a processor register across multiple uses), but whether the "memory" being read is an actual hardware register versus a cached location is unknown to your code and unaffected by the volatile keyword. If your function only reads the register once then you can probably leave off volatile, but as a general rule I will suggest that most hardware registers should be defined as volatile.
The bigger issue is caching and cache coherency. The easiest approach here is to make sure your register is in uncached address space. That means every time you access the register you are guaranteed to read/write the actual hardware register and not cache memory. A more complex but potentially better performing approach is to use cached address space and have your code manually force cache updates for specific situations like this. For both approaches, how this is accomplished is architecture-dependent and beyond the scope of the question. It could involve MTRRs (for x86), MMU, page table modifications, etc.
Hope that helps. If I've missed something, let me know and I'll expand my answer.
From your question there is a misconception on your part.
Volatile keyword is not related to the cache as you describe.
When the keyword volatile is specified for a variable, it gives a hint to the compiler not to do certain optimizations as this variable can change from other parts of the program unexpectedly.
What is meant here, is that the compiler should not reuse the value already loaded in a register, but access the memory again as the value in register is not guaranteed to be the same as the value stored in memory.
The rest concerning the cache memory is not directly related to the programmer.
I mean the synchronization of any cache memory of CPU with the RAM is an entirely different subject.
My suggestion is to mark the page as non-cached by the virtual memory manager.
In Windows, this is done through setting PAGE_NOCACHE when calling VirtualProtect.
For a somewhat different purpose, the SSE 2 instructions have the _mm_stream_xyz instructions to prevent cache pollution, although I don't think they apply to your case here.
In either case, there is no portable way of doing what you want in C; you have to use OS functionality.
Wikipedia has a pretty good article about MTRR (Memory Type Range Registers) which apply to the x86 family of CPUs.
To summarize it, starting with the Pentium Pro Intel (and AMD copied) had these MTR registers which could set uncached, write-through, write-combining, write-protect or write-back attributes on ranges of memory.
Starting with the Pentium III but as far as I know, only really useful with the 64-bit processors, they honor the MTRRs but they can be overridden by the Page Attribute Tables which let the CPU set a memory type for each page of memory.
A major use of the MTRRs that I know of is graphics RAM. It is much more efficient to mark it as write-combining. This lets the cache store up the writes and it relaxes all of the memory write ordering rules to allow very high-speed burst writes to a graphics card.
But for your purposes you would want either a MTRR or a PAT setting of either uncached or write-through.
As you say cache is transparent to the programmer. The system guarantees that you always see the value that was last written to if you access an object through its address. The "only" thing that you may incur if an obsolete value is in your cache is a runtime penalty.
volatile makes sure that data is read everytime it is needed without bothering with any cache between CPU and memory. But if you need to read actual data from memory and not cached data, you have two options:
Make a board where said data is not cached. This may already be the case if you address some I/O device,
Use specific CPU instructions that bypass the cache. This is used when you need to scrub memory for activating possible SEU errors.
The details of second option depend on OS and/or CPU.
using the _Uncached keyword may help in embedded OS , like MQX
#define MEM_READ(addr) (*((volatile _Uncached unsigned int *)(addr)))
#define MEM_WRITE(addr,data) (*((volatile _Uncached unsigned int *)(addr)) = data)

What are the benefits/uses of Device or Strongly-ordered memory type?

My question is regarding different memory types available on an M-4 chip which I am reading about right now. To summarize, there are three different types of memory, i.e. 'normal', 'device' and 'strongly-ordered' that define the sequence (or whether there will be any sequence at all) in which memory system will carry out program instructions (e.g. ldr or str). It seems that 'normal' memory type allows memory system to move instruction execution order around to improve efficiency, provided that program behaviour is unchanged.
The question is - if the behaviour is unchanged and efficiency is improved, what are the practical uses of 'device' and 'strongly-ordered' memory. From my beginner perspective, I understand that there must be a reason for them to exist, but I am yet to have personal experience to link to this topic.
Basically, you use strongly-ordered attribute for memory accesses which have side effects - e.g. FIFOs

ARM bare-metal with MMU: write to non-cachable,non-bufferable mapped area fail

I am ARM Cortex A9 CPU with 2 cores. But I just use 1 core and the other is just in a busy loop. I setup the MMU table using section (1MB per entry) like this:
0x00000000-0x14ffffff => 0x00000000-0x14ffffff (non-cachable, non-bufferable)
0x15000000-0x24ffffff => 0x15000000-0x24ffffff (cachable, bufferable)
0x25000000-0x94ffffff => 0x25000000-0x94ffffff (non-cachable, non-bufferable)
0x15000000-0x24ffffff => 0x95000000-0xa4ffffff (non-cachable, non-bufferable)
0xa5000000-0xffffffff => 0xa5000000-0xffffffff (non-cachable, non-bufferable)
It is rather simple. I just want to have a mirror of 256MB memory for non-cachable access. However, when I do several write to the the non-cachable memory section at 0x95000000-0xa4ffffff. I find the write is not actually written until I explicitly give a cache flush.
Am I doing something wrong or this kind of mapping is not valid? If that is the case, I don't understand how Linux's ioremap will be working on ARM. It will be good if anyone can give some explanation to me. Thanks very much.
First of all: the Cortex-A9 is an ARMv7-A processor. The terms non-cacheable/non-bufferable/cacheable/bufferable are no longer correct descriptions of the mappings.
The actual mapping type is determined by TEX[2:0], C and B bits.
So I am actually having to guess a bit here as to what your mappings actually are.
And my guess is that you have the majority of your mappings set as Strongly-ordered, and the mirrored region as Normal Write-Back cacheable.
Having multiple virtual mappings with different memory types pointing to the same physical location is generally not a good idea in the ARM architecture. It used to be explicitly banned, but the latest version of the ARMv7-AR Architecture reference manual (DDI 0406C.b) has a (fairly long) section dedicated to the implications of "Mismatched memory attributes".
I would recommend finding a different way of achieving your goal.
Simply changing the mapping of the uncached regions to Normal Non-cacheable would be a good start. There is no valid reason for using Strongly-ordered mappings for RAM.

What is the difference between far pointers and near pointers?

Can anybody tell me the difference between far pointers and near pointers in C?
On a 16-bit x86 segmented memory architecture, four registers are used to refer to the respective segments:
DS → data segment
CS → code segment
SS → stack segment
ES → extra segment
A logical address on this architecture is written segment:offset. Now to answer the question:
Near pointers refer (as an offset) to the current segment.
Far pointers use segment info and an offset to point across segments. So, to use them, DS or CS must be changed to the specified value, the memory will be dereferenced and then the original value of DS/CS restored. Note that pointer arithmetic on them doesn't modify the segment portion of the pointer, so overflowing the offset will just wrap it around.
And then there are huge pointers, which are normalized to have the highest possible segment for a given address (contrary to far pointers).
On 32-bit and 64-bit architectures, memory models are using segments differently, or not at all.
Since nobody mentioned DOS, lets forget about old DOS PC computers and look at this from a generic point-of-view. Then, very simplified, it goes like this:
Any CPU has a data bus, which is the maximum amount of data the CPU can process in one single instruction, i.e equal to the size of its registers. The data bus width is expressed in bits: 8 bits, or 16 bits, or 64 bits etc. This is where the term "64 bit CPU" comes from - it refers to the data bus.
Any CPU has an address bus, also with a certain bus width expressed in bits. Any memory cell in your computer that the CPU can access directly has an unique address. The address bus is large enough to cover all the addressable memory you have.
For example, if a computer has 65536 bytes of addressable memory, you can cover these with a 16 bit address bus, 2^16 = 65536.
Most often, but not always, the data bus width is as wide as the address bus width. It is nice if they are of the same size, as it keeps both the CPU instruction set and the programs written for it clearer. If the CPU needs to calculate an address, it is convenient if that address is small enough to fit inside the CPU registers (often called index registers when it comes to addresses).
The non-standard keywords far and near are used to describe pointers on systems where you need to address memory beyond the normal CPU address bus width.
For example, it might be convenient for a CPU with 16 bit data bus to also have a 16 bit address bus. But the same computer may also need more than 2^16 = 65536 bytes = 64kB of addressable memory.
The CPU will then typically have special instructions (that are slightly slower) which allows it to address memory beyond those 64kb. For example, the CPU can divide its large memory into n pages (also sometimes called banks, segments and other such terms, that could mean a different thing from one CPU to another), where every page is 64kB. It will then have a "page" register which has to be set first, before addressing that extended memory. Similarly, it will have special instructions when calling/returning from sub routines in extended memory.
In order for a C compiler to generate the correct CPU instructions when dealing with such extended memory, the non-standard near and far keywords were invented. Non-standard as in they aren't specified by the C standard, but they are de facto industry standard and almost every compiler supports them in some manner.
far refers to memory located in extended memory, beyond the width of the address bus. Since it refers to addresses, most often you use it when declaring pointers. For example: int * far x; means "give me a pointer that points to extended memory". And the compiler will then know that it should generate the special instructions needed to access such memory. Similarly, function pointers that use far will generate special instructions to jump to/return from extended memory. If you didn't use far then you would get a pointer to the normal, addressable memory, and you'd end up pointing at something entirely different.
near is mainly included for consistency with far; it refers to anything in the addressable memory as is equivalent to a regular pointer. So it is mainly a useless keyword, save for some rare cases where you want to ensure that code is placed inside the standard addressable memory. You could then explicitly label something as near. The most typical case is low-level hardware programming where you write interrupt service routines. They are called by hardware from an interrupt vector with a fixed width, which is the same as the address bus width. Meaning that the interrupt service routine must be in the standard addressable memory.
The most famous use of far and near is perhaps the mentioned old MS DOS PC, which is nowadays regarded as quite ancient and therefore of mild interest.
But these keywords exist on more modern CPUs too! Most notably in embedded systems where they exist for pretty much every 8 and 16 bit microcontroller family on the market, as those microcontrollers typically have an address bus width of 16 bits, but sometimes more than 64kB memory.
Whenever you have a CPU where you need to address memory beyond the address bus width, you will have the need of far and near. Generally, such solutions are frowned upon though, since it is quite a pain to program on them and always take the extended memory in account.
One of the main reasons why there was a push to develop the 64 bit PC, was actually that the 32 bit PCs had come to the point where their memory usage was starting to hit the address bus limit: they could only address 4GB of RAM. 2^32 = 4,29 billion bytes = 4GB. In order to enable the use of more RAM, the options were then either to resort to some burdensome extended memory solution like in the DOS days, or to expand the computers, including their address bus, to 64 bits.
Far and near pointers were used in old platforms like DOS.
I don't think they're relevant in modern platforms. But you can learn about them here and here (as pointed by other answers). Basically, a far pointer is a way to extend the addressable memory in a computer. I.E., address more than 64k of memory in a 16bit platform.
A pointer basically holds addresses. As we all know, Intel memory management is divided into 4 segments.
So when an address pointed to by a pointer is within the same segment, then it is a near pointer and therefore it requires only 2 bytes for offset.
On the other hand, when a pointer points to an address which is out of the segment (that means in another segment), then that pointer is a far pointer. It consist of 4 bytes: two for segment and two for offset.
Four registers are used to refer to four segments on the 16-bit x86 segmented memory architecture. DS (data segment), CS (code segment), SS (stack segment), and ES (extra segment). A logical address on this platform is written segment:offset, in hexadecimal.
Near pointers refer (as an offset) to the current segment.
Far pointers use segment info and an offset to point across segments. So, to use them, DS or CS must be changed to the specified value, the memory will be dereferenced and then the original value of DS/CS restored. Note that pointer arithmetic on them doesn't modify the segment portion of the pointer, so overflowing the offset will just wrap it around.
And then there are huge pointers, which are normalized to have the highest possible segment for a given address (contrary to far pointers).
On 32-bit and 64-bit architectures, memory models are using segments differently, or not at all.
Well in DOS it was kind of funny dealing with registers. And Segments. All about maximum counting capacities of RAM.
Today it is pretty much irrelevant. All you need to read is difference about virtual/user space and kernel.
Since win nt4 (when they stole ideas from *nix) microsoft programmers started to use what was called user/kernel memory spaces.
And avoided direct access to physical controllers since then. Since then dissapered a problem dealing with direct access to memory segments as well. - Everything became R/W through OS.
However if you insist on understanding and manipulating far/near pointers look at linux kernel source and how it works - you will newer come back I guess.
And if you still need to use CS (Code Segment)/DS (Data Segment) in DOS. Look at these:
https://en.wikipedia.org/wiki/Intel_Memory_Model
http://www.digitalmars.com/ctg/ctgMemoryModel.html
I would like to point out to perfect answer below.. from Lundin. I was too lazy to answer properly. Lundin gave very detailed and sensible explanation "thumbs up"!

Resources