what orderings are guaranteed by the ARM weak memory model

what orderings are guaranteed by the ARM weak memory model - arm

I understand the basic differences b/w a weak and strong memory model.
But there is no exact definition of weak and it depends on the architecture (here ARM).
I have gone through the documentation on ARM infocenter but still not clear on a lot of things.
Can somebody please list out -
what memory access orderings are guaranteed by ARM.
And hence what orderings must an assembly programmer explicitly enforce in code, when coding for a multi-core ARM system.
Bonus points for anyone who can explain the differences b/w ARM and PPC(Power PC) memory models.

Check out the Cortex-A Series Programmer's Guide, it has a chapter on memory ordering. For example:
Three memory types are defined in the ARM Architecture. All regions of
memory are configured as one of these three types.
Strongly-ordered
Device
Normal.
In addition, for normal and device memory, it is possible to specify
whether the memory is shareable (accessed by other agents) or not. For
normal memory, inner and outer cacheable properties can be specified.

Related

ARM Trustzone memory aliasing

I am trying to understand the ARM Trustzone implementation and came across the memory aliasing wherein the same memory is interpreted as secure and non-secure based on the 33rd bit of that address. I am not able to understand the concept of memory aliasing and its use. Can you please explain in detail with some good example.

The answer is an attempt to,
enumerate the overloaded uses of the term 'alias'.
Demonstrate what it means in the context of the cited document.
A literal attempt to answer the question is difficult and probably not what the OP intends.
Fundamentally, 'memory alias' means that two addresses refer to the same memory chunk. The cited article is meant more for chip designers on how to resolve issues with TrustZone systems and developing an SOC peripheral/bus in such a system.
We can have aliases in straight 'C' code through pointers.
void test(void)
{
char buffer[16];
char *p1 = &buffer[0], *p2 = p1 + 3.
strcpy(p1, "This is a test\n");
// confusion as pointers are referring to same memory chunk.
memcpy(p1,p2,strlen(p2)+1);
}
The memcpy() can be optimized to perform word moves. It might copy in a forward or reverse manner. When the pointers are aliased, the order and size of transferring memory matters.
For a cache, you have additional complication that the cache value and the 'backing store' (actual memory cell) may not be consistent. There are cache protocols to handle this. The protocols are dependant on the cache type.
VIVT - virtually indexed, virtually tagged
PIPT - physically indexed, physically tagged
VIPT - virtually indexed, physically tagged.
Trustzone deals with physical addresses and it adds an extra bit. For an SOC vendor, you can have a peripheral that uses two memory addresses to refer to the same cell. This is also 'aliasing'. So two pointers actually have the same memory behind. This can be convenient to not deal with TrustZone in the SOC module, but just provide an alias in the bus connection. So the peripheral will respond to two different address ranges. This is implicit in the TrustZone mechanics. A secure address clears 'NS' (address bit 33) and a normal access sets 'NS' (address bit 33).
Caches need to deal with physical addresses and this can cause issues in the cache protocols. An easy fix is to not allow the duplicated address to be cached. The address in 'C' are the same pointer value; but get amended by the CPU world.
Can you please explain in detail with some good example.
Not really example code. I would have to present some Verilog code and bus connection to an SOC peripheral and master and a cache with a protocol. I think the explanations above are sufficient without an 'example'.
Another topic to help/search might be 'full address decoding'. Non-full address decode is sometimes done with memory devices by hardware. This is also an aliasing which is much the same as the article is trying to elucidate.

ARM - Memory map leakages

Lets assume that we are using MCU with ARM Cortex-M4, 256KB of FLASH and 64KB of RAM. This CPU contains memory map like showed below:
As I understand it correctly, the memory map tells us what are the maximum sizes of memories, that limits MCU vendor and where that CPU will look for it. For example, we cannot use Cortex-M4 with FLASH memory above 512MB, right?
In that situation, we have 64KB of RAM, and the limit is 512MB. My question is - does CPU know about that? Does it have any safety mechanisms, that will avoid trying to access beyond that 64KB of RAM (stack overflow) by halting in any way? Or maybe the CPU will work in way like "I have that boundaries, I will move around these if necessary". I know, that compilers may provide some information, that can aware the programmer.

As I understand it correctly, the memory map tells us what are the maximum sizes of memories, that limits MCU vendor and where that CPU will look for it.
Yes.
For example, we cannot use Cortex-M4 with FLASH memory above 512MB, right?
Normally the flash is the part between address 0x0 and 0x1FFFFFFF. Meaning 512MB indeed (1024*1024*512=0x20000000). Which is a ridiculously large size for a Cortex M.
My question is - does CPU know about that?
Yes and no. The physical memory will exist where the vendor placed it. This can at some extent be remapped through the linker script.
The Cortex M does not have an advanced MMU/MPU with support virtual memory, meaning all memory is physical addresses. It does however keep track of various invalid accesses through hardware exceptions. From ARM/Keil AN209 Using Cortex-M3/M4/M7 Fault Exceptions:
Fault exception handlers
Fault exceptions trap illegal memory accesses and illegal program behavior. The following conditions are detected by fault exception handlers:
HardFault: is the default exception and can be triggered because of an error during exception processing, or because an exception cannot be managed by any other exception mechanism.
MemManage: detects memory access violations to regions that are defined in the Memory Management Unit (MPU); for example, code execution from a memory region with read/write access only.
BusFault: detects memory access errors on instruction fetch, data read/write, interrupt vector fetch, and register stacking (save/restore) on interrupt (entry/exit).
UsageFault: detects execution of undefined instructions, unaligned memory access for load/store multiple. When enabled, divide-by-zero and other unaligned memory accesses are detected.

No the CPU does not know - you specify the memory map in the linker script, and the link will fail if your code and/or data cannot be located in the stated available memory.
If you specify the memory map incorrectly, the linker may locate code/data in non-existent memory and when you load it, parts will be missing. For the flash programming very likely the programming tool will fail if it is set to read-back verify the code.
Also if you dynamically load code to non existent memory, or access memory not allocated by the linker at run-time, the results are non-deterministic, other than it won't do anything useful.

The CPU cannot know as everyone has said. The MCU vendor buys the processor ip from arm, as well as ip from other vendors as well as creates some of their own if nothing else the glue that holds the modules together. The flash itself is likely from some third party.
Some chip designers wrap around, this is not uncommon in hardware or software, for example the part may have 16Kbytes starting at 0x08000000 this is the CHIP companies decision ARM has little to do with it other than what you have found that they define wide ranges (likely for caching and other options within their domain). 16K is 16384 bytes or 0x4000 so 14 bits of address. There is likely an address decoder that sees some number of upper bits 0x08...and sends that request to the flash logic, then at the flash logic it would not suprise me to see the lower 14 address bits stripped off and used meaning if you were to address 0x08000000 and 0x08008000 you may get the same 0x0000 offset/address in the flash.
Some engineers may choose to look at those upper bits and declare a fault.
You have to examine this on a case by case basis not just an stm32 for example but each family of stm32, for every datasheet basically. (And there is no reason to expect this level of detail is documented by the chip vendor).
The arm cortex-m as with all processors are very very stupid they do what the bits tell them to do it is our responsibility to feed the a sequential trail of working instructions, just like laying track in front of a train you can lay a lot of track in the wrong place, with gaps, etc. If not per the rules of the train then the train will crash or fail in some way.
The others have mentioned the linker script, and to be clear the linker script does not just magically somehow know what chip you have, ultimately you, the programmer are responsible for telling the toolchain to build programs that follow the rules of the cpu AND CHIP, to be successful. So the right architecture instructions (or a subset, cortex-m0 instructions (armv6m will run on a cortex-m4 (armv7m)). And the linker script needs to define addresses for read only and read write areas that match the chip (not the core, the chip as they are in charge of that definition). And then barring 100 other ways you can fail. It will run.
You are ultimately responsible but most folks grab an sdk or sandbox of some sort and hope for the best, blind faith in others. Gnu and llvm tools are fully capable to be used by you directly without these third parties, but then you are fully responsible for getting everything right.

What are the benefits/uses of Device or Strongly-ordered memory type?

My question is regarding different memory types available on an M-4 chip which I am reading about right now. To summarize, there are three different types of memory, i.e. 'normal', 'device' and 'strongly-ordered' that define the sequence (or whether there will be any sequence at all) in which memory system will carry out program instructions (e.g. ldr or str). It seems that 'normal' memory type allows memory system to move instruction execution order around to improve efficiency, provided that program behaviour is unchanged.
The question is - if the behaviour is unchanged and efficiency is improved, what are the practical uses of 'device' and 'strongly-ordered' memory. From my beginner perspective, I understand that there must be a reason for them to exist, but I am yet to have personal experience to link to this topic.

Basically, you use strongly-ordered attribute for memory accesses which have side effects - e.g. FIFOs

Threading and Thread Safety in C

When there is a common set of global data that needs to be shared among several threaded processes, I typically have used a thread token to protect the shared resource:
Edit - 7/22/15 (to incorporate atomics as a viable option, per Jens comments)
My [First] question is, in C, if I write my routines in such a way as to guarantee each thread accesses one, and only one element of an array:
Is there any reason to think that asynchronous and simultaneous access to different indices of the same unprotected array (as shown in diagram) would be a problem?
Second question: Given that an object that can be accessed as
an atomic entity, even in the presence of asynchronous interrupts ( C99 - 7.14 Signal handling ) would using atomics be an effective method for thread protection for an otherwise unprotected variable?
Edit (Clarifications to address questions in comments to this point):
- Specifics for this application:
- Target OS: Windows 7/8/10
- Compiler : C99 compliant (cannot use C11, which include the _Atomic() type specifier )
- H/W : Intel i7 family

This (which looks like a C standard of some sort)
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf sayeth:
NOTE 1 Two threads of execution can update and access separate memory
locations without interfering with each other
NOTE 13 Compiler transformations that introduce assignments to a
potentially shared memory location that would not be modified by the
abstract machine are generally precluded by this standard, since such
an assignment might overwrite another assignment by a different thread
in cases in which an abstract machine execution would not have
encountered a data race. This includes implementations of data member
assignment that overwrite adjacent members in separate memory
locations. We also generally preclude reordering of atomic loads in
cases in which the atomics in question may alias, since this may
violate the "visible sequence" rules.
The way I understand it, this would preclude quamrana's concerns and guarantee you that unprotected writes to separate memory locations should never result in undefined behavior if there is no data race.

In C it will depend on your platform, that is your combination of compiler, processor architecture and operating system.
Your compiler can choose how to use the internal registers and instructions of the cpu to make the executable seem to perform the intent of the program. And C may know nothing about threads. It is usually the job of the operating system to provide a threading library.
There may be processors which might perform the write to an element of your array by reading a much larger patch of memory than just one element, then overwrite just the right bits that forms one element within internal registers and then writing the whole patch back. A single threaded program would work just fine, but two or more threads which interrupt each other could cause chaos in the array.
On the other hand it may work out just fine.
And as has been said, read-only access is always just fine.
Also, google is your friend. It found this stackoverflow question.

If each thread is accessing a different array element, and only the element it is "assigned", this shouldn't be a problem. Both scenarios above are essentially equivalent, since each array element has its own address.

ARM bare-metal with MMU: write to non-cachable,non-bufferable mapped area fail

I am ARM Cortex A9 CPU with 2 cores. But I just use 1 core and the other is just in a busy loop. I setup the MMU table using section (1MB per entry) like this:
0x00000000-0x14ffffff => 0x00000000-0x14ffffff (non-cachable, non-bufferable)
0x15000000-0x24ffffff => 0x15000000-0x24ffffff (cachable, bufferable)
0x25000000-0x94ffffff => 0x25000000-0x94ffffff (non-cachable, non-bufferable)
0x15000000-0x24ffffff => 0x95000000-0xa4ffffff (non-cachable, non-bufferable)
0xa5000000-0xffffffff => 0xa5000000-0xffffffff (non-cachable, non-bufferable)
It is rather simple. I just want to have a mirror of 256MB memory for non-cachable access. However, when I do several write to the the non-cachable memory section at 0x95000000-0xa4ffffff. I find the write is not actually written until I explicitly give a cache flush.
Am I doing something wrong or this kind of mapping is not valid? If that is the case, I don't understand how Linux's ioremap will be working on ARM. It will be good if anyone can give some explanation to me. Thanks very much.

First of all: the Cortex-A9 is an ARMv7-A processor. The terms non-cacheable/non-bufferable/cacheable/bufferable are no longer correct descriptions of the mappings.
The actual mapping type is determined by TEX[2:0], C and B bits.
So I am actually having to guess a bit here as to what your mappings actually are.
And my guess is that you have the majority of your mappings set as Strongly-ordered, and the mirrored region as Normal Write-Back cacheable.
Having multiple virtual mappings with different memory types pointing to the same physical location is generally not a good idea in the ARM architecture. It used to be explicitly banned, but the latest version of the ARMv7-AR Architecture reference manual (DDI 0406C.b) has a (fairly long) section dedicated to the implications of "Mismatched memory attributes".
I would recommend finding a different way of achieving your goal.
Simply changing the mapping of the uncached regions to Normal Non-cacheable would be a good start. There is no valid reason for using Strongly-ordered mappings for RAM.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight