How to debug an aarch64 translation fault? - arm

I am writing a simple kernel in armv8 (aarch64).
MMU config:
48 VA bits (T1SZ=64-48=16)
4K page size
All physical RAM flat mapped into kernel virtual memory (on TTBR1_EL1)
(MMU is active with TTBR0_EL1=0, so I'm only using addresses in 0xffff< addr >, all flat-mapped into physical memory)
I'm mapping a new address space (starting at 1<<40) to some free physical region. When I try to access address 1<<40, I get an exception (of type "EL1 using SP1, synchronous"):
ESR_EL1=0x96000044
FAR_EL1=0xffff010000000000
Inspecting other registers, I have:
TTBR1_EL1=0x82000000
TTBR1_EL1[2]=0x0000000082003003
So, based on ARM Architecture Reference Manual for ARMv8 (ARMv8-A profile):
ESR (exception syndrome register) translates into: Exception Class=100101 (Data abort without a change in exception level) on pages D7-1933 sq. ; WnR=1 (faulting instruction is a write) ; DFSC=0b000100 (translation fault at level 0) on page D7-1958 ;
FAR_EL1 is the faulting address ; it indicates TTBR1_EL1 is used (since high bits are all 1). The VA top 9 bits are 0b000000010, which indicate that entry 2 is used in the table ;
Entry 2 in the table indicates a next-level table (low bits 0b11) at physical address 0x82003000.
So, translation fails at level 0, where it should not.
My question is: am I doing something wrong? Am I missing some info that could lead to the translation fault? And, more generally, how to debug a translation fault ?
Update:
Everthing works when I write to tables before enabling the MMU.
Whenever I write to tables AFTER enabling the MMU (via flat-mapped table region), mapping never works. I wonder why this happens.
I also tried manually writing to the selected tables (to remove any side effect from my mmapping function): same result (when writes are done before MMU is on, it works; after, it fails).
I tried doing tlbi and dsb sy instructions, followed by isb, without effect. Only one CPU is running at this time so caching should not be a problem - write instructions and MMU talk to the same caches (but I will test it next).

I overlooked caching issues within a single core. The problem was that, after turning the MMU on, the CPU and table walk unit didn't have the same view of memory. ARMv8 Cortex-A Programming Guide states that cache has to be cleaned/invalidated to point of unification (same view for a single core) after modifying tables.
Two possibilities can explain this behavior (I don't fully understand how caches work yet):
First possibility: the MMU does not have the required address in its internal walk cache.
In this case, when updating regular data and making it available to other core's L1, the dsb instruction simply waits for all cores to have a synchronized state (thanks to coherency network): other cores will know that the line has to be updated, and when they try to access it, it gets updated to L2 or migrated from the previous core's L1 to their L1.
This does not happen with the MMU (no coherency participation), so it still sees the old value in L2.
However, if this were the case, the same thing should happen before the MMU is turned on (because caching is activated way before), except if all memory is considered L1-non-cacheable before MMU is activated (which is possible, I'll have to double check that).
A minimal way of fixing the problem may be to change caching policies for table pages, but the cache maintenance is still necessary to clear possible old values from the MMU.
Second possibility: in all cases tested, the MMU already has the faulting address in its internal walk cache, which is not coherent with data L1 or L2.
In that case, only an explicit invalidate can eject the old line from the MMU cache. Before the MMU is turned on, the cache contains nothing and never gets the old value (0), only the new one.
I still think that case is unlikely because I tested many cases, and sometimes the offset between previsouly mapped memory (for example, entry 0 in the level 1 table) and newly mapped memory (for example, entry 128 in the same level 1 table) was greater than the cache line size (in this case, 1024 bytes, which is more than any cache line size).
So, I'm still not sure what exactly causes the problem, but cleaning/invalidating all the updated addresses works.

Related

Why is memory barrier not required for UP? [duplicate]

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:
x = 0;
f = 0;
Thread #1:
while (f == 0);
print x;
Thread #2:
x = 42;
f = 1;
I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.
However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:
When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?
The CPU would not be aware that these are two threads. Threads are a software construct (1).
So the CPU sees these instructions, in this order:
store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x
If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.
In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.
(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.
The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)
The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.
(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)
But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.
If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.
Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)
And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/
so what makes sure the program order is preserved?
Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location1.
Without this, code like
x = 1;
int tmp = x;
might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.
We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.
Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction.
This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

PoU and PoC in cache maintenance operations in arm

When reading ARM arch. ref. manual v7, I've found two concepts; point of coherency (PoC) and point of unification (PoU).
For PoC, it looks like the point that all agents (i.e., CPU cores) can see the same copy of memory.
For PoU, it looks like the point that all agents (in this case, CPU cores and MMU) can see the same copy of memory.
I have several follow up questions:
Is my understanding correct?
If so, If I issue DCCMVAC (Data cache clean MVA to PoC) with giving MVA to 0x40000000, (and let say PoC happen to be 0x70000000),
all cache entries between VA of 0x40000000 and 0x70000000 are cleaned?
Then, if I issue DCCMVAC with MVA 0x0, all data cache entries are cleaned?
PoU sounds like that MMU itself has its own data caches (not TLB) for page table walk inside main memory. Is this correct?
According to ARM training materials:
The PoU (Point of Unification) for a processor is the point (physical location within the hardware) where the instruction and data caches and the translation table walks of the processor are guaranteed to see the same copy of a memory location. For example, a unified level 2 cache would be the point of unification in a system with Harvard level 1 caches and a TLB (to cache page table entries). If no external cache is present, main memory would be the Point of unification.
The PoC (Point of [system] Coherency) is the point at which all blocks (for example, CPUs, DSPs, or DMA engines) which can access memory, are guaranteed for a particular address to see the same copy of a memory location. Typically, this will be the main external system memory.
it's one old case, however, adding some comments in case of someone's search.
in my opinion, PoU and PoC are coined by ARM to define one level for cache maintenance. the definition of PoC and PoU is in ARM ARM specification, while its ARMv8 programming guide (not ARM spec) gives some diagram for better understanding: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch11s04.html
one point is ,under ARM V8 processor's implementation, Iside can snoop Dside, for example, if there is one Icache miss, it will check Dcache, so you could treat PoU as the level of L1 cache. while other ARMv8 processor may not have this behaviour.
back to the original questions:
2) DCCMVAC 0x40000000, it will do cache clean to PoC about this address, mostly one cache line
PoC is defined by SoC implementation, not by address.
3) considering Q2, DCCMVAC 0x0, only applies to one cache line.
if you want to clean and invalidate the whole cache, you need use by set/way to walkthrough the whole cache.
4) PoU has nothing to do with MMU.
MMU hardware block owns some buffers to save TLB entries, it's one common practice, as for pagetable, which is built by software as in the memory, normally it's defined as normal memory type, so it could be in the cache during setup by CPU instruction, or walk by MMU hardware walk unit.

Does Virtual Memory area struct only comes into picture when there is a page fault?

Virtual Memory is a quite complex topic for me. I am trying to understand it. Here is my understanding for a 32-bit system. Example RAM is just 2GB. I have tried reading many links, and I am not confident at the moment. I would like you people to help me in clearing up my concepts. Please acknowledge my points, and also please answer for what you feel is wrong. I have also a confused section in my points. So, here starts the summary.
Every process thinks it is only running. It can access the 4GB of memory - virtual address space.
When a process access a virtual address it is translated to physical address via MMU.
This MMU is a part of a CPU - a hardware.
When the MMU cannot translate the address to a physical one, it raises a page fault.
On page fault, the kernel is notified. The kernel check the VM area struct. If it can find it - may be on disk. It will do some page-in /page-out. And get this memory on the RAM.
Now MMU will again try and will succeed this time.
In case the kernel cannot find the address, it will raise a signal. For example, invalid access will raise a SIGSEGV.
Confused points.
Does Page table is maintained in Kernel? This VM area struct has a page table ?
How MMU cannot find the address in physical RAM. Let's say it translates to some wrong address in RAM. Still the code will execute, but it will be a bad address. How MMU ensures that it is reading a right data? Does it consult Kernel VM area everytime?
Is the Mapping table - virtual to physical is inside a MMU. I have read it that is maintained by an individual process. If it is inside a process, why I can't see it.
Or if it is MMU, how MMU generates the address - is it that Segment + 12-bit shift -> Page frame number, and then the addition of offset (bits -1 to 10) -> gives a physical address.
Does it mean that for a 32-bit architecture, with this calculation in my mind. I can determine the physical address from a virtual address.
cat /proc/pid_value/maps. This shows me the current mapping of the vmarea. Basically, it reads the Vmarea struct and prints it. That means that this is important. I am not able to fit this piece in the complete picture. When the program is executed does the vmarea struct is generated. Is VMAREA comes only into the picture when the MMU cannnot translate the address i.e. Page fault? When I print the vmarea it displays the address range , permission and mapped to file descriptor, and offset. I am sure this file descriptor is the one in the hard-disk and the offset is for that file.
The high-mem concept is that kernel cannot directly access the Memory region greater than 1 GB(approx). Thus, it needs a page table to indirectly map it. Thus, it will temporarily load some page table to map the address. Does HIGH MEM will come into the picture everytime. Because Userspace can directly translate the address via MMU. On what scenario, does kernel really want to access the High MEM. I believe the kernel drivers will mostly be using kmalloc. This is a direct memory + offset address. In this case no mapping is really required. So, the question is on what scenario a kernel needs to access the High Mem.
Does the processor specifically comes with the MMU support. Those who doesn't have MMU support cannot run LInux?
Does Page table is maintained in Kernel? This VM area struct has a page table ?
Yes. Not exactly: each process has a mm_struct, which contains a list of vm_area_struct's (which represent abstract, processor-independent memory regions, aka mappings), and a field called pgd, which is a pointer to the processor-specific page table (which contains the current state of each page: valid, readable, writable, dirty, ...).
The page table doesn't need to be complete, the OS can generate each part of it from the VMAs.
How MMU cannot find the address in physical RAM. Let's say it translates to some wrong address in RAM. Still the code will execute, but it will be a bad address. How MMU ensures that it is reading a right data? Does it consult Kernel VM area everytime?
The translation fails, e.g. because the page was marked as invalid, or a write access was attempted against a readonly page.
Is the Mapping table - virtual to physical is inside a MMU. I have read it that is maintained by an individual process. If it is inside a process, why I can't see it.
Or if it is MMU, how MMU generates the address - is it that Segment + 12-bit shift -> Page frame number, and then the addition of offset (bits -1 to 10) -> gives a physical address.
Does it mean that for a 32-bit architecture, with this calculation in my mind. I can determine the physical address from a virtual address.
There are two kinds of MMUs in common use. One of them only has a TLB (Translation Lookaside Buffer), which is a cache of the page table. When the TLB doesn't have a translation for an attempted access, a TLB miss is generated, the OS does a page table walk, and puts the translation in the TLB.
The other kind of MMU does the page table walk in hardware.
In any case, the OS maintains a page table per process, this maps Virtual Page Numbers to Physical Frame Numbers. This mapping can change at any moment, when a page is paged-in, the physical frame it is mapped to depends on the availability of free memory.
cat /proc/pid_value/maps. This shows me the current mapping of the vmarea. Basically, it reads the Vmarea struct and prints it. That means that this is important. I am not able to fit this piece in the complete picture. When the program is executed does the vmarea struct is generated. Is VMAREA comes only into the picture when the MMU cannnot translate the address i.e. Page fault? When I print the vmarea it displays the address range , permission and mapped to file descriptor, and offset. I am sure this file descriptor is the one in the hard-disk and the offset is for that file.
To a first approximation, yes. Beyond that, there are many reasons why the kernel may decide to fiddle with a process' memory, e.g: if there is memory pressure it may decide to page out some rarely used pages from some random process. User space can also manipulate the mappings via mmap(), execve() and other system calls.
The high-mem concept is that kernel cannot directly access the Memory region greater than 1 GB(approx). Thus, it needs a page table to indirectly map it. Thus, it will temporarily load some page table to map the address. Does HIGH MEM will come into the picture everytime. Because Userspace can directly translate the address via MMU. On what scenario, does kernel really want to access the High MEM. I believe the kernel drivers will mostly be using kmalloc. This is a direct memory + offset address. In this case no mapping is really required. So, the question is on what scenario a kernel needs to access the High Mem.
Totally unrelated to the other questions. In summary, high memory is a hack to be able to access lots of memory in a limited address space computer.
Basically, the kernel has a limited address space reserved to it (on x86, a typical user/kernel split is 3Gb/1Gb [processes can run in user space or kernel space. A process runs in kernel space when a syscall is invoked. To avoid having to switch the page table on every context-switch, on x86 typically the address space is split between user-space and kernel-space]). So the kernel can directly access up to ~1Gb of memory. To access more physical memory, there is some indirection involved, which is what high memory is all about.
Does the processor specifically comes with the MMU support. Those who doesn't have MMU support cannot run Linux?
Laptop/desktop processors come with an MMU. x86 supports paging since the 386.
Linux, specially the variant called µCLinux, supports processors without MMUs (!MMU). Many embedded systems (ADSL routers, ...) use processors without an MMU. There are some important restrictions, among them:
Some syscalls don't work at all: e.g fork().
Some syscalls work with restrictions and non-POSIX conforming behavior: e.g mmap()
The executable file format is different: e.g bFLT or ELF-FDPIC instead of ELF.
The stack cannot grow, and its size has to be set at link-time.
When a program is loaded first the kernel will setup a kernel VM-Area for that process is it? This Kernel VM Area actually holds where the program sections are there in the memory/HDD. Then the entire story of updating CR3 register, and page walkthrough or TLB comes into the picture right? So, whenever there is a pagefault - Kernel will update the page table by looking at Kernel virtual memory area is it? But they say Kernel VM area keeps updating. How this is possible, since cat /proc/pid_value/map will keep updating.The map won't be constant from start to end. SO, the real information is available in the Kernel VM area struct is it? This is the acutal information where the section of program lies, it could be HDD or physical memory -- RAM? So, this is filled during process loading is it, the first job? Kernel does the page in page out on page fault, and will update the Kernel VM area is it? So, it should also know the entire program location on the HDD for page-in / page out right? Please correct me here. This is in continuation to my first question of the previous comment.
When the kernel loads a program, it will setup several VMAs (mappings), according to the segments in the executable file (which on ELF files you can see with readelf --segments), which will be text/code segment, data segment, etc... During the lifetime of the program, additional mappings may be created by the dynamic/runtime linkers, by the memory allocator (malloc(), which may also extend the data segment via brk()), or directly by the program via mmap(),shm_open(), etc..
The VMAs contain the necessary information to generate the page table, e.g. they tell whether that memory is backed by a file or by swap (anonymous memory). So, yes, the kernel will update the page table by looking at the VMAs. The kernel will page in memory in response to page faults, and will page out memory in response to memory pressure.
Using x86 no PAE as an example:
On x86 with no PAE, a linear address can be split into 3 parts: the top 10 bits point to an entry in the page directory, the middle 10 bits point to an entry in the page table pointed to by the aforementioned page directory entry. The page table entry may contain a valid physical frame number: the top 22 bits of a physical address. The bottom 12 bits of the virtual address is an offset into the page that goes untranslated into the physical address.
Each time the kernel schedules a different process, the CR3 register is written to with a pointer to the page directory for the current process. Then, each time a memory access is made, the MMU tries to look for a translation cached in the TLB, if it doesn't find one, it looks for one doing a page table walk starting from CR3. If it still doesn't find one, a GPF fault is raised, the CPU switches to Ring 0 (kernel mode), and the kernel tries to find one in the VMAs.
Also, I believe this reading from CR, page directory->page-table->Page frame number-memory address this all done by MMU. Am I correct?
On x86, yes, the MMU does the page table walk. On other systems (e.g: MIPS), the MMU is little more than the TLB, and on TLB miss exceptions the kernel does the page table walk by software.
Though this is not going to be the best answer, iw ould like to share my thoughts on confused points.
1. Does Page table is maintained...
Yes. kernel maintains the page tables. In fact it maintains nested page tables. And top of the page tables is stored in top_pmd. pmd i suppose it is page mapping directory. You can traverse through all the page tables using this structure.
2. How MMU cannot find the address in physical RAM.....
I am not sure i understood the question. But in case because of some problem, the instruction is faulted or out of its instruction area is being accessed, you generally get undefined instruction exception resulting in undefined exception abort. If you look at the crash dumps, you can see it in the kernel log.
3. Is the Mapping table - virtual to physical is inside a MMU...
Yes. MMU is SW+HW. HW is like TLB and all. The mapping tables are stored here. For instructions, that is for code section i always converted the physical-virtual address and always they matched. And almost all the times it matches for Data sections as well.
4. cat /proc/pid_value/maps. This shows me the current mapping of the vmarea....
This is more used for analyzing the virtual addresses of user space stacks. As you know virtually all the user space programs can have 4 GB of virtual address. So unlike kernel if i say 0xc0100234. You cannot directly go and point to the istruction. So you need this mapping and the virtual address to point the instruction based on the data you have.
5. The high-mem concept is that kernel cannot directly access the Memory...
High-mem corresponds to user space memory(some one correct me if i am wrong). When kernel wants to read some data from a address at user space you will be accessing the HIGHMEM.
6. Does the processor specifically comes with the MMU support. Those who doesn't have MMU support cannot run LInux?
MMU as i mentioned is HW + SW. So mostly it would be coming with the chipset. and the SW would be generally architecture dependent. You can disable MMU from kernel config and build. I have never tried it though. Mostly these days allthe chipsets have it. But small boards i think they disable MMU. I am not entirely sure though.
As all these are conceptual questions, i may be lacking some knowledge and be wrong at places. If so others please correct me.

ARM11 Translation Lookaside Buffer (TLB) usage?

Is there a decent guide explaining how to use the TLB (Translation Lookaside Buffers) tables on an ARM1176JZF-S core?
Having looked over the technical documentation for the that ARM platform I still have no clue what a TLB is or what it looks like. As far as I understand, each TLB entry maps a virtual page to a physical page, allowing remapping and controlling memory permissions.
Apart from that, I have absolutely no clue on how to use them.
What structure does a TLB entry have? How do I create new entries?
How do I handle VM in context switches for user-space threads? How do I ensure that those threads can only access specific pages assigned to their parent processes (enforce memory protection)? Do I save the TLB state for each context?
Why are there two TLBs? What can I use the MicroTLB for if it can only have 10 entries? Surely, I need more than 10.
It says that one of the parts of the main TLB is "a fully-associative array of eight elements, that is lockable". What is that? Do I only get to have 8 entries for the Main TLB?
Thank you in advance. I'll be really glad if someone provides an explanation of what TLBs are. I'm currently working on a memory mapper for my kernel, and I've pretty much hit a dead end.
The technical reference manual for ARM1176JZF-S appears to be DDI 0301. That document contains all the specific details for that specific ARM core.
I still have no clue what a TLB is or what it looks like. As far as I understand, each TLB entry maps a virtual page to a physical page, allowing remapping and controlling memory permissions.
A TLB is a cache of the page table. Some processors allow direct access to the TLB, while knowing nothing about page tables (e.g: MIPS), while others know about page tables, and internally use TLBs that the programmer mostly doesn't see (e.g: x86). In this case, the TLB is managed by hardware, and the system programmer only has to care to make the TTB (Translation Table Base) registers point to the page table, and invalidate the TLB in apropriate places.
What structure does a TLB entry have? How do I create new entries?
Done by hardware. On a TLB miss, the MMU walks the page table and fills the TLB from there.
How do I handle VM in context switches for user-space threads?
Some platforms have TLBs that simply map virtual addresses to physical addresses (e.g: x86). On these platforms, you have to do a full TLB flush on each context switch. Other platforms (MIPS, this specific ARM core) map (ASID, virtual address) pairs to physical addresses. An ASID is an Application-Specific Identifier, i.e: an identifier for a process. The MMU uses a register to know which ASID to use (I think it's the Context ID register in this case). Since there may be more processes than ASIDs, occasionally you may need to recycle an ASID (assigning it to a different process) and do a TLB flush (that's what the Invalidate TLB by ASID operation is for).
Why are there two TLBs? What can I use the MicroTLB for if it can only have 10 entries? Surely, I need more than 10.
This is exactly for the same reason you have small separate level-1 caches for instructions and data. Since they are caches, you don't need more than 10 (though having more could improve performance).
It says that one of the parts of the main TLB is "a fully-associative array of eight elements, that is lockable". What is that? Do I only get to have 8 entries for the Main TLB?
Some memory pages (e.g: some portions of the kernel) are accessed very often. It makes sense to lock them, so they don't get thrown off of the TLB. Also, on realtime systems, a TLB miss or a cache miss may introduce some unwanted unpredictability. So, there is an option to lock a number of TLB entries. The main TLB has more entries, but only those 8 are lockable.

Out of Order Execution and Memory Fences

I know that modern CPUs can execute out of order, However they always retire the results in-order, as described by wikipedia.
"Out of Oder processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal."
Now memory fences are said to be required when using multicore platforms, because owing to Out of Order execution, wrong value of x can be printed here.
Processor #1:
while f == 0
;
print x; // x might not be 42 here
Processor #2:
x = 42;
// Memory fence required here
f = 1
Now my question is, since Out of Order Processors (Cores in case of MultiCore Processors I assume) always retire the results In-Order, then what is the necessity of Memory fences. Don't the cores of a multicore processor sees results retired from other cores only or they also see results which are in-flight?
I mean in the example I gave above, when Processor 2 will eventually retire the results, the result of x should come before f, right? I know that during out of order execution it might have modified f before x but it must have not retired it before x, right?
Now with In-Order retiring of results and cache coherence mechanism in place, why would you ever need memory fences in x86?
This tutorial explains the issues: http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf
FWIW, where memory ordering issues happen on modern x86 processors, the reason is that while the x86 memory consistency model offers quite strong consistency, explicit barriers are needed to handle read-after-write consistency. This is due to something called the "store buffer".
That is, x86 is sequentially consistent (nice and easy to reason about) except that loads may be reordered wrt earlier stores. That is, if the processor executes the sequence
store x
load y
then on the processor bus this may be seen as
load y
store x
The reason for this behavior is the afore-mentioned store buffer, which is a small buffer for writes before they go out on the system bus. Load latency is, OTOH, a critical issue for performance, and hence loads are permitted to "jump the queue".
See Section 8.2 in http://download.intel.com/design/processor/manuals/253668.pdf
The memory fence ensures that all changes to variables before the fence are visible to all other cores, so that all cores have an up to date view of the data.
If you don't put a memory fence, the cores might be working with wrong data, this can be seen especially in scenario's, where multiple cores would be working on the same datasets. In this case you can ensure that when CPU 0 has done some action, that all changes done to the dataset are now visible to all other cores, whom can then work with up to date information.
Some architectures, including the ubiquitous x86/x64, provide several
memory barrier instructions including an instruction sometimes called
"full fence". A full fence ensures that all load and store operations
prior to the fence will have been committed prior to any loads and
stores issued following the fence.
If a core were to start working with outdated data on the dataset, how could it ever get the correct results? It couldn't no matter if the end result were to be presented as-if all was done in the right order.
The key is in the store buffer, which sits between the cache and the CPU, and does this:
Store buffer invisible to remote CPUs
Store buffer allows writes to memory and/or caches to be saved to
optimize interconnect accesses
That means that things will be written to this buffer, and then at some point will the buffer be written to the cache. So the cache could contain a view of data that is not the most recent, and therefore another CPU, through cache coherency, will also not have the latest data. A store buffer flush is necessary for the latest data to be visible, this, I think is essentially what the memory fence will cause to happen at hardware level.
EDIT:
For the code you used as an example, Wikipedia says this:
A memory barrier can be inserted before processor #2's assignment to f
to ensure that the new value of x is visible to other processors at or
prior to the change in the value of f.
Just to make explicit what is implicit in the previous answers, this is correct, but is distinct from memory accesses:
CPUs can execute out of order, However they always retire the results in-order
Retirement of the instruction is separate from performing the memory access, the memory access may complete at a different time to instruction retirement.
Each core will act as if it's own memory accesses occur at retirement, but other cores may see those accesses at different times.
(On x86 and ARM, I think only stores are observably subject to this, but e.g., Alpha may load an old value from memory. x86 SSE2 has instructions with weaker guarentees than normal x86 behaviour).
PS. From memory the abandoned Sparc ROCK could in fact retire out-of-order, it spent power and transistors determining when this was harmless. It got abandoned because of power consumption and transistor count... I don't believe any general purpose CPU has been bought to market with out-of-order retirement.

Resources