Out of Order Execution and Memory Fences - c

I know that modern CPUs can execute out of order, However they always retire the results in-order, as described by wikipedia.
"Out of Oder processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal."
Now memory fences are said to be required when using multicore platforms, because owing to Out of Order execution, wrong value of x can be printed here.
Processor #1:
while f == 0
;
print x; // x might not be 42 here
Processor #2:
x = 42;
// Memory fence required here
f = 1
Now my question is, since Out of Order Processors (Cores in case of MultiCore Processors I assume) always retire the results In-Order, then what is the necessity of Memory fences. Don't the cores of a multicore processor sees results retired from other cores only or they also see results which are in-flight?
I mean in the example I gave above, when Processor 2 will eventually retire the results, the result of x should come before f, right? I know that during out of order execution it might have modified f before x but it must have not retired it before x, right?
Now with In-Order retiring of results and cache coherence mechanism in place, why would you ever need memory fences in x86?

This tutorial explains the issues: http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf
FWIW, where memory ordering issues happen on modern x86 processors, the reason is that while the x86 memory consistency model offers quite strong consistency, explicit barriers are needed to handle read-after-write consistency. This is due to something called the "store buffer".
That is, x86 is sequentially consistent (nice and easy to reason about) except that loads may be reordered wrt earlier stores. That is, if the processor executes the sequence
store x
load y
then on the processor bus this may be seen as
load y
store x
The reason for this behavior is the afore-mentioned store buffer, which is a small buffer for writes before they go out on the system bus. Load latency is, OTOH, a critical issue for performance, and hence loads are permitted to "jump the queue".
See Section 8.2 in http://download.intel.com/design/processor/manuals/253668.pdf

The memory fence ensures that all changes to variables before the fence are visible to all other cores, so that all cores have an up to date view of the data.
If you don't put a memory fence, the cores might be working with wrong data, this can be seen especially in scenario's, where multiple cores would be working on the same datasets. In this case you can ensure that when CPU 0 has done some action, that all changes done to the dataset are now visible to all other cores, whom can then work with up to date information.
Some architectures, including the ubiquitous x86/x64, provide several
memory barrier instructions including an instruction sometimes called
"full fence". A full fence ensures that all load and store operations
prior to the fence will have been committed prior to any loads and
stores issued following the fence.
If a core were to start working with outdated data on the dataset, how could it ever get the correct results? It couldn't no matter if the end result were to be presented as-if all was done in the right order.
The key is in the store buffer, which sits between the cache and the CPU, and does this:
Store buffer invisible to remote CPUs
Store buffer allows writes to memory and/or caches to be saved to
optimize interconnect accesses
That means that things will be written to this buffer, and then at some point will the buffer be written to the cache. So the cache could contain a view of data that is not the most recent, and therefore another CPU, through cache coherency, will also not have the latest data. A store buffer flush is necessary for the latest data to be visible, this, I think is essentially what the memory fence will cause to happen at hardware level.
EDIT:
For the code you used as an example, Wikipedia says this:
A memory barrier can be inserted before processor #2's assignment to f
to ensure that the new value of x is visible to other processors at or
prior to the change in the value of f.

Just to make explicit what is implicit in the previous answers, this is correct, but is distinct from memory accesses:
CPUs can execute out of order, However they always retire the results in-order
Retirement of the instruction is separate from performing the memory access, the memory access may complete at a different time to instruction retirement.
Each core will act as if it's own memory accesses occur at retirement, but other cores may see those accesses at different times.
(On x86 and ARM, I think only stores are observably subject to this, but e.g., Alpha may load an old value from memory. x86 SSE2 has instructions with weaker guarentees than normal x86 behaviour).
PS. From memory the abandoned Sparc ROCK could in fact retire out-of-order, it spent power and transistors determining when this was harmless. It got abandoned because of power consumption and transistor count... I don't believe any general purpose CPU has been bought to market with out-of-order retirement.

Related

Why is memory barrier not required for UP? [duplicate]

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:
x = 0;
f = 0;
Thread #1:
while (f == 0);
print x;
Thread #2:
x = 42;
f = 1;
I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.
However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:
When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?
The CPU would not be aware that these are two threads. Threads are a software construct (1).
So the CPU sees these instructions, in this order:
store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x
If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.
In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.
(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.
The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)
The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.
(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)
But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.
If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.
Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)
And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/
so what makes sure the program order is preserved?
Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location1.
Without this, code like
x = 1;
int tmp = x;
might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.
We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.
Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction.
This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

Is clflush or clflushopt atomic when system crash?

Commonly, cacheline is 64B but atomicity of non-volatile memory is 8B.
For example:
x[1]=100;
x[2]=100;
clflush(x);
x is cacheline aligned, and is initially set to 0.
System crashs in clflush();
Is it possible x[1]=0, x[2]=100 after reboot?
Under the following assumptions:
I assume that the code you've shown represents a sequence of x86 assembly instructions rather than actual C code that is yet to be compiled.
I also assume that the code is being executed on a Cascade Lake processor and not on a later generation of Intel processors (I think CPL or ICX with Barlow Pass support eADR, meaning that explicit flushing is not required for persistence because the caches are in the persistence domain). This answer also applies to existing AMD+NVDIMM platforms.
The global observablility order of stores may differ from the persist order on Intel x86 processors. This is referred to as relaxed persistency. The only case in which the order is guaranteed to be the same is for a sequence of stores of type WB to the same cache line (but a store reaching GO doesn't necessarily meant it's become durable). This is because CLFLUSH is atomic and WB stores cannot be reordered in global observability. See: On x86-64, is the “movnti” or "movntdq" instruction atomic when system crash?.
If the two stores cross a cache line boundary or if the effective memory type of the stores is WC:
The x86-TSO memory model doesn't allow reordering stores, so it's impossible for another agent to observe x[2] == 100 and x[1] != 100 during normal operation (i.e., in the volatile state without a crash). However, if the system crashed and rebooted, it's possible for the persistent state to be x[2] == 100 and x[1] != 100. This is possible even if the system crashed after retiring clflush because the retirement of clflush doesn't necessarily mean that the cache line flushed has reached the persistence domain.
If you want to eliminate that possibly, you can either move clflush as follows:
x[1]=100;
clflush(x);
x[2]=100;
clflush on Intel processors is ordered with respect to all writes, meaning that the line is guaranteed to reach the persistence domain before any later stores become globally observable. See: Persistent Memory Programming Primary (PDF) and the Intel SDM V2. The second store could be to the same line or any other line.
If you want x[1]=100 to become persistent before x[2]=100 becomes globally observable, add sfence after clflush on Intel CSX or mfence on AMD processors (clflush is only ordered by mfence on AMD processors). clflush by itself sufficient to control persist order.
Alternatively, use the sequenceclflushopt+sfence (or clwb+sfence) as follows:
x[1]=100;
clflushopt(x);
sfence;
x[2]=100;
In this case, if a crashed happened and if x[2] == 100 in the persistent state, then it's guaranteed that x[1] == 100. clflushopt by itself doesn't impose any persist ordering.
(Also see #Hadi's answer: x86 TSO store ordering does not guarantee persistence ordering even within one line. This answer doesn't try to address that. My best guess based on Hadi's answer is that a single atomic store to one 32-byte half of a cache line will persist atomically, but that's based on how current HW works, transferring lines in 2 32-byte halves between cores, caches, and memory controllers. If this really matters, look for docs or ask Intel.)
Remember that store data can propagate out of cache (into DRAM or NVDIMM) on its own, before explicit flushing.
The following sequence of events is possible:
x[2]=100; store the 3nd byte of the cache line first. (Compile-time reordering: this is a C not asm question and x is apparently plain uint8_t x[64], not _Atomic or volatile so x[1]=100; and x[2]=100; aren't guaranteed to happen in that order in the asm.)
An interrupt arrives; at some point the cache line containing x[] is evicted all the way out of cache, into the persistence domain. (Perhaps after a context-switch to another thread, so lots of other code runs between those two asm stores).
The system crashes before execution resumes. (Or before x[1]=100; finishes becoming durable.)
If you want to depend on x86 memory ordering rules to control durability order within a cache line, you need to make sure C respects that. volatile would work, or _Atomic with memory_order_release for at least the 2nd store. (Or better, get them done as a single store if it's within an aligned 8-byte chunk.) (x86 asm memory model = program order with a store buffer; no StoreStore reordering.)
Compile-time reordering doesn't usually happen for no reason (but it can); more often due to surrounding code making it appealing to do so. But surrounding code could cause this. (And of course x[1]=100; / x[2]=0; can happen by this mechanism without any compile-time reordering, if it's 2 separate stores.)
I think a necessary pre-condition for atomicity of durability is being done as a single atomic store. e.g. guaranteed atomic by the ISA, or with a single wider SIMD store1 because Intel CPUs in practice don't split those up (but there's no on-paper guarantee of that). Being atomic wrt. interrupts (i.e. a single instruction) but not a single store uop makes it harder to split up but still completely possible2 and thus not guaranteed safe. e.g. a 10-byte x87 fstp tbyte that involves 2 separate store-data uops can be split up by an invalidation from another core that's possible even without false sharing. (See footnote 2 again.)
Without any on-paper atomicity guarantee for 16-byte or wider SIMD stores, you'd be depending on implementation details for SIMD stores or misaligned stores to not be split up.
Even ISA-guaranteed atomicity isn't sufficient, though: a lock cmpxchg that spans a cache-line boundary is still guaranteed atomic wrt. other cores and DMA readers. (Supporting this is very very slow, don't do it.) But there's no way to guarantee those two lines become durable at the same time. But outside of that special case for atomicity, IDK, I can't rule out whole-line atomicity. It's certainly plausible that a plain-store into a single line which is atomic in asm will become durable atomically, with no chance of tearing.
Within a single cache line, I don't know.
I'd guess that an atomic store within an 8-byte-aligned block will make it to persistence atomically or not at all, but I haven't checked Intel's docs. (And in practice perhaps even a whole 64-byte line, which you could store with AVX512). The point of this answer is that you don't even have a single atomic store so there are lots of other mechanisms for breaking your test-case.
Footnote 1: Modern Intel CPUs commit SIMD stores to L1d cache as a single transaction as long as they don't span a cache line. Intel hasn't made a CPU that splits SIMD stores into 2 halves since Sandy/Ivy Bridge which had full-width 256-bit AVX execution units but only 128-bit wide paths to/from cache in the load units and AFAIK in the store-buffer-commit stuff. (The store-data execution unit also took 2 cycles to write 32-byte store data to the store buffer).
Footnote 2: For separate store uops that are part of the same instruction like in fstp tbyte [rdi], this might be possible:
first part commits from the store buffer to L1d cache
an RFO or share-request arrives and is handled before the 2nd store from the same instruction commits: This core's copy is now Invalid or Shared so the commit from the store buffer to L1d is blocked until it regains exclusive ownership. The 2nd part of the store from that one instruction is at the head of the store buffer, not in coherent cache.
The other core that was doing an RFO followed up their store with a clflush, evicting this line to persistent memory before the first core can get it back and finish committing the other data from that one instruction.
An NT store like movnti by another core would force eviction of the line as part of committing the NT store, like a normal store + clflushopt.
This scenario requires false-sharing between two threads trying to persist 2 separate things in the same line, so can be avoided if you avoid false-sharing e.g. with padding. (Or some insane true sharing, or firing off clflush without storing first, on memory that other threads might be in the middle of writing).
(Or more plausible for software, much less plausible for hardware): The line gets evicted on its own before the first writer gets it back, even though a core has a pending RFO for it. (As soon as it loses ownership, the first core would send out an RFO).
(Or fully plausible without false-sharing): Forced eviction from L2/L1d at any time due to eviction from an inclusive cache-line tracking structure. This could be triggered by demand for lines that merely happen to alias the same set in L3, not false sharing.
Skylake-server (SKX) has non-inclusive L3, as do later Intel server CPUs. Cascade Lake (CSX) was the first to support persistent memory. Even though it has a non-inclusive L3, the snoop filter is inclusive and a fill conflict that causes an eviction does cause a back invalidation in the entire NUMA node.
So an invalidate request can arrive at any time, and it's likely the core / store buffer isn't going to hold onto the line for more cycles to commit an unknown number of more stores to the same line.
(By that point, the fact that both store-buffer entries were part of one instruction is probably lost. It's possible for an access pattern to create a stream of store buffer entries that store different parts of the same cache line indefinitely, so waiting until "all stores for this line are done" could let unprivileged code create a denial-of-service for a core that wanted to read it. So I think it's unlikely that HW would have a mechanism to avoid releasing ownership of a cache line between stores that came from the same instruction.)

How is load->store reordering possible with in-order commit?

ARM allows the reordering loads with subsequent stores, so that the following pseudocode:
// CPU 0 | // CPU 1
temp0 = x; | temp1 = y;
y = 1; | x = 1;
can result in temp0 == temp1 == 1 (and, this is observable in practice as well). I'm having trouble understanding how this occurs; it seems like in-order commit would prevent it (which, it was my understanding, is present in pretty much all OOO processors). My reasoning goes "the load must have its value before it commits, it commits before the store, and the store's value can't become visible to other processors until it commits."
I'm guessing that one of my assumptions must be wrong, and something like one of the following must hold:
Instructions don't need to commit all the way in-order. A later store could safely commit and become visible before an earlier load, so long as at the time the store commits the core can guarantee that the previous load (and all intermediate instructions) won't trigger an exception, and that the load's address is guaranteed to be distinct from the store's.
The load can commit before its value is known. I don't have a guess as to how this would be implemented.
Stores can become visible before they are committed. Maybe a memory buffer somewhere is allowed to forward stores to loads to a different thread, even if the load was enqueued earlier?
Something else entirely?
There's a lot of hypothetical microarchitectural features that would explain this behavior, but I'm most curious about the ones that are actually present in modern weakly ordered CPUs.
Your bullet points of assumptions all look correct to me, except that you could build a uarch where loads can retire from the OoO core after merely checking permissions (TLB) on a load to make sure it can definitely happen. There could be OoO exec CPUs that do that (update: apparently there are).
I think x86 CPUs require loads to actually have the data arrive before they can retire, but their strong memory model doesn't allow LoadStore reordering anyway. So ARM certainly could be different.
You're right that stores can't be made visible to any other cores before retirement. That way lies madness. Even on an SMT core (multiple logical threads on one physical core), it would link speculation on two logical threads together, requiring them both to roll back if either one detected mis-speculation. That would defeat the purpose of SMT of having one logical thread take advantage of stalls in others.
(Related: Making retired but not yet committed (to L1d) stores visible to other logical threads on the same core is how some real PowerPC implementations make it possible for threads to disagree on the global order of stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)
CPUs with in-order execution can start a load (check the TLB and write a load-buffer entry) and only stall if an instruction tries to use the result before it's ready. Then later instructions, including stores, can run normally. This is basically required for non-terrible performance in an in-order pipeline; stalling on every cache miss (or even just L1d latency) would be unacceptable. Memory parallelism is a thing even on in-order CPUs; they can have multiple load buffers that track multiple outstanding cache misses. High(ish) performance in-order ARM cores like Cortex-A53 are still widely used in modern smartphones, and scheduling loads well ahead of when the result register is used is a well-known important optimization for looping over an array. (Unrolling or even software pipelining.)
So if the load misses in cache but the store hits (and commits to L1d before earlier cache-miss loads get their data), you can get LoadStore reordering. (Jeff Preshing intro to memory reording uses that example for LoadStore, but doesn't get into uarch details at all.)
A load can't fault after you've checked the TLB and / or whatever memory-region stuff for it. That part has to be complete before it retires, or before it reaches the end of an in-order pipeline. Just like a retired store sitting in the store buffer waiting to commit, a retired load sitting in a load buffer is definitely happening at some point.
So the sequence on an in-order pipeline is:
lw r0, [r1] TLB hit, but misses in L1d cache. Load execution unit writes the address (r1) into a load buffer. Any later instruction that tries to read r0 will stall, but we know for sure that the load didn't fault.
With r0 tied to waiting for that load buffer to be ready, the lw instruction itself can leave the pipeline (retire), and so can later instructions.
any amount of other instructions that don't read r0. That would stall an in-order pipeline.
sw r2, [r3] store execution unit writes address + data to the store buffer / queue. Then this instruction can retire.
Probing the load buffers finds that this store doesn't overlap with the pending load, so it can commit to L1d. (If it had overlapped, you couldn't commit it until a MESI RFO completed anyway, and fast restart would forward the incoming data to the load buffer. So it might not be too complicated to handle that case without even probing on every store, but let's only look at the separate-cache-line case where we can get LoadStore reordering)
Committing to L1d = becoming globally visible. This can happen while the earlier load is still waiting for the cache line to arrive.
For OoO CPUs, you'd need some way to tie load completion back into the OoO core for instructions waiting on the load result. I guess that's possible, but it means that the architectural/retirement value of a register might not be stored anywhere in the core. Pipeline flushes and other rollbacks from mis-speculation would have to hang on to that association between an incoming load and a physical and architectural register. (Not flushing store buffers on pipeline rollbacks is already a thing that CPUs have to do, though. Retired but not yet committed stores sitting in the store buffer have no way to be rolled back.)
That could be a good design idea for uarches with a small OoO window that's too small to come close to hiding a cache miss. (Which to be fair, is every high-performance OoO exec CPU: memory latency is usually too high to fully hide.)
We have experimental evidence of LoadStore reordering on an OoO ARM: section 7.1 of https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf shows non-zero counts for "load buffering" on Tegra 2, which is based on the out-of-order Cortex-A9 uarch. I didn't look up all the others, but I did rewrite the answer to suggest that this is the likely mechanism for out-of-order CPUs, too. I don't know for sure if that's the case, though.

Random Memory Reads vs Random Memory Writes

In low level languages like C I know you should try to use the CPU cache to your benefit as much as possible. As a cache miss means your program will temporarily have to wait for the RAM to dereference a pointer. However are writes to memory also effected by this? If you write to memory it would seem that the CPU does not need to wait on a response.
I'm trying to decide if reordering a array of items would truly be worth it when I need to access items in the array in certain groups repeatedly (so sorting it based on those groups). However those groups will frequently change so I would need to keep reordering the array if I do this.
Depending on your architecture, random memory writes can be expensive for at least two reasons.
On today's multi-core machines, almost all writes will require some kind of cache coherence protocol to be run so that the corresponding cache lines on other caches will be invalidated.
In terms of ordinary writes, they will either always cost a memory operation or sometimes cause a memory operation depending on whether the cache is write-through or write-back.
You can read more details about the possible behaviors of caches on Wikipedia.
This is a very broad question, so my answer is nearly as broad.
The source code, the compiled code, and the underlying hardware are not necessarily all in sync when it comes to reading and writing memory. Your C/C++ code simply references variables. The compiled code will turn that into appropriate machine language which is close to the source code but can vary in the case of optimization, volatile keyword, etc. Finally the hardware will optimize the 3 main levels of storage: CPU cache (fastest), RAM, and hard disk (yes, your program variables can actually be stored on the hard disk, in the case of swapping).
Whether the CPU waits or not depends partially on what's going on at the hardware layer combined with the machine code (again for example consider data specified as volatile).

Multiple threads and CPU cache

I am implementing an image filtering operation in C using multiple threads and making it as optimized as possible. I have one question though: If a memory is accessed by thread-0, and concurrently if the same memory is accessed by thread-1, will it get it from the cache ? This question stems from the possibility that these two threads could be running into two different cores of the CPU. So another way of putting this is: do all the cores share the same common cache memory ?
Suppose i have a memory layout like the following
int output[100];
Assume there are 2 CPU cores and hence I spawn two threads to work concurrently. One scheme could be to divide the memory into two chunks, 0-49 and 50-99 and let each thread work on each chunk. Another way could be to let thread-0 work on even indices, like 0 2 4 and so on.. while the other thread work on odd indices like 1 3 5 .... This later technique is easier to implement (specially for 3D data) but I am not sure if I could use the cache efficiently this way.
The answer to this question strongly depends upon the architecture and the cache level, along with where the threads are actually running.
For example, recent Intel multi core CPUs have a L1 caches that are per-core, and an L2 cache that is shared among cores that are in the same CPU package; however different CPU packages will have their own L2 caches.
Even in the case when your threads are running on two cores within the one package though, if both threads access data within the same cacheline you will have that cacheline bouncing between the two L1 caches. This is very inefficient, and you should design your algorithm to avoid this situation.
A few comments have asked about how to go about avoiding this problem.
At heart, it's really not particularly complicated - you just want to avoid two threads from simultaneously trying to access data that is located on the same cache line, where at least one thread is writing to the data. (As long as all the threads are only reading the data, there's no problem - on most architectures, read-only data can be present in multiple caches).
To do this, you need to know the cache line size - this varies by architecture, but currently most x86 and x86-64 family chips use a 64 byte cache line (consult your architecture manual for other architectures). You will also need to know the size of your data structures.
If you ask your compiler to align the shared data structure of interest to a 64 byte boundary (for example, your array output), then you know that it will start at the start of a cache line, and you can also calculate where the subsequent cache line boundaries are. If your int is 4 bytes, then each cacheline will contain exactly 8 int values. As long as the array starts on a cacheline boundary, then output[0] through output[7] will be on one cache line, and output[8] through output[15] on the next. In this case, you would design your algorithm such that each thread works on a block of adjacent int values that is a multiple of 8.
If you are storing complicated struct types rather than plain int, the pahole utility will be of use. It will analyse the struct types in your compiled binary, and show you the layout (including padding) and total size. You can then adjust your structs using this output - for example, you may want to manually add some padding so that your struct is a multiple of the cache line size.
On POSIX systems, the posix_memalign() function is useful for allocating a block of memory with a specified alignment.
In general it is a bad idea to share overlapping memory regions like if one thread processes 0,2,4... and the other processes 1,3,5... Although some architectures may support this, most architectures will not, and you probably can not specify on which machines your code will run on. Also the OS is free to assign your code to any core it likes (a single one, two on the same physical processor, or two cores on separate processors). Also each CPU usually has a separate first level cache, even if its on the same processor.
In most situations 0,2,4.../1,3,5... will slow down performance extremely up to possibly being slower than a single CPU.
Herb Sutters "Eliminate False Sharing" demonstrates this very well.
Using the scheme [...n/2-1] and [n/2...n] will scale much better on most systems. It even may lead to super linear performance as the cache size of all CPUs in sum can be possibly used. The number of threads used should be always configurable and should default to the number of processor cores found.
I might be mistaking, but whether the core's cache is shared or not depends on the implementation of the CPU. You'd have to look up the technical sheets on the manufacturer's page to check whether each core in your CPU has their own cache or whether the cache was shared.
I was working on image manipulation as well for a security company and sometimes we got corrupted images after running batch operations on threads. After long investigations we came to the conclusion that the cache was shared between CPU Core's and that in rare cases the data was beeing overwritten or replaced with incorrect data.
Whether this is something to keep into account or is rather a rare event I cannot anwser.
Intel documentation
Intel publishes per-generation datasheets which may contain this kind of information.
For example, for the processor i5-3210M which I had on my older computer, I look up the 3rd generation - Datasheet Volume 1 3.3 "Intel Hyper-Threading Technology (Intel HT Technology)" says:
The processor supports Intel Hyper-Threading Technology (Intel HT Technology)
that allows an execution core to function as two logical processors. While some
execution resources such as caches, execution units, and buses are shared, each
logical processor has its own architectural state with its own set of general-purpose registers and control registers.
which confirms that caches are shared in a given hyperthread for that generation of CPUs.
See also:
similar question for cache sharing across cores: How are cache memories shared in multicore Intel CPUs?
further analysis of threads vs cores: https://superuser.com/questions/133082/what-is-the-difference-between-hyper-threading-and-multiple-cores/995858#995858
the architecture spec itself also has a section about the sharing of certain resources that must be valid across all implementations, although it does not mention caches: What does multicore assembly language look like?

Resources