What is co-Turing recognizable, how do I prove complement of two languages is decidable using Co-Turing concepts? - theory

Let L1 and L2 be two languages such that there exist no string w that belongs to both L1 and L2.
I am struggling on How to Prove that, if L1 and L2 are both co-Turing-recognizable, there exists a decidable language A such that L1 ⊆ A and L2 ⊆ A`.
A` - complement of A

We may assume that neither of L1 nor L2 is decidable, since if either is, the solution is trivial (let A = L1 or A' = L2 if L1 or L2 is decidable, respectively). In particular, neither L1 nor L2 is Turing-recognizable.
Given that, A must be equal to the set L1 with some more elements added to it (it must have at least the elements in A1 if it is to be a superset). Because L2 is a subset of A', none of the elements added to L1 to form A can be in L2. Furthermore, we must add infinitely many items since adding finitely many items cannot render A decidable where L1 is not.
Split up the stuff not in L1 or L2 into two languages R1 and R2 such that those languages have nothing in common and every string is in exactly one of L1, L2, R1 and R2. Furthermore, choose R1 and R2 so that R1 is co-Turing-recognizable, R2 is Turing-recognizable and both sets are infinite. Let A = L1 U R1. Now, A' = L2 U R2.
A is co-Turing-recognizable. If w is not in L1, we can eventually recognize that fact. If w is not in R1, we can decide that fact. Therefore, we can eventually recognize that w is in neither.
L2 is c-Turing-recognizable. If w is not in L2, we can eventually recognize that fact. If it's not in L2, then it's either in A or R2. But we can decide whether w is in R2 since R2 is decidable. Therefore, if we recognize that w is not in L2 and decide it's not in R2, we have recognized that w is in A. Therefore, A is Turing-recognizable.
We saw in 1 that A is co-Turing-recognizable and in 2 that A is Turing-recognizable. Therefore, A is decidable. Consequently, A' is decidable.
Note that we sort of waved our hands there when we "split up" stuff not in L1 or L2 into two infinite languages, one co-Turing-recognizable and the other co-Turing-recognizable. It seems like it's safe to assume that in any infinite language, there must be exist a proper subset of that language which is recognizable but not decidable. You might want to look that up and/or prove separately to verify. Proof idea: the elements of any infinite set could be put into lexicographic order, in which case there is a bijection with the language of all strings over the alphabet; because there are such recognizable but undecidable languages over the set of all strings, so too must there be recognizable but undecidable languages over this set of strings. It might be important to note that (L1 U L2)' is recognizable as that might be required to make any argument rigorous.


find nan in array of doubles using simd

This question is very similar to:
SIMD instructions for floating point equality comparison (with NaN == NaN)
Although that question focused on 128 bit vectors and had requirements about identifying +0 and -0.
I had a feeling I might be able to get this one myself but the intel intrinsics guide page seems to be down :/
My goal is to take an array of doubles and to return whether a NaN is present in the array. I am expecting that the majority of the time that there won't be one, and would like that route to have the best performance.
Initially I was going to do a comparison of 4 doubles to themselves, mirroring the non-SIMD approach for NaN detection (i.e. NaN only value where a != a is true). Something like:
data *double = ...
__m256d a, b;
int temp = 0;
//This bit would be in a loop over the array
//I'd probably put a sentinel in and loop over while !temp
a = _mm256_loadu_pd(data);
b = _mm256_cmp_pd(a, a, _CMP_NEQ_UQ);
temp = temp | _mm256_movemask_pd(b);
However, in some of the examples of comparison it looks like there is some sort of NaN detection already going on in addition to the comparison itself. I briefly thought, well if something like _CMP_EQ_UQ will detect NaNs, I can just use that and then I can compare 4 doubles to 4 doubles and magically look at 8 doubles at once at the same time.
__m256d a, b, c;
a = _mm256_loadu_pd(data);
b = _mm256_loadu_pd(data+4);
c = _mm256_cmp_pd(a, b, _CMP_EQ_UQ);
At this point I realized I wasn't quite thinking straight because I might happen to compare a number to itself that is not a NaN (i.e. 3 == 3) and get a hit that way.
So my question is, is comparing 4 doubles to themselves (as done above) the best I can do or is there some other better approach to finding out whether my array has a NaN?
You might be able to avoid this entirely by checking fenv status, or if not then cache block it and/or fold it into another pass over the same data, because it's very low computational intensity (work per byte loaded/stored), so it easily bottlenecks on memory bandwidth. See below.
The comparison predicate you're looking for is _CMP_UNORD_Q or _CMP_ORD_Q to tell you that the comparison is unordered or ordered, i.e. that at least one of the operands is a NaN, or that both operands are non-NaN, respectively. What does ordered / unordered comparison mean?
The asm docs for cmppd list the predicates and have equal or better details than the intrinsics guide.
So yes, if you expect NaN to be rare and want to quickly scan through lots of non-NaN values, you can vcmppd two different vectors against each other. If you cared about where the NaN was, you could do extra work to sort that out once you know that there is at least one in either of two input vectors. (Like _mm256_cmp_pd(a,a, _CMP_UNORD_Q) to feed movemask + bitscan for lowest set bit.)
OR or AND multiple compares per movemask
Like with other SSE/AVX search loops, you can also amortize the movemask cost by combining a few compare results with _mm256_or_pd (find any unordered) or _mm256_and_pd (check for all ordered). E.g. check a couple cache lines (4x _mm256d with 2x _mm256_cmp_pd) per movemask / test/branch. (glibc's asm memchr and strlen use this trick.) Again, this optimizes for your common case where you expect no early-outs and have to scan the whole array.
Also remember that it's totally fine to check the same element twice, so your cleanup can be simple: a vector that loads up to the end of the array, potentially overlapping with elements you already checked.
// checks 4 vectors = 16 doubles
// non-zero means there was a NaN somewhere in p[0..15]
static inline
int any_nan_block(double *p) {
__m256d a = _mm256_loadu_pd(p+0);
__m256d abnan = _mm256_cmp_pd(a, _mm256_loadu_pd(p+ 4), _CMP_UNORD_Q);
__m256d c = _mm256_loadu_pd(p+8);
__m256d cdnan = _mm256_cmp_pd(c, _mm256_loadu_pd(p+12), _CMP_UNORD_Q);
__m256d abcdnan = _mm256_or_pd(abnan, cdnan);
return _mm256_movemask_pd(abcdnan);
// more aggressive ORing is possible but probably not needed
// especially if you expect any memory bottlenecks.
I wrote the C as if it were assembly, one instruction per source line. (load / memory-source cmppd). These 6 instructions are all single-uop in the fused-domain on modern CPUs, if using non-indexed addressing modes on Intel. test/jnz as a break condition would bring it up to 7 uops.
In a loop, an add reg, 16*8 pointer increment is another 1 uop, and cmp / jne as a loop condition is one more, bringing it up to 9 uops. So unfortunately on Skylake this bottlenecks on the front-end at 4 uops / clock, taking at least 9/4 cycles to issue 1 iteration, not quite saturating the load ports. Zen 2 or Ice Lake could sustain 2 loads per clock without any more unrolling or another level of vorpd combining.
Another trick that might be possible is to use vptest or vtestpd on two vectors to check that they're both non-zero. But I'm not sure it's possible to correctly check that every element of both vectors is non-zero. Can PTEST be used to test if two registers are both zero or some other condition? shows that the other way (that _CMP_UNORD_Q inputs are both all-zero) is not possible.
But this wouldn't really help: vtestpd / jcc is 3 uops total, vs. vorpd / vmovmskpd / test+jcc also being 3 fused-domain uops on existing Intel/AMD CPUs with AVX, so it's not even a win for throughput when you're branching on the result. So even if it's possible, it's probably break even, although it might save a bit of code size. And wouldn't be worth considering if it takes more than one branch to sort out the all-zeros or mix_zeros_and_ones cases from the all-ones case.
Avoiding work: check fenv flags instead
If your array was the result of computation in this thread, just check the FP exception sticky flags (in MXCSR manually, or via fenv.h fegetexcept) to see if an FP "invalid" exception has happened since you last cleared FP exceptions. If not, I think that means the FPU hasn't produced any NaN outputs and thus there are none in arrays written since then by this thread.
If it is set, you'll have to check; the invalid exception might have been raised for a temporary result that didn't propagate into this array.
Cache blocking:
If/when fenv flags don't let you avoid the work entirely, or aren't a good strategy for your program, try to fold this check into whatever produced the array, or into the next pass that reads it. So you're reusing data while it's already loaded into vector registers, increasing computational intensity. (ALU work per load/store.)
Even if data is already hot in L1d, it will still bottleneck on load port bandwidth: 2 loads per cmppd still bottlenecks on 2/clock load port bandwidth, on CPUs with 2/clock vcmppd ymm (Skylake but not Haswell).
Also worthwhile to align your pointers to make sure you're getting full load throughput from L1d cache, especially if data is sometimes already hot in L1d.
Or at least cache-block it so you check a 128kiB block before running another loop on that same block while it's hot in cache. That's half the size of 256k L2 so your data should still be hot from the previous pass, and/or hot for the next pass.
Definitely avoid running this over a whole multi-megabyte array and paying the cost of getting it into the CPU core from DRAM or L3 cache, then evicting again before another loop reads it. That's worst case computational intensity, paying the cost of getting it into a CPU core's private cache more than once.

Hypothetical Loop Iterator

Can anyone help me with my homework. I don't get this question can you explain it to me?
Take the following shorthand for a hypothetical loop iterator, that could have been in MIPS instruction set.
itr $t6, loop # if(R[rs]>0) R[rs]=R[rs]-1 PC=PC+4+BranchAddr
a ) Among the available instruction formats [R, I, J], what is the most
appropriate for itr?
b ) Implement itr using the existing MIPS instruction set.
c ) Elaborate on the reason why this instruction is not available in MIPS
instruction set on the basis of principles of computer architecture and com‐
puter organization.
itr $t6, loop # if(R[rs]>0) R[rs]=R[rs]-1 PC=PC+4+BranchAddr
This is an informal syntax and explanation. The syntax is that there is an opcode (itr) that takes two operands: the first one is a register, and the second is a label.
The stuff behind the # indicates what they expect the instruction to do: Roughly, this instruction is a conditional branch that tests if the value in register rs is > 0 — and when greater than zero, it will decrement that register, then take the branch to the target address. If not greater than zero, then not take the branch.
This instruction might be used in a count down loop.
Let's write out the conditional branch using C code:
if ( rs > 0 ) {
goto loop;
// ...else, fall thru to here...
(a) They then want you to say which format this would take if it were a real MIPS instruction. You should draw a format for the instruction and then lookup at the indicated formats to see which is closest.
(b) They they want you to perform this operation using an existing MIPS sequence of instructions. You might translate the above C into MIPS assembly.
(c) Theorize why this instruction might be difficult or opposed to the design goals of the MIPS processor (these design goals include short cycle duration and simplicity).

What is the instruction that gives branchless FP min and max on x86?

To quote (thanks to the author for developing and sharing the algorithm!):
Since modern floating-point instruction sets can compute min and max without branches
Corresponding code by the author is just
dmnsn_min(double a, double b)
return a < b ? a : b;
I'm familiar with e.g. _mm_max_ps, but that's a vector instruction. The code above obviously is meant to be used in a scalar form.
What is the scalar branchless minmax instruction on x86? Is it a sequence of instructions?
Is it safe to assume it's going to be applied, or how do I call it?
Does it make sense to bother about branchless-ness of min/max? From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch. Am I right about this?
Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY. Is this reliable w.r.t the (unknown) instruction we're discussing and the floating-point standard?
Just in case: I'm familiar with Use of min and max functions in C++, believe it's related but not quite my question.
Warning: Beware of compilers treating _mm_min_ps / _mm_max_ps (and _pd) intrinsics as commutative even in strict FP (not fast-math) mode; even though the asm instruction isn't. GCC specifically seems to have this bug: PR72867 which was fixed in GCC7 but may be back or never fixed for _mm_min_ss etc. scalar intrinsics (_mm_max_ss has different behavior between clang and gcc, GCC bugzilla PR99497).
GCC knows how the asm instructions themselves work, and doesn't have this problem when using them to implement strict FP semantics in plain scalar code, only with the C/C++ intrinsics.
Unfortunately there isn't a single instruction that implements fmin(a,b) (with guaranteed NaN propagation), so you have to choose between easy detection of problems vs. higher performance.
Most vector FP instructions have scalar equivalents. MINSS / MAXSS / MINSD / MAXSD are what you want. They handle +/-Infinity the way you'd expect.
MINSS a,b exactly implements (a<b) ? a : b according to IEEE rules, with everything that implies about signed-zero, NaN, and Infinities. (i.e. it keeps the source operand, b, on unordered.) This means C++ compilers can use them for std::min(b,a) and std::max(b,a), because those functions are based on the same expression. Note the b,a operand order for the std:: functions, opposite Intel-syntax for x86 asm, but matching AT&T syntax.
MAXSS a,b exactly implements (b<a) ? a : b, again keeping the source operand (b) on unordered. Like std::max(b,a).
Looping over an array with x = std::min(arr[i], x); (i.e. minss or maxss xmm0, [rsi]) will take a NaN from memory if one is present, and then take whatever non-NaN element is next because that compare will be unordered. So you'll get the min or max of the elements following the last NaN. You normally don't want this, so it's only good for arrays that don't contain NaN. But it means you can start with float v = NAN; outside a loop, instead of the first element or FLT_MAX or +Infinity, and might simplify handling possibly-empty lists. It's also convenient in asm, allowing init with pcmpeqd xmm0,xmm0 to generate an all-ones bit-pattern (a negative QNAN), but unfortunately GCC's NAN uses a different bit-pattern.
Demo/proof on the Godbolt compiler explorer, including showing that v = std::min(v, arr[i]); (or max) ignores NaNs in the array, at the cost of having to load into a register and then minss into that register.
(Note that min of an array should use vectors, not scalar; preferably with multiple accumulators to hide FP latency. At the end, reduce to one vector then do horizontal min of it, just like summing an array or doing a dot product.)
Don't try to use _mm_min_ss on scalar floats; the intrinsic is only available with __m128 operands, and Intel's intrinsics don't provide any way to get a scalar float into the low element of a __m128 without zeroing the high elements or somehow doing extra work. Most compilers will actually emit the useless instructions to do that even if the final result doesn't depend on anything in the upper elements. (Clang can often avoid it, though, applying the as-if rule to the contents of dead vector elements.) There's nothing like __m256 _mm256_castps128_ps256 (__m128 a) to just cast a float to a __m128 with garbage in the upper elements. I consider this a design flaw. :/
But fortunately you don't need to do this manually, compilers know how to use SSE/SSE2 min/max for you. Just write your C such that they can. The function in your question is ideal: as shown below (Godbolt link):
// can and does inline to a single MINSD instruction, and can auto-vectorize easily
static inline double
dmnsn_min(double a, double b) {
return a < b ? a : b;
Note their asymmetric behaviour with NaN: if the operands are unordered, dest=src (i.e. it takes the second operand if either operand is NaN). This can be useful for SIMD conditional updates, see below.
(a and b are unordered if either of them is NaN. That means a<b, a==b, and a>b are all false. See Bruce Dawson's series of articles on floating point for lots of FP gotchas.)
The corresponding _mm_min_ss / _mm_min_ps intrinsics may or may not have this behaviour, depending on the compiler.
I think the intrinsics are supposed to have the same operand-order semantics as the asm instructions, but gcc has treated the operands to _mm_min_ps as commutative even without -ffast-math for a long time, gcc4.4 or maybe earlier. GCC 7 finally changed it to match ICC and clang.
Intel's online intrinsics finder doesn't document that behaviour for the function, but it's maybe not supposed to be exhaustive. The asm insn ref manual doesn't say the intrinsic doesn't have that property; it just lists _mm_min_ss as the intrinsic for MINSS.
When I googled on "_mm_min_ps" NaN, I found this real code and some other discussion of using the intrinsic to handle NaNs, so clearly many people expect the intrinsic to behave like the asm instruction. (This came up for some code I was writing yesterday, and I was already thinking of writing this up as a self-answered Q&A.)
Given the existence of this longstanding gcc bug, portable code that wants to take advantage of MINPS's NaN handling needs to take precautions. The standard gcc version on many existing Linux distros will mis-compile your code if it depends on the order of operands to _mm_min_ps. So you probably need an #ifdef to detect actual gcc (not clang etc), and an alternative. Or just do it differently in the first place :/ Perhaps with a _mm_cmplt_ps and boolean AND/ANDNOT/OR.
Enabling -ffast-math also makes _mm_min_ps commutative on all compilers.
As usual, compilers know how to use the instruction set to implement C semantics correctly. MINSS and MAXSS are faster than anything you could do with a branch anyway, so just write code that can compile to one of those.
The commutative-_mm_min_ps issue applies to only the intrinsic: gcc knows exactly how MINSS/MINPS work, and uses them to correctly implement strict FP semantics (when you don't use -ffast-math).
You don't usually need to do anything special to get decent scalar code out of a compiler. But if you are going to spend time caring about what instructions the compiler uses, you should probably start by manually vectorizing your code if the compiler isn't doing that.
(There may be rare cases where a branch is best, if the condition almost always goes one way and latency is more important than throughput. MINPS latency is ~3 cycles, but a perfectly predicted branch adds 0 cycles to the dependency chain of the critical path.)
In C++, use std::min and std::max, which are defined in terms of > or <, and don't have the same requirements on NaN behaviour that fmin and fmax do. Avoid fmin and fmax for performance unless you need their NaN behaviour.
In C, I think just write your own min and max functions (or macros if you do it safely).
C & asm on the Godbolt compiler explorer
float minfloat(float a, float b) {
return (a<b) ? a : b;
# any decent compiler (gcc, clang, icc), without any -ffast-math or anything:
minss xmm0, xmm1
// C++
float minfloat_std(float a, float b) { return std::min(a,b); }
# This implementation of std::min uses (b<a) : b : a;
# So it can produce the result only in the register that b was in
# This isn't worse (when inlined), just opposite
minss xmm1, xmm0
movaps xmm0, xmm1
float minfloat_fmin(float a, float b) { return fminf(a, b); }
# clang inlines fmin; other compilers just tailcall it.
minfloat_fmin(float, float):
movaps xmm2, xmm0
cmpunordss xmm2, xmm2
movaps xmm3, xmm2
andps xmm3, xmm1
minss xmm1, xmm0
andnps xmm2, xmm1
orps xmm2, xmm3
movaps xmm0, xmm2
# Obviously you don't want this if you don't need it.
If you want to use _mm_min_ss / _mm_min_ps yourself, write code that lets the compiler make good asm even without -ffast-math.
If you don't expect NaNs, or want to handle them specially, write stuff like
lowest = _mm_min_ps(lowest, some_loop_variable);
so the register holding lowest can be updated in-place (even without AVX).
Taking advantage of MINPS's NaN behaviour:
Say your scalar code is something like
if(some condition)
lowest = min(lowest, x);
Assume the condition can be vectorized with CMPPS, so you have a vector of elements with the bits all set or all clear. (Or maybe you can get away with ANDPS/ORPS/XORPS on floats directly, if you just care about their sign and don't care about negative zero. This creates a truth value in the sign bit, with garbage elsewhere. BLENDVPS looks at only the sign bit, so this can be super useful. Or you can broadcast the sign bit with PSRAD xmm, 31.)
The straight-forward way to implement this would be to blend x with +Inf based on the condition mask. Or do newval = min(lowest, x); and blend newval into lowest. (either BLENDVPS or AND/ANDNOT/OR).
But the trick is that all-one-bits is a NaN, and a bitwise OR will propagate it. So:
__m128 inverse_condition = _mm_cmplt_ps(foo, bar);
__m128 x = whatever;
x = _mm_or_ps(x, condition); // turn elements into NaN where the mask is all-ones
lowest = _mm_min_ps(x, lowest); // NaN elements in x mean no change in lowest
// REQUIRES NON-COMMUTATIVE _mm_min_ps: no -ffast-math
So with only SSE2, and we've done a conditional MINPS in two extra instructions (ORPS and MOVAPS, unless loop unrolling allows the MOVAPS to disappear).
The alternative without SSE4.1 BLENDVPS is ANDPS/ANDNPS/ORPS to blend, plus an extra MOVAPS. ORPS is more efficient than BLENDVPS anyway (it's 2 uops on most CPUs).
Peter Cordes's answer is great, I just figured I'd jump in with some shorter point-by-point answers:
What is the scalar branchless minmax instruction on x86? Is it a sequence of instructions?
I was referring to minss/minsd. And even other architectures without such instructions should be able to do this branchlessly with conditional moves.
Is it safe to assume it's going to be applied, or how do I call it?
gcc and clang will both optimize (a < b) ? a : b to minss/minsd, so I don't bother using intrinsics. Can't speak to other compilers though.
Does it make sense to bother about branchless-ness of min/max? From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch. Am I right about this?
The individual a < b tests are pretty much completely unpredictable, so it is very important to avoid branching for those. Tests like if (ray.dir.x != 0.0) are very predictable, so avoiding those branches is less important, but it does shrink the code size and make it easier to vectorize. The most important part is probably removing the divisions though.
Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY. Is this reliable w.r.t the (unknown) instruction we're discussing and the floating-point standard?
Yes, minss/minsd behave exactly like (a < b) ? a : b, including their treatment of infinities and NaNs.
Also, I wrote a followup post to the one you referenced that talks about NaNs and min/max in more detail.

Last used cache line versus different cache lines

Let's assume cache lines are 64 bytes wide and I have two arrays a and b which fill a cache line and are also aligned to a cache line. Let's also assume that both arrays are in the L1 cache so when I read from them I don't get a cache miss.
float a[16]; //64 byte aligned e.g. with __attribute__((aligned (64)))
float b[16]; //64 byte aligned
I read a[0]. My question is it faster to now read a[1] than to read b[0]? In other words, is it faster to read from the last used cache line?
Does the set matter? Let's now assume that I have a 32 kb L1 data cache which is 4 way. So if a and b are 8192 bytes apart they end up in the same set. Will this change the answer to my question?
Another way to ask my question (which is what I really care about) is in regards to reading a matrix.
In other words which one of these two code options will be more efficient assuming matrix M fits in the L1 cache and is 64 byte aligned and is already in the L1 cache.
float M[16][16]; //64 byte aligned
Version 1:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[i][j];
Version 2:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[j][i];
Edit: To make this clear due to SSE/AVX lets assume I read the first eight values from a at once with AVX (e.g. with _mm256_load_ps()). Will reading the next eight values from a be faster than reading the first eight values from b (recall that a and b are already in the cache so there will not be a cahce miss)?
Edit:: I'm mostly interested in all processors since Intel Core 2 and Nehalem but I'm currently working with an Ivy Bridge processor and plan to use Haswell soon.
With current Intel processors, there is no performance difference between loading two different cache lines that are both in L1 cache, all else being equal. Given float a[16], b[16]; with a[0] recently loaded, a[1] in the same cache line as a[0], and b[1] not recently loaded but still in L1 cache, then there will be no performance difference between loading a[1] and b[0] in the absence of some other factor.
One thing that can cause a difference is if there has recently been a store to some address that shares some bits with one of the values being loaded, although the entire address is different. Intel processors compare some of the bits of addresses to determine whether they might match a store that is currently in progress. If the bits match, some Intel processors delay the load instruction to give the processor time to resolve the complete virtual address and compare it to the address being stored. However, this is an incidental effect that is not particular to a[1] or b[0].
It is also theoretically possible that a compiler that sees your code is loading both a[0] and a[1] in short succession might make some optimization, such as loading them both with one instruction. My comments above apply to hardware behavior, not C implementation behavior.
With the two-dimensional array scenario, there should still be no difference as long as the entire array M is in L1 cache. However, column traversals of arrays are notorious for performance problems when the array exceeds L1 cache. A problem occurs because addresses are mapped to sets in cache by fixed bits in the address, and each cache set can hold only a limited number of cache lines, such as four. Here is a problem scenario:
An array M has a row length that is a multiple of the distance that results in addresses being mapped to the same cache sets, such as 4096 bytes. E.g., in the array float M[1024][1024];, M[0][0] and M[1][0] are 4096 bytes apart and map to the same cache set.
As you traverse a column of the array, you access M[0][0], M[1][0], M[2][0], M[3][0], and so on. The cache line for each of these elements is loaded into cache.
As you continue along the column, you access M[8][0], M[9][0], and so on. Since each of these uses the same cache set as the previous ones and the cache set can hold only four lines, the earlier lines containing M[0][0] and so on are evicted from cache.
When you complete the column and start the next column by reading M[0][1], the data is no longer in L1 cache, and all of your loads must fetch the data from L2 cache (or worse if you also thrashed L2 cache in the same way).
Fetching a[0] and then either a[1] or b[0] should amount to 2 cache access that hit the L1 in either case. You didn't say which uArch you're using but i'm not familiar with any mechanism that does further "caching" of the full cacheline above the L1 (anywhere in the memory unit), and I don't think such a mechanism could be feasible (at least not for any reasonable price).
Assume you read a[0] and then a[1], and would like to save the effort of accessing the L1 again for that line - your HW would have to not only keep the full cache line somewhere in the memory unit in case it's going to be accessed again (not sure how much that's a common case, so this feature is probably not the effort), but also keep it snoopable as a logical extension of your cache in case some other core tries to modify a[1] between these two reads (which x86 permits for wb memory). In fact, it could even be a store in the same thread context, and you'll have to guard against that (since most common x86 CPUs today are performing loads out of order). If you don't maintain both of these (and probably other safeguards too) - you break coherency, if you do - you've created a monster logic that does that same as your L1 already does, just to save meager 1-2 cycles of access.
However, even though both options would require the same number of cache accesses, there may be other considerations effecting their efficiency, such as L1 banking, same-set access restrictions, lazy LRU updating, etc.. All of which depend on your exact machine implementation.
If you don't focus only on memory/cache access efficiency, your compiler should be able to vectorize accesses to consecutive memory locations, which would still incur the same accesses but will be lighter on execution BW. I think that any decent compiler should be able to unroll your loops at this size, and combine the consecutive accesses into a single vector, but you may be able to help it by using option 1 (especially if there are also writes or other problematic instructions in the middle that would compilcate the job for the compiler)
Since you're also asking about fitting the matrix in the L2 - that simplifies the question - in that case using the same line(s) multiple times as in option 1 is better as it allows you to hit the L1, while the alternative is to constantly fetch from the L2, which gives you lower latency and bandwidth. This is the basic principle behind loop tiling / blocking
Spatial locality is king so version #1 is faster. A good compiler can even vectorize the reads using SSE/AVX.
The CPU rearranges reads so it doesn't matter which one is first. In out-of-order CPUs it should matter very little if the both cache lines are on the same way.
For large matrices, it is even more important to keep locality so the L1 cache remains hot (less cache misses).
Although I don't know the answer to your question(s) directly (someone else may have more knowledge about processor architecture), have you tried / is it possible to find out the answer yourself by some form of benchmarking?
You can get a high resolution timer by some function such as QueryPerformanceCounter (assuming you're on Windows) or OS equivalent, then iterate the reads you want to test by x amount of times, then get the high resolution timer again to get the average time a read took.
Perform this process again for different reads and you should be able to compare average read times for different types of read, which should answer your question. That's not to say that the answer will remain the same on different processors though.

How would you generically detect cache line associativity from user mode code?

I'm putting together a small patch for the cachegrind/callgrind tool in valgrind which will auto-detect, using completely generic code, CPU instruction and cache configuration (right now only x86/x64 auto-configures, and other architectures don't provide CPUID type configuration to non-privileged code). This code will need to execute entirely in a non-privileged context i.e. pure user mode code. It also needs to be portable across very different POSIX implementations, so grokking /proc/cpuinfo won't do as one of our destination systems doesn't have such a thing.
Detecting the frequency of the CPU, the number of caches, their sizes, and even cache line size can all be done using 100% generic POSIX code which has no CPU-specific opcodes whatsoever (just a lot of reasonable assumptions, such as that adding two numbers together, if without memory or register dependency stalls, probably will be executed in a single cycle). This part is fairly straightforward.
What isn't so straightforward, and why I ask StackOverflow, is how to detect cache line associativity for a given cache? Associativity is how many places in a cache can contain a given cache line from main memory. I can see that L1 cache associativity could be detected, but L2 cache? Surely the L1 associativity gets in the way?
I appreciate this is probably a problem which cannot be solved. But I throw it onto StackOverflow and hope someone knows something I don't. Note that if we fail here, I'll simply hard code in an associativity default of four way, assuming it wouldn't make a huge difference to results.
Here's a scheme:
Have a memory access pattern with a stride S , and number of unique elements accessed = N. The test first touches each unique element, and then measures the average time to access each element, by accessing the same pattern a very large number of times.
Example: for S = 2 and N = 4 the address pattern would be 0,2,4,6,0,2,4,6,0,2,4,6,...
Consider a multi-level cache hierarchy. You can make the following reasonable assumptions:
Size of n+1 th level-cache is a power of two times the size of the nth cache
The associativity of n+1 th cache is also a power of two times the associativity of the nth cache.
These 2 assumptions allow us to say that if two addresses map to the same set in n+1 th cache(say L2), then they must map to the same set in nth cache(say L1).
Say you know the sizes of L1, L2 caches. You need to find the associativity of L2 cache.
set stride S = size of L2 cache (so that every access maps to the same set in L2, and in L1 too)
vary N (by powers of 2)
You get the following regimes:
Regime 1: N <= associativity of L1. (All accesses HIT in L1)
Regime 2: associativity of L1 < N <= associativity of L2 (All accesses miss in L1, but HIT in L2)
Regime 3: N > associativity of L2 ( All accesses miss in L2)
So, if you plot average access time against N (when S = size of L2), you will see a step-like plot. The end of the lowest step gives you the associativity of L1. The next step gives you the associativity of L2.
You can repeat the same procedure between L2-L3 and so-on. Please let me know if that helps. The method of obtaining cache parameters by varying the stride of a memory access pattern is similar to that used by the LMBENCH benchmark. I don't know if lmbench infers associativity too.
Could you do a small program that only accesses lines from the same set? Then you can increase the stack distance between the accesses and when the execution time dramatically fall, you can assume you have reach the associativity.
It's probably not very stable, but maybe that could give a lead, don't know. I hope it can help.
For x86 platform you can use cpuid:
See http://www.intel.com/content/www/us/en/processors/processor-identification-cpuid-instruction-note.html for details.
You need something like:
long _eax,_ebx,_ecx,_edx;
long op = func;
asm ("cpuid"
: "=a" (_eax),
"=b" (_ebx),
"=c" (_ecx),
"=d" (_edx)
: "a" (op)
Then use the info according to the doc in the link mentioned above.
