Intuition for understanding gcc -Og and -Os loop output

Intuition for understanding gcc -Og and -Os loop output - loops

Background
I've assumed for a while that gcc will convert while-loops into do-while form. (See Why are loops always compiled into "do...while" style (tail jump)?)
And that -O0 for while-loops...
while (test-expr)
body-statement
..Will generate code on the form of jump-to-middle do-while
goto test;
loop:
body-statement
test:
if (test-expr) goto loop;
And gcc -O2 will generate guarded do while
if (test-expr)
goto done;
loop:
body-statement
if (test-expr) goto loop;
done:
Concrete examples
Here are godbolt examples of functions for which gcc generates the kind of control flow I'm describing above (I use for-loops but a while loop will give the same code).
This simple function...
int sum1(int a[], size_t N) {
int s = 0;
for (size_t i = 0; i < N; i++) {
s += a[i];
}
return s;
}
Will for -O0 generate this jump to middle code
```sum1:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-24], rdi
mov QWORD PTR [rbp-32], rsi
mov DWORD PTR [rbp-4], 0
mov QWORD PTR [rbp-16], 0
jmp .L2
.L3:
mov rax, QWORD PTR [rbp-16]
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov eax, DWORD PTR [rax]
add DWORD PTR [rbp-4], eax
add QWORD PTR [rbp-16], 1
.L2:
mov rax, QWORD PTR [rbp-16]
cmp rax, QWORD PTR [rbp-32]
jb .L3
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
Will for -O2 generate this guarded-do code.
sum1:
test rsi, rsi
je .L4
lea rdx, [rdi+rsi*4]
xor eax, eax
.L3:
add eax, DWORD PTR [rdi]
add rdi, 4
cmp rdi, rdx
jne .L3
ret
.L4:
xor eax, eax
ret
My question
What I'm after is hand-wavy rule to apply when looking at -Os loops. I'm more used to looking at -O2 code and now that I'm working in the embedded field where -Os is more prevalent, I'm surprised by the form of loops I see.
It seems that gcc -Og and -Os both generate code as a jmpat a bottom and if() break at the top. Clang on the other hand generated guarded-do-while A godbolt link to gcc and clang output
Here is an example of gcc -Os output for the above function:
sum1:
xor eax, eax
xor r8d, r8d
.L2:
cmp rax, rsi
je .L5
add r8d, DWORD PTR [rdi+rax*4]
inc rax
jmp .L2
.L5:
mov eax, r8d
ret
Am I correct in assuming that gcc -Og and -Os generates code on the form I described above?
Does anyone have a resource that describes the rationale for using while-form for -Og and -Os? Is it by design or an accidental fall-out form the way the optimization passes are organized.
I thought that converting loops into do-while form was part of the early canonicalization done by compilers? How come gcc -O0 generates do-while but gcc -Og gives while-loops? Do that canonicalization only happen when optimization is enabled?
Sidenote: I'm surprised by how much code generated with -Os and -O2 differ given that there aren't many compiler flags that are different. Maybe many passes checks some variable for tradeoff_speed_vs_space.

Related

How do I write Rust code which compiles to assembly which resembles that produced by GCC from C?

I have these two source files:
const ARR_LEN: usize = 128 * 1024;
pub fn plain_mod_test(x: &[u64; ARR_LEN], m: u64, result: &mut [u64; ARR_LEN]) {
for i in 0..ARR_LEN {
result[i] = x[i] % m;
}
}
and
#include <stdint.h>
#define ARR_LEN (128 * 1024)
void plain_mod_test(uint64_t *x, uint64_t m, uint64_t *result) {
for (int i = 0; i < ARR_LEN; ++ i) {
result[i] = x[i] % m;
}
}
Is my C code a good approximation to the Rust code?
When I compile the C code on godbolt.org x86_64 gcc12.2 -O3 I get the sensible:
plain_mod_test:
mov r8, rdx
xor ecx, ecx
.L2:
mov rax, QWORD PTR [rdi+rcx]
xor edx, edx
div rsi
mov QWORD PTR [r8+rcx], rdx
add rcx, 8
cmp rcx, 1048576
jne .L2
ret
But when I do the same for rustc 1.66 -C opt-level=3 I get this verbose output:
example::plain_mod_test:
push rax
test rsi, rsi
je .LBB0_10
mov r8, rdx
xor ecx, ecx
jmp .LBB0_2
.LBB0_7:
xor edx, edx
div rsi
mov qword ptr [r8 + 8*rcx + 8], rdx
mov rcx, r9
cmp r9, 131072
je .LBB0_9
.LBB0_2:
mov rax, qword ptr [rdi + 8*rcx]
mov rdx, rax
or rdx, rsi
shr rdx, 32
je .LBB0_3
xor edx, edx
div rsi
jmp .LBB0_5
.LBB0_3:
xor edx, edx
div esi
.LBB0_5:
mov qword ptr [r8 + 8*rcx], rdx
mov rax, qword ptr [rdi + 8*rcx + 8]
lea r9, [rcx + 2]
mov rdx, rax
or rdx, rsi
shr rdx, 32
jne .LBB0_7
xor edx, edx
div esi
mov qword ptr [r8 + 8*rcx + 8], rdx
mov rcx, r9
cmp r9, 131072
jne .LBB0_2
.LBB0_9:
pop rax
ret
.LBB0_10:
lea rdi, [rip + str.0]
lea rdx, [rip + .L__unnamed_1]
mov esi, 57
call qword ptr [rip + core::panicking::panic#GOTPCREL]
ud2
How do I write Rust code which compiles to assembly similar to that produced by gcc for C?
Update: When I compile the C code with clang 12.0.0 -O3 I get output which looks far more like the Rust assembly than the GCC/C assembly.
i.e. This looks like a GCC vs clang issue, rather than a C vs Rust difference.
plain_mod_test: # #plain_mod_test
mov r8, rdx
xor ecx, ecx
jmp .LBB0_1
.LBB0_6: # in Loop: Header=BB0_1 Depth=1
xor edx, edx
div rsi
mov qword ptr [r8 + 8*rcx + 8], rdx
add rcx, 2
cmp rcx, 131072
je .LBB0_8
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov rax, qword ptr [rdi + 8*rcx]
mov rdx, rax
or rdx, rsi
shr rdx, 32
je .LBB0_2
xor edx, edx
div rsi
jmp .LBB0_4
.LBB0_2: # in Loop: Header=BB0_1 Depth=1
xor edx, edx
div esi
.LBB0_4: # in Loop: Header=BB0_1 Depth=1
mov qword ptr [r8 + 8*rcx], rdx
mov rax, qword ptr [rdi + 8*rcx + 8]
mov rdx, rax
or rdx, rsi
shr rdx, 32
jne .LBB0_6
xor edx, edx
div esi
mov qword ptr [r8 + 8*rcx + 8], rdx
add rcx, 2
cmp rcx, 131072
jne .LBB0_1
.LBB0_8:
ret

Don’t compare apples to orange crabs.
Most of the difference between the assembly outputs is due to loop unrolling, which the LLVM code generator used by rustc does much more aggressively than GCC’s, and working around a CPU performance pitfall, as explained in Peter Cordes’ answer. When you compile the same C code with Clang 15, it outputs:
mov r8, rdx
xor ecx, ecx
jmp .LBB0_1
.LBB0_6:
xor edx, edx
div rsi
mov qword ptr [r8 + 8*rcx + 8], rdx
add rcx, 2
cmp rcx, 131072
je .LBB0_8
.LBB0_1:
mov rax, qword ptr [rdi + 8*rcx]
mov rdx, rax
or rdx, rsi
shr rdx, 32
je .LBB0_2
xor edx, edx
div rsi
jmp .LBB0_4
.LBB0_2:
xor edx, edx
div esi
.LBB0_4:
mov qword ptr [r8 + 8*rcx], rdx
mov rax, qword ptr [rdi + 8*rcx + 8]
mov rdx, rax
or rdx, rsi
shr rdx, 32
jne .LBB0_6
xor edx, edx
div esi
mov qword ptr [r8 + 8*rcx + 8], rdx
add rcx, 2
cmp rcx, 131072
jne .LBB0_1
.LBB0_8:
ret
which is pretty much the same as the Rust version.
Using Clang with -Os results in assembly much closer to that of GCC:
mov r8, rdx
xor ecx, ecx
.LBB0_1:
mov rax, qword ptr [rdi + 8*rcx]
xor edx, edx
div rsi
mov qword ptr [r8 + 8*rcx], rdx
inc rcx
cmp rcx, 131072
jne .LBB0_1
ret
Likewise does -C opt-level=s with rustc:
push rax
test rsi, rsi
je .LBB6_4
mov r8, rdx
xor ecx, ecx
.LBB6_2:
mov rax, qword ptr [rdi + 8*rcx]
xor edx, edx
div rsi
mov qword ptr [r8 + 8*rcx], rdx
lea rax, [rcx + 1]
mov rcx, rax
cmp rax, 131072
jne .LBB6_2
pop rax
ret
.LBB6_4:
lea rdi, [rip + str.0]
lea rdx, [rip + .L__unnamed_1]
mov esi, 57
call qword ptr [rip + core::panicking::panic#GOTPCREL]
ud2
Of course, there is still the check if m is zero, leading to a panicking branch. You can eliminate that branch by narrowing down the type of the argument to exclude zero:
const ARR_LEN: usize = 128 * 1024;
pub fn plain_mod_test(x: &[u64; ARR_LEN], m: std::num::NonZeroU64, result: &mut [u64; ARR_LEN]) {
for i in 0..ARR_LEN {
result[i] = x[i] % m
}
}
Now the function will emit identical assembly to Clang.

rustc uses the LLVM back-end optimizer, so compare against clang. LLVM unrolls small loops by default.
Recent LLVM is also tuning for Intel CPUs before Ice Lake, where div r64 is much slower than div r32, so much slower that it's worth branching on.
It's checking if a uint64_t actually fits in uint32_t and using 32-bit operand-size for div. The shr/je is doing if ((dividend|divisor)>>32 == 0) use 32-bit to check the high halves of both operands for being all zero. If it checked the high half of m once, and made 2 versions of loop, the test would be simpler. But this code will bottleneck on division throughput anyway.
This opportunistic div r32 code-gen will eventually become obsolete, as Ice Lake's integer divider is wide enough not to need way more micro-ops for 64-bit, so performance just depends on the actual values, regardless of whether there are an extra 32 bits of zeros above it or not. AMD has been like that for a while.
But Intel sold a lot of CPUs based re-spins of Skylake (including Cascade Lake servers, and client CPUs up to Comet Lake). While those are still in widespread use, LLVM -mtune=generic should probably keep doing this.
For more details:
Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux (a case where we know at compile-time the 64-bit integers will only hold small values, and rewriting the binary to use 32-bit operand-size with zero other changes to alignment or machine code makes it 3x faster on my Skylake CPU.)
uops for integer DIV instruction
Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs?

Could anyone help me to solve "zsh: bus error"

On mac OS(which has intel inside), I tried to make a simple x86 hybrid program with main module written in C and a function written in x86 assembly language (NASM assembler).
Then, the following function is to reverse the string of the argument.
My C code is
#include <stdio.h>
char *revstring(char *s);
int main(int argc, char* argv[]){
for (int i=1; i<argc; i++){
printf("%s->", argv[i]);
printf("%s\n", revstring(argv[i]));
}
}
Then my assembly code
section .text
global _revstring
_revstring:
push rbp
mov rbp, rsp
mov rax, [rbp+8]
mov rcx, rax
_find_end:
mov dl, [rax]
inc rax
test dl, dl
jnz _find_end
sub rax, 2
_swap:
cmp rax, rcx
jbe _fin
mov dl, [rax]
xchg dl, [rcx]
mov [rax], dl
dec rax
inc rcx
jmp _swap
_fin:
mov rax, [rbp+8]
pop rbp
ret
Or
section .text
global _revstring
_revstring:
push rbp
mov rbp, rsp
mov rax, [rbp+8]
mov rcx, rax
find_end:
mov dl, [rax]
inc rax
test dl, dl
jnz find_end
sub rax, 2
swap:
cmp rax, rcx
jbe fin
mov dl, [rax]
xchg dl, [rcx]
mov [rax], dl
dec rax
inc rcx
jmp swap
fin:
mov rax, [rbp+8]
pop rbp
ret
Currnt MacOS cannot run 32 bit program, so I built the program by using these commands.
cc -m64 -std=c99 -c revs.c
nasm -f macho64 revstring.s
cc -m64 -o revs revs.o revstring.o
But When I enter
./revs abc123
the following error occured.
zsh: bus error ./revs abc123
I cannot find any solutions, so could anyone help me?

Conditional move instruction on a NULL check before dereferencing a pointer [duplicate]

This question already has answers here:
Why does this example of what compilers aren't allowed to do cause null pointer dereferencing using cmov?
(3 answers)
Hard to debug SEGV due to skipped cmov from out-of-bounds memory
(2 answers)
Closed 4 years ago.
I'm working on exercise 3.61 of CSAPP, which requires to write a very simple function that checks if a pointer is NULL before trying to dereference it, which should base on a conditional move instruction rather than a jump. Here's an example I've found online:
long cond(long* p) {
return (!p) ? 0 : *p;
}
According to claims, the function can be compiled into the following assembly:
cond:
xor eax, eax
test rdi, rdi
cmovne rax, QWORD PTR [rdi]
ret
I am running GCC 7.3.0 (from APT package gcc/bionic-updates,now 4:7.3.0-3ubuntu2.1 amd64) on Ubuntu 18.04 on WSL. The computer is running on an Intel Coffee Lake (i.e. 8th-gen Core-i) processor.
I have tried the following commands:
gcc -S a.c -O3
gcc -S a.c -O3 -march=x86-64
gcc -S a.c -O3 -march=core2
gcc -S a.c -O3 -march=k8
Honestly, I wasn't able to observe any difference in the generated a.s file, since all of them look like
cond:
xorl %eax, %eax
testq %rdi, %rdi
je .L1
movq (%rdi), %rax
.L1:
ret
Is there any possibility to have such a function that compiles into a conditional move, without a jump?
Edit: As told in comments, the CMOVxx series of instructions loads operands unconditionally, and only the actual assignment operation is conditional, so there'd be no luck to put *p (or (%rdi)) as the source operand of CMOV, is it right?
The claim is on this page but I think it's invalid.

Here is a branch-less version:
inline long* select(long* p, long* q) {
uintptr_t a = (uintptr_t)p;
uintptr_t b = (uintptr_t)q;
uintptr_t c = !a;
uintptr_t r = (a & (c - 1)) | (b & (!c - 1));
return (long*)r;
}
long cond(long* p) {
long t = 0;
return *select(p, &t);
}
Assembly for gcc-8.2:
cond(long*):
mov QWORD PTR [rsp-8], 0
xor eax, eax
test rdi, rdi
sete al
lea rdx, [rax-1]
neg rax
and rdi, rdx
lea rdx, [rsp-8]
and rax, rdx
or rdi, rax
mov rax, QWORD PTR [rdi]
ret
Assembly for clang-7:
cond(long*): # #cond(long*)
mov qword ptr [rsp - 8], 0
xor eax, eax
test rdi, rdi
lea rcx, [rsp - 8]
cmovne rcx, rax
or rcx, rdi
mov rax, qword ptr [rcx]
ret

assembly output + questions about stacks

I was studying one of my courses when I ran into a specific exercise that I cannot seem to resolve... It is pretty basic because I am VERY new to assembly. So lets begin.
I have a C function
unsigned int func(int *ptr, unsigned int j) {
unsigned int res = j;
int i = ptr[j+1];
for(; i<8; ++i) {
res >>= 1;
}
return res;
}
I translated it with gcc to assembly
.file "func.c"
.intel_syntax noprefix
.text
.globl func
.type func, #function
func:
.LFB0:
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
mov rbp, rsp
.cfi_def_cfa_register 6
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
mov eax, DWORD PTR [rbp-28]
mov DWORD PTR [rbp-8], eax
mov eax, DWORD PTR [rbp-28]
add eax, 1
mov eax, eax
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-4], eax
jmp .L2
.L3:
shr DWORD PTR [rbp-8]
add DWORD PTR [rbp-4], 1
.L2:
cmp DWORD PTR [rbp-4], 7
jle .L3
mov eax, DWORD PTR [rbp-8]
pop rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size func, .-func
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"
.section .note.GNU-stack,"",#progbits
The question is as follow. what is the command that place j (variable in the c function) on top of the stack?
I sincerely cannot find out please enlighten me XD.

The variable j is the second parameter for func; it is stored in the register esi in the x86-64 System V ABI calling convention. This instruction mov DWORD PTR [rbp-28], esi put j into the stack.
You can see it very clearly by writing a simple function that calls "func" and compiling it with -O0 (or with -O2 and marking it as noinline, or only providing a prototype so there's nothing for the compiler to inline).
unsigned int func(int *ptr, unsigned int j) {
unsigned int res = j;
int i = ptr[j+1];
for(; i<8; ++i) {
res >>= 1;
}
return res;
}
int main()
{
int a = 1;
int array[10];
func (array, a);
return 0;
}
Using the Godbolt compiler explorer, we can easily get gcc -O0 -fverbose-asm assembly output.
Please focus on the following instructions:
# in main:
...
mov DWORD PTR [rbp-4], 1
mov edx, DWORD PTR [rbp-4]
...
mov esi, edx
...
func(int*, unsigned int):
...
mov DWORD PTR [rbp-28], esi # j, j
...
j, j is a comment added by gcc -fverbose-asm tell you that the source and destination operands are both the C variable j in that instruction.
The full assembly instructions:
func(int*, unsigned int):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
mov eax, DWORD PTR [rbp-28]
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-28]
add eax, 1
mov eax, eax
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-8], eax
jmp .L2
.L3:
shr DWORD PTR [rbp-4]
add DWORD PTR [rbp-8], 1
.L2:
cmp DWORD PTR [rbp-8], 7
jle .L3
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
main:
push rbp
mov rbp, rsp
sub rsp, 48
mov DWORD PTR [rbp-4], 1
mov edx, DWORD PTR [rbp-4]
lea rax, [rbp-48]
mov esi, edx
mov rdi, rax
call func(int*, unsigned int)
mov eax, 0
leave
ret

Taking into account these instructions
mov eax, DWORD PTR [rbp-28]
add eax, 1
it seems that j is stored at address rbp-28 While ptr is stored at address rbp-24.
These are instructions where the values are stored in the stack
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
It seems the arguments are passed to the function using registers rdi and esi.
Compilers can optimize their calls of functions and use registers instead of the stack to pass arguments of small sizes to functions. Within the functions they can use the stack to temporary store the arguments passed through registers.

Just a suggestion for further explorations on your own. Use gcc -O0 -g2 f.c -Wa,-adhln. It will turn off optimizations and generate assembly code intermixed with the source. It might give you better ideas about what it does.
As an alternative you can use the objdump -Sd f.o on the output '.o' or executable. Just make sure that you add debugging info and turn off optimizations at compilation.

Why do some C compilers set the return value of a function in weird places?

I wrote this snippet in a recent argument over the supposed speed of array[i++] vs array[i]; i++.
int array[10];
int main(){
int i=0;
while(i < 10){
array[i] = 0;
i++;
}
return 0;
}
Snippet at the compiler explorer: https://godbolt.org/g/de7TY2
As expected, the compiler output identical asm for array[i++] and array[i]; i++ with at least -O1. However what surprised me was the placement of the xor eax, eax seemingly randomly in the function at higher optimization levels.
GCC
At -O2, GCC places the xor before the ret as expected
mov DWORD PTR [rax], 0
add rax, 4
cmp rax, OFFSET FLAT:array+40
jne .L2
xor eax, eax
ret
However it places the xor after the second mov at -O3
mov QWORD PTR array[rip], 0
mov QWORD PTR array[rip+8], 0
xor eax, eax
mov QWORD PTR array[rip+16], 0
mov QWORD PTR array[rip+24], 0
mov QWORD PTR array[rip+32], 0
ret
icc
icc places it normally at -O1:
push rsi
xor esi, esi
push 3
pop rdi
call __intel_new_feature_proc_init
stmxcsr DWORD PTR [rsp]
xor eax, eax
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
..B1.2:
mov DWORD PTR [array+rax*4], 0
inc rax
cmp rax, 10
jl ..B1.2
xor eax, eax
pop rcx
ret
but in a strange place at -O2
push rbp
mov rbp, rsp
and rsp, -128
sub rsp, 128
xor esi, esi
mov edi, 3
call __intel_new_feature_proc_init
stmxcsr DWORD PTR [rsp]
pxor xmm0, xmm0
xor eax, eax
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
movdqu XMMWORD PTR array[rip], xmm0
movdqu XMMWORD PTR 16+array[rip], xmm0
mov DWORD PTR 32+array[rip], eax
mov DWORD PTR 36+array[rip], eax
mov rsp, rbp
pop rbp
ret
and -O3
and rsp, -128
sub rsp, 128
mov edi, 3
call __intel_new_proc_init
stmxcsr DWORD PTR [rsp]
xor eax, eax
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
mov rsp, rbp
pop rbp
ret
Clang
only clang places the xor directly in front of the ret at all optimization levels:
xorps xmm0, xmm0
movaps xmmword ptr [rip + array+16], xmm0
movaps xmmword ptr [rip + array], xmm0
mov qword ptr [rip + array+32], 0
xor eax, eax
ret
Since GCC and ICC both do this at higher optimisation levels, I presume there must be some kind of good reason.
Why do some compilers do this?
The code is semantically identical of course and the compiler can reorder it as it wishes, but since this only changes at higher optimization levels this must be caused by some kind of optimization.

Since eax isn't used, compilers can zero the register whenever they want, and it works as expected.
An interesting thing that you didn't notice is the icc -O2 version:
xor eax, eax
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
movdqu XMMWORD PTR array[rip], xmm0
movdqu XMMWORD PTR 16+array[rip], xmm0
mov DWORD PTR 32+array[rip], eax ; set to 0 using the value of eax
mov DWORD PTR 36+array[rip], eax
notice that eax is zeroed for the return value, but also used to zero 2 memory regions (last 2 instructions), probably because the instruction using eax is shorter than the instruction with the immediate zero operand.
So two birds with one stone.

Different instructions have different latencies. Sometimes changing the order of instructions can speed up the code for several reasons. For example:
If a certain instruction takes several cycles to complete, if it is at the end of the function the program just waits until it is done. If it is earlier in the function other things can happen while that instruction finishes. That is unlikely the actual reason here, though, on second thought, as xor of registers is I believe a low-latency instruction. Latencies are processor dependent though.
However, placing the XOR there may have to do with separating the mov instructions between which it is placed.
There are also optimizations that take advantage of the optimization capabilities of modern processors such as pipelining, branch prediction (not the case here as far as I can see....), etc. You need a pretty deep understanding of these capabilities to understand what an optimizer may do to take advantage of them.
You might find this informative. It pointed me to Agner Fog's site, a resource I have not seen before but has a lot of the information you wanted (or didn't want :-) ) to know but were afraid to ask :-)

Those memory accesses are expected to burn at least several clock cycles. You can move the xor without changing the functionality of the code. By pulling it back with one/some memory accesses after it it becomes free, doesnt cost you any execution time it is parallel with the external access (the processor finishes the xor and waits on the external activity rather than just waits on the external activity). If you put it in a clump of instructions without memory accesses it costs a clock at least. And as you probably know using the xor vs mov immediate reduces the size of the instruction, probably not costing clocks but saving space in the binary. A ghee whiz kinda cool optimization that dates back to the original 8086, and is still used today even if it doesnt save you much in the end.

Where processor set the particular value depends on the moment where passing the execution tree it is sure that this register will not be needed anymore and will not be changed by the external world.
Here is less trivial example:
https://godbolt.org/g/6AowMJ
And the processor zeroes eax past the memset because memset can change its value. The moment depends on parsing the complex tree, and it possible not logical for the humans.